Giter VIP home page Giter VIP logo

Comments (4)

rasa avatar rasa commented on May 20, 2024 4

@adoyle-h @oschwald @polarathene
Thank you for the detailed error report. It made it extremely easy to craft the fix. PR coming soon!

from editorconfig-checker.

oschwald avatar oschwald commented on May 20, 2024 4

Thank you for fixing this!

from editorconfig-checker.

polarathene avatar polarathene commented on May 20, 2024 2

sub as the file content is sufficient for your reproduction it seems.


Reproduction

Reproduction via the Docker image:

$ docker run --rm --tty \
    --volume "${PWD}:/ci:ro" \
    --workdir "/ci" \
    'mstruebing/editorconfig-checker:2.7.1' \
    ec -config "/ci/test/linting/.ecrc.json"

Config: {ShowVersion:false Help:false DryRun:false Path:/ci/test/linting/.ecrc.json Version: Verbose:false Debug:true IgnoreDefaults:true SpacesAftertabs:false NoColor:false Exclude:[^test/bats/ ^test/test_helper/bats-(assert|support) \.git/] AllowedContentTypes:[text/ application/octet-stream application/ecmascript application/json application/x-ndjson application/xml +json +xml] PassedFiles:[] Disable:{EndOfLine:false Indentation:false InsertFinalNewline:false TrimTrailingWhitespace:false IndentSize:false MaxLineLength:false} Logger:{Verbosee:false Debugg:true NoColor:false}}
AddToFiles: filePath: /ci/test.txt, contentType: text/plain
Could not decode the IBM420_rtl encoded file: /ci/test.txt
unrecognized charset IBM420_rtl

.ecrc.json (config used, not required to reproduce):

{
  "Debug": true,
  "IgnoreDefaults": true,
  "Exclude": [
    "^test/bats/",
    "^test/test_helper/bats-(assert|support)",
    "\\.git/"
  ]
}

File that will fail:

sub

c2Vj and caV both trigger it too, as does much longer file content like below.


Investigation

Upstream (chardet) is responsible for detecting this charset

Looks like a package for the detection is used for this charset, and it does so via a lookup table with the char (byte) as the key. It was added to chardet in 2012 with little details to reference beyond that (not much changes to the file since).

That table covers the equivalent regex range that you can plug into grep (-P for perl regex, which supports the hex notation):

grep -P '[\x40,\x42-\x49,\x51,\x52,\x55-\x59,\x62-\x69,\x70-\x79,\
x80-\xA0,\xA2-\xB5,\xB8-\xBF,\xCB,\xCD,\xCF,\xDA-\xDF,\xEA-\xEF,\xFB-\xFE]' filename_here

In my file that triggered this (the gibberish is base64 encoded, this is for SMTP tests):

EHLO mail
AUTH LOGIN
c29tZS51c2VyQGxvY2FsaG9zdC5sb2NhbGRvbWFpbg==
c2VjcmV0
QUIT

we can see over 50% is matched (highlighted red):

image

Many other characters when used as index to the lookup table would map to 0x40. There is an array of 64 elements for IBM420_rtl
that provides a sequence of 3 bytes matching those in the lookup table values. Not quite sure exactly why caV is sufficient to match this encoding, while c2V is not, but c2Vj is. These chars that change should all map to 0x40? So something else is involved there. Might be based on hit rate of content with ngrams 🤷‍♂️

IBM420_rtl as an encoding result in editorconfig-checker seems unintentional

Pretty sure that's the culprit regardless. It's usage can be found in editorconfig-checker from this commit that introduced it. That commit message mentions:

chardet's DetectBest() returns multiple charsets with the same Confidence level. Unfortunately, it sometimes returns a different charset.
This commit fixes this by sorting the results returned by DetectAll() and returning the first alphabetically.
This could certainly be improved.

Since the encoding is not supported, we hit this error as it's not one of the accepted encodings, but it's still being returned from here where it gets sorted after. It probably should have filtered to the accepted encodings only, since if there is more than 1 in these results that's still acceptable? utf8 is probably there, just sorting approach would prioritize other encodings found. Sorting by the confidence might have been more reliable (assuming utf8 is there and had a higher confidence level).

This was contributed in June 2022 for #206 for v2.6.0, and has had no changes since.


Fixing it

I am not a Go dev, so that's about as much detective work as I can do 😅

@rasa or @mstruebing were both involved in the June 2022 PR for this change, they may know how take those findings above and fix it.

Probably just requires filtering out unsupported encodings before sorting, and failing if no supported encodings remain?


UPDATE: PR to resolve: #284 🎉

from editorconfig-checker.

oschwald avatar oschwald commented on May 20, 2024

I am also seeing this on certain ASCII files.

from editorconfig-checker.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.