Comments (4)
@adoyle-h @oschwald @polarathene
Thank you for the detailed error report. It made it extremely easy to craft the fix. PR coming soon!
from editorconfig-checker.
Thank you for fixing this!
from editorconfig-checker.
sub
as the file content is sufficient for your reproduction it seems.
Reproduction
Reproduction via the Docker image:
$ docker run --rm --tty \
--volume "${PWD}:/ci:ro" \
--workdir "/ci" \
'mstruebing/editorconfig-checker:2.7.1' \
ec -config "/ci/test/linting/.ecrc.json"
Config: {ShowVersion:false Help:false DryRun:false Path:/ci/test/linting/.ecrc.json Version: Verbose:false Debug:true IgnoreDefaults:true SpacesAftertabs:false NoColor:false Exclude:[^test/bats/ ^test/test_helper/bats-(assert|support) \.git/] AllowedContentTypes:[text/ application/octet-stream application/ecmascript application/json application/x-ndjson application/xml +json +xml] PassedFiles:[] Disable:{EndOfLine:false Indentation:false InsertFinalNewline:false TrimTrailingWhitespace:false IndentSize:false MaxLineLength:false} Logger:{Verbosee:false Debugg:true NoColor:false}}
AddToFiles: filePath: /ci/test.txt, contentType: text/plain
Could not decode the IBM420_rtl encoded file: /ci/test.txt
unrecognized charset IBM420_rtl
.ecrc.json
(config used, not required to reproduce):
{
"Debug": true,
"IgnoreDefaults": true,
"Exclude": [
"^test/bats/",
"^test/test_helper/bats-(assert|support)",
"\\.git/"
]
}
File that will fail:
sub
c2Vj
and caV
both trigger it too, as does much longer file content like below.
Investigation
Upstream (chardet
) is responsible for detecting this charset
Looks like a package for the detection is used for this charset, and it does so via a lookup table with the char (byte) as the key. It was added to chardet
in 2012 with little details to reference beyond that (not much changes to the file since).
That table covers the equivalent regex range that you can plug into grep
(-P
for perl regex, which supports the hex notation):
grep -P '[\x40,\x42-\x49,\x51,\x52,\x55-\x59,\x62-\x69,\x70-\x79,\
x80-\xA0,\xA2-\xB5,\xB8-\xBF,\xCB,\xCD,\xCF,\xDA-\xDF,\xEA-\xEF,\xFB-\xFE]' filename_here
In my file that triggered this (the gibberish is base64 encoded, this is for SMTP
tests):
EHLO mail
AUTH LOGIN
c29tZS51c2VyQGxvY2FsaG9zdC5sb2NhbGRvbWFpbg==
c2VjcmV0
QUIT
we can see over 50% is matched (highlighted red):
Many other characters when used as index to the lookup table would map to 0x40
. There is an array of 64 elements for IBM420_rtl
that provides a sequence of 3 bytes matching those in the lookup table values. Not quite sure exactly why caV
is sufficient to match this encoding, while c2V
is not, but c2Vj
is. These chars that change should all map to 0x40
? So something else is involved there. Might be based on hit rate of content with ngrams 🤷♂️
IBM420_rtl
as an encoding result in editorconfig-checker
seems unintentional
Pretty sure that's the culprit regardless. It's usage can be found in editorconfig-checker
from this commit that introduced it. That commit message mentions:
chardet
'sDetectBest()
returns multiple charsets with the same Confidence level. Unfortunately, it sometimes returns a different charset.
This commit fixes this by sorting the results returned byDetectAll()
and returning the first alphabetically.
This could certainly be improved.
Since the encoding is not supported, we hit this error as it's not one of the accepted encodings, but it's still being returned from here where it gets sorted after. It probably should have filtered to the accepted encodings only, since if there is more than 1 in these results that's still acceptable? utf8
is probably there, just sorting approach would prioritize other encodings found. Sorting by the confidence might have been more reliable (assuming utf8
is there and had a higher confidence level).
This was contributed in June 2022 for #206 for v2.6.0
, and has had no changes since.
Fixing it
I am not a Go dev, so that's about as much detective work as I can do 😅
@rasa or @mstruebing were both involved in the June 2022 PR for this change, they may know how take those findings above and fix it.
Probably just requires filtering out unsupported encodings before sorting, and failing if no supported encodings remain?
UPDATE: PR to resolve: #284 🎉
from editorconfig-checker.
I am also seeing this on certain ASCII files.
from editorconfig-checker.
Related Issues (20)
- Automatically ignore language-specific block syntax HOT 3
- Ignore generated data (by default) HOT 1
- Rate limit reached due to `GITHUB_TOKEN` not being read HOT 1
- Add multi-thread support HOT 1
- `indent_style = space` is ignored if `indent_size = unset` HOT 1
- arm64 images missing for recent release HOT 4
- Respect the `NO_COLOR` environment variable to disable default color output HOT 1
- Paths with forward slashes in .editorconfig don't work on Windows HOT 2
- Configure commitlint to ignore the body length
- Ignore specific rules within file HOT 1
- Support for .ecrc.json HOT 7
- Runtime error under alpine linux HOT 2
- Cannot stat files with names that start with + HOT 2
- allow config arguments via environment variables
- Support darcs
- feature: GitHub Actions-compatible output format
- Feature: Exclude lines that match a regex
- Improved compatibility with Windows by providing zip binary archives HOT 3
- Document meaning of `Version`, `SpacesAftertabs`, and `PassedFiles` keys in config file HOT 2
- Passing directory arguments results in `Could not get the ContentType` errors since 2.8.0 HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from editorconfig-checker.