chardet / chardet Goto Github PK
View Code? Open in Web Editor NEWPython character encoding detector
License: GNU Lesser General Public License v2.1
Python character encoding detector
License: GNU Lesser General Public License v2.1
Here's a demonstration of an example where the detection return a charset that triggers an unicode error
text =b'xxxx), xx xxxxxx xx xxxxxx xx\xe9\xe9x \xe0 xxxxx x xxxx \nxxxxx\xe9 xx xx xxx xxxxxxx\xe9.\n\n*__*\n\nxx *xxx, x. xxxxx*, xxxxx xxxxxxx xx xxxxxx xxxx x\xe9xxxxxx \xe0 xxxxxxxx \nxxxx xxx xxxxxxxxx xxxxxx\xe9xx xxx xxx xxxxxxxxxxxxxxx xxxx xx xx xx xx \nxxxxxxx xx\xe9x\xe9xxxxx :\n\n- xx xxxxxx x\\xxxxxxxxxxxx xxxxxxxxxx \xe0 x\\xxxxxxxxxxxxxx\xe9 xx xxxx xx xxxx : xxx \nxx ; xxx (xxxx) / xxx (xxxx) xxxxxx.\n\n- xxx x\xe9xxxxxx xxx xxxxxxxxxxxx xxxx xxxx : xx xxxx\xe9xxxxxxx xxxxxxxxx \n(xx xxx xxx \\xxxxx) / xxxxxxxxx (xx xxx xxx \\xxxxx) xxx xxxxxxx x\\xxxxxxxx xxxxxxxxx\xe9 \nxx xxx.\n\n- xxx x\xe9xxxxxx xxx xxxxxx xx xxxxxxxxx xxxx xxxx : xx xxxx\xe9xxxxx xxxxx \nxxx x\xe9xxxxxx xxxx xx xxxxxxx (xx xxx xxx \\xxxxx) xx xxxxxx xxxx xx xxxxxx (xx \nxxx xxx \\xxxxx) x\\xxxxxxxxxxxxx xxx xx xx\xfbx xxxx xxxxxxxxx xxx xxxxxx (xxxxxxxxx \nxxxx xxxxxx, xxxxx xx xxxxxxxxx xxxx xxxxxxxxx ...) xx x\\xxxxxxxxxxx xx \nxxxxxxxxxx xxx\xe9xxxxxxx xxxx xxxxxxxx xxxxxxx.\n\n- xxx xxxxxx x\\xxxxxxxxxxxxxxxx : xx xxxx\xe9xxxxx xx xxxxxxxx xx\xe9xxxx xxxxx \nxxxx xx xxxx xxxxxxx\xe9x xx xxxxxxx xxxxx\xe8'
chardet.detect(text) # => windows-1255
chardet.decode('windows-1255')
# => UnicodeDecodeError: 'charmap' codec can't decode byte 0xfb in position 724: character maps to <undefined>
chardet.detect(text) # => windows-1252
chardet.decode('windows-1252') # => works
Something nice would be:
Meaningwhile, here's my solution:
try:
return part.decode(charset)
except UnicodeDecodeError:
detector = UniversalDetector()
detector.feed(part)
detector.close()
try:
return part.decode(detector.result['encoding'])
except UnicodeDecodeError as e:
for prober in detector._charset_probers:
if prober.get_confidence() > detector.MINIMUM_THRESHOLD:
try:
return part.decode(prober.charset_name)
except UnicodeDecodeError:
pass
raise e
Hi Guys,
Does this return information about the line feed of a file e.g.
LF+CR
LF
CR
CR+LF
The docs from https://github.com/chardet/chardet/tree/master/docs should be hosted on a web site, to make them more accessible.
https://readthedocs.org/ maybe?
I noticed that the filter_without_english_characters function in chardet simply replaces any english alphabetical character with a space character. This might lead to inaccuracies in our confidence. I tried mimicking the behavior of Mozilla's implementation more closely and this led to a decrease in the count of failing unit tests from 28 to 25.
I wanted to know your thoughts on this and you can also check out my changes over here. .
Had an "interesting" discussion in the Stackoverflow Python room (bookmarked transcript here)
There's an article compatibility between the iphone and lgpl that has some analysis as well as links to other resources, but it appears there's not really a conclusion except the owner could assert rights, but in the spirit of things most likely wouldn't...
I'm just wondering what the stance is from the author(s) here?
File "/root/w3af/w3af/core/data/misc/encoding.py", line 89, in smart_unicode
guessed_encoding = chardet.detect(s)['encoding']
File "/usr/local/lib/python2.7/dist-packages/chardet/__init__.py", line 24, in detect
u.feed(aBuf)
File "/usr/local/lib/python2.7/dist-packages/chardet/universaldetector.py", line 119, in feed
if prober.feed(aBuf) == constants.eFoundIt:
File "/usr/local/lib/python2.7/dist-packages/chardet/charsetgroupprober.py", line 59, in feed
st = prober.feed(aBuf)
File "/usr/local/lib/python2.7/dist-packages/chardet/utf8prober.py", line 52, in feed
codingState = self._mCodingSM.next_state(c)
File "/usr/local/lib/python2.7/dist-packages/chardet/codingstatemachine.py", line 44, in next_state
byteCls = self._mModel['classTable'][ord(c)]
Original bug report at the w3af project which uses chardet==2.1.1
We currently detect EUC-TW
pretty well, but it's not actually supported by Python. Most users would expect that
result = chardet.detect(some_bytes)
try:
some_bytes.decode(result['encoding'])
except UnicodeDecodeError:
print('Oops. chardet detected the wrong encoding')
would always work, but the decode line can actually fail with a LookupError
too because of encodings that aren't supported by Python.
Using d5d0812, I am seeing multiple failues to correctly detect utf-8. Some examples: "gebührenfrei", "exámple", "naïve", "sie hören", "This is a cat 😸". (Strings that are ASCII except for a single character seem to be particularly troublesome.)
See also sv24-archive/charade#24, sv24-archive/charade#25. d5d0812 seems to be doing slightly "better" than python-chardet-2.0.1-7.fc20.noarch, for what it's worth (fewer confidence = 0.99 detections, though still wrong).
I also see 27 failed unit tests. Please let me know if this is known and/or if I should paste the complete error log here.
This feels strange to be posing as a question, since I'm one of the co-maintainers, but @sigmavirus24 and @erikrose, do you know if it's okay/legal for us to change the license of chardet? Because it was started by Mark Pilgrim I feel like it's kind of a nebulous question, because he's not someone you can just email, and he has nothing to do with development anymore. I would really like to change the license to at least be MPL, since that's what the C++ version is, and our setup currently mirrors that code pretty closely.
I'm not a fan of the LGPL and feel weird having a project I work on use it.
I'm just putting this here as an announcement that I will shortly be renaming the main branches as follows:
master
➡️ stable
develop
➡️ master
This should prevent the common problem where people don't change the target to develop
even though we're trying to use git-flow.
The following string raises the titular exception when run chardet.detect is run on it using 9e419e9
b'\xfe\xcf'
Here's the full stack trace:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/home/david/crap/chardet/chardet/__init__.py", line 30, in detect
u.feed(byte_str)
File "/home/david/crap/chardet/chardet/universaldetector.py", line 189, in feed
if prober.feed(byte_str) == ProbingState.found_it:
File "/home/david/crap/chardet/chardet/charsetgroupprober.py", line 63, in feed
state = prober.feed(byte_str)
File "/home/david/crap/chardet/chardet/mbcharsetprober.py", line 75, in feed
char_len)
File "/home/david/crap/chardet/chardet/chardistribution.py", line 82, in feed
if 512 > self._char_to_freq_order[order]:
IndexError: tuple index out of range
This one doesn't actually come from Hypothesis but from a fuzzing experiment I was running which it occurred to me would be applicable to chardet.
I use chardet to detect the codec of the 'Cinecitt%C3%A0%20Make', like this
import urlparse
import chardet
a = 'Cinecitt%C3%A0%20Make'
b = urlparse.unquote(a)
chardet.detect(b)
the result is {'confidence': 0.814286076190637, 'encoding': 'ISO-8859-2'}
but '%c3%a0' is an utf8 code of character 'à'
is there something wrong?
Hello ,
I have a situation where I'm trying to detect encoding of ISO-8859-2 file.
In [1]: import chardet
In [2]: chardet.__version__
Out[2]: '2.2.1'
In [3]: chardet.detect(file('iso_file.csv', mode='rb').read())
Out[3]: {'confidence': 0.8727101643152726, 'encoding': 'ISO-8859-2'}
As you can see it's properly detected.
But after pip install -U chardet
In [16]: import chardet
In [17]: chardet.__version__
Out[17]: '2.3.0'
In [18]: chardet.detect(file('iso_file.csv', mode='rb').read())
Out[18]: {'confidence': 1.0, 'encoding': 'UTF-8-SIG'}
Can you provide some details around what was changed in new version that would trigger incorrect behaviour and what I should do from my side to help library better recognize encoding?
Since people sometimes know what the possible encodings they may receive may be, it would be nice if we supported filtering the set of predictions to only include those. It might help alleviate some inaccuracies with short strings.
Given the following string:
u'\x000'.encode('utf-16')
chardet.detect as of 2.3.0 reports this as 'UTF-32LE' with a confidence of 1.0, but attempting to decode it as such fails with
UnicodeDecodeError: 'utf32' codec can't decode bytes in position 4-5: truncated data
I found this bug using Hypothesis. I'd be happy to submit a pull request adding the test that found it if you'd like me to, though it is of course currently failing.
From wiki:
Like UTF-8, GB18030 is a superset of ASCII and can represent the whole range of Unicode code points; in addition, it is also a superset of GB2312.
If GB18030 is a superset of GB2312, is it OK to replace GB2312 by GB18030?
Currently, the following 27 unit tests fail. We need to figure that out and fix them.
.FFF.FF..FFF........F............................................................................................................................................F.......FFFFFFF.FFFFFF.......................................F....F.FF.........................................................................................................................................................
======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test.py", line 34, in runTest
self.file_name))
AssertionError: 'iso-8859-2' != 'iso-8859-7'
- iso-8859-2
? ^
+ iso-8859-7
? ^
: Expected iso-8859-7, but got 'ISO-8859-2' in /home/travis/build/erikrose/chardet/tests/iso-8859-7-greek/naftemporiki.gr.bus.xml
======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test.py", line 34, in runTest
self.file_name))
AssertionError: 'iso-8859-2' != 'iso-8859-7'
- iso-8859-2
? ^
+ iso-8859-7
? ^
: Expected iso-8859-7, but got 'ISO-8859-2' in /home/travis/build/erikrose/chardet/tests/iso-8859-7-greek/naftemporiki.gr.cmm.xml
======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test.py", line 34, in runTest
self.file_name))
AssertionError: 'iso-8859-2' != 'iso-8859-7'
- iso-8859-2
? ^
+ iso-8859-7
? ^
: Expected iso-8859-7, but got 'ISO-8859-2' in /home/travis/build/erikrose/chardet/tests/iso-8859-7-greek/naftemporiki.gr.fin.xml
======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test.py", line 34, in runTest
self.file_name))
AssertionError: 'iso-8859-2' != 'iso-8859-7'
- iso-8859-2
? ^
+ iso-8859-7
? ^
: Expected iso-8859-7, but got 'ISO-8859-2' in /home/travis/build/erikrose/chardet/tests/iso-8859-7-greek/naftemporiki.gr.mrt.xml
======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test.py", line 34, in runTest
self.file_name))
AssertionError: 'windows-1253' != 'iso-8859-7'
- windows-1253
+ iso-8859-7
: Expected iso-8859-7, but got 'windows-1253' in /home/travis/build/erikrose/chardet/tests/iso-8859-7-greek/disabled.gr.xml
======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test.py", line 34, in runTest
self.file_name))
AssertionError: 'iso-8859-2' != 'iso-8859-7'
- iso-8859-2
? ^
+ iso-8859-7
? ^
: Expected iso-8859-7, but got 'ISO-8859-2' in /home/travis/build/erikrose/chardet/tests/iso-8859-7-greek/naftemporiki.gr.spo.xml
======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test.py", line 34, in runTest
self.file_name))
AssertionError: 'iso-8859-2' != 'iso-8859-7'
- iso-8859-2
? ^
+ iso-8859-7
? ^
: Expected iso-8859-7, but got 'ISO-8859-2' in /home/travis/build/erikrose/chardet/tests/iso-8859-7-greek/naftemporiki.gr.mrk.xml
======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test.py", line 34, in runTest
self.file_name))
AssertionError: 'iso-8859-2' != 'iso-8859-7'
- iso-8859-2
? ^
+ iso-8859-7
? ^
: Expected iso-8859-7, but got 'ISO-8859-2' in /home/travis/build/erikrose/chardet/tests/iso-8859-7-greek/naftemporiki.gr.wld.xml
======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test.py", line 34, in runTest
self.file_name))
AssertionError: 'iso-8859-2' != 'utf-8'
- iso-8859-2
+ utf-8
: Expected utf-8, but got 'ISO-8859-2' in /home/travis/build/erikrose/chardet/tests/utf-8/bom-utf-8.srt
======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test.py", line 34, in runTest
self.file_name))
AssertionError: 'maccyrillic' != 'iso-8859-6'
- maccyrillic
+ iso-8859-6
: Expected iso-8859-6, but got 'MacCyrillic' in /home/travis/build/erikrose/chardet/tests/iso-8859-6-arabic/_chromium_ISO-8859-6_with_no_encoding_specified.html
======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test.py", line 34, in runTest
self.file_name))
AssertionError: 'iso-8859-2' != 'latin1'
- iso-8859-2
+ latin1
: Expected latin1, but got 'ISO-8859-2' in /home/travis/build/erikrose/chardet/tests/latin1/_ude_2.txt
======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test.py", line 34, in runTest
self.file_name))
AssertionError: 'tis-620' != 'latin1'
- tis-620
+ latin1
: Expected latin1, but got 'TIS-620' in /home/travis/build/erikrose/chardet/tests/latin1/_ude_4.txt
======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test.py", line 34, in runTest
self.file_name))
AssertionError: 'ascii' != 'latin1'
- ascii
+ latin1
: Expected latin1, but got 'ascii' in /home/travis/build/erikrose/chardet/tests/latin1/_mozilla_bug638318_text.html
======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test.py", line 34, in runTest
self.file_name))
AssertionError: 'iso-8859-2' != 'latin1'
- iso-8859-2
+ latin1
: Expected latin1, but got 'ISO-8859-2' in /home/travis/build/erikrose/chardet/tests/latin1/_ude_3.txt
======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test.py", line 34, in runTest
self.file_name))
AssertionError: 'ibm855' != 'latin1'
- ibm855
+ latin1
: Expected latin1, but got 'IBM855' in /home/travis/build/erikrose/chardet/tests/latin1/_ude_1.txt
======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test.py", line 34, in runTest
self.file_name))
AssertionError: 'iso-8859-2' != 'windows-1252'
- iso-8859-2
+ windows-1252
: Expected windows-1252, but got 'ISO-8859-2' in /home/travis/build/erikrose/chardet/tests/windows-1252/github_bug_9.txt
======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test.py", line 34, in runTest
self.file_name))
AssertionError: 'iso-8859-2' != 'windows-1252'
- iso-8859-2
+ windows-1252
: Expected windows-1252, but got 'ISO-8859-2' in /home/travis/build/erikrose/chardet/tests/windows-1252/_mozilla_bug421271_text.html
======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test.py", line 34, in runTest
self.file_name))
AssertionError: 'ibm855' != 'windows-1250'
- ibm855
+ windows-1250
: Expected windows-1250, but got 'IBM855' in /home/travis/build/erikrose/chardet/tests/windows-1250-hungarian/bbc.co.uk.hu.pressreview.xml
======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test.py", line 34, in runTest
self.file_name))
AssertionError: 'iso-8859-2' != 'windows-1250'
- iso-8859-2
+ windows-1250
: Expected windows-1250, but got 'ISO-8859-2' in /home/travis/build/erikrose/chardet/tests/windows-1250-hungarian/bbc.co.uk.hu.learningenglish.xml
======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test.py", line 34, in runTest
self.file_name))
AssertionError: 'windows-1255' != 'windows-1250'
- windows-1255
? ^
+ windows-1250
? ^
: Expected windows-1250, but got 'windows-1255' in /home/travis/build/erikrose/chardet/tests/windows-1250-hungarian/bbc.co.uk.hu.xml
======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test.py", line 34, in runTest
self.file_name))
AssertionError: 'iso-8859-7' != 'windows-1250'
- iso-8859-7
+ windows-1250
: Expected windows-1250, but got 'ISO-8859-7' in /home/travis/build/erikrose/chardet/tests/windows-1250-hungarian/objektivhir.hu.xml
======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test.py", line 34, in runTest
self.file_name))
AssertionError: 'iso-8859-2' != 'windows-1250'
- iso-8859-2
+ windows-1250
: Expected windows-1250, but got 'ISO-8859-2' in /home/travis/build/erikrose/chardet/tests/windows-1250-hungarian/bbc.co.uk.hu.forum.xml
======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test.py", line 34, in runTest
self.file_name))
AssertionError: 'maccyrillic' != 'windows-1256'
- maccyrillic
+ windows-1256
: Expected windows-1256, but got 'MacCyrillic' in /home/travis/build/erikrose/chardet/tests/windows-1256-arabic/_chromium_windows-1256_with_no_encoding_specified.html
======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test.py", line 34, in runTest
self.file_name))
AssertionError: 'windows-1251' != 'iso-8859-2'
- windows-1251
+ iso-8859-2
: Expected iso-8859-2, but got 'windows-1251' in /home/travis/build/erikrose/chardet/tests/iso-8859-2-hungarian/cigartower.hu.xml
======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test.py", line 34, in runTest
self.file_name))
AssertionError: 'iso-8859-7' != 'iso-8859-2'
- iso-8859-7
? ^
+ iso-8859-2
? ^
: Expected iso-8859-2, but got 'ISO-8859-7' in /home/travis/build/erikrose/chardet/tests/iso-8859-2-hungarian/escience.hu.xml
======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test.py", line 34, in runTest
self.file_name))
AssertionError: 'koi8-r' != 'iso-8859-2'
- koi8-r
+ iso-8859-2
: Expected iso-8859-2, but got 'KOI8-R' in /home/travis/build/erikrose/chardet/tests/iso-8859-2-hungarian/shamalt.uw.hu.xml
======================================================================
FAIL: runTest (__main__.TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test.py", line 34, in runTest
self.file_name))
AssertionError: 'iso-8859-2' != 'windows-1254'
- iso-8859-2
+ windows-1254
: Expected windows-1254, but got 'ISO-8859-2' in /home/travis/build/erikrose/chardet/tests/windows-1254-turkish/_chromium_windows-1254_with_no_encoding_specified.html
----------------------------------------------------------------------
Ran 384 tests in 109.871s
FAILED (failures=27)
With a certain file, charted detects encoding as windows-1255. I would have thought it might be the file itself but my IDE for python development detected the encoding properly.
File link:
http://simple.podnapisi.net/fr/ppodnapisi/download/i/2292027/k/b5a3b7904326647fce0270ef3f441de7b73663af
Hi!
Is it possible to do a windows-1250 detection? Current implementation returns "windows-1252" for text encoded in windows-1250. Same question goes for "ISO-8859-2" vs. "ISO-8859-1".
Hi,
I notice that older versions (pre-2.3.0) have been removed from https://pypi.python.org/pypi/chardet
Removing older versions will break people's deployments. Could you please keep this in mind when making new releases?
Thanks in advance,
Kees
Recently had an issue with chardet usage in the in the requests module where it was incorrectly detecting the encoding on a JSON Blob. I discovered the response.apparent_encoding
and that chardet is used to set it.
I was able to identify what in my data was causing the wrong detection and distill it down to the occurrence of these simple strings.
$ cat test_chardet
~{,
~},
$ file test_chardet
test_chardetect: ASCII text
$ chardetect test_chardet
test_chardetect: HZ-GB-2312 with confidence 0.99
Originally it was a JSON Blog of all ASCII characters encoded as UTF-8. As a work around I have set up the service to specify the following in the header -
{'Content-Type': 'application/json; charset=utf-8'}
Which will have requests set the encoding properly.
I can add ASCII characters to the file, white space, quotes and numbers. It will still detect as HZ-GB-2312.
I just check the source code from upstream. They commented clearly that We use gb18030 to replace gb2312, because 18030 is a superset.
If the upstream already corrected this problem, there is no reason to hold the problem here. Please reopen and merge 33. It's funny to write the following code.
if encoding == 'GB2312':
encoding = 'GB18030'
The following string is detected as having a None encoding despite being a valid string in one of chardet's supported encodings:
u'<\xa0'.encode('iso-8859-7')
This remains true even if you pad it with ascii, it's not a length issue.
This behaviour is present in 9e419e9 (I just only was able to find it once #63 was fixed).
Shall I send you a pull request with the test that is finding these? It's not very complicated.
UPD: Deleted python2.7 example because it was not working properly. See a comment below for a better test case.
This is all on Debian GNU/Linux unstable with the current master:
$ python3.4 -c "import chardet; print(chardet.detect(u'é'.encode('utf-8')))"
{'confidence': 0.73, 'encoding': 'windows-1252'}
$ python3.4 -c "import chardet; print(chardet.detect(u'éé'.encode('utf-8')))"
{'confidence': 0.7525, 'encoding': 'utf-8'}
The second line should be utf-8
as well, not windows-1252
.
Sorry if there is already an issue related to this, but I'm not sure what is cause yet.
I've got an ISO-8859-1
string detected as Windows-1252
, although those two encoding differ in only 32 characters, mystring.decode('windows-1252')
fail to decode the content, that's why I'm filing this issue.
If you need a test case you can query whois data for brasil.gov.br
the string will contain ISO-8859-1
encoded data however will be detect as Windows-1252
with 99% confidence.
It appears to be the case that the presence of '\x1b' character (the escape key) in an ascii string prevents it from being recognised as ascii. e.g. for any n the following returns an encoding of None:
b'0' * n + b'\x1b'
This behaviour is present in master at ff1d917
Hi! I just compared your github codebase to the your PyPi package (installed via pip) and noticed that the code there seems to be out of date (even though it gives the same version number).
Would be cool to have an up-to-date version on pypi since there's a problem with ascii escape sequences in the pypi package that doesn't seem to exist with the github version :)
Hi guys,
I recently took over a project that make use of chardet @ 73ab963 which I guess is 1.1.
Do you have any pointer on big breaking changes since then I should be aware of?
Thx !
Currently, this function has been listed as a TO-DO. I was looking over at the source from Mozilla and it seems that there could be a bug in that.
From what I can tell, the original intention of this function was to remove all markup tags. Its used in the LatinProber and I imagine that the idea is to remove all markup tags - which will probably contain english alphabets/words - so that we can avoid skewing our confidence incorrectly.
The current behavior though is not that. A simple example:
<some tag> outside <some tag>
returns
tag outside tag
It includes parts of text within a tag if there are multiple words separated by any kind of punctuation in the tag. I can look into this, by I wanted to know your thoughts on this first?
Version: chardetect-script.py 2.3.0 on Windows environment; Python 3.4.3 (native) and in Cygwin shell (same in other shells)
Here the content of the misdetected script:
#!/usr/bin/env python3
# coding: utf-8
#
################################################################################
__version__ = '1.0'
__author__ = 'ü'
output of chardet:
$ chardetect uu.py
uu.py: ISO-8859-2 with confidence 0.7916670185186749
output of file (Cygwin):
$ file uu.py
uu.py: a /usr/bin/env python3 script, UTF-8 Unicode text executable, with CRLF line terminators
If i read the problematic line by open(file, 'rb').readlines()
i get
b"__author__ = '\xc3\xbc'\r\n"
Am i getting something wrong?
Latest release is very old - 2014-10-07. There are some useful updates since this date.
I have a page of text in ASCII with a single Microsoft-apostrophe chr(8217)
detected as ISO-8859-2.
#1. Create problematic sample
>>> s = 'today' + chr(8217) + 's research'
>>> s
'today’s research'
>>> b = s.encode('windows-1252')
>>> b
b'today\x92s research'
#2. Attempt to decode it
>>> chardet.detect(b)
{'encoding': 'ISO-8859-2', 'confidence': 0.8060609643099236}
>>> b.decode('ISO-8859-2')
'today\x92s research'
#3. Now try the correct encoding
>>> b.decode('windows-1252')
'today’s research'
This text is very typical of anything created using a Microsoft editor. Furthermore, latest version of Firefox detects it correctly. I am using Python 3.3. Any help is appreciated.
I tried to change the universaldetector.py file, but it still returned utf-8.
if aBuf[:3] == codecs.BOM:
# EF BB BF UTF-8 with BOM
self.result = {'encoding': "UTF-8-SIG", 'confidence': 1.0}
Output:
>>> chardet.detect(open('test.txt').read())
{'confidence': 0.99, 'encoding': 'utf-8'}
I am using chardet as part of a web crawler written in python3. I noticed that over time (many hours), the program consumes all memory. I narrowed down the problem to a single call of chardet.detect() method for certain web pages.
After some testing, it seems that chardet has problem with some special input and I managed to get a sample of such an input. It consumes on my machine about 220 MB of memory (however, the input is 2.5 MB) and takes about 1:22 minutes to process (in contrast to 43 ms when the file is truncated to about 2 MB). It seems not to be limited to python3, in python2 the memory consumption is even worse (312 MB).
Fedora release 20 (Heisenbug) x86_64
chardet-2.2.1 (via pip)
python3-3.3.2-11.fc20.x86_64
python-2.7.5-11.fc20.x86_64
I cannot attach any files to this issue so I uploaded them to my dropbox account: https://www.dropbox.com/sh/26dry8zj18cv0m1/sKgP_E44qx/chardet_test.zip Please let me know of a better place where to put it if necessary. Here is an overview of the content and the results:
setup='import chardet; html = open("mem_leak_html.txt", "rb").read()'
python3 -m timeit -s "$setup" 'chardet.detect(html[:2543482])'
# produces: 10 loops, best of 3: 43 ms per loop
python3 -m timeit -s "$setup" 'chardet.detect(html[:2543483])'
# produces: 1 loops, best of 3: 1min 22s per loop
python3 mem_leak_test.py
# produces:
# Good input left 2.65 MB of unfreed memory.
# Bad input left 220.16 MB of unfreed memory.
python -m timeit -s "$setup" 'chardet.detect(html[:2543482])'
# produces: 10 loops, best of 3: 41.7 ms per loop
python -m timeit -s "$setup" 'chardet.detect(html[:2543483])'
# produces: 10 loops, best of 3: 111 sec per loop
python mem_leak_test.py
# produces:
# Good input left 3.00 MB of unfreed memory.
# Bad input left 312.00 MB of unfreed memory.
import resource
import chardet
import gc
mem_use = lambda: resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024
html = open("mem_leak_html.txt", "rb").read()
def test(desc, instr):
gc.collect()
mem_start = mem_use()
chardet.detect(instr)
gc.collect()
mem_used = mem_use() - mem_start
print('%s left %.2f MB of unfreed memory.' % (desc, mem_used))
test('Good input', html[:2543482])
test('Bad input', html[:2543483])
Hello,
when I was working on my new language models for Central European countries I found old error in the sbcharsetprober.py (or .cpp) file.
I've looked around on the internet and I've found only ONE developer/contributor (PyYoshi): who corrected this amazing error (There are some other bugs are corrected and added many new language models).
In the code of all forks I've found (python, cpp, ...) "-1" is missing (The part of source code below is already corrected):
// Order is in [1-64] but we want 0-63 here.
order = mModel->charToOrderMap[(unsigned char)aBuf[i]] -1
if (order < SYMBOL_CAT_ORDER)
mTotalChar++;
if (order < SAMPLE_SIZE)
{
I've spent half a day to understand why my new lang models give very low
confidence value for tested text. With adding "-1" confident values are
normal.
If you can, please, post this info to other chardet developers.
Many thanks.
Currently we have a ton of encoding-specific data stored as constants all over the place. This was done to mirror the C code initially, but I propose that we start diverging from the C code in a more substantial way than we have in the past.
The problems I see with the current approach are:
So if we're agreement that the current approach is bad, how do we want to fix it?
I propose that we:
setup.py install
process, convert the files to pickled dictionaries.chardet.detect
to cache its UniversalDetector
object so that we don't constantly create new prober objects and reload the pickles.The only problem I see with this approach is that it will slow down import chardet
, but loading pickles is usually pretty fast.
@sigmavirus24, what do you think?
The following has no encoding detected:
u'\u020d\x1b'.encode('utf-8')
Again, not a length issue. Padding with ascii doesn't change anything. The initial character is needed, I believe purely because it prevents the fix in #63 from applying by preventing the string from being considered as ascii.
Behaviour is present in 9e419e9
Since this project was created, Sphinx has become the de-facto standard for publishing [Python] documentation on the web.
Thus, should reformat docs into Sphinx format
I just found a test for Arabic.
https://github.com/chardet/chardet/blob/master/tests/windows-1256-arabic/_chromium_windows-1256_with_no_encoding_specified.html
According to this link, cp1256 is also an important encoding.
http://w3techs.com/technologies/overview/character_encoding/all
I got the original codec from url "FAHR%E2%80%A2WERK", it's "FAHR•WERK" in utf-8.
But when I use chardet.detect, the result is {'confidence': 0.73, 'encoding': 'windows-1252'}
and the confidence of the utf-8 is only 0.505.
And I think there's sth wrong with utf8prober. So I look into utf8prober.py, and find code below:
elif coding_state == MachineState.start:
if self.coding_sm.get_current_charlen() >= 2:
self._num_mb_chars += 1
It seems only multibyte character can be judged as an utf-8 character.
So codec like "FAHR%E2%80%A2WERK" get a very low confidence.
In this case , I think we should judge the the single byte character also as an utf-8 character.
So I change the code into:
elif coding_state == MachineState.start:
if self.coding_sm.get_current_charlen() >= 1:
self._num_mb_chars += 1
and the result is {'confidence': 0.99, 'encoding': 'utf-8'}
I presume this is because it's covered by Windows 1252, but whereas other ISO-8859 encodings are mentioned next to their Windows superset, ISO-8859-1 is not.
If in fact the issue is more subtle, that would also be worth documenting!
It might be useful to have some method on the UniversalDetector class that populates the confidence of every encoding for a given string. This can maybe help out figuring why we are failing on those 25 other test cases and also help if any encoding is to be added / deleted. Thoughts?
>>> chardet.detect('"ULTIMA ATUALIZACAO";"17/03/2014 04:01"\r\n"ANO";"MES";"SENADOR";"TIPO_DESPESA";"CNPJ_CPF";"FORNECEDOR";"DOCUMENTO";"DATA";"DETALHAMENTO";"VALOR_REEMBOLSADO"\r\n"2011";"1";"ACIR GURGACZ";"Aluguel de im\xf3veis para escrit\xf3rio pol\xedtico, compreendendo despesas concernentes a eles.";"05.914.650/0001-66";"CERON - CENTRAIS EL\xc9TRICAS DE ROND\xd4NIA S.A.";"45216633";"11/01/11";"";"47,65"\r\n"2011";"1";"ACIR GURGACZ";"Aluguel de im\xf3veis para escrit\xf3rio pol\xedtico, compreendendo despesas concernentes a eles.";"05.914.650/0001-66";"CERON - CENTRAIS EL\xc9TRICAS DE ROND\xd4NIA S.A.";"4542061";"18/01/11";"";"196,67"\r\n"2011";"1";"ACIR GURGACZ";"Aluguel de im\xf3veis para escrit\xf3rio pol\xedtico, compreendendo despesas concernentes a eles.";"004.948.028-63";"GILBERTO PISELO DO NASCIMENTO";"01";"12/01/11";"";"5000"\r\n"2011";"1";"ACIR GURGACZ";"Aluguel de im\xf3veis para escrit\xf3rio pol\xedtico, compreendendo despesas concernentes a eles.";"76.535.764/0001-43";"OI BRASIL TELECOM S.A.";"963011";"14/01/11";"";"480,59"\r\n"2011";"1";"ACIR GURGACZ";"Aquisi\xe7\xe3o de ma')
{'confidence': 0.99, 'encoding': 'windows-1251'}
Ages ago, I filed a bug that got erroneously closed by a commit. I just stumbled on it again today, so I'm moving it over from the old location so we can see about getting it going again.
Here's the text of the old bug:
I had a frustrating issue recently when trying to use chardet to work with a web page: http://stackoverflow.com/questions/11588458/how-to-handle-encodings-using-python-requests-library
My solution was to write a bit of custom code that says, "Whenever chardet reports ISO-8859-1, instead use cp1252." It would be great if chardet internalized this behavior.
Basically, browsers don't use a number of character encodings, and instead map to other ones instead. Since most of the data that chardet is used on will be coming from the web, it makes sense for it to return the character encodings that are used by browsers.
This might make sense as an option rather than default functionality....not sure, but I'd love to see this be added.
Is there still an appetite for this kind of an issue? Basically, I think—and ages ago, committer @dan-blanchard agreed—that chardet should never return ISO-8859 and should always return cp1252 instead.
Due to the way IronPython 2.7.5 natively handles strings, I'm encountering errors with passing arguments to chardet's init in third-party packages using chardet (requests for example).
Specifically, this section of code causes problematic behaviour:
def detect(byte_str):
if (PY2 and isinstance(byte_str, unicode)) or (PY3 and
not isinstance(byte_str,
bytes)):
raise ValueError('Expected a bytes object, not a unicode object')
Can we perhaps add some way of detecting if we're using IronPython and then a check in init, which would pass bytes(string) to u.feed() instead?
For instance, in compat.py:
import platform
if (platform.python_implementation() == 'IronPython'):
IPY = True
else:
IPY = False
in init.py:
from compat import IPY
if (PY2 and not IPY and isinstance(byte_str, unicode)) or (PY3 or IPY and not isinstance(byte_str, bytes)):
raise ValueError('Expected a bytes object, not a unicode object')
I have a file in UTF-8 with BOM with Dos\Windows lines end and first string empty:
EF BB BF 0D 0A 23 69 6E .......(text)
info = chardet.detect(raw)
There is error with Unix line end (\n) also :(
It's needed to say, that I read file this way:
file_open = open(file_path, "r") # rb also error
raw = file_open.read()
Python 2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)] on win32
I want to help for Turkish language detection but I don't understand what order or how I supposed to put things to lang*model.py.
Turkish is Latin5 8-bit single-byte coded Language
http://en.wikipedia.org/wiki/ISO-8859-9
http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-128.pdf (code table is on page 16)
I only recently discovered that there's a substantially faster version of chardet
, cChardet
, that's just a Cython wrapper around uchardet-enhanced
.
According to their benchmarks it's about 2800 times faster, so if we're only doing the same things they are, maybe we should recommend people who are using CPython use that.
python 2.7.11
chardet 2.3.0
cchardet-1.1.1-cp27-cp27m-win_amd64.whl (md5)
window server 2008 r2 enterprise
system language: chinese simple (操作系统显示语言:简体中文)
test.py is the python code
testfile.cs is the input file (open by notepad, [save as(另存) ] show the encode(编码) is ANSI
testfile.chardet.cs is the output file: decode by chardet.detect(raw)['encoding'] ,then , encode by utf8
testfile.cchardet.cs is the output file: decode by cchardet.detect(raw)['encoding'] ,then , encode by utf8
testfile.GB2312.cs is the output file: decode by 'GB2312' ,then , encode by utf8.
testfile.GB2312.cs is the RIGHT.
test_chardet.zip
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.