Giter VIP home page Giter VIP logo

charade's People

Contributors

byroot avatar dcramer avatar erikrose avatar joetsoi avatar lukasa avatar mindw avatar miso-belica avatar puzzlet avatar ralphbean avatar sigmavirus24 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

charade's Issues

Charade 1.0.3 Badly identifies ISO-8859-15 as IBM855

Hello;
The above charade's version identifies the ISO-8859-15 as IBM855.

rui@rui-SatelliteI7:/Transferências$ file Bones.S09E09.HDTV.X264-LOL.srt
Bones.S09E09.HDTV.X264-LOL.srt: C++ source, ISO-8859 text, with CRLF line terminators
rui@rui-SatelliteI7:
/Transferências$ charade Bones.S09E09.HDTV.X264-LOL.srt
Bones.S09E09.HDTV.X264-LOL.srt: IBM855 with confidence 0.972957810694

Can you please check?

Fails to identify cp1252 (aka Windows-1252)

I have a very small test file that gets incorrectly identified as ISO-8859-2 http://en.wikipedia.org/wiki/ISO/IEC_8859-2 what makes this interesting is that the non-ascii characters in the test file are invalid characters in ISO-8859-2 so ISO-8859-2 not even close:

0x93, 0x94, 0x97, 0x96

I wasn't able to attached a txt file for some reason so here is a Python repr (from Python 2.x) of the file contents.

Python 2.7.3 (default, Apr 10 2012, 23:31:26) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> f = open(r'charade\cp1252_test.txt', 'rb')
>>> test_str = f.read()
>>> f.close()
>>> test_str
'Then he said, \x93The names Bod, James Bond.\x94\r\nto be \x93me\x94\r\nSpam, beans, spam \x96 served every day\r\nbeans, spam, beans, \x97 served every other day\r\n'

I have a larger (real) file if this demo one is not suitable.

Charade==1.0.3 incorrectly identifies UTF-8 buffer as TIS-620

The short German phrase "Sie hören" is incorrectly detected as TIS-620 (Thai), even though the correct encoding appears to be UTF-8.

>>> buf = b'Sie h\xc3\xb6ren'
>>> charade.detect(buf)
{'confidence': 0.99, 'encoding': 'TIS-620'}
>>> print buf.decode('TIS-620')
Sie hรถren
>>> print buf.decode('utf8')
Sie hören

Charade==1.0.3 incorrectly identifies UTF-8 buffer as ISO-8859-2

The German word "gebührenfrei" is incorrectly detected as ISO-8859-2, even though the correct encoding appears to be UTF-8.

>>> buf = b'geb\xc3\xbchrenfrei'
>>> charade.detect(buf)
{'confidence': 0.6946821700961592, 'encoding': 'ISO-8859-2'}
>>> print buf.decode('ISO-8859-2')
gebĂźhrenfrei
>>> print buf.decode('utf8')
gebührenfrei

26 Failing tests

I need to determine why they are failing.

Required:

  • I need to brush up on character encoding in general

Support CP949 (Windows-949)

CP949 is a superset of EUC-KR (Korean) with extra characters defined. Almost all webpages declared themselves as EUC-KR should be safely assumed to be in CP949, as they potentially are, since it has been the default locale of Korean version of MS Windows.

Here is the usage stats on the web, according to Google: http://googleblog.blogspot.kr/2010/01/unicode-nearing-50-of-web.html

We can support this by:

  • renaming EUCKR-related classes and constants to CP949
  • and patching the byte-sequence state machine in mbcssm.py

The frequency table should be the same, since the supplemented characters are the most infrequent.

Support GB18030

GB18030 is a superset of Chinese encoding GB2312, which charade already supports.

Like #10, we can support this by:

  • renaming GB2312-related classes and constants to GB18030
  • and patching the byte-sequence state machine in mbcssm.py

Release 1.0.0

Hey @kennethreitz

If you could add me to the package as a maintainer, I would be ever grateful. My PyPI username is graffatcolmingov.

String compatibility issues

It seems that any unicode object passed to charade causes issues in python 2, and since python 3 str objects are unicode objects (if I remember correctly), you cannot properly use the regular expressions that are compiled with as bytes objects.


Example code:

import charade
import requests

r = requests.get('http://export.yandex.ru/weather-ng/forecasts/26686.xml')
res = charade.detect(r.text)

Related: kennethreitz/requests#928

Use codecs module where possible

The codecs module has the following data (which is also used in universaldecoder.py):

    BOM = '\xff\xfe'
    BOM32_BE = '\xfe\xff'
    BOM32_LE = '\xff\xfe'
    BOM64_BE = '\x00\x00\xfe\xff'
    BOM64_LE = '\xff\xfe\x00\x00'
    BOM_BE = '\xfe\xff'
    BOM_LE = '\xff\xfe'
    BOM_UTF16 = '\xff\xfe'
    BOM_UTF16_BE = '\xfe\xff'
    BOM_UTF16_LE = '\xff\xfe'
    BOM_UTF32 = '\xff\xfe\x00\x00'
    BOM_UTF32_BE = '\x00\x00\xfe\xff'
    BOM_UTF32_LE = '\xff\xfe\x00\x00'
    BOM_UTF8 = '\xef\xbb\xbf'

We could probably swap out the hard constants in the file for this (and possibly elsewhere if possible).

UTF8 '…' is misdetected

b'\xe2\x80\xa6' is UTF8 '…'. chardet mistakes it for Big5 and fails with a UnicodeDecodeError, charade takes it to be ISO-8859-2 which is broken but not detected by the Python codec.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.