Giter VIP home page Giter VIP logo

chardet's People

Contributors

dcramer avatar erikrose avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

chardet's Issues

Add html5 character encoding mappings

I had a frustrating issue recently when trying to use chardet to work with a web page: http://stackoverflow.com/questions/11588458/how-to-handle-encodings-using-python-requests-library

My solution was to write a bit of custom code that says, "Whenever chardet reports ISO-8859-1, instead use cp1252."

Basically, browsers don't use a number of character encodings, and instead map to other ones instead. This was done unofficially for a while by browsers, but it's now enshrined in the HTML5 spec:

http://dev.w3.org/html5/spec/single-page.html#character-encodings-0

Since most of the data that chardet is used on will be coming from the web, it makes sense for it to return the character encodings that are used by browsers. This might make sense as an option rather than default functionality....not sure, but I'd love to see this be added.

If this is a feature that'd be accepted, I'd be happy to put it together in a pull request, but I need guidance as to the design that'd be accepted.

Math bug in latin1prober

(dcramer/erikrose - first, thanks for taking an interest in chardet - good to see someone is rescuing this useful package from oblivion).

Just noticed a serious logic bug in latin1prober. At line 133, you'll see:

 confidence = (self._mFreqCounter[3] / total) - (self._mFreqCounter[1] * 20.0 / total)

The problem with this is that self._mFreqCounter[3] and total are both integers, so this first term is always 0 any time self._mFreqCounter[3] != total, meaning that confidence is 0 for all cases where self._mFreqCounter[3] != total. In other words, even one "unlikely" character transition in a document of any length will produce a confidence of 0. This is certainly not the behavior of the original Mozilla code, which does this all in floating point.

An example document which shows this problem can be found at: http://www.lvo.com/GASTRONOMIE/VINS/VITI/VITI1F.HTML

A simple fix would be to change this line of code to:

 confidence = (self._mFreqCounter[3] - self._mFreqCounter[1] * 20.0) / total

which obviates the multiple divisions as well.

As an aside, the "0.5" confidence multiplier (a few lines later in the code) is a wild guess (as the original authors note, this is sort of a hacky Latin-1 detector anyway). I've had better experience with 0.8 (for example, the document given in this report is incorrectly detected as 'iso-8859-2,' which wreaks havoc on the accents, until about 0.8), though your mileage may vary.

UnicodeDecodeError in chardet

Hi,

When trying to guess encoding for http://www.debian.org/index.fr.html , I guess this traceback:

Traceback (most recent call last):
  File "/home/progval/packages/lib/python2.7/site-packages/supybot/callbacks.py", line 1265, in _callCommand
    self.callCommand(command, irc, msg, *args, **kwargs)
  File "/home/progval/packages/lib/python2.7/site-packages/supybot/utils/python.py", line 86, in g
    f(self, *args, **kwargs)
  File "/home/progval/packages/lib/python2.7/site-packages/supybot/plugins/Web/plugin.py", line 77, in callCommand
    super(Web, self).callCommand(command, irc, msg, *args, **kwargs)
  File "/home/progval/packages/lib/python2.7/site-packages/supybot/utils/python.py", line 86, in g
    f(self, *args, **kwargs)
  File "/home/progval/packages/lib/python2.7/site-packages/supybot/callbacks.py", line 1246, in callCommand
    method(irc, msg, *args, **kwargs)
  File "/home/progval/packages/lib/python2.7/site-packages/supybot/commands.py", line 1054, in newf
    f(self, irc, msg, args, *state.args, **state.kwargs)
  File "/home/progval/packages/lib/python2.7/site-packages/supybot/plugins/Web/plugin.py", line 203, in title
    title = utils.web.htmlToText(parser.title.strip())
  File "/home/progval/packages/lib/python2.7/site-packages/supybot/utils/web.py", line 199, in htmlToText
    u.feed(s)
  File "/usr/lib/python2.7/dist-packages/chardet/universaldetector.py", line 115, in feed
    if prober.feed(aBuf) == constants.eFoundIt:
  File "/usr/lib/python2.7/dist-packages/chardet/charsetgroupprober.py", line 59, in feed
    st = prober.feed(aBuf)
  File "/usr/lib/python2.7/dist-packages/chardet/sjisprober.py", line 67, in feed
    self._mContextAnalyzer.feed(self._mLastChar[2 - charLen :], charLen)
  File "/usr/lib/python2.7/dist-packages/chardet/jpcntx.py", line 145, in feed
    order, charLen = self.get_order(aBuf[i:i+2])
  File "/usr/lib/python2.7/dist-packages/chardet/jpcntx.py", line 176, in get_order
    if ((aStr[0] >= '\x81') and (aStr[0] <= '\x9F')) or \
UnicodeDecodeError: 'ascii' codec can't decode byte 0x81 in position 0: ordinal not in range(128)

With this stack:

Frame _callCommand in /home/progval/packages/lib/python2.7/site-packages/supybot/callbacks.py at line 1282
                    args = ([],)
                       e = UnicodeDecodeError('ascii', '\x81', 0, 1, 'ordinal not in range(128)')
                    name = 'title'
                    self = <Web Web <Web.plugin.Web object at 0x3d18650>>
                     cap = False
                 command = ['web', 'title']
                  kwargs = {}
                     msg = IrcMsg(prefix="ProgVal!progval@pdpc/supporter/student/progval", command="PRIVMSG", args=('#supybot-fr', 'Limnoria: web title http://www.debian.org/index.fr.html'))
                     irc = <supybot.callbacks.NestedCommandsIrcProxy object at 0x410cc90>


Frame g in /home/progval/packages/lib/python2.7/site-packages/supybot/utils/python.py at line 88
                    lock = <_RLock owner=None count=0>
                    self = <Web Web <Web.plugin.Web object at 0x3d18650>>
                    args = (['web', 'title'], <supybot.callbacks.NestedCommandsIrcProxy object at 0x410cc90>, IrcMsg(prefix="ProgVal!progval@pdpc/supporter/student/progval", command="PRIVMSG", args=('#supybot-fr', 'Limnoria: web title http://www.debian.org/index.fr.html')), [])
                       f = <function callCommand at 0x3cee7d0>
                  kwargs = {}


Frame callCommand in /home/progval/packages/lib/python2.7/site-packages/supybot/plugins/Web/plugin.py at line 79
                    self = <Web Web <Web.plugin.Web object at 0x3d18650>>
                    args = ([],)
                 command = ['web', 'title']
                  kwargs = {}
                     msg = IrcMsg(prefix="ProgVal!progval@pdpc/supporter/student/progval", command="PRIVMSG", args=('#supybot-fr', 'Limnoria: web title http://www.debian.org/index.fr.html'))
                     irc = <supybot.callbacks.NestedCommandsIrcProxy object at 0x410cc90>


Frame g in /home/progval/packages/lib/python2.7/site-packages/supybot/utils/python.py at line 88
                    lock = <_RLock owner=None count=0>
                    self = <Web Web <Web.plugin.Web object at 0x3d18650>>
                    args = (['web', 'title'], <supybot.callbacks.NestedCommandsIrcProxy object at 0x410cc90>, IrcMsg(prefix="ProgVal!progval@pdpc/supporter/student/progval", command="PRIVMSG", args=('#supybot-fr', 'Limnoria: web title http://www.debian.org/index.fr.html')), [])
                       f = <function callCommand at 0x240f140>
                  kwargs = {}


Frame callCommand in /home/progval/packages/lib/python2.7/site-packages/supybot/callbacks.py at line 1246
                    self = <Web Web <Web.plugin.Web object at 0x3d18650>>
                    args = ([],)
                 command = ['web', 'title']
                  kwargs = {}
                     msg = IrcMsg(prefix="ProgVal!progval@pdpc/supporter/student/progval", command="PRIVMSG", args=('#supybot-fr', 'Limnoria: web title http://www.debian.org/index.fr.html'))
                     irc = <supybot.callbacks.NestedCommandsIrcProxy object at 0x410cc90>
                  method = <bound method Web.title of <Web Web <Web.plugin.Web object at 0x3d18650>>>


Frame newf in /home/progval/packages/lib/python2.7/site-packages/supybot/commands.py at line 1063
                       f = <function title at 0x3d19758>
                specList = [<getopts for {'no-filter': ''}>, <context for httpUrl>]
                    self = <Web Web <Web.plugin.Web object at 0x3d18650>>
                    args = []
                   state = State(args=[[], 'http://www.debian.org/index.fr.html'], kwargs={}, channel=None)
                  kwargs = {}
                     msg = IrcMsg(prefix="ProgVal!progval@pdpc/supporter/student/progval", command="PRIVMSG", args=('#supybot-fr', 'Limnoria: web title http://www.debian.org/index.fr.html'))
                     irc = <supybot.callbacks.NestedCommandsIrcProxy object at 0x410cc90>
                    spec = <supybot.commands.Spec object at 0x3d18f10>


Frame title in /home/progval/packages/lib/python2.7/site-packages/supybot/plugins/Web/plugin.py at line 203
                    args = []
                     url = 'http://www.debian.org/index.fr.html'
                    text = u'<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">\n<html lang="fr">\n<head>\n  <meta http-equiv="Content-Type" content="text/html; charset=utf-8">\n  <title>Debian -- Le syst\xe8me d\'exploitation universel </title>\n  <link rev="made" href="mailto:[email protected]">\n  <meta name="Description" content="Debian est un syst\xe8me d\'exploitation et une distribution de logiciels libres. Elle est d\xe9velopp\xe9e et mise \xe0 jour gr\xe2ce au travail de nombreux utilisateurs qui offrent leur temps et leurs efforts.">\n  <meta name="Generator" content="WML 2.0.12 (16-Apr-2008)">\n  <meta name="Modified" content="2013-06-11 20:16:59">\n  <meta name="viewport" content="width=device-width">\n  <meta name="mobileoptimized" content="300">\n  <meta name="HandheldFriendly" content="true">\n<link rel="alternate" type="application/rss+xml"\n title="Actualit\xe9s Debian" href="News/news">\n<link rel="alternate" type="application/rss+xml"\n title="Nouvelles du projet Debian" href="News/weekly/dwn">\n<link rel="alternate" type="application/rss+xml"\n title="Annonces de s\xe9curit\xe9 Debian (titres seuls)" href="security/dsa">\n<link rel="alternate" type="application/rss+xml"\n title="Annonces de s\xe9curit\xe9 Debian (r\xe9sum\xe9s)" href="security/dsa-long">\n<link href="./debhome.css" rel="stylesheet" type="text/css">\n  <link href="./debian-fr.css" rel="stylesheet" type="text/css" media="all">\n  <link rel="shortcut icon" href="favicon.ico">\n  <meta name="Keywords" content="debian, GNU, linux, unix, open source, libre, DFSG">\n<link rel="search" type="application/opensearchdescription+xml" title="Recherche sur le site web Debian" href="./search.fr.xml">\n</head>\n<body>\n<div id="header">\n   <div id="upperheader">\n   <div id="logo">\n  <a href="./" title="Accueil Debian"><img src="./Pics/openlogo-50.png" alt="Debian"></a>\n  </div> <!-- end logo -->\n\t<div id="searchbox">\n\t\t<form name="p" method="get" action="http://search.debian.org/cgi-bin/omega">\n\t\t<p>\n<input type="hidden" name="DB" value="fr">\n\t\t\t<input name="P" value="" size="27">\n\t\t\t<input type="submit" value="Recherche">\n\t\t</p>\n\t\t</form>\n\t</div>   <!-- end sitetools -->\n </div> <!-- end upperheader -->\n<!--UdmComment-->\n<div id="navbar">\n<p class="hidecss"><a href="#content">Sauter le menu</a></p>\n<ul>\n   <li><a href="intro/about">\xc0 propos de Debian</a></li>\n   <li><a href="distrib/">Obtenir Debian</a></li>\n   <li><a href="./support">Assistance</a></li>\n   <li><a href="./devel/">Le coin du d\xe9veloppeur</a></li>\n</ul>\n</div> <!-- end navbar -->\n\t<p id="breadcrumbs">&nbsp; </p>\n</div> <!-- end header -->\n<!--/UdmComment-->\n<div id="content">\n<span class="download"><a\nhref="http://cdimage.debian.org/debian-cd/7.0.0/multi-arch/iso-cd/debian-7.0.0-amd64-i386-netinst.iso">T\xe9l\xe9charger Debian\xa07.0<em>(installation par le r\xe9seau, PC 32 et 64\xa0bits)</em></a> </span>\n<div id="splash" style="text-align: center;">\n       <h1>Debian</h1>\n</div>\n<div id="intro">\n<p><a href="http://www.debian.org/">Debian</a> est un syst\xe8me d\'exploitation\n<a href="intro/free">libre</a> pour votre ordinateur. Un syst\xe8me d\'exploitation\nest la suite des programmes de base et des utilitaires qui permettent \xe0 un\nordinateur de fonctionner.\n</p>\n<p>Debian est bien plus qu\'un simple syst\xe8me d\'exploitation&nbsp;:\nil contient plus de 37500\n<a href="distrib/packages">paquets</a>&nbsp;; les paquets sont des composants\nlogiciels pr\xe9compil\xe9s con\xe7us pour s\'installer facilement sur votre machine.\n<a href="intro/about">Suite...</a>\n</div>\n<div id="hometoc">\n<!--UdmComment-->\n<ul id="hometoc-cola">\n  <li><a href="intro/about">\xc0&nbsp;propos&nbsp;de&nbsp;Debian</a>\n    <ul>\n      <li><a href="./social_contract">Notre contrat social</a></li>\n      <li><a href="./intro/free">Logiciel libre</a></li>\n      <li><a href="./partners/">Partenaires</a></li>\n      <li><a href="./donations">Dons</a></li>\n      <li><a href="./contact">Nous contacter</a></li>\n    </ul>\n  </li>\n  <li><a href="./intro/help">Aider Debian</a></li>\n</ul>\n<ul id="hometoc-colb">\n  <li><a href="distrib/">Obtenir Debian</a>\n    <ul>\n      <li><a href="distrib/netinst'
                    self = <Web Web <Web.plugin.Web object at 0x3d18650>>
                  parser = <Web.plugin.Title instance at 0x4111d88>
                 optlist = []
                     msg = IrcMsg(prefix="ProgVal!progval@pdpc/supporter/student/progval", command="PRIVMSG", args=('#supybot-fr', 'Limnoria: web title http://www.debian.org/index.fr.html'))
                     irc = <supybot.callbacks.NestedCommandsIrcProxy object at 0x410cc90>
                    size = 4096


Frame htmlToText in /home/progval/packages/lib/python2.7/site-packages/supybot/utils/web.py at line 199
                 chardet = <module 'chardet' from '/usr/lib/python2.7/dist-packages/chardet/__init__.pyc'>
              tagReplace = ' '
                       u = <chardet.universaldetector.UniversalDetector instance at 0x4111a28>
                       s = u"Debian -- Le syst\xe8me d'exploitation universel"


Frame feed in /usr/lib/python2.7/dist-packages/chardet/universaldetector.py at line 115
                    aBuf = u"Debian -- Le syst\xe8me d'exploitation universel"
                    aLen = 45
                    self = <chardet.universaldetector.UniversalDetector instance at 0x4111a28>
                  prober = <chardet.mbcsgroupprober.MBCSGroupProber instance at 0x4111170>


Frame feed in /usr/lib/python2.7/dist-packages/chardet/charsetgroupprober.py at line 59
                    aBuf = u"Debian -- Le syst\xe8me d'exploitation universel"
                    self = <chardet.mbcsgroupprober.MBCSGroupProber instance at 0x4111170>
                  prober = <chardet.sjisprober.SJISProber instance at 0x4115320>
                      st = 2


Frame feed in /usr/lib/python2.7/dist-packages/chardet/sjisprober.py at line 67
                    aBuf = u"Debian -- Le syst\xe8me d'exploitation universel"
                       i = 0
                    self = <chardet.sjisprober.SJISProber instance at 0x4115320>
                    aLen = 45
                 charLen = 1
             codingState = 0


Frame feed in /usr/lib/python2.7/dist-packages/chardet/jpcntx.py at line 145
                       i = 0
                    aBuf = [u'D']
                    aLen = 1
                    self = <chardet.jpcntx.SJISContextAnalysis instance at 0x41153f8>


Frame get_order in /usr/lib/python2.7/dist-packages/chardet/jpcntx.py at line 176
                    aStr = [u'D']
                    self = <chardet.jpcntx.SJISContextAnalysis instance at 0x41153f8>

Regards,
Valentin

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.