Giter VIP home page Giter VIP logo

Comments (3)

kostalski avatar kostalski commented on August 24, 2024 3

Hi @samuelchen @Gallaecio ,

Source of encoding detection problem seems to be in invalid input HTML it self not in w3lib. There is invalid HTML meta tag. There is <meta httpequiv="ContentType" ..., but valid (with w3c) should be <meta http-equiv="Content-Type" ...(missing dash character). Because of that w3lib is not detecting defined encoding.

beautifulsoup4 is detecting 'gbk' encoding, because it is using naive regex for fallback encoding detection (lib: beautifulsoup4 file: bs4/dammit.py, line: html_meta = '<\\s*meta[^>]+charset\\s*=\\s*["\']?([^>]*?)[ /;\'">]').

For @samuelchen problem w3lib can be updated to be more forgiving/lenient. Updating (lib: w3lib, file: w3lib/encoding.py)
From: _HTTPEQUIV_RE = _TEMPLATE % ('http-equiv', 'Content-Type')
To: _HTTPEQUIV_RE = _TEMPLATE % (r'http-?equiv', r'Content-?Type')

After this fix w3lib would detected encoding as gb18030. This should have no side effects, but I don't know if it is right way ;)
What you think @Gallaecio ?

More details below.


Details

I was able to reproduce issue with provided settings:

  • Python 3.7.9
  • libs:
    -- beautifulsoup4==4.9.3
    -- html5lib==1.1
    -- lxml==4.6.1
    -- w3lib==1.22.0

Test python script:

from w3lib.encoding import html_body_declared_encoding
from bs4 import BeautifulSoup

b = b'<HTML>\r\n <HEAD>\r\n  <TITLE>\xcd\xf8\xd5\xbe\xb5\xd8\xcd\xbc</TITLE>\r\n  <meta httpequiv="ContentType" content="text/html; charset=gbk" />\r\n  <META NAME="Keywords" CONTENT="\xe5\xd0\xa1\xcb\xb5,\xcd\xf8\xd5\xbe\xb5\xd8\xcd\xbc">'
enc = html_body_declared_encoding(b)
print("html_body_declared_encoding: %s" % enc)

for parser in ['html5lib', 'html.parser', 'lxml']:
    soup = BeautifulSoup(b, parser)
    print("soup.original_encoding[parser:{}]: {}".format(parser, soup.original_encoding))

Script output:

html_body_declared_encoding: None
soup.original_encoding[parser:html5lib]: windows-1252
soup.original_encoding[parser:html.parser]: windows-1252
soup.original_encoding[parser:lxml]: gbk

Detection by Beatifulsoup only for 'lxml' parser, by fallback encoding detection.
lib: beautifulsoup4
file: bs4/dammit.py,
line: html_meta = '<\\s*meta[^>]+charset\\s*=\\s*["\']?([^>]*?)[ /;\'">]'

from w3lib.

kostalski avatar kostalski commented on August 24, 2024 1

Ok @samuelchen, no problem 👍

from w3lib.

samuelchen avatar samuelchen commented on August 24, 2024

@kostalski Thank you for the feedback. I am not able to recall why that html was httpequiv="ContentType". Not sure if it is possible to be coverted by other parts of scrapy or it's original. I am sorry about this, too long ago to remember that.
btw. GB18030 is compatible with GBK.

from w3lib.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.