Hi, Thanks you guys for the great framework. I am

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Ok <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Scrapy can not auto detect GBK html encoding about w3lib HOT 3 OPEN

scrapy commented on August 24, 2024

Scrapy can not auto detect GBK html encoding

from w3lib.

Comments (3)

kostalski commented on August 24, 2024 3

Hi @samuelchen @Gallaecio ,

Source of encoding detection problem seems to be in invalid input HTML it self not in w3lib. There is invalid HTML meta tag. There is <meta httpequiv="ContentType" ..., but valid (with w3c) should be <meta http-equiv="Content-Type" ...(missing dash character). Because of that w3lib is not detecting defined encoding.

beautifulsoup4 is detecting 'gbk' encoding, because it is using naive regex for fallback encoding detection (lib: beautifulsoup4 file: bs4/dammit.py, line: html_meta = '<\\s*meta[^>]+charset\\s*=\\s*["\']?([^>]*?)[ /;\'">]').

For @samuelchen problem w3lib can be updated to be more forgiving/lenient. Updating (lib: w3lib, file: w3lib/encoding.py)
From: _HTTPEQUIV_RE = _TEMPLATE % ('http-equiv', 'Content-Type')
To: _HTTPEQUIV_RE = _TEMPLATE % (r'http-?equiv', r'Content-?Type')

After this fix w3lib would detected encoding as gb18030. This should have no side effects, but I don't know if it is right way ;)
What you think @Gallaecio ?

More details below.

Details

I was able to reproduce issue with provided settings:

Python 3.7.9
libs:
-- beautifulsoup4==4.9.3
-- html5lib==1.1
-- lxml==4.6.1
-- w3lib==1.22.0

Test python script:

from w3lib.encoding import html_body_declared_encoding
from bs4 import BeautifulSoup

b = b'<HTML>\r\n <HEAD>\r\n  <TITLE>\xcd\xf8\xd5\xbe\xb5\xd8\xcd\xbc</TITLE>\r\n  <meta httpequiv="ContentType" content="text/html; charset=gbk" />\r\n  <META NAME="Keywords" CONTENT="\xe5\xd0\xa1\xcb\xb5,\xcd\xf8\xd5\xbe\xb5\xd8\xcd\xbc">'
enc = html_body_declared_encoding(b)
print("html_body_declared_encoding: %s" % enc)

for parser in ['html5lib', 'html.parser', 'lxml']:
    soup = BeautifulSoup(b, parser)
    print("soup.original_encoding[parser:{}]: {}".format(parser, soup.original_encoding))

Script output:

html_body_declared_encoding: None
soup.original_encoding[parser:html5lib]: windows-1252
soup.original_encoding[parser:html.parser]: windows-1252
soup.original_encoding[parser:lxml]: gbk

Detection by Beatifulsoup only for 'lxml' parser, by fallback encoding detection.
lib: beautifulsoup4
file: bs4/dammit.py,
line: html_meta = '<\\s*meta[^>]+charset\\s*=\\s*["\']?([^>]*?)[ /;\'">]'

from w3lib.

kostalski commented on August 24, 2024 1

Ok @samuelchen, no problem 👍

from w3lib.

samuelchen commented on August 24, 2024

@kostalski Thank you for the feedback. I am not able to recall why that html was httpequiv="ContentType". Not sure if it is possible to be coverted by other parts of scrapy or it's original. I am sorry about this, too long ago to remember that.
btw. GB18030 is compatible with GBK.

from w3lib.

Recommend Projects

Scrapy can not auto detect GBK html encoding about w3lib HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent