<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

It's not a good idead to parse HTML text using regular expressions about w3lib HOT 5 CLOSED

scrapy commented on July 22, 2024

It's not a good idead to parse HTML text using regular expressions

from w3lib.

Comments (5)

kmike commented on July 22, 2024 1

@starrify I believe the goal was indeed speed; also, these regexes may take e.g. only first 4096 bytes of the page, without the rest. Ideas about a proper solution are welcome! It should

a) be almost as fast as these regexes;
b) work on arbitrarily truncated HTML files.

from w3lib.

openandclose commented on July 22, 2024 1

Hello, just for your reference.

I recently tested w3lib's prescan against 500 most popular websites.
I found three bugs (or different behaviors from html5 spec).

books.google.com:
<meta http-equiv="content-type"content="text/html; charset=UTF-8">
(no space between attributes)

mega.nz:
<meta http-equiv="Content-Type" content="text/html, charset=UTF-8" />
(comma, not semicolon)

stuff.co.nz:
doc.write('<body onload=[...] <meta charset="utf-8"/>
(matching '<body')

validator's, jsdom's and html5lib-python's prescan parsers get encoding successfully.

...I don't know it is a good idea to fix these and make prescan regex even more complex.

from w3lib.

devspyrosv commented on July 22, 2024

Just to add my 2 cents and to bump this issue,

Indeed, regex parsing of the html seems to miss some things and as others have said before me, commented out base tags is one example. In some cases those commented out base tags point to different websites altogether. So for me the question is speed vs accuracy. One can fork w3lib or override scrapy/utils/response.py: get_base_url() and make it call an also overridden w3lib/html.py:get_base_url() with the addition of @starrify

from w3lib.

botzill commented on July 22, 2024

Another issue here is that it does not ignore the commented tags. For example we may have a commented base tag like:

 <!--<base href="http://127.0.0.1" />-->
 <base href="http://www.example.com/" />

Of course according to _baseurl_re it will take the commented one.

Any ideas on how can we solve this?

from w3lib.

fonkwe commented on July 22, 2024

Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by regular expressions. HTML is a language of sufficient complexity that it cannot be parsed by regular expressions.

from w3lib.

Recommend Projects

It's not a good idead to parse HTML text using regular expressions about w3lib HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent