Comments (5)
@starrify I believe the goal was indeed speed; also, these regexes may take e.g. only first 4096 bytes of the page, without the rest. Ideas about a proper solution are welcome! It should
a) be almost as fast as these regexes;
b) work on arbitrarily truncated HTML files.
from w3lib.
Hello, just for your reference.
I recently tested w3lib's prescan against 500 most popular websites.
I found three bugs (or different behaviors from html5 spec).
books.google.com:
<meta http-equiv="content-type"content="text/html; charset=UTF-8">
(no space between attributes)
mega.nz:
<meta http-equiv="Content-Type" content="text/html, charset=UTF-8" />
(comma, not semicolon)
stuff.co.nz:
doc.write('<body onload=[...] <meta charset="utf-8"/>
(matching '<body')
validator's, jsdom's and html5lib-python's prescan parsers get encoding successfully.
...I don't know it is a good idea to fix these and make prescan regex even more complex.
from w3lib.
Just to add my 2 cents and to bump this issue,
Indeed, regex parsing of the html seems to miss some things and as others have said before me, commented out base tags is one example. In some cases those commented out base tags point to different websites altogether. So for me the question is speed vs accuracy. One can fork w3lib or override scrapy/utils/response.py: get_base_url() and make it call an also overridden w3lib/html.py:get_base_url() with the addition of @starrify
from w3lib.
Another issue here is that it does not ignore the commented tags. For example we may have a commented base tag like:
<!--<base href="http://127.0.0.1" />-->
<base href="http://www.example.com/" />
Of course according to _baseurl_re
it will take the commented one.
Any ideas on how can we solve this?
from w3lib.
Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by regular expressions. HTML is a language of sufficient complexity that it cannot be parsed by regular expressions.
from w3lib.
Related Issues (20)
- url.add_or_replace_parameter(s) removes param values for param with multiple values HOT 2
- Scrapy can not auto detect GBK html encoding HOT 3
- remove_tags not working on html comments HOT 4
- Redirection may not work depending on order of 'content' and 'http-equiv' in meta tag HOT 5
- Fix CI issue on PyPy 3 and the Rust compiler HOT 1
- test_add_or_replace_parameter fails on Python 3.6.13, 3.7.10, 3.8.8, 3.9.2 due to CVE-2021-23336 fix HOT 1
- w3lib.url.safe_url_string incorrectly encode IDNA domain with port HOT 1
- should the canonize_url function convert an apostrophe to %27 HOT 2
- [request] Update pypi release HOT 1
- basic_auth_header uses the wrong flavor of base64
- safe_url_string URL-encodes already-encoded username and password, breaking idempodency HOT 1
- Issue in safe_url_encoding HOT 8
- BOM should take precedence over Content-Type header when detecting the encoding
- safe_url_string handling IPv6 URLs HOT 5
- Function `convert_entity` does not catch `OverflowError`
- Reimplement safe_url_string based on the URL living standard HOT 4
- Location of libw3.a HOT 1
- test_safe_url_string_url regressed on 3.11.4 HOT 1
- canonicalize_url incorrectly handles port when using hostname that requires IDNA encoding HOT 1
- Space at end of query string is trimmed HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from w3lib.