Giter VIP home page Giter VIP logo

Comments (5)

kmike avatar kmike commented on July 22, 2024 1

@starrify I believe the goal was indeed speed; also, these regexes may take e.g. only first 4096 bytes of the page, without the rest. Ideas about a proper solution are welcome! It should

a) be almost as fast as these regexes;
b) work on arbitrarily truncated HTML files.

from w3lib.

openandclose avatar openandclose commented on July 22, 2024 1

Hello, just for your reference.

I recently tested w3lib's prescan against 500 most popular websites.
I found three bugs (or different behaviors from html5 spec).

books.google.com:
<meta http-equiv="content-type"content="text/html; charset=UTF-8">
(no space between attributes)

mega.nz:
<meta http-equiv="Content-Type" content="text/html, charset=UTF-8" />
(comma, not semicolon)

stuff.co.nz:
doc.write('<body onload=[...] <meta charset="utf-8"/>
(matching '<body')

validator's, jsdom's and html5lib-python's prescan parsers get encoding successfully.

...I don't know it is a good idea to fix these and make prescan regex even more complex.

from w3lib.

devspyrosv avatar devspyrosv commented on July 22, 2024

Just to add my 2 cents and to bump this issue,

Indeed, regex parsing of the html seems to miss some things and as others have said before me, commented out base tags is one example. In some cases those commented out base tags point to different websites altogether. So for me the question is speed vs accuracy. One can fork w3lib or override scrapy/utils/response.py: get_base_url() and make it call an also overridden w3lib/html.py:get_base_url() with the addition of @starrify

from w3lib.

botzill avatar botzill commented on July 22, 2024

Another issue here is that it does not ignore the commented tags. For example we may have a commented base tag like:

 <!--<base href="http://127.0.0.1" />-->
 <base href="http://www.example.com/" />

Of course according to _baseurl_re it will take the commented one.

Any ideas on how can we solve this?

from w3lib.

fonkwe avatar fonkwe commented on July 22, 2024

Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by regular expressions. HTML is a language of sufficient complexity that it cannot be parsed by regular expressions.

from w3lib.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.