Giter VIP home page Giter VIP logo

html-parse's Introduction

html-parse

html-parse is an efficient, reasonably robust HTML tokenizer based on the HTML5 tokenization specification. The parser is written using the fast attoparsec parsing library and can exposes both a native attoparsec Parser as well as convenience functions for lazily parsing token streams out of strict and lazy Text values.

For instance,

>>> parseTokens "<div><h1>Hello World</h1><br/><p class=widget>Example!</p></div>"
[TagOpen "div" [],TagOpen "h1" [],ContentText "Hello World",TagClose "h1",TagSelfClose "br" [],TagOpen "p" [Attr "class" "widget"],ContentText "Example!",TagClose "p",TagClose "div"]

Performance

Here are some typical performance numbers taken from parsing a fairly long Wikipedia article,

benchmarking Forced/tagsoup fast Text
time                 171.2 ms   (166.4 ms .. 177.3 ms)
                     0.999 R²   (0.997 R² .. 1.000 R²)
mean                 171.9 ms   (169.4 ms .. 173.2 ms)
std dev              2.516 ms   (1.104 ms .. 3.558 ms)
variance introduced by outliers: 12% (moderately inflated)

benchmarking Forced/tagsoup normal Text
time                 176.9 ms   (167.3 ms .. 188.5 ms)
                     0.998 R²   (0.994 R² .. 1.000 R²)
mean                 180.7 ms   (177.5 ms .. 183.7 ms)
std dev              4.246 ms   (2.316 ms .. 5.803 ms)
variance introduced by outliers: 14% (moderately inflated)

benchmarking Forced/html-parser
time                 20.88 ms   (20.60 ms .. 21.25 ms)
                     0.999 R²   (0.998 R² .. 0.999 R²)
mean                 20.99 ms   (20.81 ms .. 21.20 ms)
std dev              446.1 μs   (336.4 μs .. 596.2 μs)

html-parse's People

Contributors

bgamari avatar fisx avatar stepcut avatar ddssff avatar felixonmars avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.