Giter VIP home page Giter VIP logo

ltgt's Introduction

LtGt

Status Build Coverage Version Downloads

Development of this project is entirely funded by the community. Consider donating to support!

Note: As an alternative, consider using AngleSharp, which is a more performant and feature-complete HTML processing library.

LtGt is a minimalistic library for working with HTML. It can parse any HTML5-compliant code into an object model which you can use to traverse nodes or locate specific elements. The library establishes itself as a foundation that you can build upon, and comes with a lot of extension methods that can help navigate the DOM easily.

Download

  • NuGet: dotnet add package LtGt

Features

  • Parse any HTML5-compliant code
  • Traverse the DOM using LINQ or Seq
  • Use basic element selectors like GetElementById(), GetElementsByTagName(), etc
  • Use CSS selectors via QueryElements()
  • Convert any HTML node to its equivalent Linq2Xml representation
  • Render any HTML entity to code
  • Targets .NET Framework 4.5+ and .NET Standard 1.6+

Screenshots

dom css selectors

Usage

LtGt is a library written in F# but it provides two separate idiomatic APIs that you can use from both C# and F#.

Parse a document

C#

using LtGt;

const string html = @"<!doctype html>
<html>
  <head>
    <title>Document</title>
  </head>
  <body>
    <div>Content</div>
  </body>
</html>";

// This throws an exception on parse errors
var document = Html.ParseDocument(html);

// -or-

// This returns a wrapped result instead
var documentResult = Html.TryParseDocument(html);
if (documentResult.IsOk)
{
    // Handle result
    var document = documentResult.ResultValue;
}
else
{
    // Handle error
    var error = documentResult.ErrorValue;
}

F#

open LtGt

let html = "<!doctype html>
<html>
  <head>
    <title>Document</title>
  </head>
  <body>
    <div>Content</div>
  </body>
</html>"

// This throws an exception on parse errors
let document = Html.parseDocument html

// -or-

// This returns a wrapped result instead
match Html.tryParseDocument html with
| Result.Ok document -> // handle result
| Result.Error error -> // handle error

Parse a fragment

C#

const string html = "<div id=\"some-element\"><a href=\"https://example.com\">Link</a></div>";

// Parse an element node
var element = Html.ParseElement(html);

// Parse any node
var node = Html.ParseNode(html);

F#

let html = "<div id=\"some-element\"><a href=\"https://example.com\">Link</a></div>"

// Parse an element node
let element = Html.parseElement html

// Parse any node
let node = Html.parseNode html

Find specific element

C#

var element1 = document.GetElementById("menu-bar");
var element2 = document.GetElementsByTagName("div").FirstOrDefault();
var element3 = document.GetElementsByClassName("floating-button floating-button--enabled").FirstOrDefault();

var element1Data = element1.GetAttributeValue("data");
var element2Id = element2.GetId();
var element3Text = element3.GetInnerText();

F#

let element1 = document |> Html.tryElementById "menu-bar"
let element2 = document |> Html.elementsByTagName "div" |> Seq.tryHead
let element3 = document |> Html.elementsByClassName "floating-button floating-button--enabled" |> Seq.tryHead

let element1Data = element1 |> Option.bind (Html.tryAttributeValue "data")
let element2Id = element2 |> Option.bind Html.tryId
let element3Text = element3 |> Option.map Html.innerText

You can leverage the full power of CSS selectors as well.

C#

var element = document.QueryElements("div#main > span.container:empty").FirstOrDefault();

F#

let element = document |> CssSelector.queryElements "div#main > span.container:empty" |> Seq.tryHead

Check equality

You can compare two HTML entities by value, including their descendants.

C#

var element1 = new HtmlElement("span",
    new HtmlAttribute("id", "foo"),
    new HtmlText("bar"));

var element2 = new HtmlElement("span",
    new HtmlAttribute("id", "foo"),
    new HtmlText("bar"));

var element3 = new HtmlElement("span",
    new HtmlAttribute("id", "foo"),
    new HtmlText("oof"));

var firstTwoEqual = HtmlEntityEqualityComparer.Instance.Equals(element1, element2); // true
var lastTwoEqual = HtmlEntityEqualityComparer.Instance.Equals(element2, element3); // false

F#

let element1 = HtmlElement("span",
    HtmlAttribute("id", "foo"),
    HtmlText("bar"))

let element2 = HtmlElement("span",
    HtmlAttribute("id", "foo"),
    HtmlText("bar"))

let element3 = HtmlElement("span",
    HtmlAttribute("id", "foo"),
    HtmlText("oof"))

let firstTwoEqual = Html.equal element1 element2 // true
let lastTwoEqual = Html.equal element2 element3 // false

Convert to Linq2Xml

You can convert LtGt's objects to System.Xml.Linq objects (XNode, XElement, etc). This can be useful if you need to convert HTML to XML or if you want to use XPath to select nodes.

C#

var htmlDocument = Html.ParseDocument(html);
var xmlDocument = (XDocument) htmlDocument.ToXObject();
var elements = xmlDocument.XPathSelectElements("//input[@type=\"submit\"]");

F#

let htmlDocument = Html.parseDocument html
let xmlDocument = htmlDocument |> Html.toXObject :?> XDocument
let elements = xmlDocument.XPathSelectElements("//input[@type=\"submit\"]")

Render nodes

You can turn any entity to its equivalent HTML code.

C#

var element = new HtmlElement("div",
    new HtmlAttribute("id", "main"),
    new HtmlText("Hello world"));

var html = element.ToHtml(); // <div id="main">Hello world</div>

F#

let element = HtmlElement("div",
    HtmlAttribute("id", "main"),
    HtmlText("Hello world"))

let html = element |> Html.toHtml // <div id="main">Hello world</div>

Benchmarks

This is how LtGt compares to popular HTML libraries when it comes to parsing a document (in this case, a YouTube video watch page). The results are not in favor of LtGt so if performance is important for your task, you should probably consider using a different parser. That said, these results are still pretty impressive for a parser built with parser combinators as opposed to a traditional manual approach.

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.14393.3384 (1607/AnniversaryUpdate/Redstone1)
Intel Core i5-4460 CPU 3.20GHz (Haswell), 1 CPU, 4 logical and 4 physical cores
Frequency=3125000 Hz, Resolution=320.0000 ns, Timer=TSC
.NET Core SDK=3.1.100
[Host]     : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), X64 RyuJIT DEBUG
DefaultJob : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), X64 RyuJIT
Method Mean Error StdDev Ratio Rank
AngleSharp 11.94 ms 0.104 ms 0.097 ms 0.29 1
HtmlAgilityPack 20.51 ms 0.140 ms 0.124 ms 0.49 2
LtGt 41.59 ms 0.450 ms 0.399 ms 1.00 3

ltgt's People

Contributors

tyrrrz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

ltgt's Issues

Wrong render output with valid HTML5

Test HTML:

<!doctype html><meta charset=utf-8><title>shortest html5</title>

Output:

<!doctype html>
<meta charset="utf" -8>
<title>
  shortest html5
</title>

Failing to Parse HTML

Hey. I'm trying to parse BandCamp website but it fails to do so and fails with the following reason:

Error in Ln: 4 Col: 1
<html   xmlns:og="http://opengraphprotocol.org/schema/"
^
Expecting: any char not in ‘<’, end of input, '<!', '<!--', '<![CDATA[' or '<?'

The parser backtracked after:
  Error in Ln: 4 Col: 2
  <html   xmlns:og="http://opengraphprotocol.org/schema/"
   ^
  Expecting: Raw text element tag

The parser backtracked after:
  Error in Ln: 4 Col: 2
  <html   xmlns:og="http://opengraphprotocol.org/schema/"
   ^
  Expecting: Void element tag

The parser backtracked after:
  Error in Ln: 5 Col: 53
          xmlns:fb="http://www.facebook.com/2008/fbml">
                                                      ^
  Expecting: '/>'

The parser backtracked after:
  Error in Ln: 187 Col: 1
  <body class="has-menubar " lang="en">
  ^
  Expecting: any char not in ‘<’, '<!', '<!--', '<![CDATA[', '</html' or '<?'

  The parser backtracked after:
    Error in Ln: 187 Col: 2
    <body class="has-menubar " lang="en">
     ^
    Expecting: Raw text element tag

  The parser backtracked after:
    Error in Ln: 187 Col: 2
    <body class="has-menubar " lang="en">
     ^
    Expecting: Void element tag

  The parser backtracked after:
    Error in Ln: 187 Col: 37
    <body class="has-menubar " lang="en">
                                        ^
    Expecting: '/>'

  The parser backtracked after:
    Error in Ln: 192 Col: 5
        <div id="menubar-wrapper" class="header-rework-2018 ">
        ^
    Expecting: any char not in ‘<’, '<!', '<!--', '<![CDATA[', '</body' or '<?'

    The parser backtracked after:
      Error in Ln: 192 Col: 6
          <div id="menubar-wrapper" class="header-rework-2018 ">
           ^
      Expecting: Raw text element tag

    The parser backtracked after:
      Error in Ln: 192 Col: 6
          <div id="menubar-wrapper" class="header-rework-2018 ">
           ^
      Expecting: Void element tag

    The parser backtracked after:
      Error in Ln: 192 Col: 58
          <div id="menubar-wrapper" class="header-rework-2018 ">
                                                               ^
      Expecting: '/>'

    The parser backtracked after:
      Error in Ln: 232 Col: 1
      <div id="menubar-vm" class="menubar-outer loading" data-initial-values="{
      ^
      Expecting: any char not in ‘<’, '<!', '<!--', '<![CDATA[', '</div' or
      '<?'

      The parser backtracked after:
        Error in Ln: 232 Col: 2
        <div id="menubar-vm" class="menubar-outer loading" data-initial-values=
         ^
        Expecting: Raw text element tag

      The parser backtracked after:
        Error in Ln: 232 Col: 2
        <div id="menubar-vm" class="menubar-outer loading" data-initial-values=
         ^
        Expecting: Void element tag

      The parser backtracked after:
        Error in Ln: 232 Col: 535
        rce_no_control&quot;:false,&quot;page_path&quot;:&quot;/search&quot;}">
                                                                              ^
        Expecting: '/>'

      The parser backtracked after:
        Error in Ln: 233 Col: 1
        <div id="menubar" class="menubar-2018 out  ">
        ^
        Expecting: any char not in ‘<’, '<!', '<!--', '<![CDATA[', '</div' or
        '<?'

        The parser backtracked after:
          Error in Ln: 233 Col: 2
          <div id="menubar" class="menubar-2018 out  ">
           ^
          Expecting: Raw text element tag

        The parser backtracked after:
          Error in Ln: 233 Col: 2
          <div id="menubar" class="menubar-2018 out  ">
           ^
          Expecting: Void element tag

        The parser backtracked after:
          Error in Ln: 233 Col: 45
          <div id="menubar" class="menubar-2018 out  ">
                                                      ^
          Expecting: '/>'

        The parser backtracked after:
          Error in Ln: 234 Col: 1
          <ul id="site-nav" class="menubar-section horizontal">
          ^
          Expecting: any char not in ‘<’, '<!', '<!--', '<![CDATA[', '</div' or
          '<?'

          The parser backtracked after:
            Error in Ln: 234 Col: 2
            <ul id="site-nav" class="menubar-section horizontal">
             ^
            Expecting: Raw text element tag

          The parser backtracked after:
            Error in Ln: 234 Col: 2
            <ul id="site-nav" class="menubar-section horizontal">
             ^
            Expecting: Void element tag

          The parser backtracked after:
            Error in Ln: 234 Col: 53
            <ul id="site-nav" class="menubar-section horizontal">
                                                                ^
            Expecting: '/>'

          The parser backtracked after:
            Error in Ln: 237 Col: 5
                <li class="bclogo white"><a href="https://bandcamp.com/?from=me
                ^
            Expecting: any char not in ‘<’, '<!', '<!--', '<![CDATA[', '</ul'
            or '<?'

            The parser backtracked after:
              Error in Ln: 237 Col: 6
                  <li class="bclogo white"><a href="https://bandcamp.com/?from=
                   ^
              Expecting: Raw text element tag

            The parser backtracked after:
              Error in Ln: 237 Col: 6
                  <li class="bclogo white"><a href="https://bandcamp.com/?from=
                   ^
              Expecting: Void element tag

            The parser backtracked after:
              Error in Ln: 237 Col: 29
                  <li class="bclogo white"><a href="https://bandcamp.com/?from=
                                          ^
              Expecting: '/>'

            The parser backtracked after:
              Error in Ln: 237 Col: 30
                  <li class="bclogo white"><a href="https://bandcamp.com/?from=
                                           ^
              Expecting: any char not in ‘<’, '<!', '<!--', '<![CDATA[', '</li'
              or '<?'

              The parser backtracked after:
                Error in Ln: 237 Col: 31
                    <li class="bclogo white"><a href="https://bandcamp.com/?fro
                                              ^
                Expecting: Raw text element tag

              The parser backtracked after:
                Error in Ln: 237 Col: 31
                    <li class="bclogo white"><a href="https://bandcamp.com/?fro
                                              ^
                Expecting: Void element tag

              The parser backtracked after:
                Error in Ln: 237 Col: 90
                /?from=menubar_logo_logged_out"><svg width="108px" height="17px
                                               ^
                Expecting: '/>'

              The parser backtracked after:
                Error in Ln: 237 Col: 91
                ?from=menubar_logo_logged_out"><svg width="108px" height="17px"
                                               ^
                Expecting: any char not in ‘<’, '<!', '<!--', '<![CDATA[',
                '</a' or '<?'

                The parser backtracked after:
                  Error in Ln: 237 Col: 92
                  rom=menubar_logo_logged_out"><svg width="108px" height="17px"
                                                ^
                  Expecting: Raw text element tag

                The parser backtracked after:
                  Error in Ln: 237 Col: 92
                  rom=menubar_logo_logged_out"><svg width="108px" height="17px"
                                                ^
                  Expecting: Void element tag

                The parser backtracked after:
                  Error in Ln: 237 Col: 144
                  ht="17px" viewBox="0 0 127 20"><use xlink:href="#bandcamp-log
                                                ^
                  Expecting: '/>'

                The parser backtracked after:
                  Error in Ln: 237 Col: 145
                  t="17px" viewBox="0 0 127 20"><use xlink:href="#bandcamp-logo
                                                ^
                  Expecting: any char not in ‘<’, '<!', '<!--', '<![CDATA[',
                  '</svg' or '<?'

                  The parser backtracked after:
                    Error in Ln: 237 Col: 146
                    "17px" viewBox="0 0 127 20"><use xlink:href="#bandcamp-logo
                                                 ^
                    Expecting: Raw text element tag

                  The parser backtracked after:
                    Error in Ln: 237 Col: 146
                    "17px" viewBox="0 0 127 20"><use xlink:href="#bandcamp-logo
                                                 ^
                    Expecting: Void element tag

                  The parser backtracked after:
                    Error in Ln: 237 Col: 189
                    use xlink:href="#bandcamp-logo-color-white"></svg></a></li>
                                                               ^
                    Expecting: '/>'

                  The parser backtracked after:
                    Error in Ln: 237 Col: 190
                    use xlink:href="#bandcamp-logo-color-white"></svg></a></li>
                                                                ^
                    Expecting: any char not in ‘<’, '<!', '<!--', '<![CDATA[',
                    '</use' or '<?'

                    The parser backtracked after:
                      Error in Ln: 237 Col: 191
                      e xlink:href="#bandcamp-logo-color-white"></svg></a></li>
                                                                 ^
                      Expecting: Raw text element tag

                    The parser backtracked after:
                      Error in Ln: 237 Col: 191
                      e xlink:href="#bandcamp-logo-color-white"></svg></a></li>
                                                                 ^
                      Expecting: Void element tag

                    The parser backtracked after:
                      Error in Ln: 237 Col: 191
                      e xlink:href="#bandcamp-logo-color-white"></svg></a></li>
                                                                 ^
                      Expecting: decimal digit, letter or '-'

Source is here: view-source:https://bandcamp.com/search?q=Deep%20House
Code is like so:

        public async ValueTask ScrapeHtmlAsync(string query) {
            using var requestMessage = new HttpRequestMessage(HttpMethod.Get, query);
            using var responseMessage = await _httpClient.SendAsync(requestMessage)
                .ConfigureAwait(false);
            using var content = responseMessage.Content;
            var rawHtml = await content.ReadAsStringAsync()
                .ConfigureAwait(false);

            var documentResult = Html.TryParseDocument(rawHtml);
            if (documentResult.IsError)
                return;

            var document = documentResult.ResultValue;
            using var resultItems = document.GetElementsByClassName("result-items").GetEnumerator();
            while (resultItems.MoveNext()) {
                var item = resultItems.Current;
                
            }
        }

Recursive child selectors not working

var url = "https://finance.yahoo.com/calendar/earnings?day=2019-02-04&offset=0&size=100";
var html = new WebClient().DownloadString(url);
var doc = HtmlParser.Default.ParseDocument(html);
var pagingSpan = doc.GetElementsBySelector(@"#fin-cal-table > h3 > span.Mstart\(15px\).Fw\(500\).Fz\(s\) > span").FirstOrDefault();

Problem is this selects span #1 shown in the image below instead of span #2

image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.