Giter VIP home page Giter VIP logo

xeno's Introduction

xeno

Github actions build status Hackage version Stackage version

A fast event-based XML parser.

Blog post.

Features

  • SAX-style/fold parser which triggers events for open/close tags, attributes, text, etc.
  • Low memory use (see memory benchmarks below).
  • Very fast (see speed benchmarks below).
  • It cheats like Hexml does (doesn't expand entities, or most of the XML standard).
  • Written in pure Haskell.
  • CDATA is supported as of version 0.2.

Please see the bottom of this file for guidelines on contributing to this library.

Performance goals

The hexml Haskell library uses an XML parser written in C, so that is the baseline we're trying to beat or match roughly.

Imgur

The Xeno.SAX module is faster than Hexml for simply walking the document. Hexml actually does more work, allocating a DOM. Xeno.DOM is slighly slower or faster than Hexml depending on the document, although it is 2x slower on a 211KB document.

Memory benchmarks for Xeno:

Case                Bytes  GCs  Check
4kb/xeno/sax        2,376    0  OK
31kb/xeno/sax       1,824    0  OK
211kb/xeno/sax     56,832    0  OK
4kb/xeno/dom       11,360    0  OK
31kb/xeno/dom      10,352    0  OK
211kb/xeno/dom  1,082,816    0  OK

I memory benchmarked Hexml, but most of its allocation happens in C, which GHC doesn't track. So the data wasn't useful to compare.

Speed benchmarks:

benchmarking 4KB/hexml/dom
time                 6.317 μs   (6.279 μs .. 6.354 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 6.333 μs   (6.307 μs .. 6.362 μs)
std dev              97.15 ns   (77.15 ns .. 125.3 ns)
variance introduced by outliers: 13% (moderately inflated)

benchmarking 4KB/xeno/sax
time                 5.152 μs   (5.131 μs .. 5.179 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 5.139 μs   (5.128 μs .. 5.161 μs)
std dev              58.02 ns   (41.25 ns .. 85.41 ns)

benchmarking 4KB/xeno/dom
time                 10.93 μs   (10.83 μs .. 11.14 μs)
                     0.994 R²   (0.983 R² .. 0.999 R²)
mean                 11.35 μs   (11.12 μs .. 11.91 μs)
std dev              1.188 μs   (458.7 ns .. 2.148 μs)
variance introduced by outliers: 87% (severely inflated)

benchmarking 31KB/hexml/dom
time                 9.405 μs   (9.348 μs .. 9.480 μs)
                     0.999 R²   (0.998 R² .. 0.999 R²)
mean                 9.745 μs   (9.599 μs .. 10.06 μs)
std dev              745.3 ns   (598.6 ns .. 902.4 ns)
variance introduced by outliers: 78% (severely inflated)

benchmarking 31KB/xeno/sax
time                 2.736 μs   (2.723 μs .. 2.753 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 2.757 μs   (2.742 μs .. 2.791 μs)
std dev              76.93 ns   (43.62 ns .. 136.1 ns)
variance introduced by outliers: 35% (moderately inflated)

benchmarking 31KB/xeno/dom
time                 5.767 μs   (5.735 μs .. 5.814 μs)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 5.759 μs   (5.728 μs .. 5.810 μs)
std dev              127.3 ns   (79.02 ns .. 177.2 ns)
variance introduced by outliers: 24% (moderately inflated)

benchmarking 211KB/hexml/dom
time                 260.3 μs   (259.8 μs .. 260.8 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 259.9 μs   (259.7 μs .. 260.3 μs)
std dev              959.9 ns   (821.8 ns .. 1.178 μs)

benchmarking 211KB/xeno/sax
time                 249.2 μs   (248.5 μs .. 250.1 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 251.5 μs   (250.6 μs .. 253.0 μs)
std dev              3.944 μs   (3.032 μs .. 5.345 μs)

benchmarking 211KB/xeno/dom
time                 543.1 μs   (539.4 μs .. 547.0 μs)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 550.0 μs   (545.3 μs .. 553.6 μs)
std dev              14.39 μs   (12.45 μs .. 17.12 μs)
variance introduced by outliers: 17% (moderately inflated)

DOM Example

Easy as running the parse function:

> parse "<p key='val' x=\"foo\" k=\"\"><a><hr/>hi</a><b>sup</b>hi</p>"
Right
  (Node
     "p"
     [("key", "val"), ("x", "foo"), ("k", "")]
     [ Element (Node "a" [] [Element (Node "hr" [] []), Text "hi"])
     , Element (Node "b" [] [Text "sup"])
     , Text "hi"
     ])

SAX Example

Quickly dumping XML:

> let input = "Text<tag prop='value'>Hello, World!</tag><x><y prop=\"x\">Content!</y></x>Trailing."
> dump input
"Text"
<tag prop="value">
  "Hello, World!"
</tag>
<x>
  <y prop="x">
    "Content!"
  </y>
</x>
"Trailing."

Folding over XML:

> fold const (\m _ _ -> m + 1) const const const const 0 input -- Count attributes.
Right 2
> fold (\m _ -> m + 1) (\m _ _ -> m) const const const const 0 input -- Count elements.
Right 3

Most general XML processor:

process
  :: Monad m
  => (ByteString -> m ())               -- ^ Open tag.
  -> (ByteString -> ByteString -> m ()) -- ^ Tag attribute.
  -> (ByteString -> m ())               -- ^ End open tag.
  -> (ByteString -> m ())               -- ^ Text.
  -> (ByteString -> m ())               -- ^ Close tag.
  -> ByteString                         -- ^ Input string.
  -> m ()

You can use any monad you want. IO, State, etc. For example, fold is implemented like this:

fold openF attrF endOpenF textF closeF s str =
  execState
    (process
       (\name -> modify (\s' -> openF s' name))
       (\key value -> modify (\s' -> attrF s' key value))
       (\name -> modify (\s' -> endOpenF s' name))
       (\text -> modify (\s' -> textF s' text))
       (\name -> modify (\s' -> closeF s' name))
       str)
    s

The process is marked as INLINE, which means use-sites of it will inline, and your particular monad's type will be potentially erased for great performance.

Contributors

See CONTRIBUTORS.md

Contribution guidelines

All contributions and bug fixes are welcome and will be credited appropriately, as long as they are aligned with the goals of this library: speed and memory efficiency. In practical terms, patches and additional features should not introduce significant performance regressions.

xeno's People

Contributors

0xd34df00d avatar adamse avatar avieth avatar chrisdone avatar dmalkr avatar erikd avatar jappeace avatar jed-leggett avatar mgajda avatar mitchellwrosen avatar ocramz avatar pauljohnson avatar phadej avatar pkamenarsky avatar qrilka avatar rembane avatar teofilc avatar thielema avatar unhammer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

xeno's Issues

Use on files too large to fit in memory

From what I can tell from reading the (short!) source, there doesn't seem to be a way to use this on files too large to fit in memory. In particular, the ByteString's are strict, and if one simply tries splitting and sending in the parts, there's a possibility of getting errors like

Left (XenoParseError "Couldn't find the matching quote character.")

I'd rather not have Lazy ByteStrings because reasons, though that would probably be the simplest solution.

An alternative would be to add an eofF :: (Int -> m ()) to process, and pass it the index of the last fully processed element, and then call that wherever we now call throw, which would allow simply sending in arbitrarily sized ByteString's and stitch them using the eoF function. But most of the throw's happen in s_index – making that return a Maybe seems like it would hurt performance.

Or is it possible to make indexEndOfLastMatch an argument of XenoException's, and catch the throw's without losing the monad return value?

Segmentation fault: runtime crash in xeno

  • We have an application that make a lot of asynchronous HTTP requests to external system with a custom response timeout and strict non-functional requirements.
  • External system respond with XML bytestrings and we used xeno as a parser for them.
  • Under const 500 rps application started to crash unexpectedly after 1-5 minutes uptime with a segmentation fault.
  • While bisecting we identified that root cause located somewhere under the hood of xeno SAX parser.
  • After switching to tagsoup the problem has gone, application stopped to crash.
  • I do realise that xeno parser should be used with care, different trade-offs should be considered.

Here are some details:

  • GHC: 8.4.3, 8.4.4.
  • xeno: 0.4.2.
  • lldb output (looks like GC is going wild):
    (lldb) thread backtrace 22
    * thread #22, stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
      * frame #0: 0x000000010cc2cd91 some-application`LOOKS_LIKE_INFO_PTR_NOT_NULL(p=12297829382473034410) at ClosureMacros.h:260:23
        frame #1: 0x000000010cc2baa6 some-application`LOOKS_LIKE_INFO_PTR(p=12297829382473034410) at ClosureMacros.h:265:42
        frame #2: 0x000000010cc2aa7d some-application`LOOKS_LIKE_CLOSURE_PTR(p=0x00000042095c41e0) at ClosureMacros.h:270:12
        frame #3: 0x000000010cc2b05a some-application`checkClosure(p=0x0000004200209000) at Sanity.c:344:9
        frame #4: 0x000000010cc2b9b7 some-application`checkHeapChain(bd=0x0000004200200240) at Sanity.c:473:33
        frame #5: 0x000000010cc2ce40 some-application`checkGeneration(gen=0x00007fee41704250, after_major_gc=true) at Sanity.c:769:5
        frame #6: 0x000000010cc2c0b3 some-application`checkFullHeap(after_major_gc=true) at Sanity.c:788:9
        frame #7: 0x000000010cc2c03d some-application`checkSanity(after_gc=true, major_gc=true) at Sanity.c:797:5
        frame #8: 0x000000010cc256e4 some-application`GarbageCollect(collect_gen=1, do_heap_census=false, gc_type=2, cap=0x00007fee4180a600, idle_cap=0x00007fee4140bf80) at GC.c:769:3
        frame #9: 0x000000010cc0c77f some-application`scheduleDoGC(pcap=0x00007000053c6f48, task=0x00007fee41712000, force_major=true) at Schedule.c:1804:5
        frame #10: 0x000000010cc0dd59 some-application`scheduleDetectDeadlock(pcap=0x00007000053c6f48, task=0x00007fee41712000) at Schedule.c:924:9
        frame #11: 0x000000010cc0b1d2 some-application`schedule(initialCapability=0x00007fee4180a600, task=0x00007fee41712000) at Schedule.c:281:5
        frame #12: 0x000000010cc0bbdd some-application`scheduleWorker(cap=0x00007fee4180a600, task=0x00007fee41712000) at Schedule.c:2564:11
        frame #13: 0x000000010cc1487a some-application`workerStart(task=0x00007fee41712000) at Task.c:445:5
        frame #14: 0x00007fff70081109 libsystem_pthread.dylib`_pthread_start + 148
        frame #15: 0x00007fff7007cb8b libsystem_pthread.dylib`thread_start + 15
    (lldb) disassemble
    some-application`LOOKS_LIKE_INFO_PTR_NOT_NULL:
        0x10cc2cd70 <+0>:  pushq  %rbp
        0x10cc2cd71 <+1>:  movq   %rsp, %rbp
        0x10cc2cd74 <+4>:  subq   $0x20, %rsp
        0x10cc2cd78 <+8>:  movq   %rdi, -0x8(%rbp)
        0x10cc2cd7c <+12>: movq   -0x8(%rbp), %rdi
        0x10cc2cd80 <+16>: callq  0x10cbf7760               ; INFO_PTR_TO_STRUCT at ClosureMacros.h:59
        0x10cc2cd85 <+21>: xorl   %ecx, %ecx
        0x10cc2cd87 <+23>: movb   %cl, %dl
        0x10cc2cd89 <+25>: movq   %rax, -0x10(%rbp)
        0x10cc2cd8d <+29>: movq   -0x10(%rbp), %rax
    ->  0x10cc2cd91 <+33>: cmpl   $0x0, 0x8(%rax)
        0x10cc2cd95 <+37>: movb   %dl, -0x11(%rbp)
        0x10cc2cd98 <+40>: je     0x10cc2cda8               ; <+56> at ClosureMacros.h
        0x10cc2cd9a <+42>: movq   -0x10(%rbp), %rax
        0x10cc2cd9e <+46>: cmpl   $0x40, 0x8(%rax)
        0x10cc2cda2 <+50>: setb   %cl
        0x10cc2cda5 <+53>: movb   %cl, -0x11(%rbp)
        0x10cc2cda8 <+56>: movb   -0x11(%rbp), %al
        0x10cc2cdab <+59>: andb   $0x1, %al
        0x10cc2cdad <+61>: movzbl %al, %eax
        0x10cc2cdb0 <+64>: addq   $0x20, %rsp
        0x10cc2cdb4 <+68>: popq   %rbp
        0x10cc2cdb5 <+69>: retq
        0x10cc2cdb6 <+70>: nopw   %cs:(%rax,%rax)
    

I guess, some unsafe functions are producing this unexpected effect, e.g.

First example in README doesn't work

DOM Example

Easy as running the parse function:

> stack ghci
λ> parse "<p key='val' x=\"foo\" k=\"\"><a><hr/>hi</a><b>sup</b>hi</p>"

<interactive>:1:1: error:
    Variable not in scope: parse :: [Char] -> t
λ>
λ>
λ> import Xeno.DOM
λ> parse "<p key='val' x=\"foo\" k=\"\"><a><hr/>hi</a><b>sup</b>hi</p>"

<interactive>:3:7: error:
    • Couldn't match expected type ‘Data.ByteString.Internal.ByteString’
                  with actual type ‘[Char]’
    • In the first argument of ‘parse’, namely
        ‘"<p key='val' x=\"foo\" k=\"\"><a><hr/>hi</a><b>sup</b>hi</p>"’
      In the expression:
        parse
          "<p key='val' x=\"foo\" k=\"\"><a><hr/>hi</a><b>sup</b>hi</p>"
      In an equation for ‘it’:
          it
            = parse
                "<p key='val' x=\"foo\" k=\"\"><a><hr/>hi</a><b>sup</b>hi</p>"

Dots in attribute names

To my knowledge having dots in attribute names is valid, but it doesn't seem to match xeno's expectations.

> import qualified Xeno.DOM                                                                                                   
> :set -XOverloadedStrings                                                                                           
> Xeno.DOM.parse "<foo bar.baz=\"qux\"></foo>"                                                                       
Left (XenoParseError "Expected =, got: . at character index: 8")

"xeno-bench" should be a benchmark

The "xeno-bench" target in the Cabal file is listed as an executable, but it should be a benchmark.

Amongst other things this means that Stack believes that "bytestring-mmap" is a required dependency, which causes the build to fail on Windows.

test suite failure: data/books-4kb.xml: openBinaryFile: does not exist

As seen on the stackage build server. This is presumably caused by the test xml file not being included in the tarball uploaded to hackage.

Test suite failure for package xeno-0.4
    xeno-test:  exited with: ExitFailure 1
Full log available at /var/stackage/work/unpack-dir/.stack-work/logs/xeno-0.4$
test.log


    Right (Node "a" [("attr","a")] [Element (Node "img" [] [])])

    Xeno.DOM tests
      test 1 FAILED [1]
    Xeno.DOM tests
      DOM from bytestring substring
      Leading whitespace characters are accepted by parse
      children test
      attributes
      xml prologue test
    hexml tests
      "<test id='bob'>here<extra/>there</test>"
      "<test /><close />"
      "<test /><!-- comment > --><close />"
      "<test id=\"bob value\" another-attr=\"test with <\">here </test> more $
ext at the end<close />"
      "<test></more>"
      "<test"
      "<?xml version=\"1.1\"?>\n<greeting>Hello, world!</greeting>"
      "<some-example/>"
      "<a numeric1=\"attribute\"/>"
      "<also.a.dot></also.a.dot>"
      "<test><![CDATA[Oneliner CDATA.]]></test>"
      "<test><![CDATA[<strong>This is strong but not XML tags.</strong>]]></t$
st>"
      "<test><![CDATA[A lonely ], sad isn't it?]]></test>"
      namespaces
    robust XML tests
      DOM from bytestring substring
      Leading whitespace characters are accepted by parse
      children test
      attributes
      xml prologue test
      html doctype test
      hexml tests
        "<test id='bob'>here<extra/>there</test>"
        "<test /><close />"
        "<test /><!-- comment > --><close />"
        "<test id=\"bob value\" another-attr=\"test with <\">here </test> mor$
 text at the end<close />"
        "<test></more>"
        "<test"
        "<?xml version=\"1.1\"?>\n<greeting>Hello, world!</greeting>"
        "<some-example/>"
        "<a numeric1=\"attribute\"/>"
        "<also.a.dot></also.a.dot>"
        "<test><![CDATA[Oneliner CDATA.]]></test>"
        "<test><![CDATA[<strong>This is strong but not XML tags.</strong>]]></
test>"
        "<test><![CDATA[A lonely ], sad isn't it?]]></test>"
        namespaces
      recovers unclosed tag
      ignores too many closing tags
    skipDoctype
      strips initial doctype declaration
      strips doctype after spaces
      does not strip anything after or inside element

    Failures:

      test/Main.hs:25:5:
      1) Xeno.DOM tests test 1
           uncaught exception: IOException of type NoSuchThing
           data/books-4kb.xml: openBinaryFile: does not exist (No such file or
 directory)

      To rerun use: --match "/Xeno.DOM tests/test 1/"

    Randomized with seed 1307985275

    Finished in 0.0129 seconds
    45 examples, 1 failure

Doctype after <?xml> or with internal subsets results in parse errors

Currently, the Xeno.DOM.Robust does not properly handle XML doctypes.

Doctypes are removed if they appear at the start of the document, however, usually the doctypes are placed after the XML-declaration: <?xml ...><!DOCTYPE html>.

i.e., this test fails:

describe "skipDoctype" $ do
  it "strips doctype after xml declaration" $ do
    skipDoctype "<?xml version=\"1.0\" encoding=\"UTF-8\"?><!DOCTYPE html>Hello" `shouldBe` "<?xml version=\"1.0\" encoding=\"UTF-8\"?>Hello"

One thought is that skipDoctype should check the first two < if they are followed !DOCTYPE and then remove the matching node.
I don't think supporting a doctype at the end of a document is something to be bothered with.

On top of that, skipDoctype does also not handle doctypes with internal subsets such as

<!DOCTYPE html [
  <!-- an internal subset can be embedded here -->
]>

Appropriate test:

describe "skipDoctype" $ do
  it "strips doctype with internal subsets" $ do
    skipDoctype "<!DOCTYPE html [ <!-- --> ]><?xml version=\"1.0\" encoding=\"UTF-8\"?>Hello" `shouldBe` "<?xml version=\"1.0\" encoding=\"UTF-8\"?>Hello"

In this case, skipDoctype will return a ByteString which starts with ]>.
Ideally, skipDoctype should drop until [ or >, and if a [ was matched, then continue to drop until ]>.

Invalid parsing/validation

As I understand the idea of this lib is to follow more or less hexml so I was using it as a base for comparison.
Some found inconsistencies:

  1. Not closed tags accepted by validate
λ> isRight $ Text.XML.Hexml.parse "<b>"
False
λ> Xeno.SAX.validate "<b>"
True
  1. Leading whitespace characters don't get accepted by Xeno.DOM:
λ> isRight $ Xeno.DOM.parse "\n<a></a>"
False
λ> isRight $ Text.XML.Hexml.parse "\n<a></a>"
True
  1. XML prologue gets accepted by validate but not by Xeno.DOM.parse
λ> isRight $ Text.XML.Hexml.parse "<?xml version=\"1.1\"?>\n<greeting>Hello, world!</greeting>"
True
λ> isRight $ Xeno.DOM.parse "<?xml version=\"1.1\"?>\n<greeting>Hello, world!</greeting>"
False

travis-ci.org is deprecated

See this message on travis-ci.org:

Since June 15th, 2021, the building on travis-ci.org is ceased. Please use travis-ci.com from now on.

I suspect only @ocramz can update the CI integration.

Digits in element names

Element names containing digits, e.g. something like <sha1>p9przoiizhx3wz7d46d5erk4wj20j1q</sha1>, cannot be parsed, although they are legal.
@AlexeyRaga has a fix for this (no pull request). Should this be merged?
Update:
The simple way is to slightly change isNameChar, but that leads to a too permissive parser, e.g. the illegal <1ab>...</1ab> would be parsed.

Build failure with mtl-2.3

Building library for xeno-0.5..
[1 of 7] Compiling Control.Spork    ( src/Control/Spork.hs, dist/build/Control/Spork.o, dist/build/Control/Spork.dyn_o )
[2 of 7] Compiling Xeno.DOM.Internal ( src/Xeno/DOM/Internal.hs, dist/build/Xeno/DOM/Internal.o, dist/build/Xeno/DOM/Internal.dyn_o )
[3 of 7] Compiling Xeno.Types       ( src/Xeno/Types.hs, dist/build/Xeno/Types.o, dist/build/Xeno/Types.dyn_o )

src/Xeno/Types.hs:18:1: warning: [-Wunused-imports]
    The import of ‘Control.Monad.Fail’ is redundant
      except perhaps to import instances from ‘Control.Monad.Fail’
    To import instances alone, use: import Control.Monad.Fail()
   |
18 | import Control.Monad.Fail
   | ^^^^^^^^^^^^^^^^^^^^^^^^^
[4 of 7] Compiling Xeno.SAX         ( src/Xeno/SAX.hs, dist/build/Xeno/SAX.o, dist/build/Xeno/SAX.dyn_o )

src/Xeno/SAX.hs:217:20: error:
    Variable not in scope: unless :: Bool -> m () -> m b
    |
217 |         Nothing -> unless (S.null text) (textF text)
    |                    ^^^^^^

src/Xeno/SAX.hs:220:11: error:
    Variable not in scope: unless :: Bool -> m () -> m a0
    |
220 |           unless (S.null text) (textF text)
    |           ^^^^^^

Problems with GHC 9.2

I was trying to check my package xlsx with GHC 9.2 and found it failing at https://github.com/qrilka/xlsx/blob/master/test/Main.hs#L57 which uses xeno-based fast parsing. Quick look shows some odd results from xeno.
I tried xeno with GHC 9.2.1 and with the following stack.yaml

packages:
- .
extra-deps:
- git: https://github.com/chrisdone/hexml.git
  commit: 22923b55ca7390f1973ddfd0d0e517f13cf8a8a4
- bytestring-mmap-0.2.2@sha256:d04e6bc5a158dd292757d3b9b032beb8a2c43e768777d64a8289abf89f612f67
- hexpat-0.20.13
- List-0.6.2
- libxml-0.1.1
- mutable-containers-0.3.4@sha256:aad13ec7e9686725fbbae6af1852055cc2d73f00109b5a313bb007ac74f5ecc2,2303
- vector-0.12.3.1@sha256:040210919e5ce454dcee3320f77803da3dbda579c8428dc25ff0155732234808,7946
- mono-traversable-1.0.15.3@sha256:059bf3c05cdbef2d06b765333fe41c2168ced2503a23de674e2a59ceb2548c48,2060
- primitive-0.7.3.0@sha256:6b28a1c0572f5ca50597ba8388aeade21515842969ae192cdc6bfca81367bf56,2951
- hashable-1.4.0.1@sha256:0251ad5228be07909a385152a3ff634fac1f892611f89904aea5c5e7af411e5d,4790
- split-0.2.3.4@sha256:a6df9c3e806ee7cb50bc980a183fc1156f35022a39430dabac0bf9456fe18a4b,2647
- unordered-containers-0.2.15.0@sha256:7e84950317c31e9c33f11e6338c1dcc9be0141ff696cf5249985c4625b9e144a,5217
- vector-algorithms-0.8.0.4@sha256:bf4760b23a0fee09abb8c9e3c952c870f5dc9780876e9d7e38ab2bdd98c8f283,3752
- extra-1.7.10@sha256:e384751317577554f873812358fab022da02aa9a286c9341308fac83f4d766c5,2691
- hspec-2.9.1@sha256:5daa1483240c194fdcf3c6ba446b56462dbf1f52d525339acea24e24f242065a,1709
- QuickCheck-2.14.2@sha256:4ce29211223d5e6620ebceba34a3ca9ccf1c10c0cf387d48aea45599222ee5aa,7736
- clock-0.8.2@sha256:473ffd59765cc67634bdc55b63c699a85addf3a024089073ec2a862881e83e2a,4313
- hspec-core-2.9.1@sha256:8241004297d18ceb44ee8d3a3c33222931bec926f55690ee52cca7a7da331e8b,5575
- hspec-discover-2.9.1@sha256:c16e37a84166bacf853f14e2acef098535cac655693b76523278381ad313f5a7,2157
- hspec-expectations-0.8.2@sha256:e2db24881baadc2d9d23b03cb629e80dcbda89a6b04ace9adb5f4d02ef8b31aa,1594
- HUnit-1.6.2.0@sha256:1a79174e8af616117ad39464cac9de205ca923da6582825e97c10786fda933a4,1588
- ansi-terminal-0.11@sha256:97470250c92aae14c4c810d7f664c532995ba8910e2ad797b29f22ad0d2d0194,3307
- call-stack-0.4.0@sha256:ac44d2c00931dc20b01750da8c92ec443eb63a7231e8550188cb2ac2385f7feb,1200
- quickcheck-io-0.2.0@sha256:7bf0b68fb90873825eb2e5e958c1b76126dcf984debb998e81673e6d837e0b2d,1133
- random-1.2.1@sha256:8bee24dc0c985a90ee78d94c61f8aed21c49633686f0f1c14c5078d818ee43a2,6598
- setenv-0.1.1.3@sha256:c5916ac0d2a828473cd171261328a290afe0abd799db1ac8c310682fe778c45b,1053
- splitmix-0.1.0.4@sha256:714a55fd28d3e2533bd5b49e74f604ef8e5d7b06f249c8816f6c54aed431dcf1,6483
- tf-random-0.5@sha256:14012837d0f0e18fdbbe3d56e67da8622ee5e20b180abce952dd50bd9f36b326,3983
- colour-2.3.6@sha256:ebdcbf15023958838a527e381ab3c3b1e99ed12d1b25efeb7feaa4ad8c37664a,2378
resolver: ghc-9.2.1

I get a failure

Xeno.DOM tests
  test 1 [✘]
Xeno.DOM tests
  DOM from bytestring substring [✘]
  Leading whitespace characters are accepted by parse [✔]
  children test [✘]
  attributes [✘]
  xml prologue test [✘]
hexml tests
  "<test id='bob'>here<extra/>there</test>" [✔]
  "<test /><close />" [✔]
  "<test /><!-- comment > --><close />" [✔]
  "<test id=\"bob value\" another-attr=\"test with <\">here </test> more text at the end<close />" [✔]
  "<test></more>" [✔]
  "<test" [✔]
  "<?xml version=\"1.1\"?>\n<greeting>Hello, world!</greeting>" [✔]
  "<some-example/>" [✔]
  "<a numeric1=\"attribute\"/>" [✔]
  "<also.a.dot></also.a.dot>" [✔]
  "<test><![CDATA[Oneliner CDATA.]]></test>" [✘]
  "<test><![CDATA[<strong>This is strong but not XML tags.</strong>]]></test>" [✘]
  "<test><![CDATA[A lonely ], sad isn't it?]]></test>" [✘]
  namespaces [✔]
robust XML tests
  DOM from bytestring substring [✘]
  Leading whitespace characters are accepted by parse [✔]
  children test [✘]
  attributes [✘]
  xml prologue test [✘]
  html doctype test [✘]
  hexml tests
    "<test id='bob'>here<extra/>there</test>" [✔]
    "<test /><close />" [✔]
    "<test /><!-- comment > --><close />" [✔]
    "<test id=\"bob value\" another-attr=\"test with <\">here </test> more text at the end<close />" [✔]
    "<test></more>" [✔]
    "<test" [✔]
    "<?xml version=\"1.1\"?>\n<greeting>Hello, world!</greeting>" [✔]
    "<some-example/>" [✔]
    "<a numeric1=\"attribute\"/>" [✔]
    "<also.a.dot></also.a.dot>" [✔]
    "<test><![CDATA[Oneliner CDATA.]]></test>" [✘]
    "<test><![CDATA[<strong>This is strong but not XML tags.</strong>]]></test>" [✘]
    "<test><![CDATA[A lonely ], sad isn't it?]]></test>" [✘]
    namespaces [✔]
  recovers unclosed tag [ ]Right (Node "<" [("<a a","<")] [Element (Node "<a " [] [])])
  recovers unclosed tag [✘]
  ignores too many closing tags [✔]
skipDoctype
  strips initial doctype declaration [✔]
  strips doctype after spaces [✔]
  does not strip anything after or inside element [✔]

Failures:

  test/Main.hs:29:18: 
  1) Xeno.DOM tests test 1
       expected: "catalog"
        but got: "<?xml v"

  To rerun use: --match "/Xeno.DOM tests/test 1/"

  test/Main.hs:45:23: 
  2) Xeno.DOM tests DOM from bytestring substring
       expected: "valid"
        but got: "<vali"

  To rerun use: --match "/Xeno.DOM tests/DOM from bytestring substring/"

  test/Main.hs:55:44: 
  3) Xeno.DOM tests children test
       expected: ["test","test","b","test","test"]
        but got: ["<roo","<roo","<","<roo","<roo"]

  To rerun use: --match "/Xeno.DOM tests/children test/"

  test/Main.hs:58:53: 
  4) Xeno.DOM tests attributes
       expected: [("id","1"),("extra","2")]
        but got: [("<r","<"),("<root","<")]

  To rerun use: --match "/Xeno.DOM tests/attributes/"

  test/Main.hs:63:23: 
  5) Xeno.DOM tests xml prologue test
       expected: "greeting"
        but got: "<?xml ve"

  To rerun use: --match "/Xeno.DOM tests/xml prologue test/"

  test/Main.hs:76:36: 
  6) hexml tests "<test><![CDATA[Oneliner CDATA.]]></test>"
       expected: Right [CData "Oneliner CDATA."]
        but got: Right [CData "<test><![CDATA["]

  To rerun use: --match "/hexml tests/\"<test><![CDATA[Oneliner CDATA.]]></test>\"/"

  test/Main.hs:76:36: 
  7) hexml tests "<test><![CDATA[<strong>This is strong but not XML tags.</strong>]]></test>"
       expected: Right [CData "<strong>This is strong but not XML tags.</strong>"]
        but got: Right [CData "<test><![CDATA[<strong>This is strong but not XML"]

  To rerun use: --match "/hexml tests/\"<test><![CDATA[<strong>This is strong but not XML tags.</strong>]]></test>\"/"

  test/Main.hs:76:36: 
  8) hexml tests "<test><![CDATA[A lonely ], sad isn't it?]]></test>"
       expected: Right [CData "A lonely ], sad isn't it?"]
        but got: Right [CData "<test><![CDATA[A lonely ]"]

  To rerun use: --match "/hexml tests/\"<test><![CDATA[A lonely ], sad isn't it?]]></test>\"/"

  test/Main.hs:89:25: 
  9) robust XML tests DOM from bytestring substring
       expected: "valid"
        but got: "<vali"

  To rerun use: --match "/robust XML tests/DOM from bytestring substring/"

  test/Main.hs:99:44: 
  10) robust XML tests children test
       expected: ["test","test","b","test","test"]
        but got: ["<roo","<roo","<","<roo","<roo"]

  To rerun use: --match "/robust XML tests/children test/"

  test/Main.hs:102:53: 
  11) robust XML tests attributes
       expected: [("id","1"),("extra","2")]
        but got: [("<r","<"),("<root","<")]

  To rerun use: --match "/robust XML tests/attributes/"

  test/Main.hs:107:23: 
  12) robust XML tests xml prologue test
       expected: "greeting"
        but got: "<?xml ve"

  To rerun use: --match "/robust XML tests/xml prologue test/"

  test/Main.hs:111:23: 
  13) robust XML tests html doctype test
       expected: "greeting"
        but got: "<greetin"

  To rerun use: --match "/robust XML tests/html doctype test/"

  test/Main.hs:119:38: 
  14) robust XML tests, hexml tests, "<test><![CDATA[Oneliner CDATA.]]></test>"
       expected: Right [CData "Oneliner CDATA."]
        but got: Right [CData "<test><![CDATA["]

  To rerun use: --match "/robust XML tests/hexml tests/\"<test><![CDATA[Oneliner CDATA.]]></test>\"/"

  test/Main.hs:119:38: 
  15) robust XML tests, hexml tests, "<test><![CDATA[<strong>This is strong but not XML tags.</strong>]]></test>"
       expected: Right [CData "<strong>This is strong but not XML tags.</strong>"]
        but got: Right [CData "<test><![CDATA[<strong>This is strong but not XML"]

  To rerun use: --match "/robust XML tests/hexml tests/\"<test><![CDATA[<strong>This is strong but not XML tags.</strong>]]></test>\"/"

  test/Main.hs:119:38: 
  16) robust XML tests, hexml tests, "<test><![CDATA[A lonely ], sad isn't it?]]></test>"
       expected: Right [CData "A lonely ], sad isn't it?"]
        but got: Right [CData "<test><![CDATA[A lonely ]"]

  To rerun use: --match "/robust XML tests/hexml tests/\"<test><![CDATA[A lonely ], sad isn't it?]]></test>\"/"

  test/Main.hs:131:34: 
  17) robust XML tests recovers unclosed tag
       expected: "a"
        but got: "<"

  To rerun use: --match "/robust XML tests/recovers unclosed tag/"

Randomized with seed 1893189390

Finished in 0.0063 seconds
45 examples, 17 failures

xeno                > Test suite xeno-test failed

Xeno.DOM: Heap exhausted on a 5.6M file

longlines.xml.zip
↑ through xeno-dom exhaust heap memory. I just put the file into the list in SpeedBigFiles.hs as
[ benchFile ["xeno-dom"] "6MB" "longlines.xml.bz2"
and got

benchmarking 6M/xeno-dom
xeno-speed-big-files-bench: Heap exhausted;
xeno-speed-big-files-bench: Current maximum heap size is 26843545600 bytes (25600 MB).

Strangely, only minor changes to the file (e.g. sed 's/x/xx/gincreasing the file size) will let it through with about 800M maxresident (as reported by /usr/bin/time). Inserting newlines after each > we also get 800M maxresident, but it doesn't seem to be related to the long lines, as almost any change to the file helps.

(Yes I should be using Xeno.SAX, but why does e.g. https://dumps.wikimedia.org/nowiki/20230520/nowiki-20230520-pages-articles-multistream-index.txt.bz2 at 11M go through fine with <400M maxresident and this one not? Even removing newlines, the wiki works fine. This feels like leakage.)

update stack.yaml

I think the format has changed with stack 2 because now travis complains with

$ stack $ARGS --no-terminal --install-ghc test --haddock

Could not parse '/home/travis/build/ocramz/xeno/stack.yaml':
Aeson exception:
Error in $.packages[1]: failed to parse field 'packages': expected Text, encountered Object

Merging #19 into master - a case study

I had this lingering question : how to decide whether a PR introduces significant performance regression? Here are my notes, using #19 as a case study (@unhammer might be interested, too).

On my work laptop, a 2015 MBP 15", 2.2 GHz i7 with 16GB of RAM, I get these figures for the xeno tests with the largest dataset:

master :

benchmarking 211KB/xeno-sax
time                 218.7 μs   (214.9 μs .. 221.8 μs)
                     0.998 R²   (0.998 R² .. 0.999 R²)
mean                 215.8 μs   (213.5 μs .. 218.1 μs)
std dev              7.276 μs   (6.088 μs .. 8.812 μs)
variance introduced by outliers: 29% (moderately inflated)

benchmarking 211KB/xeno-dom
time                 535.4 μs   (525.7 μs .. 544.6 μs)
                     0.998 R²   (0.997 R² .. 0.999 R²)
mean                 539.8 μs   (534.0 μs .. 547.0 μs)
std dev              21.60 μs   (19.07 μs .. 24.03 μs)
variance introduced by outliers: 33% (moderately inflated)

PR #19

benchmarking 211KB/xeno-sax
time                 225.9 μs   (223.6 μs .. 228.2 μs)
                     0.999 R²   (0.998 R² .. 0.999 R²)
mean                 228.0 μs   (225.7 μs .. 231.7 μs)
std dev              9.275 μs   (6.884 μs .. 12.33 μs)
variance introduced by outliers: 38% (moderately inflated)

benchmarking 211KB/xeno-dom
time                 542.3 μs   (533.4 μs .. 551.1 μs)
                     0.998 R²   (0.997 R² .. 0.999 R²)
mean                 551.6 μs   (544.6 μs .. 557.6 μs)
std dev              22.42 μs   (19.93 μs .. 24.23 μs)
variance introduced by outliers: 34% (moderately inflated)

The sample size is the criterion default (since these benchmarks are run with defaultMain): n = 1000.

Using the Z-test (which assumes the samples are approximately Gaussian) to assess whether the timing difference is significant, I get this result:

xeno-dom :

-- mean benchmark time before patch
mu0d = 539.8e-6

-- standard deviation " "
sig0d = 21.6e-6

-- mean benchmark time after patch #19
mu1d = 551.6e-6

--  Z score
z_d = sqrt n * (mu1d - mu0d) / sig0d 

z_d = 17.27, i.e. the mean benchmark after the patch is more than 17 standard errors larger than before the patch. For the xeno-sax benchmarks I get a Z-score of > 57 ; the probability of these values happening by accident (that is, the probability of a standard normal r.v. to yield a sample larger than Z) is extremely small, so we could say with some confidence that the patch introduces a regression.

Any thoughts?

Update to documentaton to working examples

Hello,

If I run the example from the documentation: parse "<p key='val' x=\"foo\" k=\"\"><a><hr/>hi</a><b>sup</b>hi</p>"

II get:

error:
    • Couldn't match expected type ‘Data.ByteString.Internal.ByteString’
                  with actual type ‘[Char]’
    • In the first argument of ‘parse’, namely
        ‘"<p key='val' x=\"foo\" k=\"\"><a><hr/>hi</a><b>sup</b>hi</p>"’
      In the expression:
        parse
          "<p key='val' x=\"foo\" k=\"\"><a><hr/>hi</a><b>sup</b>hi</p>"
      In an equation for ‘it’:
          it
            = parse
                "<p key='val' x=\"foo\" k=\"\"><a><hr/>hi</a><b>sup</b>hi</p>"

Obviously the documentation uses String where in fact ByteString are required.

Can you please update the documentation?

By the way, can you take the opportunity to document the rationale behind using ByteString rather than Text? As far as I know XML documents are textuals, not binary.

Add lens support eg. with xeno-lens

Hi,

I currently use hexml primarily because there is hexml-lens which simplifies finding the data I need to pull out the XML quite a lot. hexml (and hexml-lens) is fairly unmaintained and it has some other problems for me. Would something like xeno-lens fit in with this project?

Thanks

stack bench on Travis goes haywire

Benchmark xeno-speed-bench: RUNNING...
benchmarking 4KB/hexml-dom
Progress 45/46: xeno-0.3.4time                 157.0 μs   (85.46 μs .. 227.1 μs)
                     0.510 R²   (0.339 R² .. 0.692 R²)
mean                 272.6 μs   (240.1 μs .. 316.5 μs)
std dev              52.40 μs   (31.67 μs .. 70.18 μs)
variance introduced by outliers: 92% (severely inflated)
Progress 45/46: xeno-0.3.4benchmarking 4KB/xeno-sax
Progress 45/46: xeno-0.3.4time                 14.16 μs   (8.660 μs .. 22.37 μs)
                     0.445 R²   (0.414 R² .. 0.995 R²)
mean                 59.53 μs   (59.53 μs .. 59.53 μs)
std dev              0.0 s      (0.0 s .. 0.0 s)
variance introduced by outliers: -9223372036854775808% (severely inflated)
Progress 45/46: xeno-0.3.4benchmarking 4KB/xeno-dom
Progress 45/46: xeno-0.3.4time                 173.4 μs   (23.20 μs .. 365.4 μs)
                     0.220 R²   (0.131 R² .. 0.996 R²)
xeno-speed-bench: ./Data/Vector/Generic.hs:245 ((!)): index out of bounds (-9223372036854775808,1000)
CallStack (from HasCallStack):
  error, called at ./Data/Vector/Internal/Check.hs:87:5 in vector-0.12.0.1-JlawpRjIcMJIYPJVsWriIA:Data.Vector.Internal.Check
Progress 45/46: xeno-0.3.4xeno-speed-bench: thread blocked indefinitely in an MVar operation
Benchmark xeno-speed-bench: ERROR
Completed 46 action(s).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.