Giter VIP home page Giter VIP logo

Comments (5)

geofflangdale avatar geofflangdale commented on May 23, 2024

I think minifying is useful. I have not removed clmul (you may have come to the conclusion from my huffing and puffing over perhaps normalizing all strings as a new alternate stage 1), and I feel this approach would work well. I believe you have extensive experience with the trickery involved with stripping/transforming data already, so doing this in conjunction with the bitmasks associated with quotes that clmul yields should not be horrendously difficult.

There is already a whitespace detection pass so the bit vector comprising "whitespace not in quotes" is already more or less latent in the code.

As a side note, a minifying approach does not require stages 3 (the Ape Machine) or 4 (the Shovel Machine), and perhaps can do without much of stage 2. As such, it loses these as substantial bottlenecks. Stage 1, without these bottlenecks in place, might benefit from having its various substages that take place in general purpose registers transformed into SIMD themselves. There is a substitute for the CLMUL approach that could calculate the same thing over 256-bit or 512-bit values via PSHUFB on nibbles and a parallel prefix XOR approach. This might be worth looking at for AVX-512.

from simdjson.

lemire avatar lemire commented on May 23, 2024

Ok. I have edited the title of this issue and turned it into a todo task for myself. I think it is a well-defined task that could lead to impressive results.

from simdjson.

lemire avatar lemire commented on May 23, 2024

So I did implement it by hacking @geofflangdale's code and adding the necessary glue...

cd scalarvssimd
make
$ ./bench ../jsonexamples/twitter.json
Input has 616 KB
avx_json_parse(p.first, p.second, pj)   	:  2.305 cycles per input byte (best)  2.521 cycles per input byte (avg)
scalar_json_parse(p.first, p.second, pj)	:  5.914 cycles per input byte (best)  6.013 cycles per input byte (avg)
d.Parse<kParseValidateEncodingFlag>((const char *)buffer).HasParseError()	:  9.302 cycles per input byte (best)  9.360 cycles per input byte (avg)
d.Parse((const char *)buffer).HasParseError()	:  9.695 cycles per input byte (best)  9.741 cycles per input byte (avg)
d.ParseInsitu(buffer).HasParseError()   	:  4.836 cycles per input byte (best)  4.872 cycles per input byte (avg)
input length is 631514 stringified length is 466906
rapidstringme((char*) p.first)          	:  15.052 cycles per input byte (best)  15.085 cycles per input byte (avg)
rapidstringmeInsitu((char*) buffer)     	:  10.526 cycles per input byte (best)  10.585 cycles per input byte (avg)
these should match: 466906 466906
copy_without_useless_spaces(cbuffer, p.second,cbuffer)	:  0.574 cycles per input byte (best)  0.576 cycles per input byte (avg)

So a bit over half a cycle per input byte or an order of magnitude faster than RapidJSON.

I think we might agree that it is probably as good as it is going to get without relying on AVX-512 or something fancier.

from simdjson.

lemire avatar lemire commented on May 23, 2024

It is so fast that applying the minifier and then calling RapidJSON can be faster than calling RapidJSON on the original:

d.ParseInsitu(buffer).HasParseError()   	:  4.787 cycles per input byte (best)  4.824 cycles per input byte (avg)

vs

copy_without_useless_spaces(cbuffer, p.second,cbuffer)	:  0.573 cycles per input byte (best)  0.573 cycles per input byte (avg)
d.ParseInsitu(buffer).HasParseError()   	:  3.517 cycles per input byte (best)  3.758 cycles per input byte (avg)

from simdjson.

lemire avatar lemire commented on May 23, 2024

@geofflangdale Pointed out that I might be overoptimistic when thinking that 0.5 cycles per byte cannot be improved very much.

I have a little toy library to despace text in general...
https://github.com/lemire/despacer

The best I managed to do (with lots of effort) was ~0.33 cycles per input byte... and that's an easier problem.

So 0.5 cycles is quite a bit better than I expected, already. Of course, you can probably get it faster, but it is not going to be easy to make huge gains.

from simdjson.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.