Comments (5)
I think minifying is useful. I have not removed clmul (you may have come to the conclusion from my huffing and puffing over perhaps normalizing all strings as a new alternate stage 1), and I feel this approach would work well. I believe you have extensive experience with the trickery involved with stripping/transforming data already, so doing this in conjunction with the bitmasks associated with quotes that clmul yields should not be horrendously difficult.
There is already a whitespace detection pass so the bit vector comprising "whitespace not in quotes" is already more or less latent in the code.
As a side note, a minifying approach does not require stages 3 (the Ape Machine) or 4 (the Shovel Machine), and perhaps can do without much of stage 2. As such, it loses these as substantial bottlenecks. Stage 1, without these bottlenecks in place, might benefit from having its various substages that take place in general purpose registers transformed into SIMD themselves. There is a substitute for the CLMUL approach that could calculate the same thing over 256-bit or 512-bit values via PSHUFB on nibbles and a parallel prefix XOR approach. This might be worth looking at for AVX-512.
from simdjson.
Ok. I have edited the title of this issue and turned it into a todo task for myself. I think it is a well-defined task that could lead to impressive results.
from simdjson.
So I did implement it by hacking @geofflangdale's code and adding the necessary glue...
cd scalarvssimd
make
$ ./bench ../jsonexamples/twitter.json
Input has 616 KB
avx_json_parse(p.first, p.second, pj) : 2.305 cycles per input byte (best) 2.521 cycles per input byte (avg)
scalar_json_parse(p.first, p.second, pj) : 5.914 cycles per input byte (best) 6.013 cycles per input byte (avg)
d.Parse<kParseValidateEncodingFlag>((const char *)buffer).HasParseError() : 9.302 cycles per input byte (best) 9.360 cycles per input byte (avg)
d.Parse((const char *)buffer).HasParseError() : 9.695 cycles per input byte (best) 9.741 cycles per input byte (avg)
d.ParseInsitu(buffer).HasParseError() : 4.836 cycles per input byte (best) 4.872 cycles per input byte (avg)
input length is 631514 stringified length is 466906
rapidstringme((char*) p.first) : 15.052 cycles per input byte (best) 15.085 cycles per input byte (avg)
rapidstringmeInsitu((char*) buffer) : 10.526 cycles per input byte (best) 10.585 cycles per input byte (avg)
these should match: 466906 466906
copy_without_useless_spaces(cbuffer, p.second,cbuffer) : 0.574 cycles per input byte (best) 0.576 cycles per input byte (avg)
So a bit over half a cycle per input byte or an order of magnitude faster than RapidJSON.
I think we might agree that it is probably as good as it is going to get without relying on AVX-512 or something fancier.
from simdjson.
It is so fast that applying the minifier and then calling RapidJSON can be faster than calling RapidJSON on the original:
d.ParseInsitu(buffer).HasParseError() : 4.787 cycles per input byte (best) 4.824 cycles per input byte (avg)
vs
copy_without_useless_spaces(cbuffer, p.second,cbuffer) : 0.573 cycles per input byte (best) 0.573 cycles per input byte (avg)
d.ParseInsitu(buffer).HasParseError() : 3.517 cycles per input byte (best) 3.758 cycles per input byte (avg)
from simdjson.
@geofflangdale Pointed out that I might be overoptimistic when thinking that 0.5 cycles per byte cannot be improved very much.
I have a little toy library to despace text in general...
https://github.com/lemire/despacer
The best I managed to do (with lots of effort) was ~0.33 cycles per input byte... and that's an easier problem.
So 0.5 cycles is quite a bit better than I expected, already. Of course, you can probably get it faster, but it is not going to be easy to make huge gains.
from simdjson.
Related Issues (20)
- Add full support for JSONPath HOT 19
- 你能训练一个连下2步的围棋ai吗?
- Trailing comma support for array and object HOT 1
- Confusing error message when trying to convert a non-scalar on-demand document to a value HOT 2
- Add Glace to the benchmarks HOT 1
- Double parsing can produce incorrect results due to integer overflow. HOT 1
- get_number().get_double() produces incorrect results, but get_double() is correct HOT 1
- unsafe precondition(s) violated: ptr::write requires that the pointer argument is aligned and non-null HOT 1
- [SOLVED] ambiguous template specialization 'get<simdjson::fallback::ondemand::document>' HOT 2
- How can I fix 'simdjson::dom::parser::Iterator::is_object': Use the new DOM navigation API instead (see doc/basics.md)' compiler warning in VS2019? HOT 1
- Implement an ability to parse integers that exceed 64 bits HOT 11
- Does this library only support the read operations? I have seen some APIs that do not seem to support the write operations similar to rapidjson. HOT 1
- Branchless integer parsing
- Wrong version number for release 3.7.0 HOT 9
- 3.6.4: build fails with gcc 14.x HOT 9
- Fallback parser missing on aarch64 + Linux HOT 7
- When capacity of padded_string_view is given a size smaller than length, padding() is wrapping HOT 2
- Security Policy HOT 2
- Fail to parse boolean in a truncated document stream. HOT 3
- Does simdjson get faster if you keep parsing objects with the same schema? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from simdjson.