Giter VIP home page Giter VIP logo

simdjson / simdjson Goto Github PK

View Code? Open in Web Editor NEW
18.5K 238.0 970.0 57.31 MB

Parsing gigabytes of JSON per second : used by Facebook/Meta Velox, the Node.js runtime, ClickHouse, WatermelonDB, Apache Doris, Milvus, StarRocks

Home Page: https://simdjson.org

License: Apache License 2.0

Makefile 0.01% C++ 97.54% C 1.14% Shell 0.26% JavaScript 0.01% Python 0.40% CMake 0.65% Ruby 0.01% Dockerfile 0.01%
json json-parser simd avx2 sse42 neon aarch64 arm64 arm vs2019

simdjson's Issues

package as a bona fide library

Currently, the library is only usable to run benchmarks. It does not build an actual library file that can be installed and integrated into other software.

Better errors in the high-level interface.

The high-level interface (such as json_parse) only return True or False, but can fail under numerous different conditions.

As someone building on top of simdjson, I either need to hijack stderr when calling these methods and search the string to figure out what really went wrong, or I need to re-implement them entirely.

It would be preferable to instead return an error code, using 1 for success and negative values for the various errors.

[cmake] add soversion to the resulting shared library

Adding a SOVERSION to the resulting shared object would help indicating API/ABI changes in this library. This would greatly benefit linux distribution packagers. Please consider adding the following patch to master:

diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
index ea48953..dd7f2a4 100644
--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
@@ -32,7 +32,8 @@ if(NOT MSVC)
 ## We output the library at the root of the current directory where cmake is invoked
 ## This is handy but Visual Studio will happily ignore us
 set_target_properties(${SIMDJSON_LIB_NAME} PROPERTIES
-  LIBRARY_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR})
+  LIBRARY_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}
+  SOVERSION 0)
 MESSAGE( STATUS "Library output directory (does not apply to Visual Studio): " ${CMAKE_BINARY_DIR})
 endif()

For large files, the tapes collide

The current tape is just one giant tape with arbitrary segments. The segments can overwrite themselves.

We can do better.

To reproduce:

./parse -d scripts/javascript/large.json |grep size

Example of output:

 tape section i 32   (START)  from: 4161536 to: 4161538  size: 2
 tape section i 33  (NORMAL)  from: 4291584 to: 5291566  size: 999982
 tape section i 34  (NORMAL)  from: 4421632 to: 13421452  size: 8999820
 tape section i 35  (NORMAL)  from: 4551680 to: 15051470  size: 10499790
 tape section i 36  (NORMAL)  from: 4681728 to: 7681668  size: 2999940

This is scary bad. 👎

Provide a streaming API

Without much effort, we could support streaming processing without materializing JSON documents objects as in-memory tapes.

What are sane and interesting benchmarks?

As far as research prototypes go, this could be rather limited, but if we are going to write benchmarks, we need some stable programming interface so that we can solve problems.

The Mison paper pretends to answer queries from a paper by Adabi...

SELECT DISTINCT “user.id” FROM tweets;

SELECT SUM(retweet count) FROM tweets GROUP BY “user.id”;

SELECT “user.id” FROM tweets t1, deletes d1, deletes d2 WHERE t1.id str = d1.“delete.status.id_str” AND d1.“delete.status.user_id” = d2.“delete.status.user_id” AND t1.“user.lang” = ‘msa’;

SELECT t1.“user.screen name”, t2.“user.screen name” FROM tweets t1, tweets t2, tweets t3 WHERE t1.“user.screen name”=t3.“user.screen name”AND t1.“user.screen name” = t2.in reply to screen name AND t2.“user.screen name” = t3.in reply to screen name;

I don't think they actually answer these queries. This requires an actual processing engine. I think that they do things like this...

  • Find all user/id values in tweets (where user/id is to be interpreted as a path)
  • Find all pairs user/id, retweet_count in tweets
  • Find all pairs user/id, user/lang in tweets
  • Find all pairs "in_reply_to_screen_name" and user/id

I am not sure I fully understand what the Mison paper tested, http://www.vldb.org/pvldb/vol10/p1118-li.pdf, though I am sure we can figure it out...

My guess as to what is a good and generic benchmark is to start with a JSON document and extract some kind of tabular data out of it. So turn the twitter JSON document into a table.

simdjson should use a namespace

Right now all functions and classes are defined in global namespace. This might potentially lead to clashes. Instead simdjson should put everything into its own namespace.

Replace usage of string_view by a custom string class

From the README:

std::string_view p = get_corpus(filename);
...
// You can safely delete the string content
free((void*)p.data());

This is a misuse of string_view's semantics, and runs afoul of various rules that will be flagged by static analyzers, and we don't want students seeing this and copying it without understanding that it's not recommended (string_view is not just a blanket replacement for char*!).

For the sake of the example, it'd probably be better to just use an idiomatic std::string and illustrate that the parsed document can still be used after that string goes out of scope.

"states" array is written to but never used

I don't understand the purpose of the "states" array, though I am sure it was explained to me. From the code, all we have is this...

  states[depth] = trans[states[depth]][c];

When is this used for anything?

Support Unsigned 64-bit integer

As I know number field in JSON is not limited by the number of digits,
but parser is limited by their own language specification.

In terms of the datatype, uint64 is frequently used for identifier, hash value...
and it is natural that many of Cpp based project uses unsigned long.

My opinion is that it would be great if the parser supports both uint64 and int64.

I am really impressed with this artwork. thanks a lot =)

Odd warning about freeing through stringview

`In file included from /home/geoff/git/simdjson/include/simdjson/common_defs.h:4:0,
from /home/geoff/git/simdjson/benchmark/parse.cpp:33:

In function ‘void aligned_free(void*)’,
inlined from ‘int main(int, char**)’ at /home/geoff/git/simdjson/benchmark/parse.cpp:262:15:

/home/geoff/git/simdjson/include/simdjson/portability.h:123:9: warning: attempt to free a non-heap object [-Wfree-nonheap-object]
free(memblock);
~~~~^~~~~~~~~~
`
gcc 7.2.0 doesn't like us trying to look through string_view to free our data. I don't understand this warning or what the analysis thinks it's doing.

Support AVX-512

This is currently a low priority but it seems worthwhile to ask whether AVX-512 helps, and by how much. Of course, the work should be completed on AVX2 first.

consider cleaning out Geoff's 'personalized types'

I am fine with u8, i64 and so forth as type names, but it is hard to justify using so many typedefs when uint8_t and int64_t are perfectly go standards.

The over-reliance on u8 as opposed to char is also a bit of a problem.

This is mostly a "social" problem. The code looks a bit alien. It could look more "standard".

numberparsingcheck crashes

Hi,

'make test' on my Fedora 29 is crashing while 'numberparsingcheck'.
https://gist.github.com/szydell/c2a9c01aadced506b1bfb16445a15bd1

Kernel: 4.20.10-200.fc29.x86_64
gcc-c++: Version 8.2.1, Architecture : x86_64

$ ldd numberparsingcheck
linux-vdso.so.1 (0x00007fff60ff7000)
libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007ff7b42d5000)
libm.so.6 => /lib64/libm.so.6 (0x00007ff7b4151000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007ff7b4136000)
libc.so.6 => /lib64/libc.so.6 (0x00007ff7b3f70000)
/lib64/ld-linux-x86-64.so.2 (0x00007ff7b449c000)

CPU:
vendor_id : GenuineIntel
cpu family : 6
model : 158
model name : Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz
stepping : 10
microcode : 0x96
cpu MHz : 800.005
cache size : 9216 KB
physical id : 0
siblings : 12
core id : 5
cpu cores : 6
apicid : 11
initial apicid : 11
fpu : yes
fpu_exception : yes
cpuid level : 22
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp flush_l1d
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf

Lower performance on small files

I have checked on small messages with size 10Kb - 15Kb and the benchmark shows only 0.3 - 0.5 Gb per second, while for original files it shows more than 2 Gb per second.

Also, when messages grow in size up to 10Mb then benchmark shows ~1.5 Gb.

Sources of small files are here.

Bellow is a full output:

andriy@notebook:~/Projects/com/github/lemire/simdjson$ ./parse jsonexamples/jsoniter-scala/che-1.geo.json 
number of bytes 11481 number of structural chars 3310 ratio 0.288
mem alloc instructions:       1346 cycles:       3165 (4.38 %) ins/cycles: 0.43 mis. branches:          2 (cycles/mis.branch 1526.95) cache accesses:        136 (failure          0)
 mem alloc runs at 0.28 cycles per input byte.
stage 1 instructions:      47601 cycles:      14322 (19.80 %) ins/cycles: 3.32 mis. branches:          3 (cycles/mis.branch 4227.40) cache accesses:        136 (failure          0)
 stage 1 runs at 1.25 cycles per input byte.
stage 2 instructions:     190120 cycles:      54860 (75.83 %) ins/cycles: 3.47 mis. branches:        126  (cycles/mis.branch 434.83)  cache accesses:        338 (failure          0)
 stage 2 runs at 4.78 cycles per input byte and 16.57 cycles per structural character.
 all stages: 6.30 cycles per input byte.
Min:  3.7839e-05 bytes read: 11481 Gigabytes/second: 0.303417
andriy@notebook:~/Projects/com/github/lemire/simdjson$ ./parse jsonexamples/jsoniter-scala/twitter_api_response.json 
number of bytes 15253 number of structural chars 1440 ratio 0.094
mem alloc instructions:       1346 cycles:       3010 (10.41 %) ins/cycles: 0.45 mis. branches:          1 (cycles/mis.branch 2886.06) cache accesses:        207 (failure          0)
 mem alloc runs at 0.20 cycles per input byte.
stage 1 instructions:      41378 cycles:      12579 (43.50 %) ins/cycles: 3.29 mis. branches:          1 (cycles/mis.branch 9256.61) cache accesses:        207 (failure          0)
 stage 1 runs at 0.82 cycles per input byte.
stage 2 instructions:      42454 cycles:      13330 (46.09 %) ins/cycles: 3.18 mis. branches:         14  (cycles/mis.branch 930.40)  cache accesses:        294 (failure          0)
 stage 2 runs at 0.87 cycles per input byte and 9.26 cycles per structural character.
 all stages: 1.90 cycles per input byte.
Min:  2.8586e-05 bytes read: 15253 Gigabytes/second: 0.533583
andriy@notebook:~/Projects/com/github/lemire/simdjson$ ./parse jsonexamples/twitter.json
number of bytes 631514 number of structural chars 55264 ratio 0.088
mem alloc instructions:       1191 cycles:        792 (0.07 %) ins/cycles: 1.50 mis. branches:          1 (cycles/mis.branch 616.13) cache accesses:      30067 (failure          0)
 mem alloc runs at 0.00 cycles per input byte.
stage 1 instructions:    1825757 cycles:     575452 (51.51 %) ins/cycles: 3.17 mis. branches:       1506 (cycles/mis.branch 382.07) cache accesses:      30067 (failure         27)
 stage 1 runs at 0.91 cycles per input byte.
stage 2 instructions:    1639867 cycles:     540975 (48.42 %) ins/cycles: 3.03 mis. branches:       1406  (cycles/mis.branch 384.52)  cache accesses:      52092 (failure         50)
 stage 2 runs at 0.86 cycles per input byte and 9.79 cycles per structural character.
 all stages: 1.77 cycles per input byte.
Min:  0.000309023 bytes read: 631514 Gigabytes/second: 2.04358
andriy@notebook:~/Projects/com/github/lemire/simdjson$ ./parse jsonexamples/jsoniter-scala/twitter_api_response_10Mb.json 
number of bytes 9606023 number of structural chars 905350 ratio 0.094
mem alloc instructions:       1443 cycles:       7052 (0.04 %) ins/cycles: 0.20 mis. branches:         13 (cycles/mis.branch 526.28) cache accesses:     337396 (failure        250)
 mem alloc runs at 0.00 cycles per input byte.
stage 1 instructions:   25956521 cycles:    8704923 (48.30 %) ins/cycles: 2.98 mis. branches:      10742 (cycles/mis.branch 810.34) cache accesses:     337396 (failure     237093)
 stage 1 runs at 0.91 cycles per input byte.
stage 2 instructions:   26562775 cycles:    9310078 (51.66 %) ins/cycles: 2.85 mis. branches:       7822  (cycles/mis.branch 1190.20)  cache accesses:     653145 (failure     415533)
 stage 2 runs at 0.97 cycles per input byte and 10.28 cycles per structural character.
 all stages: 1.88 cycles per input byte.
Min:  0.00619737 bytes read: 9606023 Gigabytes/second: 1.55002

#warning isn't portable, causing errors on MSVC (and others)

#warning is used to tell the end-user that AVX2 isn't available, however on all versions of Visual Studio #warning isn't available and will cause a cryptic error if building without AVX2 support. #warning isn't standard and isn't guaranteed to be available.

What is "white space"? and is the current approach to detect them best?

@geofflangdale 's code seems to "define" characters to be one of '\t', '\n', '\r', ' ' from reading of the code.

One issue to contend with is that the definition of "white space" depends on the character encoding, so UTF-8 has other white space characters.

Anyhow, let us look at whether what Geoff does is optimally efficient...

The code to detect white space looks like this...


    __m256i v_lo = _mm256_and_si256(
        _mm256_shuffle_epi8(low_nibble_mask, input_lo),
        _mm256_shuffle_epi8(high_nibble_mask,
                            _mm256_and_si256(_mm256_srli_epi32(input_lo, 4),
                                             _mm256_set1_epi8(0x7f))));
    __m256i tmp_ws_lo = _mm256_cmpeq_epi8(
        _mm256_and_si256(v_lo, whitespace_shufti_mask), _mm256_set1_epi8(0));

So AND, SHUF, SHUF, AND, SHIFT, CMP, AND...
And then I guess you have to negate the result...

It sure is complicated!!!

I think you can do it more cheaply... (OR, CMP, SHUF, ADDS) and detect all of the five ASCII white space characters (tab, line feed, line tabulation, form feed, carriage return, space):

__m128i mask_20 = _mm_set1_epi8( 0x20 );// c==32
__m128i mask_70 = _mm_set1_epi8( 0x70 );// adding 0x70 does not check low 4-bits
 // but moves any value >= 16 above 128

  //for 9 <= c <= 13:
 __m128i lut_cntrl = _mm_setr_epi8(
		0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
		0x00, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x00);

__m128i v = ... // your data

__m128i bytemask = _mm_or_si128(_mm_cmpeq_epi8(mask_20, v),
			_mm_shuffle_epi8(lut_cntrl,  _mm_adds_epu8(mask_70, v)));
// bytemask has 0xFF on ASCII white space

Why not have a "thicker tape"?

Current code dedicates 24 bits to index the JSON document, using 8 bits to track the character.

We have 64-bit machines, why use a 32-bit tape? This would allow supporting enormous JSON documents and could even allow us to move up and down the tree instead of just down (or not).

(There might be good reasons for using a 32-bit tape... )

C# Version

Hi :Is there a version of C#?thanks.

Add clear error when AVX2 is not supported on the current arch

I tried running the benchmarks on my machine but got an error when running the parser. I cloned the repo, ran cmake ., make, then make test to get this result. See screenfetch at bottom for specs of my machine.

Test project /Users/speleo/Downloads/simdjson
    Start 1: jsoncheck
1/1 Test #1: jsoncheck ........................***Exception: Illegal  0.01 sec

0% tests passed, 1 tests failed out of 1

Total Test time (real) =   0.01 sec

The following tests FAILED:
	  1 - jsoncheck (ILLEGAL)
Errors while running CTest
make: *** [test] Error 8
                    'c.          [email protected]
                 ,xNMM.          -----------------
               .OMMMMo           OS: macOS 10.14 18A391 x86_64
               OMMM0,            Host: MacBookPro10,1
     .;loddo:' loolloddol;.      Kernel: 18.0.0
   cKMMMMMMMMMMNWMMMMMMMMMM0:    Uptime: 7 days, 12 hours, 19 mins
 .KMMMMMMMMMMMMMMMMMMMMMMMWd.    Packages: 273
 XMMMMMMMMMMMMMMMMMMMMMMMX.      Shell: zsh 5.3
;MMMMMMMMMMMMMMMMMMMMMMMM:       Resolution: 1440x900
:MMMMMMMMMMMMMMMMMMMMMMMM:       DE: Aqua
.MMMMMMMMMMMMMMMMMMMMMMMMX.      WM: Kwm
 kMMMMMMMMMMMMMMMMMMMMMMMMWd.    Terminal: iTerm2
 .XMMMMMMMMMMMMMMMMMMMMMMMMMMk   CPU: Intel i7-3635QM (8) @ 2.40GHz
  .XMMMMMMMMMMMMMMMMMMMMMMMMK.   GPU: Intel HD Graphics 4000, NVIDIA GeForce GT 650M
    kMMMMMMMMMMMMMMMMMMMMMMd     Memory: 3678MiB / 16384MiB
     ;KMMMMMMMWXXWMMMMMMMk.
       .cooc,.    .,coo:.

Code is not valgrind clean

Just running benchmark/parse on the twitter examples results in many complaints from valgrind; mostly "Conditional jump or move depends on uninitialised value(s)" errors in the unified_machine code.

Explore lazy materialization, for greater speed

The current code validates to a point values (nulls, and so forth) but does not materialize (store) the values.

This could be done. For example, use 64-bit per value, using some kind of union where the string case is a handled as a pointer to the in-situ string that you have unescaped.

You need to have some way to map the tape to the values. Or maybe you want to write the values directly on the tape, somehow.

significant perf regression

Last couple of commits created a huge performance regression for some quantum reason from hell. Probably some bad code?

Support null characters in strings

There seem to be some strong opinions out there that NUL chars in JSON strings should be supported. This would suggest that we maintain strings by length - possibly keep offset+length on the tape? Or start each string with a length field.

shovel_machine should probably not forcibly go through all depths...

The shovel_machine function has code that looks like this...

  for (u32 i = 0; i < MAX_DEPTH; i++) {
    ....
  }

My understanding of the code is that if a level is empty (start_loc == end_loc), then all other levels are going to be empty. So the loop should end once this stage is reached.

(I am aware that this is not a performance issue.)

consider fusing the string and main tapes

The intuition behind having two separate tapes is that it keeps the main tape tight where there are many large strings. This may or may not be an important consideration.

Two headers have CRLF line terminators

This is a minor... Only two files (singleheader/simdjson.h and include/simdjson/common_defs.h) have CRLF line terminators (DOS).

The first one looks auto generated, so only include/simdjson/common_defs.h is problematic.

Do we want to allow the user to set limits on the maximum depth of nesting.

The JSON spec states...

An implementation may set limits on the maximum depth of nesting.

De facto, the current code goes up to 256... which is perfectly reasonable (it may even be a feature).

It seems that one benchmark the Mison paper uses is to parse all (but only) the "root-level" fields. Depending on your terminology that might be level 2 or 3.

In one email, Geoff points out that stage 3 and 4 could be greatly accelerated if we limited the depth. And this seems like a useful and attractive proposition. Suppose that you know, as the user, that you are not going to need to go very deep. Then it seems you could get much of the benefits of the Mison approach, simply by specifying the maximal depth.

Of course, you would not validate the lower levels, where they might be junk hiding... but the result would be well-defined, at least. (That is, you could define the result.)

Add fuzz testing

I ran AFL and it almost immediately produces a segmentation fault.

Can be reproduced by using the benchmark parse:

#0  find_structural_bits (buf=buf@entry=0x55555558f100 "\n", len=len@entry=1, pj=...) at src/stage1_find_marks.cpp:437
#1  0x000055555555823f in find_structural_bits (pj=..., len=1, buf=0x55555558f100 "\n") at include/simdjson/stage1_find_marks.h:12
#2  main (argc=<optimized out>, argv=<optimized out>) at benchmark/parse.cpp:154

id:000000,sig:11,src:000000,op:havoc,rep:128.zip

WebAssembly Compile Target

Hi there! I'm very limited in my WebAssembly and C++, C knowledge, however if this was something that could be run via WebAssembly and allow JavaScript application's and processes to yield similar (even if slightly degraded results), it would be pretty rad!

I'm not sure what work would be involved, so at the least, I'd love to request this as a feature so that tools running on NodeJS, or the Web can leverage this awesome piece of work.

remove the using namespace std in the header only version

stage1_find_marks.cpp and stage2_build_tape.cpp both starts with a using namespace std line. which are agglomerated for the header-only version.

Since the code is not namespaced, they while be visible in the unit including them, wich can be a problem.

easy fix:

namespace simdjson
{
#include <simdjson.h>
#include <simdjson.cpp>
}

Build a "normalize/stringify/minify" JSON function

A useful task is to trim all useless spaces from a JSON document. This is intrinsically useful. There is an argument that this should not be necessary, but I think it is bogus. In many applications, you can't expect the JSON to be minified.

This could be a problem almost orthogonal to JSON parsing.

Am I correct in thinking that your previous approach to structural elements was better suited to this task (the one with clmul in it)?

My current thinking is that it might offer one concrete test for this work, and even if it does not go through all stages, that would still be an interesting test.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.