v0ldek / rsonpath Goto Github PK

View Code? Open in Web Editor NEW

43.0 2.0 5.0 63.03 MB

Blazing fast JSONPath query engine written in Rust.

Home Page: https://rsonquery.github.io/rsonpath/

License: MIT License

Rust 98.69% Just 1.15% Nix 0.09% CSS 0.01% Dockerfile 0.03% Shell 0.03%

json jsonpath query rust simd avx2 cli command-line-tool search rq

rsonpath's Introduction

rsonpath – SIMD-powered JSONPath 🚀

Experimental JSONPath engine for querying massive streamed datasets.

The rsonpath crate provides a JSONPath parser and a query execution engine rq, which utilizes SIMD instructions to provide massive throughput improvements over conventional engines.

Benchmarks of rsonpath against a reference no-SIMD engine on the Pison dataset. NOTE: Scale is logarithmic!

Usage

To run a JSONPath query on a file execute:

rq '$..a.b' ./file.json

If the file is omitted, the engine reads standard input. JSON can also be passed inline:

$ rq '$..a.b' --json '{"c":{"a":{"b":42}}}'
42

For details, consult rq --help or the rsonbook.

Results

The result of running a query is a sequence of matched values, delimited by newlines. Alternatively, passing --result count returns only the number of matches, which might be much faster. For other result modes consult the --help usage page.

Installation

See Releases for precompiled binaries for all first-class support targets.

`cargo`

Easiest way to install is via cargo.

$ cargo install rsonpath
...

Native CPU optimizations

If maximum speed is paramount, you should install rsonpath with native CPU instructions support. This will result in a binary that is not portable and might work incorrectly on any other machine, but will squeeze out every last bit of throughput.

To do this, run the following cargo install variant:

$ RUSTFLAGS="-C target-cpu=native" cargo install rsonpath
...

Check out the relevant chapter in the rsonbook.

Query language

The project is actively developed and currently supports only a subset of the JSONPath query language. A query is a sequence of segments, each containing one or more selectors.

Supported segments

Segment	Syntax	Supported	Since
Child segment (single)	`[<selector>]`	✔️	v0.1.0
Child segment (multiple)	`[<selector1>,...,<selectorN>]`	❌
Descendant segment (single)	`..[<selector>]`	✔️	v0.1.0
Descendant segment (multiple)	`..[<selector1>,...,<selectorN>]`	❌

Supported selectors

Selector	Syntax	Supported	Since	Tracking Issue
Root	`$`	✔️	v0.1.0
Name	`.<member>`, `[<member>]`	✔️	v0.1.0
Wildcard	`.`, `..`, `[*]`	✔️	v0.4.0
Index (array index)	`[<index>]`	✔️	v0.5.0
Index (array index from end)	`[-<index>]`	❌
Array slice (forward, positive bounds)	`[<start>:<end>:<step>]`	✔️	v0.9.0	#152
Array slice (forward, arbitrary bounds)	`[<start>:<end>:<step>]`	❌
Array slice (backward, arbitrary bounds)	`[<start>:<end>:-<step>]`	❌
Filters – existential tests	`[?<path>]`	❌		#154
Filters – const atom comparisons	`[?<path> <binop> <atom>]`	❌		#156
Filters – logical expressions	`&&`, `\|\|`, `!`	❌
Filters – nesting	`[?<expr>[?<expr>]...]`	❌
Filters – arbitrary comparisons	`[?<path> <binop> <path>]`	❌
Filters – function extensions	`[?func(<path>)]`	❌

Supported platforms

The crate is continuously built for all Tier 1 Rust targets, and tests are continuously ran for targets that can be ran with GitHub action images. SIMD is supported only on x86/x86_64 platforms.

Target triple	nosimd build	SIMD support	Continuous testing	Tracking issues
aarch64-unknown-linux-gnu	✔️	❌	✔️	#21, #115
i686-unknown-linux-gnu	✔️	✔️	✔️
x86_64-unknown-linux-gnu	✔️	✔️	✔️
x86_64-apple-darwin	✔️	✔️	✔️
i686-pc-windows-gnu	✔️	✔️	✔️
i686-pc-windows-msvc	✔️	✔️	✔️
x86_64-pc-windows-gnu	✔️	✔️	✔️
x86_64-pc-windows-msvc	✔️	✔️	✔️

SIMD support

SIMD support is enabled on a module-by-module basis. Generally, any CPU released in the past decade supports AVX2, which enables all available optimizations.

Older CPUs with SSE2 or higher get partial support. You can check what exactly is enabled with rq --version – check the SIMD support field:

$ rq --version
rq 0.9.1

Commit SHA:      c024e1bab89610455537b77aed249d2a05a81ed6
Features:        default,simd
Opt level:       3
Target triple:   x86_64-unknown-linux-gnu
Codegen flags:   link-arg=-fuse-ld=lld
SIMD support:    avx2;fast_quotes;fast_popcnt

The fast_quotes capability depends on the pclmulqdq instruction, and fast_popcnt on the popcnt instruction.

Caveats and limitations

JSONPath

Not all selectors are supported, see the support table above.

Duplicate keys

The engine assumes that every object in the input JSON has no duplicate keys. Behavior on duplicate keys is not guaranteed to be stable, but currently the engine will simply match the first such key.

$ rq '$.key' --json '{"key":"value","key":"other value"}'
"value"

Unicode

The engine does not parse unicode escape sequences in member names. This means that a key "a" is different from a key "\u0041", even though semantically they represent the same string. This is actually as-designed with respect to the current JSONPath spec. Parsing unicode sequences is costly, so the support for this was postponed in favour of high performance. This is tracked as #117.

Contributing

The gist is: fork, implement, make a PR back here. More details are in the CONTRIBUTING doc.

Build & test

The dev workflow utilizes just. Use the included Justfile. It will automatically install Rust for you using the rustup tool if it detects there is no Cargo in your environment.

$ just build
...
$ just test
...

Benchmarks

Benchmarks for rsonpath are located in a separate repository, included as a git submodule in this main repository.

Easiest way to run all the benchmarks is just bench. For details, look at the README in the submodule.

Background

We have a paper on rsonpath to be published at ASPLOS '24! You can read it here.

This project was conceived as my thesis. You can read it for details on the theoretical background on the engine and details of its implementation.

Dependencies

Showing direct dependencies, for full graph see below.

cargo tree --package rsonpath --edges normal --depth 1

rsonpath v0.9.1 (/home/mat/src/rsonpath/crates/rsonpath)
├── clap v4.5.4
├── color-eyre v0.6.3
├── eyre v0.6.12
├── log v0.4.21
├── rsonpath-lib v0.9.1 (/home/mat/src/rsonpath/crates/rsonpath-lib)
├── rsonpath-syntax v0.3.1 (/home/mat/src/rsonpath/crates/rsonpath-syntax)
└── simple_logger v4.3.3
[build-dependencies]
├── rustflags v0.1.5
└── vergen v8.3.1
    [build-dependencies]

cargo tree --package rsonpath-lib --edges normal --depth 1

rsonpath-lib v0.9.1 (/home/mat/src/rsonpath/crates/rsonpath-lib)
├── arbitrary v1.3.2
├── cfg-if v1.0.0
├── log v0.4.21
├── memmap2 v0.9.4
├── nom v7.1.3
├── rsonpath-syntax v0.3.1 (/home/mat/src/rsonpath/crates/rsonpath-syntax)
├── smallvec v1.13.2
├── static_assertions v1.1.0
├── thiserror v1.0.58
└── vector-map v1.0.1

Justification

clap – standard crate to provide the CLI.
color-eyre, eyre – more accessible error messages for the parser.
log, simple-logger – diagnostic logs during compilation and execution.
cfg-if – used to support SIMD and no-SIMD versions.
memmap2 – for fast reading of source files via a memory map instead of buffered copies.
nom – for parser implementation.
smallvec – crucial for small-stack performance.
static_assertions – additional reliability by some constant assumptions validated at compile time.
thiserror – idiomatic Error implementations.
vector_map – used in the query compiler for measurably better performance.

Full dependency tree

cargo tree --package rsonpath --edges normal

rsonpath v0.9.1 (/home/mat/src/rsonpath/crates/rsonpath)
├── clap v4.5.4
│   ├── clap_builder v4.5.2
│   │   ├── anstream v0.6.13
│   │   │   ├── anstyle v1.0.6
│   │   │   ├── anstyle-parse v0.2.3
│   │   │   │   └── utf8parse v0.2.1
│   │   │   ├── anstyle-query v1.0.2
│   │   │   │   └── windows-sys v0.52.0
│   │   │   │       └── windows-targets v0.52.4
│   │   │   │           ├── windows_aarch64_gnullvm v0.52.4
│   │   │   │           ├── windows_aarch64_msvc v0.52.4
│   │   │   │           ├── windows_i686_gnu v0.52.4
│   │   │   │           ├── windows_i686_msvc v0.52.4
│   │   │   │           ├── windows_x86_64_gnu v0.52.4
│   │   │   │           ├── windows_x86_64_gnullvm v0.52.4
│   │   │   │           └── windows_x86_64_msvc v0.52.4
│   │   │   ├── anstyle-wincon v3.0.2
│   │   │   │   ├── anstyle v1.0.6
│   │   │   │   └── windows-sys v0.52.0 (*)
│   │   │   ├── colorchoice v1.0.0
│   │   │   └── utf8parse v0.2.1
│   │   ├── anstyle v1.0.6
│   │   ├── clap_lex v0.7.0
│   │   ├── strsim v0.11.1
│   │   └── terminal_size v0.3.0
│   │       ├── rustix v0.38.32
│   │       │   ├── bitflags v2.5.0
│   │       │   ├── errno v0.3.8
│   │       │   │   ├── libc v0.2.153
│   │       │   │   └── windows-sys v0.52.0 (*)
│   │       │   ├── libc v0.2.153
│   │       │   ├── linux-raw-sys v0.4.13
│   │       │   └── windows-sys v0.52.0 (*)
│   │       └── windows-sys v0.48.0
│   │           └── windows-targets v0.48.5
│   │               ├── windows_aarch64_gnullvm v0.48.5
│   │               ├── windows_aarch64_msvc v0.48.5
│   │               ├── windows_i686_gnu v0.48.5
│   │               ├── windows_i686_msvc v0.48.5
│   │               ├── windows_x86_64_gnu v0.48.5
│   │               ├── windows_x86_64_gnullvm v0.48.5
│   │               └── windows_x86_64_msvc v0.48.5
│   └── clap_derive v4.5.4 (proc-macro)
│       ├── heck v0.5.0
│       ├── proc-macro2 v1.0.79
│       │   └── unicode-ident v1.0.12
│       ├── quote v1.0.35
│       │   └── proc-macro2 v1.0.79 (*)
│       └── syn v2.0.58
│           ├── proc-macro2 v1.0.79 (*)
│           ├── quote v1.0.35 (*)
│           └── unicode-ident v1.0.12
├── color-eyre v0.6.3
│   ├── backtrace v0.3.71
│   │   ├── addr2line v0.21.0
│   │   │   └── gimli v0.28.1
│   │   ├── cfg-if v1.0.0
│   │   ├── libc v0.2.153
│   │   ├── miniz_oxide v0.7.2
│   │   │   └── adler v1.0.2
│   │   ├── object v0.32.2
│   │   │   └── memchr v2.7.2
│   │   └── rustc-demangle v0.1.23
│   │   [build-dependencies]
│   │   └── cc v1.0.90
│   ├── eyre v0.6.12
│   │   ├── indenter v0.3.3
│   │   └── once_cell v1.19.0
│   ├── indenter v0.3.3
│   ├── once_cell v1.19.0
│   └── owo-colors v3.5.0
├── eyre v0.6.12 (*)
├── log v0.4.21
├── rsonpath-lib v0.9.1 (/home/mat/src/rsonpath/crates/rsonpath-lib)
│   ├── cfg-if v1.0.0
│   ├── log v0.4.21
│   ├── memmap2 v0.9.4
│   │   └── libc v0.2.153
│   ├── nom v7.1.3
│   │   ├── memchr v2.7.2
│   │   └── minimal-lexical v0.2.1
│   ├── rsonpath-syntax v0.3.1 (/home/mat/src/rsonpath/crates/rsonpath-syntax)
│   │   ├── nom v7.1.3 (*)
│   │   ├── owo-colors v4.0.0
│   │   ├── thiserror v1.0.58
│   │   │   └── thiserror-impl v1.0.58 (proc-macro)
│   │   │       ├── proc-macro2 v1.0.79 (*)
│   │   │       ├── quote v1.0.35 (*)
│   │   │       └── syn v2.0.58 (*)
│   │   └── unicode-width v0.1.11
│   ├── smallvec v1.13.2
│   ├── static_assertions v1.1.0
│   ├── thiserror v1.0.58 (*)
│   └── vector-map v1.0.1
│       ├── contracts v0.4.0 (proc-macro)
│       │   ├── proc-macro2 v1.0.79 (*)
│       │   ├── quote v1.0.35 (*)
│       │   └── syn v1.0.109
│       │       ├── proc-macro2 v1.0.79 (*)
│       │       ├── quote v1.0.35 (*)
│       │       └── unicode-ident v1.0.12
│       └── rand v0.7.3
│           ├── getrandom v0.1.16
│           │   ├── cfg-if v1.0.0
│           │   ├── libc v0.2.153
│           │   └── wasi v0.9.0+wasi-snapshot-preview1
│           ├── libc v0.2.153
│           ├── rand_chacha v0.2.2
│           │   ├── ppv-lite86 v0.2.17
│           │   └── rand_core v0.5.1
│           │       └── getrandom v0.1.16 (*)
│           ├── rand_core v0.5.1 (*)
│           └── rand_hc v0.2.0
│               └── rand_core v0.5.1 (*)
├── rsonpath-syntax v0.3.1 (/home/mat/src/rsonpath/crates/rsonpath-syntax) (*)
└── simple_logger v4.3.3
    ├── colored v2.1.0
    │   ├── lazy_static v1.4.0
    │   └── windows-sys v0.48.0 (*)
    ├── log v0.4.21
    ├── time v0.3.34
    │   ├── deranged v0.3.11
    │   │   └── powerfmt v0.2.0
    │   ├── itoa v1.0.11
    │   ├── libc v0.2.153
    │   ├── num-conv v0.1.0
    │   ├── num_threads v0.1.7
    │   │   └── libc v0.2.153
    │   ├── powerfmt v0.2.0
    │   ├── time-core v0.1.2
    │   └── time-macros v0.2.17 (proc-macro)
    │       ├── num-conv v0.1.0
    │       └── time-core v0.1.2
    └── windows-sys v0.48.0 (*)
[build-dependencies]
├── rustflags v0.1.5
└── vergen v8.3.1
    ├── anyhow v1.0.81
    ├── cargo_metadata v0.18.1
    │   ├── camino v1.1.6
    │   │   └── serde v1.0.197
    │   │       └── serde_derive v1.0.197 (proc-macro)
    │   │           ├── proc-macro2 v1.0.79 (*)
    │   │           ├── quote v1.0.35 (*)
    │   │           └── syn v2.0.58 (*)
    │   ├── cargo-platform v0.1.8
    │   │   └── serde v1.0.197 (*)
    │   ├── semver v1.0.22
    │   │   └── serde v1.0.197 (*)
    │   ├── serde v1.0.197 (*)
    │   ├── serde_json v1.0.115
    │   │   ├── itoa v1.0.11
    │   │   ├── ryu v1.0.17
    │   │   └── serde v1.0.197 (*)
    │   └── thiserror v1.0.58 (*)
    ├── cfg-if v1.0.0
    ├── regex v1.10.4
    │   ├── aho-corasick v1.1.3
    │   │   └── memchr v2.7.2
    │   ├── memchr v2.7.2
    │   ├── regex-automata v0.4.6
    │   │   ├── aho-corasick v1.1.3 (*)
    │   │   ├── memchr v2.7.2
    │   │   └── regex-syntax v0.8.3
    │   └── regex-syntax v0.8.3
    ├── rustc_version v0.4.0
    │   └── semver v1.0.22 (*)
    └── time v0.3.34
        ├── deranged v0.3.11 (*)
        ├── itoa v1.0.11
        ├── libc v0.2.153
        ├── num-conv v0.1.0
        ├── num_threads v0.1.7 (*)
        ├── powerfmt v0.2.0
        └── time-core v0.1.2
    [build-dependencies]
    └── rustversion v1.0.14 (proc-macro)

rsonpath's People

Contributors

Stargazers

Watchers

Forkers

zwerddpu zwerdlds sthagen step-security-bot serenturhal

rsonpath's Issues

Remove `eyre` as a library dependency and replace with `thiserror`

Is your feature request related to a problem? Please describe.
Using eyre inside of the rsonpath library is a design error. It is well accepted within the Rust community that APIs should expose well-defined errors, and there's a crate that makes it easy, thiserror.

Describe the solution you'd like
The eyre crate should not be a dependency inside the library of rsonpath. This is mainly an issue in the parser. All parsing errors should instead be exposed as a first-class type with the use of thiserror, and then converted to eyre errors in the binary (main.rs) for display to the user.

Allow the structural classifier to be stopped and resumed

Is your feature request related to a problem? Please describe.
After #17 we should be able to allow switching classifiers. To do that, first allow the structural classifier to be stopped on-demand and then resumed from the same place.

Describe the solution you'd like
There needs to be an object representing the state of the classifier when it was stopped that will then allow the structural classifier (or a different one if we introduce one) to be resumed from the same place.

Evaluate performance of a "find first possible match heuristic"

Is your feature request related to a problem? Please describe.
Many regular expression engines benefit from a simple heuristic where they attempt to first localize a place where a match could even occur. This can often be done quickly, while producing false positives (but never false negatives).

Our pipeline might benefit from something like this as well. If the first label that the user asks in a query occurs rarely in the input, then simply searching first for its occurence without caring about escapes or quotes should be extremely fast and yield a location where the main search engine could start.

Describe the solution you'd like
This is an open design space. Experiments are required to find the best way to implement this idea and then evaluate its performance. It is not obvious that this will actually gain us much, so one should start with a proof-of-concept that this is even worthwile.

Proptests for the end-to-end query engine

Is your feature request related to a problem? Please describe.
Our engine integration tests now run on a limited sample of hand-crafted JSONs and queries, as well as the Wikidata JSONs (again with hand-picked queries). A proptest suite generating input JSONs and queries would make the library much more fail-proof.

Describe the solution you'd like
This is a complex design space. An iterative approach might work here -- create very limited inputs (only certain types of JSON trees and queries) to start off and then continue expanding that in follow-up issues/PRs.

Additional context
To get started, see the proptest book.

Proptests for the `query` parser

Is your feature request related to a problem? Please describe.
We should use the proptest crate to comprehensively test the query parser.

Describe the solution you'd like
As usual for proptests, this is a bit of a complex design space.

It's relatively easy to create proptests for correct query strings – just generate an arbitrary sequence of selectors and stringify them. It might be good to start with that as a single PR first.

Query strings that should result in an error are a bit more tricky. There are many things that can cause a parsing error, and just generating random garbage won't give too good a coverage. We can start by generating a sequence of valid selectors interspersed with "error" segments, and then figuring out smart kinds of "error" – things like three-dot sequences, root selectors inside unescaped labels...

As usual for proptests, we need to do it iteratively. Start with simpler tests and then consequently add more complex cases.

Additional context
I'm not sure what the acceptance criteria for this issue are, so we're just gonna wing it – we'll close it when we feel sufficiently happy with the test suite for something labelled as "1.0.0".

Compile the wildcard descendant selector `..`/`..[]`

Is your feature request related to a problem? Please describe.
The automaton module should recognize the wildcard descendant selectors introduced by #69 and correctly compile them.

Describe the solution you'd like
For the NFA, a state representing the recursive descent wildcard it is a Recursive state that can match any label. The approach should be similar to #7.

Improve `Display` of a `JsonPathQuery`

Is your feature request related to a problem? Please describe.
Current Display impl of JsonPathQueryNode (and therefore JsonPathQuery) is flawed. At minimum, it should round trip &ndash calling JsonPathQuery::parse(query.display()) should result in the same query. This is currently not the case due to labels containing the ' character.

Describe the solution you'd like
The code is not too hard, the crucial part is verifying – proptests for round-tripping. These are easy to write, since we just need to generate arbitrary queries. Proptest infrastructure from #51 can be shared with this issue.

Better proptests for classification

We have proptests for classifiers in tests/classifier_correctness_test.rs, but they are limited – they are explicitly coded in a way that disallows quoted sequences and escaped quotes.

It's important to test all possible paths, so these tests should be extended to also include quoted and escaped sequences. The best way to do this is unclear, so design work is also needed.

Installing via `cargo install` fails without custom flags with target features

Describe the bug

Installing the crate via cargo install by itself fails with

Target architecture is not supported by SIMD features of this crate. Disable the default `simd` feature.

Apparently the build flags from .cargo are not included in the crate on crates.io.

MRE

Just run cargo install rsonpath.

Expected behavior
The library should detect whether AVX2 is supported and enable/disable the simd feature based on that.

Workarounds (optional)
Running either the command with a custom flag on an AVX2-supporting platform:

RUSTFLAGS="-C target-feature=+avx2" cargo install rsonpath

or disabling the simd feature:

cargo install rsonpath --no-default-features

Proposed solution (optional)

A custom build script would be ideal to detect AVX2 support and emit appropriate compiler flags.

Desktop (please complete the following information):

Rust version: any
Target triple: any
Features: [simd]
Version: v0.1.0

Setup release with GitHub Actions

Is your feature request related to a problem? Please describe.
All releases are currently manual.

Describe the solution you'd like
We should have a pipeline that can be invoked to automatically build and publish the crate, as well as provide native binaries for download.

Non-nested subdocuments result mode

Is your feature request related to a problem? Please describe.
Except for #56 it might be beneficial to have a result mode that automatically filters nested subdocuments. In other words, if we have a match for a big subtree, don't show any results that are nested in that subtree anyway.

Describe the solution you'd like
Consider the query $.a..b[*], and the JSON:

{
    "a": {
        "c": [
            { "b": [1, 2, 3] },
            { "d": { "b": { "x": true } } }
            { "b": { "b": { "b": "value" } } }
        ]
    }
}

Then in the output we should find all the following paths, separated with newlines:

1
2
3
{ "x": true }
{ "b": { "b": "value" } }

Describe alternatives you've considered
Full subdocuments tracked by #56

Additional context
This should be compatible with #54. Example output for the above JSON:

(1, $['a']['c'][0]['b'][0])
(2, $['a']['c'][0]['b'][1])
(3, $['a']['c'][0]['b'][2])
({ "x": true }, $['a']['c'][1]['d']['b'])
({ "b": { "b": "value" } }, $['a']['c'][2]['b'])

Bug (panic) parsing incorrect Json

Describe the bug
panic message when reading incorrect value.

The application panicked (crashed).
Message:  index out of bounds: the len is 128 but the index is 18446744073709551615
Location: /home/cha/git/rust/rsonpath/rsonpath/src/stackless.rs:297
``

**MRE**
Example of incorrectly formatted document:

{'a': {'b': 'c', 'e': 'f'}, 'g': 'h'}

(Python dict not Jsonified)

**Expected behavior**
If the formatting error is detectable, throw an incorrect input error.

Introduce the wildcard child selector `.`/`[]` into the main execution engine

Is your feature request related to a problem? Please describe.
The main query engine should correctly work with the selector introduced in #6 and #7

Describe the solution you'd like
#8 should provide a working approach to this, but it will most likely require classifying commas.

New engine correctness tests have to be added for the selector, as well as benchmarks.

Add support for SSSE3 (128-bit wide SIMD for x86) for 32-bit architectures

Is your feature request related to a problem? Please describe.
The classifier currently supports AVX2 only. It should be expanded to support SSSE3 instructions as a fallback for older x86 architectures.

Describe the solution you'd like
Since the vector width for SSSE3 is different than AVX2, we need a separate implementation that will work on shorter vectors. This probably means that the current implementation, where we consider two 32-byte AVX2 vectors at a time will have to be adapted to instead consider two 16-byte wide SSSE3 characters and use 32-bit wide masks.

This is a complex issue and requires careful measurement of performance impact. We will need to update all our benchmarks to accurately compare AVX2 and SSSE3 implementations.

Differentiate brackets and braces in both classifiers

Is your feature request related to a problem? Please describe.
Currently the classifier returns Opening and Closing tokens regardless of whether it is a brace or a bracket. For some selectors it is required to know whether we are in a list or an object, so these should be differentiated.

Describe the solution you'd like
Split the two variants into ObjectOpening/ObjectClosing and ListOpening/ListClosing. This doesn't require changes to any SIMD machinery. Current engines need to be updated to react to both kinds of events.

Tests fail on Windows due to line ending differences

Describe the bug
On Windows system tests that check byte locations returned as result fail, because line endings in the input files are normalized to \r\n, adding an additional byte per line ending.

MRE
Run cargo test -p rsonpath --test engine_correctness_test and see all indices tests fail.

Expected behavior
The line endings should not be changed between operating systems – test data should be the same.

Workarounds (optional)
Use Linux. So no, there are no reasonable workarounds.

Proposed solution (optional)
Root cause needs to be identified – it's probably git overeagerly normalizing line endings by default. This should be disabled for the test data.

Desktop (please complete the following information):

Rust version: any
Target triple: stable-x86_64-pc-windows-msvc
Features: any
Version: v0.1.1

Additional context
This also forces our pipeline to not test Windows, as the tests automatically fail.

Compile the wildcard child selector `.`/`[]`

Is your feature request related to a problem? Please describe.
The automaton module should recognize the wildcard child selector introduced by #6 and correctly compile them.

Describe the solution you'd like
For the NFA, a state representing the child wildcard selector is a Direct state that can match any label. We need a new union type to represent either a Label to match, or Any marker that can match anything.

Introduce the wildcard descendant selector `..`/`..[]` into the recursive execution engine

Is your feature request related to a problem? Please describe.
The baseline recursive query engine should correctly work with the selector introduced in #69 and #70.

Describe the solution you'd like
There is a chance that nothing has to be done. #8 should already make the engine recognise "any label" transitions properly, so this might "just work" by default.

However, new engine correctness tests have to be added for those selectors.

Introduce the index selector (non-negative) into the main engine

Is your feature request related to a problem? Please describe.
Extend work from #62, the main execution engines need to handle the new automaton transitions.

Introduce the index selector (non-negative)

Tracking issue for the list Index Selector [n], where n is a positive constant. Support for negative (selecting from array end) is tracked at TODO.

Parser #60
Compiler #61
Engine (recursive) #62
Engine (main) #63

Provide better error messages when benchmarks are called in an unexpected way

Is your feature request related to a problem? Please describe.
As seen in #30, error messages for benchmarks failing due to unexpected ways of invoking the benches. These errors don't guide towards the actual reason of failure or how to resolve it.

Describe the solution you'd like
Provide an error message that informs the user of the possibilities (files are missing, bench invoked wrongly) and how to properly call a bench (cargo bench --bench <name> from project root). Additionally, there should be a clear entry on the GitHub wiki on how to properly bench and it should be linked from the error message.

Classify commas

Is your feature request related to a problem? Please describe.
For correct evaluation of queries related to lists (wildcard selectors, index selectors, etc.) we need classifiers to emit Comma structural tokens.

Describe the solution you'd like
This is a complex issue. Adding commas straight up will most likely limit the throughput of the engine massively (this is a hypothesis, need measurements to confirm). While no additional work will be required in the SIMD classifier (commas can be simply added to the classification lookup table as in Fast execution of JSONPath queries, 4.1.1), there will be many more characters returned and processed.

We should explore whether a "switching" approach improves throughput. The idea would be to enable comma classification if and only if we are currently within a list.

Allow buffered input streams

Is your feature request related to a problem? Please describe.
Current implementation reads the entire input to a string. This is not production-viable – very large files that we are targeting with all the performance improvements might not fit in memory. A first step would be to enable buffered reading – load a single page worth of input at a time. There are challenges here – it is possible for a single logical query step to span arbitrarily many blocks, e.g. JSON labels can be arbitrarily long.

Describe the solution you'd like
First of all, current implementations heavily rely on raw AlignedSlice data. This should be abstracted behind a buffered input that can yield slices on-demand.

Two, the query engines need to be made aware of this. They currently rely on having all the data available to index into the slice and compare labels. The engines also need to communicate to the classifiers at which point it is safe to stop keeping old input blocks in memory – we always need the entire label before the currently looked-at colon to be buffered, but after we examine it, it can be discarded.

Compile-only CLI flag

Is your feature request related to a problem? Please describe.
When implementing new selectors or just diagnosing bugs with a particular query, we often have to look at the compilation itself and the resulting automaton. It would be nice to have a way to run the binary only to the point of compilation, and not involve any of the engines.

Describe the solution you'd like
A new variant to the --engine switch sounds ideal. Example run:

just r '$..a.b' -e compiler -v

DEBUG [rsonpath_lib::query::parser] Parsed tokens: $(Descendant("a"))(Child("b"))
INFO  [rsonpath] Preparing query: `$..['a']['b']`

DEBUG [rsonpath_lib::query::automaton] NFA: r1 --a-> d1 --b-> acc
DEBUG [rsonpath_lib::query::automaton::minimizer] New superstate created: {NfaStateId(0)} DFA(1)
DEBUG [rsonpath_lib::query::automaton::minimizer] Expanding superstate: {NfaStateId(0)}, last checkpoint is Some(NfaStateId(0))
DEBUG [rsonpath_lib::query::automaton::minimizer] Considering transition NFA(0) --"a"-> NFA(1)
DEBUG [rsonpath_lib::query::automaton::minimizer] Raw transitions: {"a": {NfaStateId(1)}}
DEBUG [rsonpath_lib::query::automaton::minimizer] New superstate created: {NfaStateId(0), NfaStateId(1)} DFA(2)
DEBUG [rsonpath_lib::query::automaton::minimizer] Normalized transitions: {"a": {NfaStateId(0), NfaStateId(1)}}
DEBUG [rsonpath_lib::query::automaton::minimizer] Translated transitions: [("a", State(2))]
DEBUG [rsonpath_lib::query::automaton::minimizer] Expanding superstate: {NfaStateId(0), NfaStateId(1)}, last checkpoint is Some(NfaStateId(0))
DEBUG [rsonpath_lib::query::automaton::minimizer] Considering transition NFA(0) --"a"-> NFA(1)
DEBUG [rsonpath_lib::query::automaton::minimizer] Considering transition NFA(1) --"b"-> NFA(2)
DEBUG [rsonpath_lib::query::automaton::minimizer] Raw transitions: {"a": {NfaStateId(1)}, "b": {NfaStateId(2)}}
DEBUG [rsonpath_lib::query::automaton::minimizer] New superstate created: {NfaStateId(0), NfaStateId(2)} DFA(3)
DEBUG [rsonpath_lib::query::automaton::minimizer] Normalized transitions: {"a": {NfaStateId(0), NfaStateId(1)}, "b": {NfaStateId(0), NfaStateId(2)}}
DEBUG [rsonpath_lib::query::automaton::minimizer] Translated transitions: [("a", State(2)), ("b", State(3))]
DEBUG [rsonpath_lib::query::automaton::minimizer] Expanding superstate: {NfaStateId(0), NfaStateId(2)}, last checkpoint is Some(NfaStateId(0))
DEBUG [rsonpath_lib::query::automaton::minimizer] Considering transition NFA(0) --"a"-> NFA(1)
DEBUG [rsonpath_lib::query::automaton::minimizer] Raw transitions: {"a": {NfaStateId(1)}}
DEBUG [rsonpath_lib::query::automaton::minimizer] Normalized transitions: {"a": {NfaStateId(0), NfaStateId(1)}}
DEBUG [rsonpath_lib::query::automaton::minimizer] Translated transitions: [("a", State(2))]
DEBUG [rsonpath_lib::stackless] DFA:
 digraph {
  0 -> 0 [label="*"]
  1 -> 2 [label="a"]
  1 -> 1 [label="*"]
  2 -> 2 [label="a"]
  2 -> 3 [label="b"]
  2 -> 1 [label="*"]
  3 -> 2 [label="a"]
  3 -> 1 [label="*"]
}
INFO  [rsonpath] Compilation finished.

Describe alternatives you've considered
A dedicated flag could be used, --compile (short -c). It would be mutually exclusive with -e. Since the run would be completely different we could have the binary just output the compiled automaton to stdout in .dot format, which would make inspecting it easier.

Parse the negative index selector

Is your feature request related to a problem? Please describe.
Extend the support for the index selector intriduced in #60 to negative values. The negative value selects from the end of the list.

Describe the solution you'd like
The error message introduced in #60 should be replaced with actual parsing code.

Additional context
Syntax formulation in the RFS.

Parse the wildcard descendant selector `..`/`..[]`

Is your feature request related to a problem? Please describe.
The wildcard descendant selector should be recognized by the parser and parsed into an appropriate JsonPathQueryNode. Note that this means both the ..* and ..[*] patterns. The index and non-index version have identical semantics.

Describe the solution you'd like
This should follow the same approach as #6, which should be completed first.

Additional context
Find the syntax for the selectors in the RFC:

descendant-wildcard/descendant-index-wildcard: RFC Draft 3.5.7

Compile the index selector (non-negative)

Is your feature request related to a problem? Please describe.
After support is added to the parser in #60, the compiler nmeeds to recognise the new node and compile it into the automaton.

Describe the solution you'd like
The automaton can't do much with the selector, it should just compile it as usual to either a recursive or direct state. The type of transition will be different – we need to introduce an "index" variant of transition, alongside existing Label.

Parse the index selector (non-negative)

Is your feature request related to a problem? Please describe.
As the first step in introducing the index selector (#64, TODO), we need to add support in the parser.

Describe the solution you'd like
This is straightforward, a new JsonPathQueryNode that will hold the index. Additionally, we need good error messages:

If given a negative number, tell the user this is not supported yet, linking to TODO.
If given something else than a valid number, tell the user that they might be looking for the normal member selector but missed quotation marks.

Additional context
Syntax formulation in the RFS.

Remove the `len_trait` dependency

Is your feature request related to a problem? Please describe.
This crate is not really useful and I have doubts if it's actually used for anything that can't be easily achievedf without it. The crate was last updated over 5y ago, depends on an old version of cfg-if.

Describe the solution you'd like
Remove it from the dependencies entirely.

Result mode – line

Is your feature request related to a problem? Please describe.
The current implementation focuses on quickly returning locations of matches. The output is not human-readable though – we either get an aggregate of the number of hits or byte values. A first attempt at getting something that could be used as a simple CLI query tool (without focusing on massive datasets, or considering sparse results) would be to output the line number and the line at which a match is made.

Describe the solution you'd like
The implementation should be completely outside of the engine itself – we don't want it to influence the throughput of the main solution in any way. This is already achieved with the QueryResult trait.

As an example output, if I try to run the query $.a..c on the document:

{
    "a": {
        "b": {
            "c": [
                42,
                17,
                {
                    "c": "hello"
                }
            ],
    }
}

we expect to get an output resembling

4 |             "c": [
8 |                     "c": "hello"

Additional context
This obviously only makes sense for non-compressed JSON documents, but that's not something we can easily influence.

Show path result mode

Is your feature request related to a problem? Please describe.
When we get a match of a query we know the exact path in the JSON at which it occurs – quite naturally, JSONPath is all about that.

There should exist a result mode that includes this information in the output.

Describe the solution you'd like
Consider the query $.a..b[*], and the JSON:

{
    "a": {
        "c": [
            { "b": [1, 2, 3] },
            { "d": { "b": { "x": true } } }
        ]
    }
}

Then in the output we should find all the following paths, separated with newlines:

$['a']['c'][0]['b'][0]
$['a']['c'][0]['b'][1]
$['a']['c'][0]['b'][2]
$['a']['c'][1]['d']['b']['x']

Additional context
Note that this request can be combined with other result modes, except for count. For example, it's easy to output both byte indices and full paths. The CLI should reflect that and allow combining paths with other result modes, where it makes sense.

"Too complex" compiler error + refactoring

Is your feature request related to a problem? Please describe.
After #7 is implemented, it becomes much easier for a user to make the automaton grow way beyond the supported size of 128. We need an error type for that.

Describe the solution you'd like

A new variant added to CompilerError.
Make both engines implement a trait that contains a &JsonPathQuery -> Result<Engine, CompilerError constructor.
Rename Runner to Engine for consistency.
Move both engines under the engine submodule.
Rename stack_based to recursive and stackless to main.
Refactor of the binary:

Streamline error reporting by extracting it to a library function.
Streamline engine running with the new common trait from 2.

Facing an error while running Rson vs Jsonski benchmark

Describe the bug
A clear and concise description of what the bug is.

I am trying to run rson vs jsonski benchmark facing this error where I am unable to debug since I am not experienced with Rust. I have attached the image.

Proposed solution (optional)
Idea on an appropriate fix.

Desktop (please complete the following information):

MacOS - i9
rustc 1.64.0 (a55dd71d5 2022-09-19)

Additional context
Add any other context about the problem here.

Add support for NEON (128-bit wide SIMD for ARM) for 32-bit architectures

Is your feature request related to a problem? Please describe.
Currently we have SIMD acceleration for x86 only. ARM has its own standardised SIMD intrinsics set, called NEON. Supporting it (starting with 32-bit ARM) would be very beneficial.

Describe the solution you'd like
This should be coordinated with #14, since both should work on 32-bit architectures. Similar interfaces would be used, only the classifier implementations would be different (since they need different instruction sets).

Additional context
Find NEON intrinsics documentation here.

Install is broken

Describe the bug
Installing rsonpath on Ubuntu on a platform supporting AVX2 fails to install by saying AVX2 is not supported.

MRE
Run cargo install rsonpath.

Expected behavior

rsonpath installs with AVX2 support enabled.
The error should include the link to the bug report issue, not just the repo: https://github.com/V0ldek/rsonpath/issues/new?template=bug_report.md
This should be caught in CI/CD, not a week later randomly by me installing the binary.

Workarounds (optional)
You can pass `RUSTFLAGS="-C target-cpu=native" and then it installs correctly.

Desktop (please complete the following information):

Rust version: 1.66.1
Target triple: x86_64-unknown-linux-gnu
Features: default
Version: v0.2.0

Introduce the index selector (non-negative) into the recursive engine

Is your feature request related to a problem? Please describe.
After compiling in #61, the execution engines need to handle the new automaton transitions. Start with the recursive engine.

Describe the solution you'd like
When the selector [n] is encountered, the recursive engine should count down elements and mark the n-th as the match.

Librification

Is your feature request related to a problem? Please describe.
Before adding more features we should commit to librification of rsonpath.

Describe the solution you'd like
We need a clear separation between the binary CLI tool and the underlying library. To complete that, we need to:

separate out a library crate, clean the dependencies of the library (#13, #37);
fix error handling (#38, #39);
and improve documentation (#40)

Wildcard child selector `.`/`[]`

Tracking issue for the wildcard child selector .*/[*].

Parser #6
Compiler #7
Engine (recursive) #8
Engine (main) #73

Separate quote/escape sequences from main classifier

Is your feature request related to a problem? Please describe.
Current classifier has two separate jobs. It first detects quoted sequences, taking escapes into account, and then it classifies structural characters on top of that. Separating those concerns would unlock some perfomrance improvement opportunities.

For example, the part that classifies characters could be swapped during execution, for example to conditionally take commas into account, or to do quick skipping passes like JSONSki does.

Describe the solution you'd like
The concerns of "is this within quotes" and "is this an interesting character" should be separated into two modules.

Wildcard descendant selector `..`/`..[]`

Tracking issue for the wildcard descendant selector ..*/..[*]

Parser #69 (nice)
Compiler #70
Engine (recursive) #71
Engine (main) #72

Full subdocument result mode

Is your feature request related to a problem? Please describe.
We need the actual full result mode – output the entire subdocument that matches the query.

Describe the solution you'd like
Consider the query $.a..b[*], and the JSON:

{
    "a": {
        "c": [
            { "b": [1, 2, 3] },
            { "d": { "b": { "x": true } } }
            { "b": { "b": { "b": "value" } } }
        ]
    }
}

Then in the output we should find all the following paths, separated with newlines:

1
2
3
true
"value"
{ "b": "value" }

Describe alternatives you've considered
Non-nested subdocument result mode tracked by #57

Additional context
This should be compatible with #54. Example output for the above JSON:

(1, $['a']['c'][0]['b'][0])
(2, $['a']['c'][0]['b'][1])
(3, $['a']['c'][0]['b'][2])
(true, $['a']['c'][1]['d']['b']['x'])
("value", $['a']['c'][2]['b']['b']['b'])
({ "b": "value" }, $['a']['c'][2]['b']['b'])

Add depth-based tail-skipping

Is your feature request related to a problem? Please describe.
Tail-skipping is the ability for the engine to recognise that a given subtree can be entirely skipped, as it cannot possibly match the query. This happens when the query automaton reaches a rejecting state.

To facilitate such skipping we need a classifier that will quickly determine the depth at which we are in the subtree, so that normal classifier can be resumed after the depth reaches 0, meaning the end of the subtree.

Describe the solution you'd like
A new classifier that can cooperate with the structural classifier and be used in its place when the automaton rejects. This requires #26 to facilitate switching between them. The depth skipping algorithm is already used in simd-benchmarks and described in the paper in 3.3 (the lazy implementation).

Proptests for automaton compilation

Is your feature request related to a problem? Please describe.
In the unending quest to proptest all the things, we should have tests for the automata we compile. This is a big ask, since we don't really have automata infrastructure.

Describe the solution you'd like
The only reasonable proptest design for this seems to be fuzzing paths and checking whether they are accepted. We would take a query, compile the automaton, and then generate arbitrary paths that should be accepted by the query. The same for paths that should not be accepted.

This can be done for both the NFA and the DFA.

Describe alternatives you've considered
For our NFA -> minimal DFA code, we can also test parity between the input and output automaton. We can simulate both of them side by side, making sure they're the same. This might be a potential additional issue.

Introduce the wildcard child selector `.`/`[]` into the recursive execution engine

Is your feature request related to a problem? Please describe.
The baseline recursive query engine should correctly work with the selector introduced in #6 and #7

Describe the solution you'd like
#7 will introduce transitions that can match any label, #10 will give us differentiated brackets as part of the event stream, and #11 will provide commas. There are two changes that need to be made after that to the engine to support the new selectors.

When traversing an object, the "any label" transition should be triggered on every colon.
When traversing a list, the "any label" transition should be triggered on every element of the list, which will be separated by commas.

New engine correctness tests have to be added for those selectors.

Include datasets and queries from JSONSki into benchmarks

Is your feature request related to a problem? Please describe.
The datasets described in the JSONSki paper are available on Google Drive, but are not incorporated into our benchmarks.

Describe the solution you'd like
We need the datasets included and experiments configured for those datasets. If wildcard selectors get implemented (#9), we can take the queries as-is. However, it is still beneficial to consider rewrites of these queries using the descendant selector not available to JSONSki for potential performance gains (and easier query formulations). This needs to be investigated.

Make query execution panic-free

Is your feature request related to a problem? Please describe.
There are a couple of places in the classifiers and engines using unwrap. Remove them and create strongly-typed errors for those cases.

Describe the solution you'd like
Remove all panics and enable clippy::unwrap_used and clippy::expect_used lints.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Add error and panic documentation

Is your feature request related to a problem? Please describe.
We should document all places where an error can occur in library code.

Describe the solution you'd like
Enable clippy::missing_errors_doc and clippy::missing_panics_doc doc lints and fix all occurrences.

Make query parsing and compiling panic-free

Is your feature request related to a problem? Please describe.
The parser should use Error impls to handle errors, not unwrap (or expect).

Describe the solution you'd like
Remove all panics from the query module and enable clippy::unwrap_used and clippy::expect_used as warnings for the module.

Parse the wildcard child selector `.`/`[]`

Is your feature request related to a problem? Please describe.
The wildcard child selector should be recognized by the parser and parsed into an appropriate JsonPathQueryNode. Note that this means both the .* and [*] patterns, selecting any child. The index and non-index version have identical semantics.

Describe the solution you'd like
There are two possible solutions.

Turn Label into a variant type and have a variant for the wildcard.
Add a separate node type for AnyChild.

It's not obvious to me which one is better. Approach 2 seems to be a little easier to do, since it won't have to touch the widely used Label type.

The consumers of queries should react to the new variant with an error "not supported yet".

Additional context
Find the syntax for the selectors in the RFC:

dot-wildcard: RFC Draft 3.5.3
index-wildcard: RFC Draft 3.5.5

Introduce the wildcard descendant selector `..`/`..[]` into the main execution engine

Is your feature request related to a problem? Please describe.
The main query engine should correctly work with the selector introduced in #69 and #70.

Describe the solution you'd like
There is a chance that nothing has to be done. #8 should already make the engine recognise "any label" transitions properly, so this might "just work" by default. This should be cleared up in #71.

However, new engine correctness tests have to be added for those selectors, as well as benchmarks.

v0ldek / rsonpath Goto Github PK

rsonpath's Introduction

rsonpath – SIMD-powered JSONPath 🚀

Usage

Results

Installation

cargo

Native CPU optimizations

Query language

Supported segments

Supported selectors

Supported platforms

SIMD support

Caveats and limitations

JSONPath

Duplicate keys

Unicode

Contributing

Build & test

Benchmarks

Background

Dependencies

Justification

Full dependency tree

rsonpath's People

Contributors

Stargazers

Watchers

Forkers

rsonpath's Issues

Recommend Projects

Recommend Topics

Recommend Org

`cargo`