Comments (5)
Thanks for an excellent bug-report! You were right, there was a silly n^2 in the middle. But it was an unusually subtle one :). It's fixed in master now, so if you build from source it'll run much faster.
It turns out the slow bit was the [foo]
array collection syntax, which is used internally by group_by
to collect the key for each element. You get the same n^2 behaviour with just [.[]]
. Here's a bit of a rant about why:
The [foo]
syntax works by creating an empty array, and appending each result of foo
to the array. jq's internal JSON library is functional and referentially transparent - all functions work "by value", and there are no side-effects (this is necessary to implement jq's backtracking behaviour in a relatively sane manner). So, the jv_array_append
function, in the general case, copies the entire array to append the next element.
There's an important optimisation here, though. jq uses refcounting internally (which is valid since it's intentionally impossible to create a cyclic data structure in jq). If jv_array_append
detects that the refcount of the array is 1, and so there are no remaining references to the old array, then it will update the array in-place. This is what was meant to happen for [foo]
.
Unfortunately, the previous code for [foo]
kept an extra reference to the old array hanging around for too long, and forced jv_array_append
into the slow full-copy path every time.
from jq.
That should be reasonably fast now, and I think group_by
is the "correct" way of solving this.
But if you're building from master, you can also use the somewhat inscrutable (and undocumented) fold
syntax, which might (or might not) be slightly faster:
jq -R '.' | jq -s 'fold {} as $wc (.[] as $word | $wc | .[$word] += 1)'
from jq.
Thank you for the fix and the explanation! I can now use group_by
with 1M strings in 21 seconds, and 10M strings in 239 seconds.
I look forward to reading the documentation for fold
. I tried your solution with 1M strings, but now it's the one still going after an hour.... ;)
I'd really like to use jq for some big-data type things, but it's still about 10X slower than awk for my task. Have you considered 1.) implementing group_by
with hashing, yielding unsorted output (since you can always sort yourself if needed) and 2.) an LLVM backend? I see you have experience with that sort of thing.... ;)
from jq.
It occurs to me now that fold
currently has exactly the same N^2 problem that [foo]
used to have :)
A hashing version of group_by
is a good idea.
There are a lot of speedups that could be made in the jq internals. I think the first two I'd do are a simple peephole optimiser for the jq bytecode, and an inliner/constant-propagator (lots of jq guts are implemented as builtins in the jq language itself, this makes everything a bit slower but it means that optimisations speed everything up at once). Going for a machine-code/LLVM backend would be quite fiddly - backtracking is an integral part of jq, which can be implemented quite efficiently when I have control over the stack, and which doesn't fit very neatly into LLVM's C-like stack model.
So far, I have explicitly avoided doing any serious performance work on jq. I could sink a lot of time into it, and with very little performance effort it already seems fast enough for most purposes. I'm definitely going to make it go really fast at some stage, but I think there are other things to do with it first.
If you want to make it go faster right now there's an easy cheat: hack the Makefile and compile jq with -O2 -DNDEBUG
. It's currently compiled at -O1
with assertions turned on (mainly because I can handle an "assertion failed on line foo" bugreport better than a "random segfault" bugreport) but it spends quite a bit of time doing redundant checks. There are also a lot of out-of-line small C functions that are called regularly, you'd get quite a perf win by either a) figuring out GCC or Clang's link-time-optimisation, or b) cat-ing all of the C files into one giant file before compiling. Maybe try adding -flto -fwhole-program
and see what happens.
from jq.
You're right that there are higher priorities than performance, and I imagine most people aren't doing really performance-intensive stuff anyway. I trust you'd know if LLVM wasn't the right tool for the job.
I tried gcc -O3 -DNDEBUG
and it cut the time to about 2/3 what it was. clang -O4 -DNDEBUG
(which implies -flto
) got it down to half what it was. Thanks for the tip! That brings it to around 3X the time of a sorting-based Python solution, and if you say there is plenty of room for optimization in the future, I can live with that. :) Besides, I have an optimization idea to cut memory usage, with the likely side effect of bringing performance up.
Thanks again for a great tool and rapid improvements.
from jq.
Related Issues (20)
- need command line options to print path and key values of JSON file HOT 7
- Please add nullish coalescing operator "??" to JQ? HOT 3
- Cannot select sub-field that has a dash in the name of the field HOT 1
- compile error HOT 10
- Multiplying -1 with 0 must result 0, not -0 HOT 2
- ~/.jq is not sourced on windows
- Unexpected output HOT 1
- Null bytes are handled inconsistently HOT 5
- tonumber doesn't work on `true` or `false` HOT 4
- "color for object keys" from JQ_COLORS doesn't seem to be respected HOT 4
- `range/3` behaviour when $init and $upto arguments are not numbers HOT 4
- Regular expression alternation (|) used with quantifier (* or +) returns inconsistent results when first alternative is able to match an empty string HOT 3
- Create Multi Smaller Files From a Big Json File HOT 5
- Add the Symbolic Binding Operator ("as") to the operator priority table HOT 1
- Including and importing module with importing JSON crashes jq
- The code example for trim ltrim and rtrim from the manual do not work HOT 2
- Incorrect `Invalid path expression` in destructuring expressions
- 1.7.1 dumps core on Cygwin, trying to find the input filename to print an error message HOT 5
- jq --raw modes, --binary and newlines
- Request for snap maintenance HOT 11
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from jq.