First, I really like jq; thank you. I'm trying to use it for some large-ish data anal

jq wordcount is slow about jq HOT 5 CLOSED

jqlang commented on July 24, 2024

jq wordcount is slow

from jq.

Comments (5)

stedolan commented on July 24, 2024

Thanks for an excellent bug-report! You were right, there was a silly n^2 in the middle. But it was an unusually subtle one :). It's fixed in master now, so if you build from source it'll run much faster.

It turns out the slow bit was the [foo] array collection syntax, which is used internally by group_by to collect the key for each element. You get the same n^2 behaviour with just [.[]]. Here's a bit of a rant about why:

The [foo] syntax works by creating an empty array, and appending each result of foo to the array. jq's internal JSON library is functional and referentially transparent - all functions work "by value", and there are no side-effects (this is necessary to implement jq's backtracking behaviour in a relatively sane manner). So, the jv_array_append function, in the general case, copies the entire array to append the next element.

There's an important optimisation here, though. jq uses refcounting internally (which is valid since it's intentionally impossible to create a cyclic data structure in jq). If jv_array_append detects that the refcount of the array is 1, and so there are no remaining references to the old array, then it will update the array in-place. This is what was meant to happen for [foo].

Unfortunately, the previous code for [foo] kept an extra reference to the old array hanging around for too long, and forced jv_array_append into the slow full-copy path every time.

from jq.

stedolan commented on July 24, 2024

That should be reasonably fast now, and I think group_by is the "correct" way of solving this.

But if you're building from master, you can also use the somewhat inscrutable (and undocumented) fold syntax, which might (or might not) be slightly faster:

jq -R '.' | jq -s 'fold {} as $wc (.[] as $word | $wc | .[$word] += 1)'

from jq.

jkleint commented on July 24, 2024

Thank you for the fix and the explanation! I can now use group_by with 1M strings in 21 seconds, and 10M strings in 239 seconds.

I look forward to reading the documentation for fold. I tried your solution with 1M strings, but now it's the one still going after an hour.... ;)

I'd really like to use jq for some big-data type things, but it's still about 10X slower than awk for my task. Have you considered 1.) implementing group_by with hashing, yielding unsorted output (since you can always sort yourself if needed) and 2.) an LLVM backend? I see you have experience with that sort of thing.... ;)

from jq.

stedolan commented on July 24, 2024

It occurs to me now that fold currently has exactly the same N^2 problem that [foo] used to have :)

A hashing version of group_by is a good idea.

There are a lot of speedups that could be made in the jq internals. I think the first two I'd do are a simple peephole optimiser for the jq bytecode, and an inliner/constant-propagator (lots of jq guts are implemented as builtins in the jq language itself, this makes everything a bit slower but it means that optimisations speed everything up at once). Going for a machine-code/LLVM backend would be quite fiddly - backtracking is an integral part of jq, which can be implemented quite efficiently when I have control over the stack, and which doesn't fit very neatly into LLVM's C-like stack model.

So far, I have explicitly avoided doing any serious performance work on jq. I could sink a lot of time into it, and with very little performance effort it already seems fast enough for most purposes. I'm definitely going to make it go really fast at some stage, but I think there are other things to do with it first.

If you want to make it go faster right now there's an easy cheat: hack the Makefile and compile jq with -O2 -DNDEBUG. It's currently compiled at -O1 with assertions turned on (mainly because I can handle an "assertion failed on line foo" bugreport better than a "random segfault" bugreport) but it spends quite a bit of time doing redundant checks. There are also a lot of out-of-line small C functions that are called regularly, you'd get quite a perf win by either a) figuring out GCC or Clang's link-time-optimisation, or b) cat-ing all of the C files into one giant file before compiling. Maybe try adding -flto -fwhole-program and see what happens.

from jq.

jkleint commented on July 24, 2024

You're right that there are higher priorities than performance, and I imagine most people aren't doing really performance-intensive stuff anyway. I trust you'd know if LLVM wasn't the right tool for the job.

I tried gcc -O3 -DNDEBUG and it cut the time to about 2/3 what it was. clang -O4 -DNDEBUG (which implies -flto) got it down to half what it was. Thanks for the tip! That brings it to around 3X the time of a sorting-based Python solution, and if you say there is plenty of room for optimization in the future, I can live with that. :) Besides, I have an optimization idea to cut memory usage, with the likely side effect of bringing performance up.

Thanks again for a great tool and rapid improvements.

from jq.

jq wordcount is slow about jq HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent