IndexSearcher.collectionStatistics(field) can do a fa

Optimize IndexSearcher.collectionStatistics [LUCENE-8040] about stargazers-migration-test HOT 16 CLOSED

mikemccand commented on June 3, 2024

Optimize IndexSearcher.collectionStatistics [LUCENE-8040]

from stargazers-migration-test.

Comments (16)

mikemccand commented on June 3, 2024

I considered a few alternatives to this today using JMH:

Cache MultiFields on the IndexSearcher
Compute the CollectionStatics raw, immediately (don't lookup or cache)
Add a ConcurrentHashMap<String,CollectionStatistics> on the IndexSearcher and compute on demand.

Attached is the JMH benchmark program. Between runs I would change line 78 to call out to the impl I wanted to try. JMH Main method is "org.openjdk.jmh.Main" and I used args "-wi 5 -i 5 -f 1"

My annotated results are:

Result "dsmiley.MyBenchmark.bench":    IndexSearcher
  1146.739 ±(99.9%) 280.645 us/op [Average]
  (min, avg, max) = (1034.410, 1146.739, 1238.123), stdev = 72.883
  CI (99.9%): [866.094, 1427.385] (assumes normal distribution)

Result "dsmiley.MyBenchmark.bench":    cached MultiFields
  29.556 ±(99.9%) 8.929 us/op [Average]
  (min, avg, max) = (27.409, 29.556, 33.424), stdev = 2.319
  CI (99.9%): [20.626, 38.485] (assumes normal distribution)

Result "dsmiley.MyBenchmark.bench":    raw compute 
  951.494 ±(99.9%) 182.555 us/op [Average]
  (min, avg, max) = (904.328, 951.494, 1024.473), stdev = 47.409
  CI (99.9%): [768.940, 1134.049] (assumes normal distribution)

Result "dsmiley.MyBenchmark.bench":   ConcurrentHashMap
  4.448 ±(99.9%) 1.268 us/op [Average]
  (min, avg, max) = (4.090, 4.448, 4.860), stdev = 0.329
  CI (99.9%): [3.180, 5.717] (assumes normal distribution)


For 5 fields:
raw:               10.716
ConcurrentHashMap:  0.155 us/op

I think the results are pretty clear that we should go with the ConcurrentHashMap.

I'm aware my benchmark implementation of this needs some more work. If an IOException is thrown it should pass through without RuntimeException wrapper. And if the field doesn't exist, we want to return null.

[Legacy Jira: David Smiley (@dsmiley) on Nov 06 2017]

from stargazers-migration-test.

mikemccand commented on June 3, 2024

I don't think we should add caching to the indexsearcher, as it is supposed to remain lightweight. Instead of using MultiFields, we should just sum up the stats per segment as a start. This is easier now that the statistics can no longer be -1.

[Legacy Jira: Robert Muir (@rmuir) on Nov 06 2017]

from stargazers-migration-test.

mikemccand commented on June 3, 2024

Also I think as far as lowering the overhead to getting to a field, the better fix is probably in BlockTreeTermsReader. Today getting to a specific field is log N (TreeMap). Maybe it should be HashMap instead.

Either linkedhashmap or separate sorted array can be used for the "iterator" functionality, but I think it currently optimizes for the wrong case (iterating fields in order, versus getting to a particular field).

[Legacy Jira: Robert Muir (@rmuir) on Nov 06 2017]

from stargazers-migration-test.

mikemccand commented on June 3, 2024

RE switching away from TreeMap to HashMap: absolutely! I've been working on a client project where this was already profiled and optimized to a HashMap in a fork. Definitely a separate issue, though I can see how the results of that will impact the results here. I'll file an issue.

[Legacy Jira: David Smiley (@dsmiley) on Nov 06 2017]

from stargazers-migration-test.

mikemccand commented on June 3, 2024

I don't really see it as a separate issue. collectionStatistics() is just looking up the stats from blocktree's maps and summing them up. so if its too slow, then blocktree is the place to fix it.

[Legacy Jira: Robert Muir (@rmuir) on Nov 06 2017]

from stargazers-migration-test.

mikemccand commented on June 3, 2024

I updated the benchmark to use a custom FilterDirectoryReader that ultimately has a custom FilterLeafReader that caches the Terms impls into a HashMap. Then I reran the benchmark with 150 fields, 30 segments:

IndexSearcher MultiFields (current)
  346.155 ±(99.9%) 57.775 us/op [Average]
  (min, avg, max) = (334.952, 346.155, 371.996), stdev = 15.004
  CI (99.9%): [288.380, 403.930] (assumes normal distribution)

Raw compute on demand each time
  196.271 ±(99.9%) 14.716 us/op [Average]
  (min, avg, max) = (192.012, 196.271, 201.187), stdev = 3.822
  CI (99.9%): [181.555, 210.987] (assumes normal distribution)

ConcurrentHashMap lazy cache of raw compute
  4.553 ±(99.9%) 0.245 us/op [Average]
  (min, avg, max) = (4.465, 4.553, 4.636), stdev = 0.064
  CI (99.9%): [4.308, 4.799] (assumes normal distribution)

Clearly the ConcurrentHashMap is saving us a lot.

You say we shouldn't add caching to IndexSearcher. IndexSearcher contains the QueryCache. Looking at LRUQueryCache, I think I can safely say that a ConcurrentHashMap is comparatively more lightweight. Do you disagree?

[Legacy Jira: David Smiley (@dsmiley) on Nov 10 2017]

from stargazers-migration-test.

mikemccand commented on June 3, 2024

Its not saving a "lot". We are talking about microseconds here either way.

IndexSearcher does not contain the querycache. The caching is at the segment level. You just configure it by passing it in there.

Big difference: I'm strongly against caching on index searcher. especially for something that takes microseconds.

[Legacy Jira: Robert Muir (@rmuir) on Nov 10 2017]

from stargazers-migration-test.

mikemccand commented on June 3, 2024

Ok, I'll commit the raw impl then. It's at least nice to no longer reference MultiFields and hence MultiTerms here.

[Legacy Jira: David Smiley (@dsmiley) on Nov 10 2017]

from stargazers-migration-test.

mikemccand commented on June 3, 2024

Can you please upload a proper patch so it can be reviewed? Note the code needs to be different for master and 7.x

[Legacy Jira: Robert Muir (@rmuir) on Nov 10 2017]

from stargazers-migration-test.

mikemccand commented on June 3, 2024

LUCENE-8040.patch is the master branch version.

I think the 7x version is simply removing the null return condition when docCount is 0.

[Legacy Jira: David Smiley (@dsmiley) on Nov 10 2017]

from stargazers-migration-test.

mikemccand commented on June 3, 2024

Updated patch to fix a mistake I discovered while investigating some test failures. I had written =\+ instead of \+= – nasty! I've tweaked my IntelliJ inspection settings to alert me to this.

I'll commit later this week if I don't hear further feedback.

[Legacy Jira: David Smiley (@dsmiley) on Nov 14 2017]

from stargazers-migration-test.

mikemccand commented on June 3, 2024

no for 7.x you need to handle -1 case for stats, just like MultiTerms currently does.

[Legacy Jira: Robert Muir (@rmuir) on Nov 14 2017]

from stargazers-migration-test.

mikemccand commented on June 3, 2024

no for 7.x you need to handle -1 case for stats, just like MultiTerms currently does.

Oh yeah, thanks for the tip. So adding support for -1 stats would be pretty annoying here... like this but for all 3:
Instead of

    docCount += terms.getDocCount()

We have:

   int tmpDC = terms.getDocCount();
   docCount = tmpDC == -1 ? -1 : docCount + tmpDC;

But even then it's not completely equivalent if the stats are -1 in some segments but not all. Do you think that matters @rmuir? I'm tempted to just not backport to 7x.

[Legacy Jira: David Smiley (@dsmiley) on Nov 15 2017]

from stargazers-migration-test.

mikemccand commented on June 3, 2024

+1 to take the conservative approach and just commit to master: the stats can't be -1 there.

[Legacy Jira: Robert Muir (@rmuir) on Nov 15 2017]

from stargazers-migration-test.

mikemccand commented on June 3, 2024

Commit af2b903f65e4451838fb3e93511329acec30a2a1 in lucene-solr's branch refs/heads/master from @dsmiley
https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=af2b903

LUCENE-8040: optimize IndexSearcher.collectionStatistics

[Legacy Jira: ASF subversion and git services on Nov 15 2017]

from stargazers-migration-test.

mikemccand commented on June 3, 2024

Good. Thanks for the review/feedback!

[Legacy Jira: David Smiley (@dsmiley) on Nov 15 2017]

from stargazers-migration-test.

Optimize IndexSearcher.collectionStatistics [LUCENE-8040] about stargazers-migration-test HOT 16 CLOSED

Comments (16)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent