Comments (16)
I considered a few alternatives to this today using JMH:
- Cache MultiFields on the IndexSearcher
- Compute the CollectionStatics raw, immediately (don't lookup or cache)
- Add a ConcurrentHashMap<String,CollectionStatistics> on the IndexSearcher and compute on demand.
Attached is the JMH benchmark program. Between runs I would change line 78 to call out to the impl I wanted to try. JMH Main method is "org.openjdk.jmh.Main" and I used args "-wi 5 -i 5 -f 1"
My annotated results are:
Result "dsmiley.MyBenchmark.bench": IndexSearcher
1146.739 ±(99.9%) 280.645 us/op [Average]
(min, avg, max) = (1034.410, 1146.739, 1238.123), stdev = 72.883
CI (99.9%): [866.094, 1427.385] (assumes normal distribution)
Result "dsmiley.MyBenchmark.bench": cached MultiFields
29.556 ±(99.9%) 8.929 us/op [Average]
(min, avg, max) = (27.409, 29.556, 33.424), stdev = 2.319
CI (99.9%): [20.626, 38.485] (assumes normal distribution)
Result "dsmiley.MyBenchmark.bench": raw compute
951.494 ±(99.9%) 182.555 us/op [Average]
(min, avg, max) = (904.328, 951.494, 1024.473), stdev = 47.409
CI (99.9%): [768.940, 1134.049] (assumes normal distribution)
Result "dsmiley.MyBenchmark.bench": ConcurrentHashMap
4.448 ±(99.9%) 1.268 us/op [Average]
(min, avg, max) = (4.090, 4.448, 4.860), stdev = 0.329
CI (99.9%): [3.180, 5.717] (assumes normal distribution)
For 5 fields:
raw: 10.716
ConcurrentHashMap: 0.155 us/op
I think the results are pretty clear that we should go with the ConcurrentHashMap.
I'm aware my benchmark implementation of this needs some more work. If an IOException is thrown it should pass through without RuntimeException wrapper. And if the field doesn't exist, we want to return null.
[Legacy Jira: David Smiley (@dsmiley) on Nov 06 2017]
from stargazers-migration-test.
I don't think we should add caching to the indexsearcher, as it is supposed to remain lightweight. Instead of using MultiFields, we should just sum up the stats per segment as a start. This is easier now that the statistics can no longer be -1.
[Legacy Jira: Robert Muir (@rmuir) on Nov 06 2017]
from stargazers-migration-test.
Also I think as far as lowering the overhead to getting to a field, the better fix is probably in BlockTreeTermsReader. Today getting to a specific field is log N (TreeMap). Maybe it should be HashMap instead.
Either linkedhashmap or separate sorted array can be used for the "iterator" functionality, but I think it currently optimizes for the wrong case (iterating fields in order, versus getting to a particular field).
[Legacy Jira: Robert Muir (@rmuir) on Nov 06 2017]
from stargazers-migration-test.
RE switching away from TreeMap to HashMap: absolutely! I've been working on a client project where this was already profiled and optimized to a HashMap in a fork. Definitely a separate issue, though I can see how the results of that will impact the results here. I'll file an issue.
[Legacy Jira: David Smiley (@dsmiley) on Nov 06 2017]
from stargazers-migration-test.
I don't really see it as a separate issue. collectionStatistics() is just looking up the stats from blocktree's maps and summing them up. so if its too slow, then blocktree is the place to fix it.
[Legacy Jira: Robert Muir (@rmuir) on Nov 06 2017]
from stargazers-migration-test.
I updated the benchmark to use a custom FilterDirectoryReader that ultimately has a custom FilterLeafReader that caches the Terms impls into a HashMap. Then I reran the benchmark with 150 fields, 30 segments:
IndexSearcher MultiFields (current)
346.155 ±(99.9%) 57.775 us/op [Average]
(min, avg, max) = (334.952, 346.155, 371.996), stdev = 15.004
CI (99.9%): [288.380, 403.930] (assumes normal distribution)
Raw compute on demand each time
196.271 ±(99.9%) 14.716 us/op [Average]
(min, avg, max) = (192.012, 196.271, 201.187), stdev = 3.822
CI (99.9%): [181.555, 210.987] (assumes normal distribution)
ConcurrentHashMap lazy cache of raw compute
4.553 ±(99.9%) 0.245 us/op [Average]
(min, avg, max) = (4.465, 4.553, 4.636), stdev = 0.064
CI (99.9%): [4.308, 4.799] (assumes normal distribution)
Clearly the ConcurrentHashMap is saving us a lot.
You say we shouldn't add caching to IndexSearcher. IndexSearcher contains the QueryCache. Looking at LRUQueryCache, I think I can safely say that a ConcurrentHashMap is comparatively more lightweight. Do you disagree?
[Legacy Jira: David Smiley (@dsmiley) on Nov 10 2017]
from stargazers-migration-test.
Its not saving a "lot". We are talking about microseconds here either way.
IndexSearcher does not contain the querycache. The caching is at the segment level. You just configure it by passing it in there.
Big difference: I'm strongly against caching on index searcher. especially for something that takes microseconds.
[Legacy Jira: Robert Muir (@rmuir) on Nov 10 2017]
from stargazers-migration-test.
Ok, I'll commit the raw impl then. It's at least nice to no longer reference MultiFields and hence MultiTerms here.
[Legacy Jira: David Smiley (@dsmiley) on Nov 10 2017]
from stargazers-migration-test.
Can you please upload a proper patch so it can be reviewed? Note the code needs to be different for master and 7.x
[Legacy Jira: Robert Muir (@rmuir) on Nov 10 2017]
from stargazers-migration-test.
LUCENE-8040.patch is the master branch version.
I think the 7x version is simply removing the null return condition when docCount is 0.
[Legacy Jira: David Smiley (@dsmiley) on Nov 10 2017]
from stargazers-migration-test.
Updated patch to fix a mistake I discovered while investigating some test failures. I had written =\+
instead of \+=
– nasty! I've tweaked my IntelliJ inspection settings to alert me to this.
I'll commit later this week if I don't hear further feedback.
[Legacy Jira: David Smiley (@dsmiley) on Nov 14 2017]
from stargazers-migration-test.
no for 7.x you need to handle -1 case for stats, just like MultiTerms currently does.
[Legacy Jira: Robert Muir (@rmuir) on Nov 14 2017]
from stargazers-migration-test.
no for 7.x you need to handle -1 case for stats, just like MultiTerms currently does.
Oh yeah, thanks for the tip. So adding support for -1 stats would be pretty annoying here... like this but for all 3:
Instead of
docCount += terms.getDocCount()
We have:
int tmpDC = terms.getDocCount();
docCount = tmpDC == -1 ? -1 : docCount + tmpDC;
But even then it's not completely equivalent if the stats are -1 in some segments but not all. Do you think that matters @rmuir? I'm tempted to just not backport to 7x.
[Legacy Jira: David Smiley (@dsmiley) on Nov 15 2017]
from stargazers-migration-test.
+1 to take the conservative approach and just commit to master: the stats can't be -1 there.
[Legacy Jira: Robert Muir (@rmuir) on Nov 15 2017]
from stargazers-migration-test.
Commit af2b903f65e4451838fb3e93511329acec30a2a1 in lucene-solr's branch refs/heads/master from @dsmiley
https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=af2b903
LUCENE-8040: optimize IndexSearcher.collectionStatistics
[Legacy Jira: ASF subversion and git services on Nov 15 2017]
from stargazers-migration-test.
Good. Thanks for the review/feedback!
[Legacy Jira: David Smiley (@dsmiley) on Nov 15 2017]
from stargazers-migration-test.
Related Issues (20)
- Update javadocs to reflect experimental status of Kuromoji DictionaryBuilder [LUCENE-8981] HOT 3
- Make NativeUnixDirectory pure java now that direct IO is possible [LUCENE-8982] HOT 31
- PhraseWildcardQuery - new query to control and optimize wildcard expansions in phrase [LUCENE-8983] HOT 13
- MoreLikeThis MLT is biased for uncommon fields [LUCENE-8984] HOT 11
- SynonymGraphFilter cannot handle input stream with tokens filtered. [LUCENE-8985] HOT 12
- Add asf.yaml to our git repo [LUCENE-8986] HOT 7
- Move Lucene web site from svn to git [LUCENE-8987] HOT 58
- Maximal -- Minimum Based Early Termination For TopFieldCollector [LUCENE-8988]
- IndexSearcher Should Handle Rejection of Concurrent Task [LUCENE-8989] HOT 10
- IndexOrDocValuesQuery can take a bad decision for range queries if field has many values per document [LUCENE-8990] HOT 8
- disable java.util.HashMap assertions to avoid spurious vailures due to JDK-8205399 [LUCENE-8991] HOT 13
- Share minimum score across segments in concurrent search [LUCENE-8992] HOT 7
- Change Maven POM repository URLs to https [LUCENE-8993] HOT 15
- Code Cleanup - Pass values to list constructor instead of empty constructor followed by addAll(). [LUCENE-8994] HOT 5
- TopSuggestDocsCollector#collect should be able to signal rejection [LUCENE-8995] HOT 1
- maxScore is sometimes missing from distributed grouped responses [LUCENE-8996] HOT 45
- Add type of triangle info to ShapeField encoding [LUCENE-8997] HOT 4
- OverviewImplTest.testIsOptimized reproducible failure [LUCENE-8998] HOT 5
- expectThrows doesn't play nicely with "assume" failures [LUCENE-8999] HOT 12
- Cannot resolve classes from org.apache.lucene.core plugin and others [LUCENE-9000] HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from stargazers-migration-test.