Comments (11)
I don't think we should disable discountOverlaps:
The reason is that there are too many commonly-used tokenfilters adding synonyms or similar and they will bias document lengths. I've done measurements here, and that's why i originally proposed enabling it by default (the option was there, but was disabled by default).
average document length will never be exact either (due to deleted documents and many other reasons). norm is inexact since its a single byte too. Ultimately this average is just a pivot, it doesn't need to be pedantically correct. and we shouldn't make relevance worse for no good reason.
if you have a different/special use-case, you can disable discountOverlaps yourself, that's why the option is there.
[Legacy Jira: Robert Muir (@rmuir) on Oct 19 2017]
from stargazers-migration-test.
+1 for keeping the existing behavior of true. It definitely struck me as weird too, but for many indexes flipping the default would result in markedly worse behavior. Rather than disabling discount overlaps, maybe the more ideal behavior would be making the average document length equal to the total number of positions across the collection divided by the number of documents? That way we'd be comparing position length to average position length? However, I haven't looked into the feasibility or expense of doing that. If we were able to do that, discountOverlaps could move to something like countPositions vs countFrequencies.
[Legacy Jira: Timothy M. Rodriguez on Oct 19 2017]
from stargazers-migration-test.
My point is that defaults are for typical use-cases, and the default of discountOverlaps meets that goal. It results in better (measured) performance for many tokenfilters that are commonly used such as common-grams, WDF, synonyms, etc. I ran these tests before proposing the default, it was not done flying blind.
You can still turn it off if you have an atypical use case.
I don't think we need to modify the current computation based on sumTotalTermFreq/docCount without relevance measurements (multiple datasets) indicating that it improves default/common use cases in statistically significant ways. Index statistics are expensive and we should keep things simple and minimal.
Counting positions would be this entirely different thing, and mixes in more differences that all need to be measured. For example it means that stopwords which were removed now count against document's length where they don't do that today.
[Legacy Jira: Robert Muir (@rmuir) on Oct 19 2017]
from stargazers-migration-test.
Makes sense, agreed on both points.
[Legacy Jira: Timothy M. Rodriguez on Oct 19 2017]
from stargazers-migration-test.
and just to iterate a bit more on why position count can be a can of worms: It means lucene would behave differently/inconsistently depending on language in many cases (or even different minor encoding differences). Some languages may inflect a word to make it plural, and a stemmer strips it. Otherwise might use a postposition that gets remove by the stopfilter, etc.
Today this is all consistent either way since neither suffixes stripped by stemmers, nor stopwords, nor artificial synonyms count towards the length. So we measure length based on the "important content" according to the user's selected analyzer.
The avg document length calculation is just an approximation for a pivot value, and that same pivot is used for all documents. Because of that, I don't think there will be huge wins in trying to be pedantic about how its exact value is computed. It will never be exact since individual document's lengths are truncated to a single byte and the average wouldn't reflect such truncation. Nevertheless its a protected method so you can override the implementation if you don't trust it works and want to do something different.
[Legacy Jira: Robert Muir (@rmuir) on Oct 19 2017]
from stargazers-migration-test.
My point is that defaults are for typical use-cases, and the default of discountOverlaps meets that goal. It results in better (measured) performance for many tokenfilters that are commonly used such as common-grams, WDF, synonyms, etc. I ran these tests before proposing the default, it was not done flying blind.
Understood. I have not experienced any problems with the current default and I have the option to set discountOverlaps to false. Therefore it's ok for me if the ticket gets closed.
I only think about this out of "scientific" curiosity in the context of relevance tuning.
What benchmarks have you used for measuring performance?
Is your opinion based on tests with Lucene Classic Similarity (it also uses discountOverlaps = true) or also on tests with BM25.
Have you any idea / explanation why relevancy is better using discountOverlaps = true. My naive guess would be that since stopwords or synonyms are either used on all documents or on none and therefore it should not make much difference whether we count overlaps or not. Is the explanation that for some documents many stopwords / synonyms / WDF splits are used and for others not (for the same field). Another possible explanation would be that some fields have synonyms and others have not. That would punish fields with synonyms compared to others since their length is greater (in Classic Similarity with discountOverlaps = false), but in BM25 it should not have this effect since BM25 uses relative lenght for scoring and not abolute length like Classic Similarity.
Sorry for bothering you with these questions. It's only my curiosity and maybe Jira is not the right place for this.
[Legacy Jira: Christoph Goller on Oct 20 2017]
from stargazers-migration-test.
not sure how intuitive it is, i guess maybe it kinda is if you think on a case-by-case basis. Some examples:
- WDF splitting up "wi-fi", if those synonyms count towards doc's length, then we punish the doc because the author wrote a hyphen (vs writing "wi fi").
- if you have 1000 synonyms for hamburger and those count towards the length, then we punish a doc because the author wrote hamburger (versus writing "pizza").
note that punishing a doc unfairly here punishes it for all queries. if i search on "joker", why should one doc get a very low ranking for that term just because the doc also happens to mention "hamburger" instead of "pizza". In this case we have skewed length normalization in such a way that it doesn't properly reflect verbosity.
[Legacy Jira: Robert Muir (@rmuir) on Oct 20 2017]
from stargazers-migration-test.
What benchmarks have you used for measuring performance?
I use trec-like IR collections in different languages. The Lucene benchmark module has some support for running the queries and creating output that you can send to trec_eval. I just use its query-running support (QueryDriver), i don't use its indexing/parsing support although it has that too. Instead I index the test collections myself. That's because the collections/queries/judgements are always annoyingly in a slightly different non-standard format. I only look at measures which are generally the most stable like MAP and bpref.
Is your opinion based on tests with Lucene Classic Similarity (it also uses discountOverlaps = true) or also on tests with BM25.
I can't remember which scoring systems I tested at the time we flipped the default, but I think we should keep the same default for all scoring functions. It is fairly easy once you have everything setup to test with a ton of similarities at once (or different parameters) by modifying the code to loop across a big list. That's one reason why its valuable to try to keep any index-time logic consistent across all of them (such as formula for encoding the norm). Otherwise it makes testing unnecessarily difficult. Its already painful enough. This is important for real users too, they shouldn't have to reindex to do parameter tuning.
[Legacy Jira: Robert Muir (@rmuir) on Oct 20 2017]
from stargazers-migration-test.
@rcmuir thanks for the further explanation. That helped clarify. It does seem the effect would be minor at best. It'd be an interesting experiment at some point, though. If I ever get to trying it, I'll post back.
[~[email protected]] As an additional point, advanced use cases often utilize token "stacking" for additional uses as well and these would have further distortions on length. For example, some folks use analysis chains that stack variants of urls, currencies, etc.
[Legacy Jira: Timothy M. Rodriguez on Oct 20 2017]
from stargazers-migration-test.
Robert Muir thanks for the further explanation. That helped clarify. It does seem the effect would be minor at best. It'd be an interesting experiment at some point, though. If I ever get to trying it, I'll post back.
Thanks Timothy! Maybe if you get the chance to do the experiment, simply override the method protected float avgFieldLength(CollectionStatistics collectionStats)
to return the alternative value. For experiments it can just be a hardcoded number you computed yourself in a different way.
[Legacy Jira: Robert Muir (@rmuir) on Oct 20 2017]
from stargazers-migration-test.
As an additional point, advanced use cases often utilize token "stacking" for additional uses as well and these would have further distortions on length.
That's exactly what we are doing. Therefore using discountOverlaps = false could punish languages with more different word forms. I also prefer discountOverlaps = true. I have an intern (student) working on relevance tuning and benchmarks. I think we can try overwriting
protected float avgFieldLength(CollectionStatistics collectionStats)
and see it it changes anything. We will also have a look into Lucene benchmark module.
Thanks for your feedback.
[Legacy Jira: Christoph Goller on Oct 23 2017]
from stargazers-migration-test.
Related Issues (20)
- Update javadocs to reflect experimental status of Kuromoji DictionaryBuilder [LUCENE-8981] HOT 3
- Make NativeUnixDirectory pure java now that direct IO is possible [LUCENE-8982] HOT 31
- PhraseWildcardQuery - new query to control and optimize wildcard expansions in phrase [LUCENE-8983] HOT 13
- MoreLikeThis MLT is biased for uncommon fields [LUCENE-8984] HOT 11
- SynonymGraphFilter cannot handle input stream with tokens filtered. [LUCENE-8985] HOT 12
- Add asf.yaml to our git repo [LUCENE-8986] HOT 7
- Move Lucene web site from svn to git [LUCENE-8987] HOT 58
- Maximal -- Minimum Based Early Termination For TopFieldCollector [LUCENE-8988]
- IndexSearcher Should Handle Rejection of Concurrent Task [LUCENE-8989] HOT 10
- IndexOrDocValuesQuery can take a bad decision for range queries if field has many values per document [LUCENE-8990] HOT 8
- disable java.util.HashMap assertions to avoid spurious vailures due to JDK-8205399 [LUCENE-8991] HOT 13
- Share minimum score across segments in concurrent search [LUCENE-8992] HOT 7
- Change Maven POM repository URLs to https [LUCENE-8993] HOT 15
- Code Cleanup - Pass values to list constructor instead of empty constructor followed by addAll(). [LUCENE-8994] HOT 5
- TopSuggestDocsCollector#collect should be able to signal rejection [LUCENE-8995] HOT 1
- maxScore is sometimes missing from distributed grouped responses [LUCENE-8996] HOT 45
- Add type of triangle info to ShapeField encoding [LUCENE-8997] HOT 4
- OverviewImplTest.testIsOptimized reproducible failure [LUCENE-8998] HOT 5
- expectThrows doesn't play nicely with "assume" failures [LUCENE-8999] HOT 12
- Cannot resolve classes from org.apache.lucene.core plugin and others [LUCENE-9000] HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from stargazers-migration-test.