Length of individual documents only counts the number of positions of a document since

What benchmarks have you used for measuring performance? </blockquote

Document Length Normalization in BM25Similarity correct? [LUCENE-8000] about stargazers-migration-test HOT 11 OPEN

mikemccand commented on June 12, 2024

Document Length Normalization in BM25Similarity correct? [LUCENE-8000]

from stargazers-migration-test.

Comments (11)

mikemccand commented on June 12, 2024

I don't think we should disable discountOverlaps:
The reason is that there are too many commonly-used tokenfilters adding synonyms or similar and they will bias document lengths. I've done measurements here, and that's why i originally proposed enabling it by default (the option was there, but was disabled by default).

average document length will never be exact either (due to deleted documents and many other reasons). norm is inexact since its a single byte too. Ultimately this average is just a pivot, it doesn't need to be pedantically correct. and we shouldn't make relevance worse for no good reason.

if you have a different/special use-case, you can disable discountOverlaps yourself, that's why the option is there.

[Legacy Jira: Robert Muir (@rmuir) on Oct 19 2017]

from stargazers-migration-test.

mikemccand commented on June 12, 2024

+1 for keeping the existing behavior of true. It definitely struck me as weird too, but for many indexes flipping the default would result in markedly worse behavior. Rather than disabling discount overlaps, maybe the more ideal behavior would be making the average document length equal to the total number of positions across the collection divided by the number of documents? That way we'd be comparing position length to average position length? However, I haven't looked into the feasibility or expense of doing that. If we were able to do that, discountOverlaps could move to something like countPositions vs countFrequencies.

[Legacy Jira: Timothy M. Rodriguez on Oct 19 2017]

from stargazers-migration-test.

mikemccand commented on June 12, 2024

My point is that defaults are for typical use-cases, and the default of discountOverlaps meets that goal. It results in better (measured) performance for many tokenfilters that are commonly used such as common-grams, WDF, synonyms, etc. I ran these tests before proposing the default, it was not done flying blind.

You can still turn it off if you have an atypical use case.

I don't think we need to modify the current computation based on sumTotalTermFreq/docCount without relevance measurements (multiple datasets) indicating that it improves default/common use cases in statistically significant ways. Index statistics are expensive and we should keep things simple and minimal.

Counting positions would be this entirely different thing, and mixes in more differences that all need to be measured. For example it means that stopwords which were removed now count against document's length where they don't do that today.

[Legacy Jira: Robert Muir (@rmuir) on Oct 19 2017]

from stargazers-migration-test.

mikemccand commented on June 12, 2024

Makes sense, agreed on both points.

[Legacy Jira: Timothy M. Rodriguez on Oct 19 2017]

from stargazers-migration-test.

mikemccand commented on June 12, 2024

and just to iterate a bit more on why position count can be a can of worms: It means lucene would behave differently/inconsistently depending on language in many cases (or even different minor encoding differences). Some languages may inflect a word to make it plural, and a stemmer strips it. Otherwise might use a postposition that gets remove by the stopfilter, etc.

Today this is all consistent either way since neither suffixes stripped by stemmers, nor stopwords, nor artificial synonyms count towards the length. So we measure length based on the "important content" according to the user's selected analyzer.

The avg document length calculation is just an approximation for a pivot value, and that same pivot is used for all documents. Because of that, I don't think there will be huge wins in trying to be pedantic about how its exact value is computed. It will never be exact since individual document's lengths are truncated to a single byte and the average wouldn't reflect such truncation. Nevertheless its a protected method so you can override the implementation if you don't trust it works and want to do something different.

[Legacy Jira: Robert Muir (@rmuir) on Oct 19 2017]

from stargazers-migration-test.

mikemccand commented on June 12, 2024

Understood. I have not experienced any problems with the current default and I have the option to set discountOverlaps to false. Therefore it's ok for me if the ticket gets closed.

I only think about this out of "scientific" curiosity in the context of relevance tuning.

What benchmarks have you used for measuring performance?

Is your opinion based on tests with Lucene Classic Similarity (it also uses discountOverlaps = true) or also on tests with BM25.

Have you any idea / explanation why relevancy is better using discountOverlaps = true. My naive guess would be that since stopwords or synonyms are either used on all documents or on none and therefore it should not make much difference whether we count overlaps or not. Is the explanation that for some documents many stopwords / synonyms / WDF splits are used and for others not (for the same field). Another possible explanation would be that some fields have synonyms and others have not. That would punish fields with synonyms compared to others since their length is greater (in Classic Similarity with discountOverlaps = false), but in BM25 it should not have this effect since BM25 uses relative lenght for scoring and not abolute length like Classic Similarity.

Sorry for bothering you with these questions. It's only my curiosity and maybe Jira is not the right place for this.

[Legacy Jira: Christoph Goller on Oct 20 2017]

from stargazers-migration-test.

mikemccand commented on June 12, 2024

not sure how intuitive it is, i guess maybe it kinda is if you think on a case-by-case basis. Some examples:

WDF splitting up "wi-fi", if those synonyms count towards doc's length, then we punish the doc because the author wrote a hyphen (vs writing "wi fi").
if you have 1000 synonyms for hamburger and those count towards the length, then we punish a doc because the author wrote hamburger (versus writing "pizza").

note that punishing a doc unfairly here punishes it for all queries. if i search on "joker", why should one doc get a very low ranking for that term just because the doc also happens to mention "hamburger" instead of "pizza". In this case we have skewed length normalization in such a way that it doesn't properly reflect verbosity.

[Legacy Jira: Robert Muir (@rmuir) on Oct 20 2017]

from stargazers-migration-test.

mikemccand commented on June 12, 2024

What benchmarks have you used for measuring performance?

I use trec-like IR collections in different languages. The Lucene benchmark module has some support for running the queries and creating output that you can send to trec_eval. I just use its query-running support (QueryDriver), i don't use its indexing/parsing support although it has that too. Instead I index the test collections myself. That's because the collections/queries/judgements are always annoyingly in a slightly different non-standard format. I only look at measures which are generally the most stable like MAP and bpref.

Is your opinion based on tests with Lucene Classic Similarity (it also uses discountOverlaps = true) or also on tests with BM25.

I can't remember which scoring systems I tested at the time we flipped the default, but I think we should keep the same default for all scoring functions. It is fairly easy once you have everything setup to test with a ton of similarities at once (or different parameters) by modifying the code to loop across a big list. That's one reason why its valuable to try to keep any index-time logic consistent across all of them (such as formula for encoding the norm). Otherwise it makes testing unnecessarily difficult. Its already painful enough. This is important for real users too, they shouldn't have to reindex to do parameter tuning.

[Legacy Jira: Robert Muir (@rmuir) on Oct 20 2017]

from stargazers-migration-test.

mikemccand commented on June 12, 2024

@rcmuir thanks for the further explanation. That helped clarify. It does seem the effect would be minor at best. It'd be an interesting experiment at some point, though. If I ever get to trying it, I'll post back.

[~[email protected]] As an additional point, advanced use cases often utilize token "stacking" for additional uses as well and these would have further distortions on length. For example, some folks use analysis chains that stack variants of urls, currencies, etc.

[Legacy Jira: Timothy M. Rodriguez on Oct 20 2017]

from stargazers-migration-test.

mikemccand commented on June 12, 2024

Robert Muir thanks for the further explanation. That helped clarify. It does seem the effect would be minor at best. It'd be an interesting experiment at some point, though. If I ever get to trying it, I'll post back.

Thanks Timothy! Maybe if you get the chance to do the experiment, simply override the method protected float avgFieldLength(CollectionStatistics collectionStats) to return the alternative value. For experiments it can just be a hardcoded number you computed yourself in a different way.

[Legacy Jira: Robert Muir (@rmuir) on Oct 20 2017]

from stargazers-migration-test.

mikemccand commented on June 12, 2024

As an additional point, advanced use cases often utilize token "stacking" for additional uses as well and these would have further distortions on length.

That's exactly what we are doing. Therefore using discountOverlaps = false could punish languages with more different word forms. I also prefer discountOverlaps = true. I have an intern (student) working on relevance tuning and benchmarks. I think we can try overwriting

protected float avgFieldLength(CollectionStatistics collectionStats)

and see it it changes anything. We will also have a look into Lucene benchmark module.

Thanks for your feedback.

[Legacy Jira: Christoph Goller on Oct 23 2017]

from stargazers-migration-test.

Document Length Normalization in BM25Similarity correct? [LUCENE-8000] about stargazers-migration-test HOT 11 OPEN

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent