mikemccand / stargazers-migration-test Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 0 B

Testing Lucene's Jira -> GitHub issues migration

stargazers-migration-test's People

Watchers

stargazers-migration-test's Issues

Decouple payload decoding from Similarity [LUCENE-8038]

PayloadScoreQuery is the only place that currently uses SimScorer.computePayloadFactor(), and as discussed on LUCENE-8014, this seems like the wrong place for it. We should instead add a PayloadDecoder abstraction that is passed to PayloadScoreQuery.

Legacy Jira details

LUCENE-8038 by Alan Woodward (@romseygeek) on Nov 06 2017, resolved Nov 09 2017
Attachments: LUCENE-8038.patch, LUCENE-8038-master.patch
Linked issues:

LUCENE-7744

Adopt JDK options for tests when running Java9+ [LUCENE-8035]

Currently, Policeman Jenkins uses --illegal-access=deny when running tests on Java 9 or later. We should do this by default, so we ensure that nothing uses private APIs of the JDK or tries to do setAccessible() on runtime classes.

Legacy Jira details

LUCENE-8035 by Uwe Schindler (@uschindler) on Nov 04 2017, resolved Nov 04 2017
Attachments: LUCENE-8035.patch

Bound values of boosts [LUCENE-8016]

We allow any boosts to be passed down to similarities, eg. via BoostQuery. This is a bit trappy since it means that you can make scores rounded down to 0 and/or slow to compute (because of subnormal floats) by using tiny boosts, or infinite scores if you pass boosts that are too close to the float max value.

I would like to restrict boosts to be either +0 or between 2 ^-10 and 2 ¹⁰ .

Any objections?

Legacy Jira details

LUCENE-8016 by Adrien Grand (@jpountz) on Oct 26 2017, updated Dec 08 2021

IndexUpgraderTool should rewrite segments rather than forceMerge [LUCENE-8004]

Spinoff from LUCENE-7976. We help users get themselves into a corner by using forceMerge on an index to rewrite all segments in the current Lucene format. We should rewrite each individual segment instead. This would also help with upgrading X-2->X-1, then X-1->X.

Of course the preferred method is to re-index from scratch.

Legacy Jira details

LUCENE-8004 by Erick Erickson (@ErickErickson) on Oct 23 2017, resolved Aug 18 2018
Linked issues:

Test that SimScorer.computeSlopFactor doesn't increase as the distance goes up [LUCENE-8013]

We should ensure that computeSlopFactor is always (strictly) positive, equal to 1 when distance is 0, and doesn't increase when the distance goes up.

Legacy Jira details

LUCENE-8013 by Adrien Grand (@jpountz) on Oct 25 2017, resolved Oct 26 2017
Linked issues:

LUCENE-8014

Remove SimScorer.computeSlopFactor and SimScorer.computePayloadFactor [LUCENE-8014]

This supersedes LUCENE-8013.

We should hardcode computeSlopFactor to 1/(N+1) in SloppyPhraseScorer and move computePayloadFactor to PayloadFunction so that all the payload scoring logic is in a single place.

Legacy Jira details

LUCENE-8014 by Adrien Grand (@jpountz) on Oct 26 2017, resolved Nov 10 2017
Attachments: LUCENE-8014.patch, LUCENE-8014-tfidfsim.patch
Linked issues:

LUCENE-8013

ExitableDirectoryReader does not instrument points [LUCENE-8026]

This means it cannot interrupt range or geo queries.

Legacy Jira details

LUCENE-8026 by Adrien Grand (@jpountz) on Oct 31 2017, resolved Nov 16 2018
Linked issues:

LUCENE-9036

Pull requests: apache/lucene-solr#497

Avoid Class.getSimpleName in UsageTrackingQueryCachingPolicy [LUCENE-8005]

By profiling an Elasticsearch cluster, I found the private method UsageTrackingQueryCachingPolicy.isPointQuery to be quite expensive due to the clazz.getSimpleName() call.

Here is an excerpt from hot_threads:

java.lang.Class.getEnclosingMethod0(Native Method)
       java.lang.Class.getEnclosingMethodInfo(Class.java:1072)
       java.lang.Class.getEnclosingClass(Class.java:1272)
       java.lang.Class.getSimpleBinaryName(Class.java:1443)
       java.lang.Class.getSimpleName(Class.java:1309)
       org.apache.lucene.search.UsageTrackingQueryCachingPolicy.isPointQuery(UsageTrackingQueryCachingPolicy.java:39)
       org.apache.lucene.search.UsageTrackingQueryCachingPolicy.isCostly(UsageTrackingQueryCachingPolicy.java:54)
       org.apache.lucene.search.UsageTrackingQueryCachingPolicy.minFrequencyToCache(UsageTrackingQueryCachingPolicy.java:121)
       org.apache.lucene.search.UsageTrackingQueryCachingPolicy.shouldCache(UsageTrackingQueryCachingPolicy.java:179)
       org.elasticsearch.index.shard.ElasticsearchQueryCachingPolicy.shouldCache(ElasticsearchQueryCachingPolicy.java:53)
       org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.bulkScorer(LRUQueryCache.java:806)
       org.elasticsearch.indices.IndicesQueryCache$CachingWeightWrapper.bulkScorer(IndicesQueryCache.java:168)
       org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:665)
       org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:472)
       org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:388)
       org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:108)

Legacy Jira details

LUCENE-8005 by Scott Somerville on Oct 23 2017, updated Oct 24 2017
Attachments: LUCENE-8005.patch (versions: 2)

Regression from 6.x version on search with wildcard [LUCENE-8022]

Hello,

let say I index documents with attribute name like: prefixFileName

and that I search with "prefixF*", it is not found.
while searching with "prefix*" it works.

In 6.x (and 5.x) "prefixF*" was finding the value.

I've provided a test case
https://gist.github.com/benoitf/6078a0a8925826d8c89916a78a883cb0

and a pom.xml file
https://gist.github.com/benoitf/fefaf174fa4d96c40318dc4d044495b1

when setting property version in pom.xml to 6.6.2 it works

Legacy Jira details

LUCENE-8022 by Florent BENOIT on Oct 30 2017, resolved Oct 31 2017

Document Length Normalization in BM25Similarity correct? [LUCENE-8000]

Length of individual documents only counts the number of positions of a document since discountOverlaps defaults to true.

 `@Override`
  public final long computeNorm(FieldInvertState state) {
    final int numTerms = discountOverlaps ? state.getLength() - state.getNumOverlap() : state.getLength();
    int indexCreatedVersionMajor = state.getIndexCreatedVersionMajor();
    if (indexCreatedVersionMajor >= 7) {
      return SmallFloat.intToByte4(numTerms);
    } else {
      return SmallFloat.floatToByte315((float) (1 / Math.sqrt(numTerms)));
    }
  }}

Measureing document length this way seems perfectly ok for me. What bothers me is that
average document length is based on sumTotalTermFreq for a field. As far as I understand that sums up totalTermFreqs for all terms of a field, therefore counting positions of terms including those that overlap.

 protected float avgFieldLength(CollectionStatistics collectionStats) {
    final long sumTotalTermFreq = collectionStats.sumTotalTermFreq();
    if (sumTotalTermFreq <= 0) {
      return 1f;       // field does not exist, or stat is unsupported
    } else {
      final long docCount = collectionStats.docCount() == -1 ? collectionStats.maxDoc() : collectionStats.docCount();
      return (float) (sumTotalTermFreq / (double) docCount);
    }
  }
}

Are we comparing apples and oranges in the final scoring?

I haven't run any benchmarks and I am not sure whether this has a serious effect. It just means that documents that have synonyms or in my use case different normal forms of tokens on the same position are shorter and therefore get higher scores than they should and that we do not use the whole spectrum of relative document lenght of BM25.

I think for BM25 discountOverlaps should default to false.

Legacy Jira details

LUCENE-8000 by Christoph Goller on Oct 19 2017, updated Oct 23 2017

fix or sandbox similarities in core with problems [LUCENE-8010]

We want to support scoring optimizations such as LUCENE-4100 and LUCENE-7993, which put very minimal requirements on the similarity impl. Today similarities of various quality are in core and tests.

The ones with problems currently have warnings in the javadocs about their bugs, and if the problems are severe enough, then they are also disabled in randomized testing too.

IMO lucene core should only have practical functions that won't return NaN scores at times or cause relevance to go backwards if the user's stopfilter isn't configured perfectly. Also it is important for unit tests to not deal with broken or semi-broken sims, and the ones in core should pass all unit tests.

I propose we move the buggy ones to sandbox and deprecate them. If they can be fixed we can put them back in core, otherwise bye-bye.

FWIW tests developed in LUCENE-7997 document the following requirements:

scores are non-negative and finite.
score matches the explanation exactly.
internal explanations calculations are sane (e.g. sum of: and so on actually compute sums)
scores don't decrease as term frequencies increase: e.g. score(freq=N + 1) >= score(freq=N)
scores don't decrease as documents get shorter, e.g. score(len=M) >= score(len=M+1)
scores don't decrease as terms get rarer, e.g. score(term=N) >= score(term=N+1)
scoring works for floating point frequencies (e.g. sloppy phrase and span queries will work)
scoring works for reasonably large 64-bit statistic values (e.g. distributed search will work)
scoring works for reasonably large boost values (0 .. Integer.MAX_VALUE, e.g. query boosts will work)
scoring works for parameters randomized within valid ranges

Legacy Jira details

LUCENE-8010 by Robert Muir (@rmuir) on Oct 25 2017, resolved Apr 06 2018
Attachments: LUCENE-8010.patch
Linked issues:

UAX_URL_EMAIL tokenizer not compliant to rfc1808 [LUCENE-8044]

I noticed that the uax_url_email tokenizer splits urls in multiple tokens in presence of digits, ".", "-"

I opened a issue on elasticsearch github repo (elastic/elasticsearch#27309) because I noticed this strange behaviour.

Their answer was

The uax_url_email tokenizer tokenizes URLs and email addresses, but in order to recognize a token as a URL it must include the scheme (usually HTTP:// or https://):
Additionally, this tokenizer belongs to Lucene. Could you open this issue at https://lucene\.apache\.org/core/ instead?

URLs are defined by RFC1738 and extended by RFC1808.
In RFC1808 Relative URLs are explained, and this allows scheme-less URLs.
I would expect uax_url_email to tokenize correctly also scheme-less and relative URL.

Legacy Jira details

LUCENE-8044 by Sergio Leoni on Nov 07 2017
Environment:

Elasticsearch 5.5.2, Build: b2f0c09/2017-08-14T12:33:14.154Z, JVM: 1.8.0_144

JVM java version "1.8.0_144"
Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)

OS Linux 3.10.0-514.10.2.el7.x86_64 #1 SMP Mon Feb 20 02:37:52 EST 2017 x86_64 x86_64 x86_64 GNU/Linux

DOCS_ONLY fields set incorrect length norms [LUCENE-8031]

Term frequencies are discarded in the DOCS_ONLY case from the postings list but they still count against the length normalization, which looks like it may screw stuff up.

I ran some quick experiments on LUCENE-8025, by encoding fieldInvertState.getUniqueTermCount() and it seemed worth fixing (e.g. 20% or 30% improvement potentially). Happy to do testing for real, if we want to fix.

But this seems tricky, today you can downgrade to DOCS_ONLY on the fly, and its hard for me to think about that case (i think its generally screwed up besides this, but still).

Legacy Jira details

LUCENE-8031 by Robert Muir (@rmuir) on Nov 01 2017, resolved Feb 24 2018
Attachments: LUCENE-8031.patch

reduce/remove usages of CheckHits.EXPLAIN_TOLERANCE_* [LUCENE-8008]

these tolerance deltas are being (ab)used in various ways in tests. I did some experimentation and they can almost be removed entirely without too much pain.

LUCENE-7997 fixes all similarities such that score() == explain(). It makes its possible to actually debug numeric errors and we need to do that if we have optimizations such as maxscore that care about score values. So I think we should do the same thing elsewhere in scoring (weight/scorer).

We should at the very least fix tests (such as expression tests) that no longer need these deltas and can now assert exact values.

Legacy Jira details

LUCENE-8008 by Robert Muir (@rmuir) on Oct 24 2017, resolved Nov 29 2017
Attachments: LUCENE-8008.patch
Linked issues:

LUCENE-8372

deprecate forceMerge/optimize and create forceMergeAndFreeze [LUCENE-8003]

Spinoff from LUCENE-7976

Part of that discussion surfaced the idea that optimize/forceMerge is being discouraged, but the use-cases for forceMerge still need to be supported. Those use cases make sense when an index changes rarely.

This JIRA is to explore what that would look like.

Legacy Jira details

LUCENE-8003 by Erick Erickson (@ErickErickson) on Oct 23 2017

Test failed: RandomGeoShapeRelationshipTest.testRandomContains [LUCENE-8032]

https://jenkins.thetaphi.de/job/Lucene-Solr-master-Linux/20800/

Error Message:
geoAreaShape: GeoExactCircle: {planetmodel=PlanetModel.WGS84, center=[lat=-0.00871130560892533, lon=2.3029626482941588([X=-0.6692047265792528, Y=0.7445316825911176, Z=-0.008720939756154669])], radius=3.038428918538668(174.0891533827647), accuracy=2.111101444186927E-4} shape: GeoRectangle: {planetmodel=PlanetModel.WGS84, toplat=0.18851664435052304(10.801208089253723), bottomlat=-1.4896034997154073(-85.34799368160976), leftlon=-1.4970589804391838(-85.7751612613233), rightlon=1.346321571653886(77.13854392318753)} expected:<0> but was:<2>

Stack Trace:
java.lang.AssertionError: geoAreaShape: GeoExactCircle: {planetmodel=PlanetModel.WGS84, center=[lat=-0.00871130560892533, lon=2.3029626482941588([X=-0.6692047265792528, Y=0.7445316825911176, Z=-0.008720939756154669])], radius=3.038428918538668(174.0891533827647), accuracy=2.111101444186927E-4}
shape: GeoRectangle: {planetmodel=PlanetModel.WGS84, toplat=0.18851664435052304(10.801208089253723), bottomlat=-1.4896034997154073(-85.34799368160976), leftlon=-1.4970589804391838(-85.7751612613233), rightlon=1.346321571653886(77.13854392318753)} expected:<0> but was:<2>
        at __randomizedtesting.SeedInfo.seed([87612C9805977C6F:B087E212A0C8DB25]:0)
        at org.junit.Assert.fail(Assert.java:93)
        at org.junit.Assert.failNotEquals(Assert.java:647)
        at org.junit.Assert.assertEquals(Assert.java:128)
        at org.junit.Assert.assertEquals(Assert.java:472)
        at org.apache.lucene.spatial3d.geom.RandomGeoShapeRelationshipTest.testRandomContains(RandomGeoShapeRelationshipTest.java:225)

Legacy Jira details

LUCENE-8032 by David Smiley (@dsmiley) on Nov 02 2017, resolved Nov 02 2017
Attachments: LUCENE-8032.patch

ParallelReader does not propagate doc values generation numbers [LUCENE-8045]

Exposed by this test failure: https://jenkins.thetaphi.de/job/Lucene-Solr-7.x-Linux/777/testReport/junit/org.apache.lucene.search/TestLRUQueryCache/testDocValuesUpdatesDontBreakCache/

A reader is randomly wrapped with a ParallelLeafReader, which does not then correctly propagate the dvGen into its own FieldInfo.

Legacy Jira details

LUCENE-8045 by Alan Woodward (@romseygeek) on Nov 09 2017, resolved Nov 09 2017
Attachments: LUCENE-8045.patch

Improve Explanation class [LUCENE-8012]

Explanation class is currently nice and simple, and float matches the scoring api, but this does not work well for debugging numerical errors of internal calculations (it usually makes practical sense to use 64-bit double to avoid issues).

Also it makes for nasty formatting of integral values such as number of tokens in the collection or even document's length, its just noise to see 10.0 there instead of 10, and scientific notation for e.g. number of documents is just annoying.

One idea is to take Number instead of float? Then you could pass in the correct numeric type (int,long,double,float) for internal calculations, parameters, statistics, etc, and output would look nice.

Legacy Jira details

LUCENE-8012 by Robert Muir (@rmuir) on Oct 25 2017, resolved Jan 02 2018
Attachments: LUCENE-8012.patch (versions: 2)

Filesystems do not guarantee order of directories updates [LUCENE-8048]

Currently when index is written to disk the following sequence of events is taking place:

write segment file
sync segment file
write segment file
sync segment file
...
write list of segments
sync list of segments
rename list of segments
sync index directory

This sequence leads to potential window of opportunity for system to crash after 'rename list of segments' but before 'sync index directory' and depending on exact filesystem implementation this may potentially lead to 'list of segments' being visible in directory while some of the segments are not.

Solution to this is to sync index directory after all segments have been written. This commit shows idea implemented. I'm fairly certain that I didn't find all the places this may be potentially happening.

Legacy Jira details

LUCENE-8048 by Nikolay Martynov on Nov 09 2017, resolved Dec 05 2017
Attachments: LUCENE-8048.patch (versions: 3), Screen Shot 2017-11-22 at 12.34.51 PM.png

Do not cache clauses if they might make the query more than X times slower [LUCENE-8027]

Query caching can have a negative impact on tail latencies as the clause that is cached needs to be entirely consumed. Maybe we could leverage the fact that we can know the lead cost from any scorer now (LUCENE-7897) in order to implement heuristics that would look like "do not cache clause X if its cost is 10x greater than the cost of the entire query". This would make sure that a fast query can not become absurdly slow just because it had to cache a costly filter. The filter will need to wait for a more costly query to be cached, or might never be cached at all.

Legacy Jira details

LUCENE-8027 by Adrien Grand (@jpountz) on Oct 31 2017, updated Oct 17 2019
Attachments: LUCENE-8027.patch
Linked issues:

LUCENE-9002

ShingleFilter should have an option to skip filler tokens (e.g. stop words) [LUCENE-8036]

ShingleFilterFactory should have an option to ignore filler tokens in the total shingle size.
For instance (adapted from https://stackoverflow.com/questions/33193144/solr-stemming-stop-words-and-shingles-not-giving-expected-outputs), consider the text "A brown fox quickly jumps over the lazy dog". When we remove stopwords and execute the ShingleFilter (shingle size = 3), it gives us the following result:

_ brown fox
brown fox quickly
fox quickly jump
quickly jump _
jump _ _
_ _ lazy
_ lazy dog

We can clearly see that the filler token "_" occupies one token in the shingle.
I suppose the returned shingles should be:

brown fox quickly
fox quickly jump
quickly jump lazy
jump lazy dog

To maintain backward compatibility, i suggest the creation of an option called "skipFillerTokens" to implement this behavior (note that this is different than using fillerTokens="", since the empty string occupies one token in the shingle)

I've attached a patch for the ShingleFilter class (getNextToken() method), ShingleFilterFactory and ShingleFilterTest clases.

Legacy Jira details

LUCENE-8036 by Edans Sandes on Nov 04 2017, updated Nov 29 2017
Attachments: SOLR-11604.patch
Linked issues:

Add SegmentCachable interface [LUCENE-8042]

Following LUCENE-8017, I tried to add a getCacheHelper(LeafReaderContext) method to DoubleValuesSource so that Weights that use DVS can delegate on. This ended up with the same method being added to LongValuesSource, and some of the similar objects in spatial-extras. I think it makes sense to abstract this out into a separate SegmentCachable interface.

Legacy Jira details

LUCENE-8042 by Alan Woodward (@romseygeek) on Nov 07 2017, resolved Nov 10 2017
Attachments: LUCENE-8042.patch (versions: 4)

Add a root failure cause to Explanation [LUCENE-8019]

If you need to analyze the root cause of a query's failure to match some document, you can use the Weight.explain() API. If you want to do some gross analysis of a whole batch of queries, say scraped from a log, that once matched, but no longer do, perhaps after some refactoring or other large-scale change, the Explanation isn't very good for that. You can try parsing its textual output, which is pretty regular, but instead I found it convenient to add some boolean structure to Explanation, and use that to find failing leaves on the Explanation tree, and report only those.

This patch adds a "condition" to each Explanation, which can be REQUIRED, OPTIONAL, PROHIBITED, or NONE. The conditions correspond in obvious ways to the Boolean Occur, except for NONE, which is used to indicate a node which can't be further decomposed. It adds new Explanation construction methods for creating Explanations with conditions (defaulting to NONE with the existing methods).

Finally Explanation.getFailureCauses() returns a list of Strings that are the one-line explanations of the failing queries that, if some of them had succeeded, would have made the original overall query match.

Legacy Jira details

LUCENE-8019 by Michael Sokolov (@msokolov) on Oct 26 2017, resolved Aug 31 2018
Attachments: LUCENE_8019.patch, LUCENE-8019.patch

maxScore is sometimes missing from distributed grouped responses [LUCENE-8996]

This issue occurs when using the grouping feature in distributed mode and sorting by score.

Each group's docList in the response is supposed to contain a maxScore entry that hold the maximum score for that group. Using the current releases, it sometimes happens that this piece of information is not included:

{
  "responseHeader": {
    "status": 0,
    "QTime": 42,
    "params": {
      "sort": "score desc",
      "fl": "id,score",
      "q": "_text_:\"72\"",
      "group.limit": "2",
      "group.field": "group2",
      "group.sort": "score desc",
      "group": "true",
      "wt": "json",
      "fq": "group2:72 OR group2:45"
    }
  },
  "grouped": {
    "group2": {
      "matches": 567,
      "groups": [
        {
          "groupValue": 72,
          "doclist": {
            "numFound": 562,
            "start": 0,
            "maxScore": 2.0378063,
            "docs": [
              {
                "id": "29!26551",
                "score": 2.0378063
              },
              {
                "id": "78!11462",
                "score": 2.0298104
              }
            ]
          }
        },
        {
          "groupValue": 45,
          "doclist": {
            "numFound": 5,
            "start": 0,
            "docs": [
              {
                "id": "72!8569",
                "score": 1.8988966
              },
              {
                "id": "72!14075",
                "score": 1.5191172
              }
            ]
          }
        }
      ]
    }
  }
}

Looking into the issue, it comes from the fact that if a shard does not contain a document from that group, trying to merge its maxScore with real maxScore entries from other shards is invalid (it results in NaN).

I'm attaching a patch containing a fix.

Legacy Jira details

LUCENE-8996 by Julien Massenet on Dec 18 2015, resolved Dec 09 2019
Attachments: lucene_6_5-GroupingMaxScore.patch, lucene_solr_5_3-GroupingMaxScore.patch, LUCENE-8996.02.patch, LUCENE-8996.03.patch, LUCENE-8996.04.patch, LUCENE-8996.patch (versions: 2), master-GroupingMaxScore.patch
Linked issues:

SpanNotWeight returns wrong results due to integer overflow [LUCENE-8034]

In SpanNotQuery, there is an acceptance condition:

if (candidate.endPosition() + post <= excludeSpans.startPosition()) {
    return AcceptStatus.YES;
}

This overflows in case candidate.endPosition() + post > Integer.MAX_VALUE. I have a fix for this which I am working on. Basically I am flipping the add to a subtract.

Legacy Jira details

LUCENE-8034 by Hari Menon on Nov 03 2017, resolved Nov 22 2017
Attachments: LUCENE-8034.patch

Don't force sim to score bogus terms (e.g. docfreq=0) [LUCENE-8020]

Today all sim formulas have to be "hacked" to deal with the fact that they may be passed stats such as docFreq=0, totalTermFreq=0. This happens easily with spans and there is even a dedicated test for it. All formulas have hacks such as what you see in https://issues.apache.org/jira/browse/LUCENE-6818:

Instead of:

expected = stats.getTotalTermFreq() * docLen / stats.getNumberOfFieldTokens();

they must do tricks such as:

expected = (1 + stats.getTotalTermFreq()) * docLen / (1 + stats.getNumberOfFieldTokens());

There is no good reason for this, it is just sloppiness in the Query/Weight/Scorer api. I think formulas should work unmodified, we shouldn't pass terms that dont exist or bogus statistics.

It adds a lot of complexity to the scoring api and makes it difficult to have meaningful/useful explanations, to debug problems, etc. It also makes it really hard to add a new sim.

Legacy Jira details

LUCENE-8020 by Robert Muir (@rmuir) on Oct 28 2017, resolved Oct 31 2017
Attachments: LUCENE-8020.patch (versions: 3)

Enforce positive scores [LUCENE-8006]

Currently we only disallow -Infinity and NaN as scores. However we started some work (eg. LUCENE-4100) that would be much easier to implement and maintain if scores we guaranteed to always be positive.

Legacy Jira details

LUCENE-8006 by Adrien Grand (@jpountz) on Oct 24 2017, resolved Nov 29 2017
Linked issues:

LUCENE-7996

API report for Lucene Core snapshot [LUCENE-8037]

Hi,

I'd like to share report on API changes and backward compatibility for the latest snapshot of the Lucene Core library (updated daily): https://abi-laboratory.pro/java/tracker/timeline/lucene-core/

BC — binary compatibility
SC — source compatibility

The report is generated according to the article https://wiki.eclipse.org/Evolving_Java-based_APIs_2 by the https://github.com/lvc/japi-tracker tool for jars from https://repository.apache.org/content/repositories/snapshots/org/apache/lucene/lucene-core/ and http://central.maven.org/maven2/org/apache/lucene/lucene-core/.

Hope it will be helpful for users and maintainers of the library.

Feel free to request more modules of the library to be included to the tracker if you are interested.

Thank you.

Legacy Jira details

LUCENE-8037 by Andrey Ponomarenko on Nov 04 2017, resolved Dec 17 2019
Attachments: lucene-core-1.png, lucene-core-2.png

Arabic Stemmer improvement for Better Search Accuracy [LUCENE-8028]

HI, this is Ayah - bidi developer at IBM Egypt - Globalization Team, we are responsible to support Arabic at IBM products and services and as we use lucence at many of services, we found that it needs major improvement at Arabic stemmer, we implement the following two papers https://dl.acm.org/citation.cfm?id=1921657 and http://waset.org/publications/10005688/arabic-light-stemmer-for-better-search-accuracy to improve lucene arabic stemmer function and would like to open a Pull request to let you integrate it as a part of lucene

Legacy Jira details

LUCENE-8028 by Ayah Shamandi on Oct 31 2017, updated Dec 06 2017

Improve similarity explanations [LUCENE-8011]

LUCENE-7997 improves BM25 and Classic explains to better explain:

product of:
  2.2 = scaling factor, k1 + 1
  9.388654 = idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:
    1.0 = n, number of documents containing term
    17927.0 = N, total number of documents with field
  0.9987758 = tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:
    979.0 = freq, occurrences of term within document
    1.2 = k1, term saturation parameter
    0.75 = b, length normalization parameter
    1.0 = dl, length of field
    1.0 = avgdl, average length of field

Previously it was pretty cryptic and used confusing terminology like docCount/docFreq without explanation:

product of:
  0.016547536 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:
    449.0 = docFreq
    456.0 = docCount
  2.1920826 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:
    113659.0 = freq=113658
    1.2 = parameter k1
    0.75 = parameter b
    2300.5593 = avgFieldLength
    1048600.0 = fieldLength

We should fix other similarities too in the same way, they should be more practical.

Legacy Jira details

LUCENE-8011 by Robert Muir (@rmuir) on Oct 25 2017, resolved Dec 13 2017

Specified default value not returned for query() when doc doesn't match [LUCENE-8908]

The 2 arg version of the "query()" was designed so that the second argument would specify the value used for any document that does not match the query pecified by the first argument – but the "exists" property of the resulting ValueSource only takes into consideration wether or not the document matches the query – and ignores the use of the second argument.

The work around is to ignore the 2 arg form of the query() function, and instead wrap he query function in def().

for example: def(query($something), $defaultval) instead of query($something, $defaultval)

Legacy Jira details

LUCENE-8908 by Bill Bell on Jul 28 2015, resolved Mar 18 2020
Attachments: LUCENE-8908.patch (versions: 2), SOLR-7845.patch (versions: 2)
Linked issues:

LUCENE-5961

compute avgdl correctly for DOCS_ONLY [LUCENE-8025]

Spinoff of LUCENE-8007:

If you omit term frequencies, we should score as if all tf values were 1. This is the way it worked for e.g. ClassicSimilarity and you can understand how it degrades.

However for sims such as BM25, we bail out on computing avg doclength (and just return a bogus value of 1) today, screwing up stuff related to length normalization too, which is separate.

Instead of a bogus value, we should substitute sumDocFreq for sumTotalTermFreq (all postings have freq of 1, since you omitted them).

Legacy Jira details

LUCENE-8025 by Robert Muir (@rmuir) on Oct 31 2017, resolved Nov 01 2017
Attachments: LUCENE-8025.patch

Remove unnecessary norms.advanceExact check in score() [LUCENE-8024]

This should no longer be needed, since the index-time boost is removed, and since sims are no longer asked to score non-existent terms.

E.g. core tests pass with:

--- a/lucene/core/src/java/org/apache/lucene/search/similarities/BM25Similarity.java
+++ b/lucene/core/src/java/org/apache/lucene/search/similarities/BM25Similarity.java
@@ -230,11 +230,8 @@ public class BM25Similarity extends Similarity {
       if (norms == null) {
         norm = k1;
       } else {
-        if (norms.advanceExact(doc)) {
-          norm = cache[((byte) norms.longValue()) & 0xFF];
-        } else {
-          norm = cache[0];
-        }
+        norms.advanceExact(doc);
+        norm = cache[((byte) norms.longValue()) & 0xFF];
       }
       return weightValue * (float) (freq / (freq + norm));
     }

Legacy Jira details

LUCENE-8024 by Robert Muir (@rmuir) on Oct 31 2017, resolved Oct 31 2017
Attachments: LUCENE-8024.patch

UnescapedCharSequence Bugs [LUCENE-8001]

There are a couple of issues with UnescapedCharSequence:

The private constructor is not used anywhere (and if it were, it would throw exceptions)
The ToEscapedString() overload has an invalid condition that will only evaluate to true if the string has a length of 0.

There are no tests for UnescapedCharSequence so these issues have gone unnoticed for quite some time.

Legacy Jira details

LUCENE-8001 by Shad Storhaug on Oct 20 2017, updated Oct 21 2017
Linked issues:

LUCENENET-592

Add AssertingSimilarity [LUCENE-8021]

We can use this to bounds check the incoming parameters and find issues in tests.

Legacy Jira details

LUCENE-8021 by Robert Muir (@rmuir) on Oct 28 2017, resolved Oct 30 2017
Attachments: LUCENE-8021.patch

Should FieldInfos always use a dense encoding? [LUCENE-8033]

Spin-off from LUCENE-8018. The dense vs. sparse encoding logic of FieldInfos introduces complexity. Given that the sparse encoding is only used when less than 1/16th of fields are used, which sounds uncommon to me, maybe we should use a dense encoding all the time?

Legacy Jira details

LUCENE-8033 by Adrien Grand (@jpountz) on Nov 02 2017, resolved Feb 13 2018
Pull requests: apache/lucene-solr#320

FunctionRangeQuery and FunctionMatchQuery can pollute the QueryCache [LUCENE-8017]

The QueryCache assumes that queries will return the same set of documents when run over the same segment, independent of all other segments held by the parent IndexSearcher. However, both FunctionRangeQuery and FunctionMatchQuery can select hits based on score, which depend on term statistics over the whole index, and could therefore theoretically return different result sets on a given segment.

Legacy Jira details

LUCENE-8017 by Alan Woodward (@romseygeek) on Oct 26 2017, resolved Nov 03 2017
Attachments: LUCENE-8017.patch (versions: 3)
Linked issues:

LUCENE-6661

FieldInfos retains garbage if non-sparse [LUCENE-8018]

A heap dump revealed a lot of TreeMap.Entry instances (millions of them) for a system with about ~1000 active searchers.

Legacy Jira details

LUCENE-8018 by Julian Vassev on Oct 26 2017, resolved Oct 27 2017
Environment:

Lucene 6.5.0, java 8

openjdk version "1.8.0_45-internal"
OpenJDK Runtime Environment (build 1.8.0_45-internal-b14)
OpenJDK 64-Bit Server VM (build 25.45-b02, mixed mode)

Attachments: LUCENE-8018.patch

Optimize IndexSearcher.collectionStatistics [LUCENE-8040]

IndexSearcher.collectionStatistics(field) can do a fair amount of work because with each invocation it will call MultiFields.getTerms(...). The effects of this are aggravated for queries with many fields since each field will want statistics, and also aggravated when there are many segments.

Legacy Jira details

LUCENE-8040 by David Smiley (@dsmiley) on Nov 06 2017, resolved Nov 15 2017
Attachments: LUCENE-8040.patch (versions: 2), lucenecollectionStatisticsbench.zip, MyBenchmark.java
Linked issues:

SOLR-11595

Fields.size() should not return -1 [LUCENE-8029]

All call sites of Fields.size() recompute it if it is not available. We should make the API easier to use and require that it never returns -1.

Legacy Jira details

LUCENE-8029 by Adrien Grand (@jpountz) on Oct 31 2017

Remove divide-by-zero formula hacks across sim impls [LUCENE-8023]

Follow-up to LUCENE-8020

With the scoring api no longer passing e.g. docFreq=0 in corner cases such as SpanOrQuery, we can revert formula hacks that we added to many similarities. It may not be very significant for ranking but it removes confusion and makes it possible to have good explain() etc.

Legacy Jira details

LUCENE-8023 by Robert Muir (@rmuir) on Oct 30 2017

Support disabling Locale randomization as part of Lucene test framework [LUCENE-8009]

The Lucene test framework randomizes the Locale configuration to test the software in different locale settings.
https://github.com/apache/lucene-solr/blob/e2521b2a8baabdaf43b92192588f51e042d21e97/lucene/test-framework/src/java/org/apache/lucene/util/TestRuleSetupAndRestoreClassEnv.java#L206-L209

While this is a very good practice from engineering perspective, it causes issues when the Lucene/Solr test framework is used with third-party components which may have issues working with a subset of locale settings. e.g. for Solr/Sentry integration (SENTRY-1475), we are using Solr test framework to test the sentry authorization plugin for Solr. For unit-testing, it uses Apache Derby. We have found multiple cases when derby fail to initialize for a locale configured by Solr test framework. This causes tests to fail and create a confusion with respect to the quality of the integration source-code. Since the Derby failures are not related to Solr/Sentry integration, we would like to avoid such nasty surprises by suppressing the locale randomization. This is similar to the way we suppress Solr SSL configuration (@SolrTestCaseJ4.SuppressSSL).

Please refer to discussion on dev mailing list for more context,
http://lucene.472066.n3.nabble.com/Solr-test-framework-Locale-randomization-td4359671.html

Legacy Jira details

LUCENE-8009 by Hrishikesh Gadre on Oct 24 2017, updated Oct 25 2017

TestBasicModelIne.testRandomScoring failure [LUCENE-8015]

reproduce with: ant test -Dtestcase=TestBasicModelIne -Dtests.method=testRandomScoring -Dtests.seed=86E85958B1183E93 -Dtests.slow=true -Dtests.locale=vi-VN -Dtests.timezone=Pacific/Tongatapu -Dtests.asserts=true -Dtests.file.encoding=UTF8

Legacy Jira details

LUCENE-8015 by Adrien Grand (@jpountz) on Oct 26 2017, resolved Dec 06 2017
Attachments: LUCENE-8015_test_fangs.patch, LUCENE-8015.patch, LUCENE-8015-test.patch

ComplexPhraseQuery.rewrite can throw exception RE MatchNoDocsQuery if MTQ sub-clause matches no terms [LUCENE-8305]

With Solr v6.3, when I issue this query:

http://localhost:8983/solr/BestBuy/select?wt=json&rows=10&q={!complexphrase%20inOrder=false}_text_:%22maytag~%20(refri~%20OR%20refri\*)%20%22&fl=id&hl=true&hl.preserveMulti=false&hl.fragsize=60&hl.fl=nameX,shortDescription,longDescription,artistName,type,manufacturer,department

I get this error in the JSON response:

*************************************************************
{
"responseHeader": {
"zkConnected": true,
"status": 500,
"QTime": 8,
"params": {
"q": "{!complexphrase inOrder=false}text:"maytag~ (refri~ OR refri*) "",
"hl": "true",
"hl.preserveMulti": "false",
"fl": "id",
"hl.fragsize": "60",
"hl.fl": "nameX,shortDescription,longDescription,artistName,type,manufacturer,department",
"rows": "10",
"wt": "json"
}
},
"response": {
"numFound": 2,
"start": 0,
"docs": [
{
"id": "5411379"
},
{
"id": "5411404"
}
]
},
"error": {
"msg": "Unknown query type:org.apache.lucene.search.MatchNoDocsQuery",
"trace": "java.lang.IllegalArgumentException: Unknown query type:org.apache.lucene.search.MatchNoDocsQuery\n\tat org.apache.lucene.queryparser.complexPhrase.ComplexPhraseQueryParser$ComplexPhraseQuery.addComplexPhraseClause(ComplexPhraseQueryParser.java:388)\n\tat org.apache.lucene.queryparser.complexPhrase.ComplexPhraseQueryParser$ComplexPhraseQuery.rewrite(ComplexPhraseQueryParser.java:289)\n\tat org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:230)\n\tat org.apache.lucene.search.highlight.WeightedSpanTermExtractor.getWeightedSpanTerms(WeightedSpanTermExtractor.java:522)\n\tat org.apache.lucene.search.highlight.QueryScorer.initExtractor(QueryScorer.java:218)\n\tat org.apache.lucene.search.highlight.QueryScorer.init(QueryScorer.java:186)\n\tat org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:195)\n\tat org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:602)\n\tat org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingOfField(DefaultSolrHighlighter.java:448)\n\tat org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:410)\n\tat org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:141)\n\tat org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:295)\n\tat org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:153)\n\tat org.apache.solr.core.SolrCore.execute(SolrCore.java:2213)\n\tat org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:654)\n\tat org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:460)\n\tat org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:303)\n\tat org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:254)\n\tat org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1668)\n\tat org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)\n\tat org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)\n\tat org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)\n\tat org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1160)\n\tat org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511)\n\tat org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)\n\tat org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1092)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)\n\tat org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)\n\tat org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)\n\tat org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)\n\tat org.eclipse.jetty.server.Server.handle(Server.java:518)\n\tat org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:308)\n\tat org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:244)\n\tat org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)\n\tat org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)\n\tat org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)\n\tat org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceAndRun(ExecuteProduceConsume.java:246)\n\tat org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:156)\n\tat org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:654)\n\tat org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572)\n\tat java.lang.Thread.run(Thread.java:745)\n",
"code": 500
}
}

*************************************************************

I did NOT have this error in Solr v6.1 so something has changed in v6.3 that is causing this error.

Steve Rowe thinks it may be related to https://issues.apache.org/jira/browse/LUCENE-7337

Hoss' initial thoughts: "i think the root of the issue is that the way those fuzzy and prefix queries are parsed means that they may produce an empty boolean query depending on the contents of the index, and then the new optimization rewrites those empty boolean queries into MatchNoDocsQueries – but the highlighter (which uses hueristics to figure out what to ask each query – based on it's type – what to highlight) doesn't know what to do with that. i'm really suprised the highlighter throws an error in the "unexpected query type" code path instead of just ignorning it."

Legacy Jira details

LUCENE-8305 by Andy Tran on Jan 31 2017, resolved May 10 2018
Pull requests: apache/lucene-solr#327

TestCorePlusExtensionsParser failure: AssertingSimilarity.simScorer()'s SimScorer.computePayloadFactor() is angry [LUCENE-8030]

From https://jenkins.thetaphi.de/job/Lucene-Solr-master-Linux/20789/ - reproduces for me, but only if I first remove -Dtests.method=testBoostingTermQueryXML from the cmdline:

Checking out Revision 39376cd8b5ef03b3338c2e8fa31dce732749bcd7 (refs/remotes/origin/master)
[...]
   [junit4] Suite: org.apache.lucene.queryparser.xml.TestCorePlusExtensionsParser
   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestCorePlusExtensionsParser -Dtests.method=testBoostingTermQueryXML -Dtests.seed=DA0883869B26E8D9 -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=th-TH -Dtests.timezone=America/Indiana/Indianapolis -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1
   [junit4] FAILURE 0.01s J0 | TestCorePlusExtensionsParser.testBoostingTermQueryXML <<<
   [junit4]    > Throwable #1: java.lang.AssertionError
   [junit4]    > 	at __randomizedtesting.SeedInfo.seed([DA0883869B26E8D9:F58B709A872CAF14]:0)
   [junit4]    > 	at org.apache.lucene.search.similarities.AssertingSimilarity$1.computePayloadFactor(AssertingSimilarity.java:120)
   [junit4]    > 	at org.apache.lucene.queries.payloads.PayloadScoreQuery$PayloadSpans.collectLeaf(PayloadScoreQuery.java:215)
   [junit4]    > 	at org.apache.lucene.search.spans.TermSpans.collect(TermSpans.java:121)
   [junit4]    > 	at org.apache.lucene.queries.payloads.PayloadScoreQuery$PayloadSpans.doCurrentSpans(PayloadScoreQuery.java:226)
   [junit4]    > 	at org.apache.lucene.search.spans.SpanScorer.setFreqCurrentDoc(SpanScorer.java:110)
   [junit4]    > 	at org.apache.lucene.search.spans.SpanScorer.ensureFreq(SpanScorer.java:126)
   [junit4]    > 	at org.apache.lucene.search.spans.SpanScorer.score(SpanScorer.java:133)
   [junit4]    > 	at org.apache.lucene.search.AssertingScorer.score(AssertingScorer.java:70)
   [junit4]    > 	at org.apache.lucene.search.AssertingScorer.score(AssertingScorer.java:70)
   [junit4]    > 	at org.apache.lucene.search.TopScoreDocCollector$SimpleTopScoreDocCollector$1.collect(TopScoreDocCollector.java:64)
   [junit4]    > 	at org.apache.lucene.search.AssertingLeafCollector.collect(AssertingLeafCollector.java:52)
   [junit4]    > 	at org.apache.lucene.search.AssertingCollector$1.collect(AssertingCollector.java:56)
   [junit4]    > 	at org.apache.lucene.search.AssertingLeafCollector.collect(AssertingLeafCollector.java:52)
   [junit4]    > 	at org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll(Weight.java:241)
   [junit4]    > 	at org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:184)
   [junit4]    > 	at org.apache.lucene.search.BulkScorer.score(BulkScorer.java:39)
   [junit4]    > 	at org.apache.lucene.search.AssertingBulkScorer.score(AssertingBulkScorer.java:69)
   [junit4]    > 	at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:658)
   [junit4]    > 	at org.apache.lucene.search.AssertingIndexSearcher.search(AssertingIndexSearcher.java:72)
   [junit4]    > 	at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:462)
   [junit4]    > 	at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:581)
   [junit4]    > 	at org.apache.lucene.search.IndexSearcher.searchAfter(IndexSearcher.java:439)
   [junit4]    > 	at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:450)
   [junit4]    > 	at org.apache.lucene.queryparser.xml.TestCoreParser.dumpResults(TestCoreParser.java:254)
   [junit4]    > 	at org.apache.lucene.queryparser.xml.TestCoreParser.testBoostingTermQueryXML(TestCoreParser.java:127)
   [junit4]    > 	at java.lang.Thread.run(Thread.java:748)
   [junit4]   2> NOTE: test params are: codec=CheapBastard, sim=Asserting(org.apache.lucene.search.similarities.AssertingSimilarity@fcaf28), locale=th-TH, timezone=America/Indiana/Indianapolis
   [junit4]   2> NOTE: Linux 4.10.0-37-generic i386/Oracle Corporation 1.8.0_144 (32-bit)/cpus=8,threads=1,free=47589984,total=67108864

Legacy Jira details

LUCENE-8030 by Steven Rowe on Nov 01 2017, resolved Nov 01 2017

All Fields.terms(fld) impls should be O(1) not O(log(N)) [LUCENE-8041]

I've seen apps that have a good number of fields – hundreds. The O(log(N)) of TreeMap definitely shows up in a profiler; sometimes 20% of search time, if I recall. There are many Field implementations that are impacted... in part because Fields is the base class of FieldsProducer.

As an aside, I hope Fields to go away some day; FieldsProducer should be TermsProducer and not have an iterator of fields. If DocValuesProducer doesn't have this then why should the terms index part of our API have it? If we did this then the issue here would be a simple transition to a HashMap.

Or maybe we can switch to HashMap and relax the definition of Fields.iterator to not necessarily be sorted?

Perhaps the fix can be a relatively simple conversion over to LinkedHashMap in many cases if we can assume when we initialize these internal maps that we consume them in sorted order to begin with.

Legacy Jira details

LUCENE-8041 by David Smiley (@dsmiley) on Nov 06 2017, resolved Dec 03 2019
Attachments: LUCENE-8041.patch, LUCENE-8041-LinkedHashMap.patch
Linked issues:

Attempting to add documents past limit can corrupt index [LUCENE-8043]

The IndexWriter check for too many documents does not always work, resulting in going over the limit. Once this happens, Lucene refuses to open the index and throws a CorruptIndexException: Too many documents.

This appears to affect all versions of Lucene/Solr (the check was first implemented in LUCENE-5843 in v4.9.1/4.10 and we've seen this manifest in 4.10)

Legacy Jira details

LUCENE-8043 by Yonik Seeley (@yonik) on Nov 07 2017, resolved Dec 04 2017
Attachments: LUCENE-8043.patch (versions: 5), YCS_IndexTest7a.java

Geo3d: Need a "distance delta" distance metric for paths and circles [LUCENE-8039]

The current "distance" method for a path returns a distance computed along the path and then perpendicular to the path. But, at least in the case of paths, it is often preferable to compute a "delta" distance, which would be the minimum straight-line distance assuming a diversion to reach the provided point.

A similar "distance delta" for a circle would be defined as returning a number exactly the same as is currently returned, with the understanding that the point given would be the destination and not a new waypoint. Similarly, the distance beyond the end of a path to the provided point would be counted only once, while the distance before the beginning of the path would be counted twice (one leg to the point, and the other leg back from that point to the start point (or nearest path point, if closer).

This obviously must be implemented in a backwards-compatible fashion.

Legacy Jira details

LUCENE-8039 by Karl Wright on Nov 06 2017, resolved Nov 08 2017

Require that codecs always store totalTermFreq, sumDocFreq and sumTotalTermFreq [LUCENE-8007]

Javadocs allow codecs to not store some index statistics. Given discussion that occurred on LUCENE-4100, this was mostly implemented this way to support pre-flex codecs. We should now require that all codecs store these statistics.

Legacy Jira details

LUCENE-8007 by Adrien Grand (@jpountz) on Oct 24 2017, resolved Nov 03 2017
Attachments: LUCENE-8007.patch (versions: 5)

Solr auto commit error [LUCENE-8002]

Hi Team,

We are seeing solr doc id missing error in our production cluster. Below is the snap from solr log.

ERROR - 2017-10-03 08:32:08.715; org.apache.solr.common.SolrException; auto commit error...:java.lang.RuntimeException: java.io.FileNotFoundException: Requested file hdfs:/solr/<path>/core_node5/data/index/_ect.fdt does not exist.
at org.apache.lucene.index.TieredMergePolicy$SegmentByteSizeDescending.compare(TieredMergePolicy.java:258)
at org.apache.lucene.index.TieredMergePolicy$SegmentByteSizeDescending.compare(TieredMergePolicy.java:238)
at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355)
at java.util.TimSort.sort(TimSort.java:234)
at java.util.Arrays.sort(Arrays.java:1512)
at java.util.ArrayList.sort(ArrayList.java:1454)
at java.util.Collections.sort(Collections.java:175)
at org.apache.lucene.index.TieredMergePolicy.findMerges(TieredMergePolicy.java:292)
at org.apache.lucene.index.IndexWriter.updatePendingMerges(IndexWriter.java:2005)
at org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:1969)
at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2999)
at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3104)
at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3071)
at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:582)
at org.apache.solr.update.CommitTracker.run(CommitTracker.java:216)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.FileNotFoundException: Requested file hdfs:/solr/<path>/core_node5/data/index/_ect.fdt does not exist.
at com.mapr.fs.MapRFileSystem.getMapRFileStatus(MapRFileSystem.java:1374)
at com.mapr.fs.MapRFileSystem.getFileStatus(MapRFileSystem.java:1031)
at org.apache.solr.store.hdfs.HdfsFileReader.getLength(HdfsFileReader.java:94)
at org.apache.solr.store.hdfs.HdfsDirectory.fileLength(HdfsDirectory.java:148)
at org.apache.lucene.store.NRTCachingDirectory.fileLength(NRTCachingDirectory.java:143)
at org.apache.lucene.index.SegmentCommitInfo.sizeInBytes(SegmentCommitInfo.java:219)
at org.apache.lucene.index.MergePolicy.size(MergePolicy.java:478)
at org.apache.lucene.index.TieredMergePolicy$SegmentByteSizeDescending.compare(TieredMergePolicy.java:248)

Legacy Jira details

LUCENE-8002 by Jerry Richard on Oct 23 2017

mikemccand / stargazers-migration-test Goto Github PK

stargazers-migration-test's People

Watchers

stargazers-migration-test's Issues

Legacy Jira details

Legacy Jira details

Legacy Jira details

Legacy Jira details

Legacy Jira details

Legacy Jira details

Legacy Jira details

Legacy Jira details

Legacy Jira details

Legacy Jira details

Legacy Jira details

Legacy Jira details

Legacy Jira details

Legacy Jira details

Legacy Jira details

Legacy Jira details

Legacy Jira details

Legacy Jira details

Legacy Jira details

Legacy Jira details

Legacy Jira details

Legacy Jira details

Legacy Jira details

Legacy Jira details

Legacy Jira details

Legacy Jira details

Legacy Jira details

Legacy Jira details

Legacy Jira details

Legacy Jira details

Legacy Jira details

Legacy Jira details

Legacy Jira details

Legacy Jira details

Legacy Jira details

Legacy Jira details

Legacy Jira details

Legacy Jira details

Legacy Jira details

Legacy Jira details

Legacy Jira details

Legacy Jira details

Legacy Jira details

Legacy Jira details

Legacy Jira details

Legacy Jira details

Legacy Jira details

Legacy Jira details

Legacy Jira details

Legacy Jira details

Recommend Projects

Recommend Topics

Recommend Org