mikemccand / stargazers-migration-test Goto Github PK
View Code? Open in Web Editor NEWTesting Lucene's Jira -> GitHub issues migration
Testing Lucene's Jira -> GitHub issues migration
PayloadScoreQuery is the only place that currently uses SimScorer.computePayloadFactor(), and as discussed on LUCENE-8014, this seems like the wrong place for it. We should instead add a PayloadDecoder abstraction that is passed to PayloadScoreQuery.
LUCENE-8038 by Alan Woodward (@romseygeek) on Nov 06 2017, resolved Nov 09 2017
Attachments: LUCENE-8038.patch, LUCENE-8038-master.patch
Linked issues:
Currently, Policeman Jenkins uses --illegal-access=deny
when running tests on Java 9 or later. We should do this by default, so we ensure that nothing uses private APIs of the JDK or tries to do setAccessible() on runtime classes.
LUCENE-8035 by Uwe Schindler (@uschindler) on Nov 04 2017, resolved Nov 04 2017
Attachments: LUCENE-8035.patch
We allow any boosts to be passed down to similarities, eg. via BoostQuery
. This is a bit trappy since it means that you can make scores rounded down to 0 and/or slow to compute (because of subnormal floats) by using tiny boosts, or infinite scores if you pass boosts that are too close to the float max value.
I would like to restrict boosts to be either +0 or between 2 -10 and 2 10 .
Any objections?
LUCENE-8016 by Adrien Grand (@jpountz) on Oct 26 2017, updated Dec 08 2021
Spinoff from LUCENE-7976. We help users get themselves into a corner by using forceMerge on an index to rewrite all segments in the current Lucene format. We should rewrite each individual segment instead. This would also help with upgrading X-2->X-1, then X-1->X.
Of course the preferred method is to re-index from scratch.
LUCENE-8004 by Erick Erickson (@ErickErickson) on Oct 23 2017, resolved Aug 18 2018
Linked issues:
We should ensure that computeSlopFactor is always (strictly) positive, equal to 1 when distance is 0, and doesn't increase when the distance goes up.
LUCENE-8013 by Adrien Grand (@jpountz) on Oct 25 2017, resolved Oct 26 2017
Linked issues:
This supersedes LUCENE-8013.
We should hardcode computeSlopFactor to 1/(N+1) in SloppyPhraseScorer and move computePayloadFactor to PayloadFunction so that all the payload scoring logic is in a single place.
LUCENE-8014 by Adrien Grand (@jpountz) on Oct 26 2017, resolved Nov 10 2017
Attachments: LUCENE-8014.patch, LUCENE-8014-tfidfsim.patch
Linked issues:
This means it cannot interrupt range or geo queries.
LUCENE-8026 by Adrien Grand (@jpountz) on Oct 31 2017, resolved Nov 16 2018
Linked issues:
Pull requests: apache/lucene-solr#497
By profiling an Elasticsearch cluster, I found the private method UsageTrackingQueryCachingPolicy.isPointQuery to be quite expensive due to the clazz.getSimpleName() call.
Here is an excerpt from hot_threads:
java.lang.Class.getEnclosingMethod0(Native Method)
java.lang.Class.getEnclosingMethodInfo(Class.java:1072)
java.lang.Class.getEnclosingClass(Class.java:1272)
java.lang.Class.getSimpleBinaryName(Class.java:1443)
java.lang.Class.getSimpleName(Class.java:1309)
org.apache.lucene.search.UsageTrackingQueryCachingPolicy.isPointQuery(UsageTrackingQueryCachingPolicy.java:39)
org.apache.lucene.search.UsageTrackingQueryCachingPolicy.isCostly(UsageTrackingQueryCachingPolicy.java:54)
org.apache.lucene.search.UsageTrackingQueryCachingPolicy.minFrequencyToCache(UsageTrackingQueryCachingPolicy.java:121)
org.apache.lucene.search.UsageTrackingQueryCachingPolicy.shouldCache(UsageTrackingQueryCachingPolicy.java:179)
org.elasticsearch.index.shard.ElasticsearchQueryCachingPolicy.shouldCache(ElasticsearchQueryCachingPolicy.java:53)
org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.bulkScorer(LRUQueryCache.java:806)
org.elasticsearch.indices.IndicesQueryCache$CachingWeightWrapper.bulkScorer(IndicesQueryCache.java:168)
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:665)
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:472)
org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:388)
org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:108)
LUCENE-8005 by Scott Somerville on Oct 23 2017, updated Oct 24 2017
Attachments: LUCENE-8005.patch (versions: 2)
Hello,
let say I index documents with attribute name like: prefixFileName
and that I search with "prefixF*", it is not found.
while searching with "prefix*" it works.
In 6.x (and 5.x) "prefixF*" was finding the value.
I've provided a test case
https://gist.github.com/benoitf/6078a0a8925826d8c89916a78a883cb0
and a pom.xml file
https://gist.github.com/benoitf/fefaf174fa4d96c40318dc4d044495b1
when setting property version in pom.xml to 6.6.2 it works
LUCENE-8022 by Florent BENOIT on Oct 30 2017, resolved Oct 31 2017
Length of individual documents only counts the number of positions of a document since discountOverlaps defaults to true.
`@Override`
public final long computeNorm(FieldInvertState state) {
final int numTerms = discountOverlaps ? state.getLength() - state.getNumOverlap() : state.getLength();
int indexCreatedVersionMajor = state.getIndexCreatedVersionMajor();
if (indexCreatedVersionMajor >= 7) {
return SmallFloat.intToByte4(numTerms);
} else {
return SmallFloat.floatToByte315((float) (1 / Math.sqrt(numTerms)));
}
}}
Measureing document length this way seems perfectly ok for me. What bothers me is that
average document length is based on sumTotalTermFreq for a field. As far as I understand that sums up totalTermFreqs for all terms of a field, therefore counting positions of terms including those that overlap.
protected float avgFieldLength(CollectionStatistics collectionStats) {
final long sumTotalTermFreq = collectionStats.sumTotalTermFreq();
if (sumTotalTermFreq <= 0) {
return 1f; // field does not exist, or stat is unsupported
} else {
final long docCount = collectionStats.docCount() == -1 ? collectionStats.maxDoc() : collectionStats.docCount();
return (float) (sumTotalTermFreq / (double) docCount);
}
}
}
Are we comparing apples and oranges in the final scoring?
I haven't run any benchmarks and I am not sure whether this has a serious effect. It just means that documents that have synonyms or in my use case different normal forms of tokens on the same position are shorter and therefore get higher scores than they should and that we do not use the whole spectrum of relative document lenght of BM25.
I think for BM25 discountOverlaps should default to false.
LUCENE-8000 by Christoph Goller on Oct 19 2017, updated Oct 23 2017
We want to support scoring optimizations such as LUCENE-4100 and LUCENE-7993, which put very minimal requirements on the similarity impl. Today similarities of various quality are in core and tests.
The ones with problems currently have warnings in the javadocs about their bugs, and if the problems are severe enough, then they are also disabled in randomized testing too.
IMO lucene core should only have practical functions that won't return NaN
scores at times or cause relevance to go backwards if the user's stopfilter isn't configured perfectly. Also it is important for unit tests to not deal with broken or semi-broken sims, and the ones in core should pass all unit tests.
I propose we move the buggy ones to sandbox and deprecate them. If they can be fixed we can put them back in core, otherwise bye-bye.
FWIW tests developed in LUCENE-7997 document the following requirements:
LUCENE-8010 by Robert Muir (@rmuir) on Oct 25 2017, resolved Apr 06 2018
Attachments: LUCENE-8010.patch
Linked issues:
I noticed that the uax_url_email tokenizer splits urls in multiple tokens in presence of digits, ".", "-"
I opened a issue on elasticsearch github repo (elastic/elasticsearch#27309) because I noticed this strange behaviour.
Their answer was
The uax_url_email tokenizer tokenizes URLs and email addresses, but in order to recognize a token as a URL it must include the scheme (usually HTTP:// or https://):
Additionally, this tokenizer belongs to Lucene. Could you open this issue at https://lucene\.apache\.org/core/ instead?
URLs are defined by RFC1738 and extended by RFC1808.
In RFC1808 Relative URLs are explained, and this allows scheme-less URLs.
I would expect uax_url_email to tokenize correctly also scheme-less and relative URL.
LUCENE-8044 by Sergio Leoni on Nov 07 2017
Environment:
Elasticsearch 5.5.2, Build: b2f0c09/2017-08-14T12:33:14.154Z, JVM: 1.8.0_144
JVM java version "1.8.0_144"
Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)
OS Linux 3.10.0-514.10.2.el7.x86_64 #1 SMP Mon Feb 20 02:37:52 EST 2017 x86_64 x86_64 x86_64 GNU/Linux
Term frequencies are discarded in the DOCS_ONLY case from the postings list but they still count against the length normalization, which looks like it may screw stuff up.
I ran some quick experiments on LUCENE-8025, by encoding fieldInvertState.getUniqueTermCount() and it seemed worth fixing (e.g. 20% or 30% improvement potentially). Happy to do testing for real, if we want to fix.
But this seems tricky, today you can downgrade to DOCS_ONLY on the fly, and its hard for me to think about that case (i think its generally screwed up besides this, but still).
LUCENE-8031 by Robert Muir (@rmuir) on Nov 01 2017, resolved Feb 24 2018
Attachments: LUCENE-8031.patch
these tolerance deltas are being (ab)used in various ways in tests. I did some experimentation and they can almost be removed entirely without too much pain.
LUCENE-7997 fixes all similarities such that score() == explain(). It makes its possible to actually debug numeric errors and we need to do that if we have optimizations such as maxscore that care about score values. So I think we should do the same thing elsewhere in scoring (weight/scorer).
We should at the very least fix tests (such as expression tests) that no longer need these deltas and can now assert exact values.
LUCENE-8008 by Robert Muir (@rmuir) on Oct 24 2017, resolved Nov 29 2017
Attachments: LUCENE-8008.patch
Linked issues:
Spinoff from LUCENE-7976
Part of that discussion surfaced the idea that optimize/forceMerge is being discouraged, but the use-cases for forceMerge still need to be supported. Those use cases make sense when an index changes rarely.
This JIRA is to explore what that would look like.
LUCENE-8003 by Erick Erickson (@ErickErickson) on Oct 23 2017
https://jenkins.thetaphi.de/job/Lucene-Solr-master-Linux/20800/
Error Message:
geoAreaShape: GeoExactCircle: {planetmodel=PlanetModel.WGS84, center=[lat=-0.00871130560892533, lon=2.3029626482941588([X=-0.6692047265792528, Y=0.7445316825911176, Z=-0.008720939756154669])], radius=3.038428918538668(174.0891533827647), accuracy=2.111101444186927E-4} shape: GeoRectangle: {planetmodel=PlanetModel.WGS84, toplat=0.18851664435052304(10.801208089253723), bottomlat=-1.4896034997154073(-85.34799368160976), leftlon=-1.4970589804391838(-85.7751612613233), rightlon=1.346321571653886(77.13854392318753)} expected:<0> but was:<2>
Stack Trace:
java.lang.AssertionError: geoAreaShape: GeoExactCircle: {planetmodel=PlanetModel.WGS84, center=[lat=-0.00871130560892533, lon=2.3029626482941588([X=-0.6692047265792528, Y=0.7445316825911176, Z=-0.008720939756154669])], radius=3.038428918538668(174.0891533827647), accuracy=2.111101444186927E-4}
shape: GeoRectangle: {planetmodel=PlanetModel.WGS84, toplat=0.18851664435052304(10.801208089253723), bottomlat=-1.4896034997154073(-85.34799368160976), leftlon=-1.4970589804391838(-85.7751612613233), rightlon=1.346321571653886(77.13854392318753)} expected:<0> but was:<2>
at __randomizedtesting.SeedInfo.seed([87612C9805977C6F:B087E212A0C8DB25]:0)
at org.junit.Assert.fail(Assert.java:93)
at org.junit.Assert.failNotEquals(Assert.java:647)
at org.junit.Assert.assertEquals(Assert.java:128)
at org.junit.Assert.assertEquals(Assert.java:472)
at org.apache.lucene.spatial3d.geom.RandomGeoShapeRelationshipTest.testRandomContains(RandomGeoShapeRelationshipTest.java:225)
LUCENE-8032 by David Smiley (@dsmiley) on Nov 02 2017, resolved Nov 02 2017
Attachments: LUCENE-8032.patch
Exposed by this test failure: https://jenkins.thetaphi.de/job/Lucene-Solr-7.x-Linux/777/testReport/junit/org.apache.lucene.search/TestLRUQueryCache/testDocValuesUpdatesDontBreakCache/
A reader is randomly wrapped with a ParallelLeafReader, which does not then correctly propagate the dvGen into its own FieldInfo.
LUCENE-8045 by Alan Woodward (@romseygeek) on Nov 09 2017, resolved Nov 09 2017
Attachments: LUCENE-8045.patch
Explanation class is currently nice and simple, and float matches the scoring api, but this does not work well for debugging numerical errors of internal calculations (it usually makes practical sense to use 64-bit double to avoid issues).
Also it makes for nasty formatting of integral values such as number of tokens in the collection or even document's length, its just noise to see 10.0
there instead of 10
, and scientific notation for e.g. number of documents is just annoying.
One idea is to take Number instead of float? Then you could pass in the correct numeric type (int,long,double,float) for internal calculations, parameters, statistics, etc, and output would look nice.
LUCENE-8012 by Robert Muir (@rmuir) on Oct 25 2017, resolved Jan 02 2018
Attachments: LUCENE-8012.patch (versions: 2)
Currently when index is written to disk the following sequence of events is taking place:
This sequence leads to potential window of opportunity for system to crash after 'rename list of segments' but before 'sync index directory' and depending on exact filesystem implementation this may potentially lead to 'list of segments' being visible in directory while some of the segments are not.
Solution to this is to sync index directory after all segments have been written. This commit shows idea implemented. I'm fairly certain that I didn't find all the places this may be potentially happening.
LUCENE-8048 by Nikolay Martynov on Nov 09 2017, resolved Dec 05 2017
Attachments: LUCENE-8048.patch (versions: 3), Screen Shot 2017-11-22 at 12.34.51 PM.png
Query caching can have a negative impact on tail latencies as the clause that is cached needs to be entirely consumed. Maybe we could leverage the fact that we can know the lead cost from any scorer now (LUCENE-7897) in order to implement heuristics that would look like "do not cache clause X if its cost is 10x greater than the cost of the entire query". This would make sure that a fast query can not become absurdly slow just because it had to cache a costly filter. The filter will need to wait for a more costly query to be cached, or might never be cached at all.
LUCENE-8027 by Adrien Grand (@jpountz) on Oct 31 2017, updated Oct 17 2019
Attachments: LUCENE-8027.patch
Linked issues:
ShingleFilterFactory should have an option to ignore filler tokens in the total shingle size.
For instance (adapted from https://stackoverflow.com/questions/33193144/solr-stemming-stop-words-and-shingles-not-giving-expected-outputs), consider the text "A brown fox quickly jumps over the lazy dog". When we remove stopwords and execute the ShingleFilter (shingle size = 3), it gives us the following result:
We can clearly see that the filler token "_" occupies one token in the shingle.
I suppose the returned shingles should be:
To maintain backward compatibility, i suggest the creation of an option called "skipFillerTokens" to implement this behavior (note that this is different than using fillerTokens="", since the empty string occupies one token in the shingle)
I've attached a patch for the ShingleFilter class (getNextToken() method), ShingleFilterFactory and ShingleFilterTest clases.
LUCENE-8036 by Edans Sandes on Nov 04 2017, updated Nov 29 2017
Attachments: SOLR-11604.patch
Linked issues:
Following LUCENE-8017, I tried to add a getCacheHelper(LeafReaderContext) method to DoubleValuesSource so that Weights that use DVS can delegate on. This ended up with the same method being added to LongValuesSource, and some of the similar objects in spatial-extras. I think it makes sense to abstract this out into a separate SegmentCachable interface.
LUCENE-8042 by Alan Woodward (@romseygeek) on Nov 07 2017, resolved Nov 10 2017
Attachments: LUCENE-8042.patch (versions: 4)
If you need to analyze the root cause of a query's failure to match some document, you can use the Weight.explain() API. If you want to do some gross analysis of a whole batch of queries, say scraped from a log, that once matched, but no longer do, perhaps after some refactoring or other large-scale change, the Explanation isn't very good for that. You can try parsing its textual output, which is pretty regular, but instead I found it convenient to add some boolean structure to Explanation, and use that to find failing leaves on the Explanation tree, and report only those.
This patch adds a "condition" to each Explanation, which can be REQUIRED, OPTIONAL, PROHIBITED, or NONE. The conditions correspond in obvious ways to the Boolean Occur, except for NONE, which is used to indicate a node which can't be further decomposed. It adds new Explanation construction methods for creating Explanations with conditions (defaulting to NONE with the existing methods).
Finally Explanation.getFailureCauses() returns a list of Strings that are the one-line explanations of the failing queries that, if some of them had succeeded, would have made the original overall query match.
LUCENE-8019 by Michael Sokolov (@msokolov) on Oct 26 2017, resolved Aug 31 2018
Attachments: LUCENE_8019.patch, LUCENE-8019.patch
This issue occurs when using the grouping feature in distributed mode and sorting by score.
Each group's docList
in the response is supposed to contain a maxScore
entry that hold the maximum score for that group. Using the current releases, it sometimes happens that this piece of information is not included:
{
"responseHeader": {
"status": 0,
"QTime": 42,
"params": {
"sort": "score desc",
"fl": "id,score",
"q": "_text_:\"72\"",
"group.limit": "2",
"group.field": "group2",
"group.sort": "score desc",
"group": "true",
"wt": "json",
"fq": "group2:72 OR group2:45"
}
},
"grouped": {
"group2": {
"matches": 567,
"groups": [
{
"groupValue": 72,
"doclist": {
"numFound": 562,
"start": 0,
"maxScore": 2.0378063,
"docs": [
{
"id": "29!26551",
"score": 2.0378063
},
{
"id": "78!11462",
"score": 2.0298104
}
]
}
},
{
"groupValue": 45,
"doclist": {
"numFound": 5,
"start": 0,
"docs": [
{
"id": "72!8569",
"score": 1.8988966
},
{
"id": "72!14075",
"score": 1.5191172
}
]
}
}
]
}
}
}
Looking into the issue, it comes from the fact that if a shard does not contain a document from that group, trying to merge its maxScore
with real maxScore
entries from other shards is invalid (it results in NaN).
I'm attaching a patch containing a fix.
LUCENE-8996 by Julien Massenet on Dec 18 2015, resolved Dec 09 2019
Attachments: lucene_6_5-GroupingMaxScore.patch, lucene_solr_5_3-GroupingMaxScore.patch, LUCENE-8996.02.patch, LUCENE-8996.03.patch, LUCENE-8996.04.patch, LUCENE-8996.patch (versions: 2), master-GroupingMaxScore.patch
Linked issues:
In SpanNotQuery, there is an acceptance condition:
if (candidate.endPosition() + post <= excludeSpans.startPosition()) {
return AcceptStatus.YES;
}
This overflows in case candidate.endPosition() + post > Integer.MAX_VALUE
. I have a fix for this which I am working on. Basically I am flipping the add to a subtract.
LUCENE-8034 by Hari Menon on Nov 03 2017, resolved Nov 22 2017
Attachments: LUCENE-8034.patch
Today all sim formulas have to be "hacked" to deal with the fact that they may be passed stats such as docFreq=0, totalTermFreq=0. This happens easily with spans and there is even a dedicated test for it. All formulas have hacks such as what you see in https://issues.apache.org/jira/browse/LUCENE-6818:
Instead of:
expected = stats.getTotalTermFreq() * docLen / stats.getNumberOfFieldTokens();
they must do tricks such as:
expected = (1 + stats.getTotalTermFreq()) * docLen / (1 + stats.getNumberOfFieldTokens());
There is no good reason for this, it is just sloppiness in the Query/Weight/Scorer api. I think formulas should work unmodified, we shouldn't pass terms that dont exist or bogus statistics.
It adds a lot of complexity to the scoring api and makes it difficult to have meaningful/useful explanations, to debug problems, etc. It also makes it really hard to add a new sim.
LUCENE-8020 by Robert Muir (@rmuir) on Oct 28 2017, resolved Oct 31 2017
Attachments: LUCENE-8020.patch (versions: 3)
Currently we only disallow -Infinity and NaN as scores. However we started some work (eg. LUCENE-4100) that would be much easier to implement and maintain if scores we guaranteed to always be positive.
LUCENE-8006 by Adrien Grand (@jpountz) on Oct 24 2017, resolved Nov 29 2017
Linked issues:
Hi,
I'd like to share report on API changes and backward compatibility for the latest snapshot of the Lucene Core library (updated daily): https://abi-laboratory.pro/java/tracker/timeline/lucene-core/
BC — binary compatibility
SC — source compatibility
The report is generated according to the article https://wiki.eclipse.org/Evolving_Java-based_APIs_2 by the https://github.com/lvc/japi-tracker tool for jars from https://repository.apache.org/content/repositories/snapshots/org/apache/lucene/lucene-core/ and http://central.maven.org/maven2/org/apache/lucene/lucene-core/.
Hope it will be helpful for users and maintainers of the library.
Feel free to request more modules of the library to be included to the tracker if you are interested.
Thank you.
LUCENE-8037 by Andrey Ponomarenko on Nov 04 2017, resolved Dec 17 2019
Attachments: lucene-core-1.png, lucene-core-2.png
HI, this is Ayah - bidi developer at IBM Egypt - Globalization Team, we are responsible to support Arabic at IBM products and services and as we use lucence at many of services, we found that it needs major improvement at Arabic stemmer, we implement the following two papers https://dl.acm.org/citation.cfm?id=1921657 and http://waset.org/publications/10005688/arabic-light-stemmer-for-better-search-accuracy to improve lucene arabic stemmer function and would like to open a Pull request to let you integrate it as a part of lucene
LUCENE-8028 by Ayah Shamandi on Oct 31 2017, updated Dec 06 2017
LUCENE-7997 improves BM25 and Classic explains to better explain:
product of:
2.2 = scaling factor, k1 + 1
9.388654 = idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:
1.0 = n, number of documents containing term
17927.0 = N, total number of documents with field
0.9987758 = tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:
979.0 = freq, occurrences of term within document
1.2 = k1, term saturation parameter
0.75 = b, length normalization parameter
1.0 = dl, length of field
1.0 = avgdl, average length of field
Previously it was pretty cryptic and used confusing terminology like docCount/docFreq without explanation:
product of:
0.016547536 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:
449.0 = docFreq
456.0 = docCount
2.1920826 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:
113659.0 = freq=113658
1.2 = parameter k1
0.75 = parameter b
2300.5593 = avgFieldLength
1048600.0 = fieldLength
We should fix other similarities too in the same way, they should be more practical.
LUCENE-8011 by Robert Muir (@rmuir) on Oct 25 2017, resolved Dec 13 2017
The 2 arg version of the "query()" was designed so that the second argument would specify the value used for any document that does not match the query pecified by the first argument – but the "exists" property of the resulting ValueSource only takes into consideration wether or not the document matches the query – and ignores the use of the second argument.
The work around is to ignore the 2 arg form of the query() function, and instead wrap he query function in def().
for example: def(query($something), $defaultval)
instead of query($something, $defaultval)
LUCENE-8908 by Bill Bell on Jul 28 2015, resolved Mar 18 2020
Attachments: LUCENE-8908.patch (versions: 2), SOLR-7845.patch (versions: 2)
Linked issues:
Spinoff of LUCENE-8007:
If you omit term frequencies, we should score as if all tf values were 1. This is the way it worked for e.g. ClassicSimilarity and you can understand how it degrades.
However for sims such as BM25, we bail out on computing avg doclength (and just return a bogus value of 1) today, screwing up stuff related to length normalization too, which is separate.
Instead of a bogus value, we should substitute sumDocFreq for sumTotalTermFreq (all postings have freq of 1, since you omitted them).
LUCENE-8025 by Robert Muir (@rmuir) on Oct 31 2017, resolved Nov 01 2017
Attachments: LUCENE-8025.patch
This should no longer be needed, since the index-time boost is removed, and since sims are no longer asked to score non-existent terms.
E.g. core tests pass with:
--- a/lucene/core/src/java/org/apache/lucene/search/similarities/BM25Similarity.java
+++ b/lucene/core/src/java/org/apache/lucene/search/similarities/BM25Similarity.java
@@ -230,11 +230,8 @@ public class BM25Similarity extends Similarity {
if (norms == null) {
norm = k1;
} else {
- if (norms.advanceExact(doc)) {
- norm = cache[((byte) norms.longValue()) & 0xFF];
- } else {
- norm = cache[0];
- }
+ norms.advanceExact(doc);
+ norm = cache[((byte) norms.longValue()) & 0xFF];
}
return weightValue * (float) (freq / (freq + norm));
}
LUCENE-8024 by Robert Muir (@rmuir) on Oct 31 2017, resolved Oct 31 2017
Attachments: LUCENE-8024.patch
There are a couple of issues with UnescapedCharSequence:
There are no tests for UnescapedCharSequence so these issues have gone unnoticed for quite some time.
LUCENE-8001 by Shad Storhaug on Oct 20 2017, updated Oct 21 2017
Linked issues:
We can use this to bounds check the incoming parameters and find issues in tests.
LUCENE-8021 by Robert Muir (@rmuir) on Oct 28 2017, resolved Oct 30 2017
Attachments: LUCENE-8021.patch
Spin-off from LUCENE-8018. The dense vs. sparse encoding logic of FieldInfos introduces complexity. Given that the sparse encoding is only used when less than 1/16th of fields are used, which sounds uncommon to me, maybe we should use a dense encoding all the time?
LUCENE-8033 by Adrien Grand (@jpountz) on Nov 02 2017, resolved Feb 13 2018
Pull requests: apache/lucene-solr#320
The QueryCache assumes that queries will return the same set of documents when run over the same segment, independent of all other segments held by the parent IndexSearcher. However, both FunctionRangeQuery and FunctionMatchQuery can select hits based on score, which depend on term statistics over the whole index, and could therefore theoretically return different result sets on a given segment.
LUCENE-8017 by Alan Woodward (@romseygeek) on Oct 26 2017, resolved Nov 03 2017
Attachments: LUCENE-8017.patch (versions: 3)
Linked issues:
A heap dump revealed a lot of TreeMap.Entry instances (millions of them) for a system with about ~1000 active searchers.
LUCENE-8018 by Julian Vassev on Oct 26 2017, resolved Oct 27 2017
Environment:
Lucene 6.5.0, java 8
openjdk version "1.8.0_45-internal"
OpenJDK Runtime Environment (build 1.8.0_45-internal-b14)
OpenJDK 64-Bit Server VM (build 25.45-b02, mixed mode)
Attachments: LUCENE-8018.patch
IndexSearcher.collectionStatistics(field)
can do a fair amount of work because with each invocation it will call MultiFields.getTerms(...)
. The effects of this are aggravated for queries with many fields since each field will want statistics, and also aggravated when there are many segments.
LUCENE-8040 by David Smiley (@dsmiley) on Nov 06 2017, resolved Nov 15 2017
Attachments: LUCENE-8040.patch (versions: 2), lucenecollectionStatisticsbench.zip, MyBenchmark.java
Linked issues:
All call sites of Fields.size() recompute it if it is not available. We should make the API easier to use and require that it never returns -1.
LUCENE-8029 by Adrien Grand (@jpountz) on Oct 31 2017
Follow-up to LUCENE-8020
With the scoring api no longer passing e.g. docFreq=0 in corner cases such as SpanOrQuery, we can revert formula hacks that we added to many similarities. It may not be very significant for ranking but it removes confusion and makes it possible to have good explain() etc.
LUCENE-8023 by Robert Muir (@rmuir) on Oct 30 2017
The Lucene test framework randomizes the Locale configuration to test the software in different locale settings.
https://github.com/apache/lucene-solr/blob/e2521b2a8baabdaf43b92192588f51e042d21e97/lucene/test-framework/src/java/org/apache/lucene/util/TestRuleSetupAndRestoreClassEnv.java#L206-L209
While this is a very good practice from engineering perspective, it causes issues when the Lucene/Solr test framework is used with third-party components which may have issues working with a subset of locale settings. e.g. for Solr/Sentry integration (SENTRY-1475), we are using Solr test framework to test the sentry authorization plugin for Solr. For unit-testing, it uses Apache Derby. We have found multiple cases when derby fail to initialize for a locale configured by Solr test framework. This causes tests to fail and create a confusion with respect to the quality of the integration source-code. Since the Derby failures are not related to Solr/Sentry integration, we would like to avoid such nasty surprises by suppressing the locale randomization. This is similar to the way we suppress Solr SSL configuration (@SolrTestCaseJ4.SuppressSSL
).
Please refer to discussion on dev mailing list for more context,
http://lucene.472066.n3.nabble.com/Solr-test-framework-Locale-randomization-td4359671.html
LUCENE-8009 by Hrishikesh Gadre on Oct 24 2017, updated Oct 25 2017
reproduce with: ant test -Dtestcase=TestBasicModelIne -Dtests.method=testRandomScoring -Dtests.seed=86E85958B1183E93 -Dtests.slow=true -Dtests.locale=vi-VN -Dtests.timezone=Pacific/Tongatapu -Dtests.asserts=true -Dtests.file.encoding=UTF8
LUCENE-8015 by Adrien Grand (@jpountz) on Oct 26 2017, resolved Dec 06 2017
Attachments: LUCENE-8015_test_fangs.patch, LUCENE-8015.patch, LUCENE-8015-test.patch
With Solr v6.3, when I issue this query:
I get this error in the JSON response:
*************************************************************
{
"responseHeader": {
"zkConnected": true,
"status": 500,
"QTime": 8,
"params": {
"q": "{!complexphrase inOrder=false}text:"maytag~ (refri~ OR refri*) "",
"hl": "true",
"hl.preserveMulti": "false",
"fl": "id",
"hl.fragsize": "60",
"hl.fl": "nameX,shortDescription,longDescription,artistName,type,manufacturer,department",
"rows": "10",
"wt": "json"
}
},
"response": {
"numFound": 2,
"start": 0,
"docs": [
{
"id": "5411379"
},
{
"id": "5411404"
}
]
},
"error": {
"msg": "Unknown query type:org.apache.lucene.search.MatchNoDocsQuery",
"trace": "java.lang.IllegalArgumentException: Unknown query type:org.apache.lucene.search.MatchNoDocsQuery\n\tat org.apache.lucene.queryparser.complexPhrase.ComplexPhraseQueryParser$ComplexPhraseQuery.addComplexPhraseClause(ComplexPhraseQueryParser.java:388)\n\tat org.apache.lucene.queryparser.complexPhrase.ComplexPhraseQueryParser$ComplexPhraseQuery.rewrite(ComplexPhraseQueryParser.java:289)\n\tat org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:230)\n\tat org.apache.lucene.search.highlight.WeightedSpanTermExtractor.getWeightedSpanTerms(WeightedSpanTermExtractor.java:522)\n\tat org.apache.lucene.search.highlight.QueryScorer.initExtractor(QueryScorer.java:218)\n\tat org.apache.lucene.search.highlight.QueryScorer.init(QueryScorer.java:186)\n\tat org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:195)\n\tat org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:602)\n\tat org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingOfField(DefaultSolrHighlighter.java:448)\n\tat org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:410)\n\tat org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:141)\n\tat org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:295)\n\tat org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:153)\n\tat org.apache.solr.core.SolrCore.execute(SolrCore.java:2213)\n\tat org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:654)\n\tat org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:460)\n\tat org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:303)\n\tat org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:254)\n\tat org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1668)\n\tat org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)\n\tat org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)\n\tat org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)\n\tat org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1160)\n\tat org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511)\n\tat org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)\n\tat org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1092)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)\n\tat org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)\n\tat org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)\n\tat org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)\n\tat org.eclipse.jetty.server.Server.handle(Server.java:518)\n\tat org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:308)\n\tat org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:244)\n\tat org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)\n\tat org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)\n\tat org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)\n\tat org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceAndRun(ExecuteProduceConsume.java:246)\n\tat org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:156)\n\tat org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:654)\n\tat org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572)\n\tat java.lang.Thread.run(Thread.java:745)\n",
"code": 500
}
}
*************************************************************
I did NOT have this error in Solr v6.1 so something has changed in v6.3 that is causing this error.
Steve Rowe thinks it may be related to https://issues.apache.org/jira/browse/LUCENE-7337
Hoss' initial thoughts: "i think the root of the issue is that the way those fuzzy and prefix queries are parsed means that they may produce an empty boolean query depending on the contents of the index, and then the new optimization rewrites those empty boolean queries into MatchNoDocsQueries – but the highlighter (which uses hueristics to figure out what to ask each query – based on it's type – what to highlight) doesn't know what to do with that. i'm really suprised the highlighter throws an error in the "unexpected query type" code path instead of just ignorning it."
LUCENE-8305 by Andy Tran on Jan 31 2017, resolved May 10 2018
Pull requests: apache/lucene-solr#327
From https://jenkins.thetaphi.de/job/Lucene-Solr-master-Linux/20789/ - reproduces for me, but only if I first remove -Dtests.method=testBoostingTermQueryXML
from the cmdline:
Checking out Revision 39376cd8b5ef03b3338c2e8fa31dce732749bcd7 (refs/remotes/origin/master)
[...]
[junit4] Suite: org.apache.lucene.queryparser.xml.TestCorePlusExtensionsParser
[junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestCorePlusExtensionsParser -Dtests.method=testBoostingTermQueryXML -Dtests.seed=DA0883869B26E8D9 -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=th-TH -Dtests.timezone=America/Indiana/Indianapolis -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1
[junit4] FAILURE 0.01s J0 | TestCorePlusExtensionsParser.testBoostingTermQueryXML <<<
[junit4] > Throwable #1: java.lang.AssertionError
[junit4] > at __randomizedtesting.SeedInfo.seed([DA0883869B26E8D9:F58B709A872CAF14]:0)
[junit4] > at org.apache.lucene.search.similarities.AssertingSimilarity$1.computePayloadFactor(AssertingSimilarity.java:120)
[junit4] > at org.apache.lucene.queries.payloads.PayloadScoreQuery$PayloadSpans.collectLeaf(PayloadScoreQuery.java:215)
[junit4] > at org.apache.lucene.search.spans.TermSpans.collect(TermSpans.java:121)
[junit4] > at org.apache.lucene.queries.payloads.PayloadScoreQuery$PayloadSpans.doCurrentSpans(PayloadScoreQuery.java:226)
[junit4] > at org.apache.lucene.search.spans.SpanScorer.setFreqCurrentDoc(SpanScorer.java:110)
[junit4] > at org.apache.lucene.search.spans.SpanScorer.ensureFreq(SpanScorer.java:126)
[junit4] > at org.apache.lucene.search.spans.SpanScorer.score(SpanScorer.java:133)
[junit4] > at org.apache.lucene.search.AssertingScorer.score(AssertingScorer.java:70)
[junit4] > at org.apache.lucene.search.AssertingScorer.score(AssertingScorer.java:70)
[junit4] > at org.apache.lucene.search.TopScoreDocCollector$SimpleTopScoreDocCollector$1.collect(TopScoreDocCollector.java:64)
[junit4] > at org.apache.lucene.search.AssertingLeafCollector.collect(AssertingLeafCollector.java:52)
[junit4] > at org.apache.lucene.search.AssertingCollector$1.collect(AssertingCollector.java:56)
[junit4] > at org.apache.lucene.search.AssertingLeafCollector.collect(AssertingLeafCollector.java:52)
[junit4] > at org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll(Weight.java:241)
[junit4] > at org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:184)
[junit4] > at org.apache.lucene.search.BulkScorer.score(BulkScorer.java:39)
[junit4] > at org.apache.lucene.search.AssertingBulkScorer.score(AssertingBulkScorer.java:69)
[junit4] > at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:658)
[junit4] > at org.apache.lucene.search.AssertingIndexSearcher.search(AssertingIndexSearcher.java:72)
[junit4] > at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:462)
[junit4] > at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:581)
[junit4] > at org.apache.lucene.search.IndexSearcher.searchAfter(IndexSearcher.java:439)
[junit4] > at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:450)
[junit4] > at org.apache.lucene.queryparser.xml.TestCoreParser.dumpResults(TestCoreParser.java:254)
[junit4] > at org.apache.lucene.queryparser.xml.TestCoreParser.testBoostingTermQueryXML(TestCoreParser.java:127)
[junit4] > at java.lang.Thread.run(Thread.java:748)
[junit4] 2> NOTE: test params are: codec=CheapBastard, sim=Asserting(org.apache.lucene.search.similarities.AssertingSimilarity@fcaf28), locale=th-TH, timezone=America/Indiana/Indianapolis
[junit4] 2> NOTE: Linux 4.10.0-37-generic i386/Oracle Corporation 1.8.0_144 (32-bit)/cpus=8,threads=1,free=47589984,total=67108864
LUCENE-8030 by Steven Rowe on Nov 01 2017, resolved Nov 01 2017
I've seen apps that have a good number of fields – hundreds. The O(log(N)) of TreeMap definitely shows up in a profiler; sometimes 20% of search time, if I recall. There are many Field implementations that are impacted... in part because Fields is the base class of FieldsProducer.
As an aside, I hope Fields to go away some day; FieldsProducer should be TermsProducer and not have an iterator of fields. If DocValuesProducer doesn't have this then why should the terms index part of our API have it? If we did this then the issue here would be a simple transition to a HashMap.
Or maybe we can switch to HashMap and relax the definition of Fields.iterator to not necessarily be sorted?
Perhaps the fix can be a relatively simple conversion over to LinkedHashMap in many cases if we can assume when we initialize these internal maps that we consume them in sorted order to begin with.
LUCENE-8041 by David Smiley (@dsmiley) on Nov 06 2017, resolved Dec 03 2019
Attachments: LUCENE-8041.patch, LUCENE-8041-LinkedHashMap.patch
Linked issues:
The IndexWriter check for too many documents does not always work, resulting in going over the limit. Once this happens, Lucene refuses to open the index and throws a CorruptIndexException: Too many documents.
This appears to affect all versions of Lucene/Solr (the check was first implemented in LUCENE-5843 in v4.9.1/4.10 and we've seen this manifest in 4.10)
LUCENE-8043 by Yonik Seeley (@yonik) on Nov 07 2017, resolved Dec 04 2017
Attachments: LUCENE-8043.patch (versions: 5), YCS_IndexTest7a.java
The current "distance" method for a path returns a distance computed along the path and then perpendicular to the path. But, at least in the case of paths, it is often preferable to compute a "delta" distance, which would be the minimum straight-line distance assuming a diversion to reach the provided point.
A similar "distance delta" for a circle would be defined as returning a number exactly the same as is currently returned, with the understanding that the point given would be the destination and not a new waypoint. Similarly, the distance beyond the end of a path to the provided point would be counted only once, while the distance before the beginning of the path would be counted twice (one leg to the point, and the other leg back from that point to the start point (or nearest path point, if closer).
This obviously must be implemented in a backwards-compatible fashion.
LUCENE-8039 by Karl Wright on Nov 06 2017, resolved Nov 08 2017
Javadocs allow codecs to not store some index statistics. Given discussion that occurred on LUCENE-4100, this was mostly implemented this way to support pre-flex codecs. We should now require that all codecs store these statistics.
LUCENE-8007 by Adrien Grand (@jpountz) on Oct 24 2017, resolved Nov 03 2017
Attachments: LUCENE-8007.patch (versions: 5)
Hi Team,
We are seeing solr doc id missing error in our production cluster. Below is the snap from solr log.
ERROR - 2017-10-03 08:32:08.715; org.apache.solr.common.SolrException; auto commit error...:java.lang.RuntimeException: java.io.FileNotFoundException: Requested file hdfs:/solr/<path>/core_node5/data/index/_ect.fdt does not exist.
at org.apache.lucene.index.TieredMergePolicy$SegmentByteSizeDescending.compare(TieredMergePolicy.java:258)
at org.apache.lucene.index.TieredMergePolicy$SegmentByteSizeDescending.compare(TieredMergePolicy.java:238)
at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355)
at java.util.TimSort.sort(TimSort.java:234)
at java.util.Arrays.sort(Arrays.java:1512)
at java.util.ArrayList.sort(ArrayList.java:1454)
at java.util.Collections.sort(Collections.java:175)
at org.apache.lucene.index.TieredMergePolicy.findMerges(TieredMergePolicy.java:292)
at org.apache.lucene.index.IndexWriter.updatePendingMerges(IndexWriter.java:2005)
at org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:1969)
at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2999)
at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3104)
at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3071)
at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:582)
at org.apache.solr.update.CommitTracker.run(CommitTracker.java:216)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.FileNotFoundException: Requested file hdfs:/solr/<path>/core_node5/data/index/_ect.fdt does not exist.
at com.mapr.fs.MapRFileSystem.getMapRFileStatus(MapRFileSystem.java:1374)
at com.mapr.fs.MapRFileSystem.getFileStatus(MapRFileSystem.java:1031)
at org.apache.solr.store.hdfs.HdfsFileReader.getLength(HdfsFileReader.java:94)
at org.apache.solr.store.hdfs.HdfsDirectory.fileLength(HdfsDirectory.java:148)
at org.apache.lucene.store.NRTCachingDirectory.fileLength(NRTCachingDirectory.java:143)
at org.apache.lucene.index.SegmentCommitInfo.sizeInBytes(SegmentCommitInfo.java:219)
at org.apache.lucene.index.MergePolicy.size(MergePolicy.java:478)
at org.apache.lucene.index.TieredMergePolicy$SegmentByteSizeDescending.compare(TieredMergePolicy.java:248)
LUCENE-8002 by Jerry Richard on Oct 23 2017
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.