Comments (8)
I just pushed the change, thanks @mikemccand for putting me on the right track.
from lucene.
OK I think the issue here may be that Terms.intersect(Automaton a, BytesRef startTerm)
requires that startTerm
is accepted by the incoming automaton, yet the way CheckIndex
is calling it can clearly violate that.
And the codecs (default and Direct) clearly don't do a good job throwing a clear exception when that is violated :)
In addition to the default Codec, DirectPostingsFormat
is also angry, using this repro:
./gradlew :lucene:core:test --tests "org.apache.lucene.index.TestTerms.testTermMinMaxRandom" -Ptests.jvms=4 -Ptests.jvmargs= -Ptests.seed=C8D1EBB5035DA9F -Ptests.multiplier=2 -Ptests.badapples=false -Ptests.gui=true -Ptests.file.encoding=US-ASCII -Ptests.vectorsize=128
from lucene.
I'll try to fix CheckIndex
so that it only uses startTerm
that is accepted by the automaton.
from lucene.
Terms.intersect(Automaton a, BytesRef startTerm) requires that startTerm is accepted by the incoming automaton, yet the way CheckIndex is calling it can clearly violate that.
I wondered about that, but the automaton is Automata.makeAnyBinary()
, shouldn't it accept any term?
from lucene.
Oh I see, I created binary automata, but the API implicitly treats automata as UTF32 automata, so you need to tell it explicitly that it's a binary automaton. And something like that should fix the problem?
diff --git a/lucene/core/src/java/org/apache/lucene/index/CheckIndex.java b/lucene/core/src/java/org/apache/lucene/index/CheckIndex.java
index a555ce40001..f899b331b92 100644
--- a/lucene/core/src/java/org/apache/lucene/index/CheckIndex.java
+++ b/lucene/core/src/java/org/apache/lucene/index/CheckIndex.java
@@ -2318,7 +2318,7 @@ public final class CheckIndex implements Closeable {
startTerm = new BytesRef();
checkTermsIntersect(terms, automaton, startTerm);
- automaton = Automata.makeAnyBinary();
+ automaton = Automata.makeNonEmptyBinary();
startTerm = new BytesRef(new byte[] {'l'});
checkTermsIntersect(terms, automaton, startTerm);
@@ -2369,8 +2369,8 @@ public final class CheckIndex implements Closeable {
throws IOException {
TermsEnum allTerms = terms.iterator();
automaton = Operations.determinize(automaton, Operations.DEFAULT_DETERMINIZE_WORK_LIMIT);
- CompiledAutomaton compiledAutomaton = new CompiledAutomaton(automaton);
- ByteRunAutomaton runAutomaton = new ByteRunAutomaton(automaton);
+ CompiledAutomaton compiledAutomaton = new CompiledAutomaton(automaton, false, true, true);
+ ByteRunAutomaton runAutomaton = new ByteRunAutomaton(automaton, true);
TermsEnum filteredTerms = terms.intersect(compiledAutomaton, startTerm);
BytesRef term;
if (startTerm != null) {
(I had to change the automaton so that it's still considered of type "normal" and not "all")
from lucene.
Terms.intersect(Automaton a, BytesRef startTerm) requires that startTerm is accepted by the incoming automaton, yet the way CheckIndex is calling it can clearly violate that.
I wondered about that, but the automaton is
Automata.makeAnyBinary()
, shouldn't it accept any term?
Oh, you're right! I missed that Automata.makeAnyBinary()
there!
Oh I see, I created binary automata, but the API implicitly treats automata as UTF32 automata, so you need to tell it explicitly that it's a binary automaton. And something like that should fix the problem?
Oh, you are also right! Specifically CompiledAutomaton
assumes it's UTF32 and needs conversion to UTF8, unless you pass isBinar=true
. OK I like your fix! I'll confirm it fixes the DirectPostingsFormat
failure too.
from lucene.
OK the DirectPostingsFormat
failure is also happy with this fix. +1 to merge. Thanks @jpountz!
from lucene.
Not sure I did so much "putting on the right path" :) More like "getting randomly confused around the right area" thus inspiring @jpountz to look more closely :)
from lucene.
Related Issues (20)
- Query matching difference in Lucene 2 and Lucene 4.10.4 HOT 6
- test-framework JUnit 5 support HOT 1
- TestShapeDocValues.testLatLonPolygonBBox
- Merge on Commit: No merges if new data is flushed (but not committed) HOT 3
- Flaky Test in TestMergeSchedulerExternal#testSubclassConcurrentMergeScheduler HOT 1
- Could Lucene's default Directory (`FSDirectory.open`) somehow preload `.vec` files? HOT 5
- Test TestIndexWriterWithThreads#testIOExceptionDuringWriteSegmentWithThreadsOnlyOnce Failed HOT 2
- Pruning of estimating the point value count since BooleanScorerSupplier HOT 2
- Add refinement of quantized vector scores with fp distance calculations HOT 4
- Try applying bipartite graph reordering to KNN graph node ids HOT 20
- "gradlew clean check" results in internal gradle error "Unable to make progress running work." HOT 1
- DocumentsWriterDeleteQueue.getNextSequenceNumber assertion failure seqNo=9 vs maxSeqNo=8 HOT 4
- Luke does not support `spanNear` queries
- Investigate possible perf regression with off-heap scoring on JDK 22 HOT 1
- Are we properly accounting for `NeighborArray.rwlock`? HOT 4
- Incomplete Javadoc for DirectoryReader#indexExists HOT 1
- Add support for reading/writing dense vectors to MemoryIndex
- ConcurrentMergeScheduler may spawn more merge threads than specified HOT 1
- Remove redundant code in PointRangeQuery Weight
- KnnFloatVectorQuery misses highest-ranking results that FloatVectorSimilarityQuery retrieves HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from lucene.