Giter VIP home page Giter VIP logo

Comments (8)

jpountz avatar jpountz commented on September 28, 2024 1

I just pushed the change, thanks @mikemccand for putting me on the right track.

from lucene.

mikemccand avatar mikemccand commented on September 28, 2024

OK I think the issue here may be that Terms.intersect(Automaton a, BytesRef startTerm) requires that startTerm is accepted by the incoming automaton, yet the way CheckIndex is calling it can clearly violate that.

And the codecs (default and Direct) clearly don't do a good job throwing a clear exception when that is violated :)

In addition to the default Codec, DirectPostingsFormat is also angry, using this repro:

./gradlew :lucene:core:test --tests "org.apache.lucene.index.TestTerms.testTermMinMaxRandom" -Ptests.jvms=4 -Ptests.jvmargs= -Ptests.seed=C8D1EBB5035DA9F -Ptests.multiplier=2 -Ptests.badapples=false -Ptests.gui=true -Ptests.file.encoding=US-ASCII -Ptests.vectorsize=128

from lucene.

mikemccand avatar mikemccand commented on September 28, 2024

I'll try to fix CheckIndex so that it only uses startTerm that is accepted by the automaton.

from lucene.

jpountz avatar jpountz commented on September 28, 2024

Terms.intersect(Automaton a, BytesRef startTerm) requires that startTerm is accepted by the incoming automaton, yet the way CheckIndex is calling it can clearly violate that.

I wondered about that, but the automaton is Automata.makeAnyBinary(), shouldn't it accept any term?

from lucene.

jpountz avatar jpountz commented on September 28, 2024

Oh I see, I created binary automata, but the API implicitly treats automata as UTF32 automata, so you need to tell it explicitly that it's a binary automaton. And something like that should fix the problem?

diff --git a/lucene/core/src/java/org/apache/lucene/index/CheckIndex.java b/lucene/core/src/java/org/apache/lucene/index/CheckIndex.java
index a555ce40001..f899b331b92 100644
--- a/lucene/core/src/java/org/apache/lucene/index/CheckIndex.java
+++ b/lucene/core/src/java/org/apache/lucene/index/CheckIndex.java
@@ -2318,7 +2318,7 @@ public final class CheckIndex implements Closeable {
         startTerm = new BytesRef();
         checkTermsIntersect(terms, automaton, startTerm);
 
-        automaton = Automata.makeAnyBinary();
+        automaton = Automata.makeNonEmptyBinary();
         startTerm = new BytesRef(new byte[] {'l'});
         checkTermsIntersect(terms, automaton, startTerm);
 
@@ -2369,8 +2369,8 @@ public final class CheckIndex implements Closeable {
       throws IOException {
     TermsEnum allTerms = terms.iterator();
     automaton = Operations.determinize(automaton, Operations.DEFAULT_DETERMINIZE_WORK_LIMIT);
-    CompiledAutomaton compiledAutomaton = new CompiledAutomaton(automaton);
-    ByteRunAutomaton runAutomaton = new ByteRunAutomaton(automaton);
+    CompiledAutomaton compiledAutomaton = new CompiledAutomaton(automaton, false, true, true);
+    ByteRunAutomaton runAutomaton = new ByteRunAutomaton(automaton, true);
     TermsEnum filteredTerms = terms.intersect(compiledAutomaton, startTerm);
     BytesRef term;
     if (startTerm != null) {

(I had to change the automaton so that it's still considered of type "normal" and not "all")

from lucene.

mikemccand avatar mikemccand commented on September 28, 2024

Terms.intersect(Automaton a, BytesRef startTerm) requires that startTerm is accepted by the incoming automaton, yet the way CheckIndex is calling it can clearly violate that.

I wondered about that, but the automaton is Automata.makeAnyBinary(), shouldn't it accept any term?

Oh, you're right! I missed that Automata.makeAnyBinary() there!

Oh I see, I created binary automata, but the API implicitly treats automata as UTF32 automata, so you need to tell it explicitly that it's a binary automaton. And something like that should fix the problem?

Oh, you are also right! Specifically CompiledAutomaton assumes it's UTF32 and needs conversion to UTF8, unless you pass isBinar=true. OK I like your fix! I'll confirm it fixes the DirectPostingsFormat failure too.

from lucene.

mikemccand avatar mikemccand commented on September 28, 2024

OK the DirectPostingsFormat failure is also happy with this fix. +1 to merge. Thanks @jpountz!

from lucene.

mikemccand avatar mikemccand commented on September 28, 2024

Not sure I did so much "putting on the right path" :) More like "getting randomly confused around the right area" thus inspiring @jpountz to look more closely :)

from lucene.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.