Comments (31)
I don't think all states need to be considered - only those reachable from the initial state. Tracking which states have been checked already may add some overhead but even with this, it should be fast (enough)?
Yes, this one is very important actually, you get 3 states with Operations.repeat(Automata.makeAnyChar())
and one is unreachable.
I think the relaxation patch is fine as a short first step - it doesn't claim to be optimal (PnP, as Mike loves to say). I'll add it to my todo list, it seems like a fun little project, although finding the time is difficult.
lemme try adding your test, that is helpful as I was having a tough time coming up with "varieties" to test. I will take a stab at it. It would be great to fix the javadoc to not require minimization to call this function.
from lucene.
In a similar vein, I wonder if RegExp
could create more efficient automata out of the box. For instance, it looks like Operations#repeat
could be optimized in the case when there is a single transition to directly create a minimal automaton, and this would in-turn make RegExp(".*")
create an automaton that returns true
for Operations#isTotal
.
from lucene.
@ChrisHegarty implementation of isTotal()
method requires a minimal DFA. If the automaton is not minimal, it may return false but it should not create a problem.
This is the only place that isTotal()
is called in lucene (see the comment): https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/automaton/CompiledAutomaton.java#L181
If you really need to minimize here, can you use something like this as a workaround? https://github.com/apache/lucene/blob/main/lucene/test-framework/src/java/org/apache/lucene/tests/util/automaton/AutomatonTestUtil.java#L338-L345
Sorry, I havent thought about this isTotal
much to see if there is a more reasonable implementation, just need to think it over.
If we need to improve isTotal
, it is definitely not necessary to minimize, e.g. the following only requires determinization + removal of dead states
boolean isTotal(Automaton a) {
return sameLanguage(a, Automata.makeAnyString());
}
from lucene.
Here's a round two, to prevent any error on NFA or having transitions to dead states:
diff --git a/lucene/core/src/java/org/apache/lucene/util/automaton/Operations.java b/lucene/core/src/java/org/apache/lucene/util/automaton/Operations.java
index 2052b1c50bf..10103628fad 100644
--- a/lucene/core/src/java/org/apache/lucene/util/automaton/Operations.java
+++ b/lucene/core/src/java/org/apache/lucene/util/automaton/Operations.java
@@ -864,14 +864,25 @@ public final class Operations {
/**
* Returns true if the given automaton accepts all strings for the specified min/max range of the
- * alphabet. The automaton must be minimized.
+ * alphabet. The automaton must be deterministic with no transitions to dead states.
*/
public static boolean isTotal(Automaton a, int minAlphabet, int maxAlphabet) {
- if (a.isAccept(0) && a.getNumTransitions(0) == 1) {
+ // minimal case
+ if (a.getNumStates() == 1 && a.isAccept(0) && a.getNumTransitions(0) == 1) {
Transition t = new Transition();
a.getTransition(0, 0, t);
return t.dest == 0 && t.min == minAlphabet && t.max == maxAlphabet;
}
+ // deterministic case
+ if (a.isDeterministic() && hasDeadStatesFromInitial(a) == false) {
+ Automaton a2 = new Automaton();
+ int s = a2.createState();
+ a2.setAccept(s, true);
+ a2.addTransition(s, s, minAlphabet, maxAlphabet);
+ a2.finishState();
+ return sameLanguage(a, a2);
+ }
+ // NFA, or has transitions to dead states, return false
return false;
}
from lucene.
I had a similar thought. Looking at the code it kinda looks a little tacky, but also kinda makes sone sense, e.g.
case REGEXP_REPEAT: + if (exp1.kind == Kind.REGEXP_ANYCHAR && automaton_provider == null) { + return Automata.makeAnyString(); + } else { a = Operations.repeat(exp1.toAutomaton(automata, automaton_provider)); + } break;
let's fix the regexp parser first? It is easier to reason about and less scary than stuff like isTotal
and subsetOf
.
Previously, regexp parser was calling minimize()
on every...parsing...step. I removed this because it is obviously not good. But if we have a simple fix to make it emit better automaton for practical uses, let's do it?
from lucene.
I think the "full traversal" suggested by Dawid here would be very fast. The annoying part is probably just the reachability (e.g. regex parser produces automatons with some unreachable states), but we have some helpers for that already in Operations?
from lucene.
Like so:
Automaton c =
Operations.repeat(
Operations.union(
List.of(
Automata.makeCharRange(Character.MIN_CODE_POINT, 100),
Automata.makeChar(101),
Automata.makeChar(102),
Automata.makeChar(103),
Automata.makeCharRange(104, Character.MAX_CODE_POINT))));
from lucene.
I don't think it is too bad because transitions are already sorted and collapsed for each state when you call finishState()
. For such an automaton you paid the price at construction time :)
But when you "iterate transitions" in order (0..numTransitions) to resolve a state, you are walking them in sorted order.
from lucene.
@dweiss i coded your idea up like this:
/**
* Returns true if the given automaton accepts all strings for the specified min/max range of the
* alphabet.
*/
public static boolean isTotal(Automaton a, int minAlphabet, int maxAlphabet) {
BitSet states = getLiveStates(a);
Transition spare = new Transition();
for (int state = states.nextSetBit(0); state >= 0; state = states.nextSetBit(state + 1)) {
// all reachable states must be accept states
if (a.isAccept(state) == false) return false;
// all reachable states must contain transitions covering minAlphabet-maxAlphabet
int previousLabel = minAlphabet - 1;
for (int transition = 0; transition < a.getNumTransitions(state); transition++) {
a.getTransition(state, transition, spare);
// no gaps are allowed
if (spare.min > previousLabel + 1) return false;
previousLabel = spare.max;
}
if (previousLabel < maxAlphabet) return false;
if (state == Integer.MAX_VALUE) {
break; // or (state+1) would overflow
}
}
// we've checked all the states, if its non-empty, its total
return a.getNumStates() > 0;
}
Only surprise was the last line, so the logic is:
- all reachable states must be accept states
- all reachable states must contain transitions covering the min..max range
- automaton must not be empty
from lucene.
@mikemccand @rmuir - your thoughts here would be helpful, since I'm less familiar with this area of code.
from lucene.
Minimization is a sure way to prove an automaton accepts all input strings because then the isTotal check is trivial [1]. You could try to trace all possible transitions, starting from the root and a full character range and see if everything in that range is always accepted... Could be fun, implementation-wise.
Looking at the examples, I wonder if this has to be a strict optimization - maybe early checking for common regexp values (.*) would be sufficient and everything else would just run as an automaton (optimized or not)?
If this isn't sufficient then I think you'll have to restore the minimization algorithm on ES side.
from lucene.
This is just what i'm mulling over now, relaxing isTotal
to no longer require a minimal DFA:
diff --git a/lucene/core/src/java/org/apache/lucene/util/automaton/Operations.java b/lucene/core/src/java/org/apache/lucene/util/automaton/Operations.java
index 2052b1c50bf..0de4ac013ee 100644
--- a/lucene/core/src/java/org/apache/lucene/util/automaton/Operations.java
+++ b/lucene/core/src/java/org/apache/lucene/util/automaton/Operations.java
@@ -864,15 +864,22 @@ public final class Operations {
/**
* Returns true if the given automaton accepts all strings for the specified min/max range of the
- * alphabet. The automaton must be minimized.
+ * alphabet. The automaton must be deterministic with no transitions to dead states.
*/
public static boolean isTotal(Automaton a, int minAlphabet, int maxAlphabet) {
- if (a.isAccept(0) && a.getNumTransitions(0) == 1) {
+ // minimal case
+ if (a.getNumStates() == 1 && a.isAccept(0) && a.getNumTransitions(0) == 1) {
Transition t = new Transition();
a.getTransition(0, 0, t);
return t.dest == 0 && t.min == minAlphabet && t.max == maxAlphabet;
}
- return false;
+ // deterministic case
+ Automaton a2 = new Automaton();
+ int s = a2.createState();
+ a2.setAccept(s, true);
+ a2.addTransition(s, s, minAlphabet, maxAlphabet);
+ a2.finishState();
+ return sameLanguage(a, a2);
}
/**
edit: struggles :)
from lucene.
In a similar vein, I wonder if
RegExp
could create more efficient automata out of the box. For instance, it looks likeOperations#repeat
could be optimized in the case when there is a single transition to directly create a minimal automaton, and this would in-turn makeRegExp(".*")
create an automaton that returnstrue
forOperations#isTotal
.
I had a similar thought. Looking at the code it kinda looks a little tacky, but also kinda makes sone sense, e.g.
case REGEXP_REPEAT:
+ if (exp1.kind == Kind.REGEXP_ANYCHAR && automaton_provider == null) {
+ return Automata.makeAnyString();
+ } else {
a = Operations.repeat(exp1.toAutomaton(automata, automaton_provider));
+ }
break;
from lucene.
@rmuir Just skimming your update to isTotal, it looks good. I think that it will be more generally useful, given that we minimize less now.
Separately, I might also make sense to improve RegExp, as suggested earlier in this issue.
from lucene.
@rmuir Just skimming your update to isTotal, it looks good. I think that it will be more generally useful, given that we minimize less now.
need to stare at it some more. I don't like that it uses some stuff such as sameLanguage
which has a TODO to move to tests/test-framework and is currently only used by tests.
And since we are checking for "total", we definitely don't need to do subsetOf twice: subsetOf(a2, a1) && subsetOf(a1, a2)
.
and I don't like that subsetOf
is quadratic to the number of states: it is heavy stuff. There is probably a simpler algorithm.
from lucene.
@ChrisHegarty I created draft PR, but I am still not happy with it yet.
from lucene.
See PR: #13707, I took a different approach which solves the practical problem without doing scary stuff.
public static boolean isTotal(Automaton a, int minAlphabet, int maxAlphabet) {
+ // allows some "fuzziness" to detect non-minimal forms such as those from RegExp
if (a.isAccept(0) && a.getNumTransitions(0) == 1) {
Transition t = new Transition();
a.getTransition(0, 0, t);
- return t.dest == 0 && t.min == minAlphabet && t.max == maxAlphabet;
+ int state = t.dest;
+ if (t.min == minAlphabet
+ && t.max == maxAlphabet
+ && a.isAccept(state)
+ && a.getNumTransitions(state) == 1) {
+ a.getTransition(state, 0, t);
+ return t.dest == state && t.min == minAlphabet && t.max == maxAlphabet;
+ }
}
return false;
}
from lucene.
I like the brevity of using sameLanguage! :)
I keep trying to find a counterexample to the assertion that a deterministic, total automaton must accept full language in each state reachable from root (there may be more than one transition but they must cover the full minAlphabet..maxAlphabet range, always ending in an accepting state somewhere.
If so, it should be possible to implement isTotal as a full traversal of the automaton in O(num states)? So something like this would also return true:
Automaton c =
Operations.repeat(
Operations.union(
Automata.makeCharRange(Character.MIN_CODE_POINT, 100),
Automata.makeCharRange(101, Character.MAX_CODE_POINT)));
System.out.println(c.toDot());
from lucene.
I like the sameLanguage too, but I don't like the potential quadratic cost, considering we currently expect the calculation to be fast, and it is called on every automaton. I think it should be avoided in production code?
As far as your counterexample, it is actually difficult to create such an automaton, you managed it with union! e,g, if you just create a state and add several ranges instead of one big range, they will be collapsed into one single range when you finishState
: https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/automaton/Automaton.java#L232
That's why I thought, there is something to be said for a very simple, constant-time check that will be practical as opposed to perfect: it will work for the stuff coming out of regex parser, or for "normal" stuff coming from the api (e.g. repeat). for that it needs to check two states (or same state twice) instead of one.
But if you are able to implement it in linear time that solves all the cases, that would be great, let's do that instead.
from lucene.
I like the sameLanguage too, but I don't like the potential quadratic cost, considering we currently expect the calculation to be fast, and it is called on every automaton. I think it should be avoided in production code?
I agree - I don't think it's a good practical replacement solution, but it's a very elegant theoretical one. :)
But if you are able to implement it in linear time that solves all the cases, that would be great, let's do that instead.
I think the relaxation patch is fine as a short first step - it doesn't claim to be optimal (PnP, as Mike loves to say). I'll add it to my todo list, it seems like a fun little project, although finding the time is difficult.
The annoying part is probably just the reachability (e.g. regex parser produces automatons with some unreachable states),
I don't think all states need to be considered - only those reachable from the initial state. Tracking which states have been checked already may add some overhead but even with this, it should be fast (enough)?
from lucene.
You could probably create an automaton containing states with an insane number of outgoing transitions, for example one transition for each character... then resolving that such a state actually covers the full min..max range, with no gaps, could be costly. The only thing I can think of is sorting transition ranges and checking for continuity (and min/max)... this may get expensive.
Whether such unrealistic automata can ever occur in reality (as a product of the set of operations we make available) is another question...
from lucene.
from lucene.
- automaton must not be empty
where "empty" means, at least one reachable state. Automaton can have all dead states, and that doesn't make it total :)
from lucene.
Yes, I like it!
I had some time to think about it before I went to bed and this implementation is actually a direct rollout of the definition of accepted language equivalence for deterministic automata - just what you mentioned at the beginning. Two equivalent (deterministic) automata must accept the same set of symbols from any state reachable for any input starting at the initial state. The automaton we compare against just happens to be repeat(anyCharacter()), so in any reachable state of automaton A we compare against the only state in automaton B - a self-connected state accepting all symbols. Consistent with the conditions you mentioned.
I'm glad this worked out to be so elegant and thank you for the implementation.
from lucene.
💙
from lucene.
and I don't like that
subsetOf
is quadratic to the number of states: it is heavy stuff. There is probably a simpler algorithm.
I am trying to review this PR, but got distracted / rat-holed into this statement lol:
Is subsetOf
really quadratic? I know its javadoc says so, but I'm not convinced. Yes it has embedded for loops (one for number of transitions leaving n1
, the other for number of transitions leaving n2
), but that is basically doing a merge sort so its cost is linear O(max(n1, n2))
.
In total I think its cost is actually O(totalTransitionCountInA1
)? Anyway, this is a (fun) distraction ... I'll try to actually review the change ;)
from lucene.
I think the relaxation patch is fine as a short first step - it doesn't claim to be optimal (PnP, as Mike loves to say).
+1, heh.
from lucene.
- automaton must not be empty
Ahh that last line was sneaky :)
from lucene.
Is
subsetOf
really quadratic? I know its javadoc says so, but I'm not convinced.
I think this is only one of the scarier parts about it. The other scary part is that it may throw IllegalArgumentException
, or in some cases AssertionError
(if asserts are enabled, if not... UB?).
For these reasons too, I wanted to avoid its usage in something that gets called e.g. by CompiledAutomaton and proposed moving it to AutomatonTestUtil for test purposes only: #13708
from lucene.
Is
subsetOf
really quadratic? I know its javadoc says so, but I'm not convinced.I think this is only one of the scarier parts about it. The other scary part is that it may throw
IllegalArgumentException
, or in some casesAssertionError
(if asserts are enabled, if not... UB?).For these reasons too, I wanted to avoid its usage in something that gets called e.g. by CompiledAutomaton and proposed moving it to AutomatonTestUtil for test purposes only: #13708
Yeah +1 to move it to test only!
from lucene.
Closed by #13707
from lucene.
Related Issues (20)
- TestTieredMergePolicy.testSimulateUpdates fails intermittently asserting segment infos
- Find a way to remove IndexSearcher#search(Query query, CollectorOwner collectorOwner) before 10.0 HOT 10
- Link and Code ANN algorithm from FaceBook in Lucene?
- [DISCUSS] Could we have a different ANN algorithm for Learned Sparse Vectors? HOT 3
- Eclipse - one or more cycles were detected HOT 4
- ComplexPhraseQueryParser parses wrongly when special characters found
- Get field number from the SegmentReadState when get norms
- Measure whether graph is strongly connected HOT 3
- Gradle build sometimes gives spurious "unreferenced license file" warnings HOT 21
- Upgrade to gradle 8.10 HOT 4
- Allow MultiLeafKnnCollector.greediness to be configurable HOT 4
- Narrow polygons close to latitude 90 do not match any points
- Experiment with branchless binary-search for numeric range faceting HOT 1
- Operations#sameLanguage has true negatives? HOT 3
- Move anonymous Weight implementation in PointRangeQuery to named class
- CompiledAutomaton not equal when NFARunAutomaton is non-null HOT 9
- `count()` in LRUQueryCache HOT 2
- simplify checkWorkingCopyClean to make backporting easier? HOT 11
- Reproducible test failure in TestTaxonomyFacetAssociations.testFloatSumAssociation -- ULP float issue? HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from lucene.