There are a number of optimization in Elasticsearch that depend upon the automaton fro

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Like so: <div class="snippet-clipboard-content notranslate position-relative overf

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

RegExp::toAutomaton no longer minimizes,about apache/lucene

Comments (31)

rmuir commented on September 13, 2024 2

I don't think all states need to be considered - only those reachable from the initial state. Tracking which states have been checked already may add some overhead but even with this, it should be fast (enough)?

Yes, this one is very important actually, you get 3 states with Operations.repeat(Automata.makeAnyChar()) and one is unreachable.

I think the relaxation patch is fine as a short first step - it doesn't claim to be optimal (PnP, as Mike loves to say). I'll add it to my todo list, it seems like a fun little project, although finding the time is difficult.

lemme try adding your test, that is helpful as I was having a tough time coming up with "varieties" to test. I will take a stab at it. It would be great to fix the javadoc to not require minimization to call this function.

from lucene.

jpountz commented on September 13, 2024 1

In a similar vein, I wonder if RegExp could create more efficient automata out of the box. For instance, it looks like Operations#repeat could be optimized in the case when there is a single transition to directly create a minimal automaton, and this would in-turn make RegExp(".*") create an automaton that returns true for Operations#isTotal.

from lucene.

rmuir commented on September 13, 2024 1

@ChrisHegarty implementation of isTotal() method requires a minimal DFA. If the automaton is not minimal, it may return false but it should not create a problem.

This is the only place that isTotal() is called in lucene (see the comment): https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/automaton/CompiledAutomaton.java#L181

If you really need to minimize here, can you use something like this as a workaround? https://github.com/apache/lucene/blob/main/lucene/test-framework/src/java/org/apache/lucene/tests/util/automaton/AutomatonTestUtil.java#L338-L345

Sorry, I havent thought about this isTotal much to see if there is a more reasonable implementation, just need to think it over.

If we need to improve isTotal, it is definitely not necessary to minimize, e.g. the following only requires determinization + removal of dead states

boolean isTotal(Automaton a) {
  return sameLanguage(a, Automata.makeAnyString());
}

from lucene.

rmuir commented on September 13, 2024 1

Here's a round two, to prevent any error on NFA or having transitions to dead states:

diff --git a/lucene/core/src/java/org/apache/lucene/util/automaton/Operations.java b/lucene/core/src/java/org/apache/lucene/util/automaton/Operations.java
index 2052b1c50bf..10103628fad 100644
--- a/lucene/core/src/java/org/apache/lucene/util/automaton/Operations.java
+++ b/lucene/core/src/java/org/apache/lucene/util/automaton/Operations.java
@@ -864,14 +864,25 @@ public final class Operations {
 
   /**
    * Returns true if the given automaton accepts all strings for the specified min/max range of the
-   * alphabet. The automaton must be minimized.
+   * alphabet. The automaton must be deterministic with no transitions to dead states.
    */
   public static boolean isTotal(Automaton a, int minAlphabet, int maxAlphabet) {
-    if (a.isAccept(0) && a.getNumTransitions(0) == 1) {
+    // minimal case
+    if (a.getNumStates() == 1 && a.isAccept(0) && a.getNumTransitions(0) == 1) {
       Transition t = new Transition();
       a.getTransition(0, 0, t);
       return t.dest == 0 && t.min == minAlphabet && t.max == maxAlphabet;
     }
+    // deterministic case
+    if (a.isDeterministic() && hasDeadStatesFromInitial(a) == false) {
+      Automaton a2 = new Automaton();
+      int s = a2.createState();
+      a2.setAccept(s, true);
+      a2.addTransition(s, s, minAlphabet, maxAlphabet);
+      a2.finishState();
+      return sameLanguage(a, a2);
+    }
+    // NFA, or has transitions to dead states, return false
     return false;
   }

from lucene.

rmuir commented on September 13, 2024 1

I had a similar thought. Looking at the code it kinda looks a little tacky, but also kinda makes sone sense, e.g.

      case REGEXP_REPEAT:
+        if (exp1.kind == Kind.REGEXP_ANYCHAR && automaton_provider == null) {
+          return Automata.makeAnyString();
+        } else {
          a = Operations.repeat(exp1.toAutomaton(automata, automaton_provider));
+       }
        break;

let's fix the regexp parser first? It is easier to reason about and less scary than stuff like isTotal and subsetOf.

Previously, regexp parser was calling minimize() on every...parsing...step. I removed this because it is obviously not good. But if we have a simple fix to make it emit better automaton for practical uses, let's do it?

from lucene.

rmuir commented on September 13, 2024 1

I think the "full traversal" suggested by Dawid here would be very fast. The annoying part is probably just the reachability (e.g. regex parser produces automatons with some unreachable states), but we have some helpers for that already in Operations?

from lucene.

dweiss commented on September 13, 2024 1

Like so:

      Automaton c =
          Operations.repeat(
              Operations.union(
                  List.of(
                      Automata.makeCharRange(Character.MIN_CODE_POINT, 100),
                      Automata.makeChar(101),
                      Automata.makeChar(102),
                      Automata.makeChar(103),
                      Automata.makeCharRange(104, Character.MAX_CODE_POINT))));

from lucene.

rmuir commented on September 13, 2024 1

I don't think it is too bad because transitions are already sorted and collapsed for each state when you call finishState(). For such an automaton you paid the price at construction time :)

But when you "iterate transitions" in order (0..numTransitions) to resolve a state, you are walking them in sorted order.

from lucene.

rmuir commented on September 13, 2024 1

@dweiss i coded your idea up like this:

  /** 
   * Returns true if the given automaton accepts all strings for the specified min/max range of the
   * alphabet. 
   */   
  public static boolean isTotal(Automaton a, int minAlphabet, int maxAlphabet) {
    BitSet states = getLiveStates(a);
    Transition spare = new Transition();
    for (int state = states.nextSetBit(0); state >= 0; state = states.nextSetBit(state + 1)) {
      // all reachable states must be accept states
      if (a.isAccept(state) == false) return false;
      // all reachable states must contain transitions covering minAlphabet-maxAlphabet
      int previousLabel = minAlphabet - 1;
      for (int transition = 0; transition < a.getNumTransitions(state); transition++) {
        a.getTransition(state, transition, spare);
        // no gaps are allowed
        if (spare.min > previousLabel + 1) return false;
        previousLabel = spare.max;
      }
      if (previousLabel < maxAlphabet) return false;
      if (state == Integer.MAX_VALUE) {
        break; // or (state+1) would overflow
      }
    }
    // we've checked all the states, if its non-empty, its total
    return a.getNumStates() > 0;
  }

Only surprise was the last line, so the logic is:

all reachable states must be accept states
all reachable states must contain transitions covering the min..max range
automaton must not be empty

from lucene.

ChrisHegarty commented on September 13, 2024

@mikemccand @rmuir - your thoughts here would be helpful, since I'm less familiar with this area of code.

from lucene.

dweiss commented on September 13, 2024

Minimization is a sure way to prove an automaton accepts all input strings because then the isTotal check is trivial [1]. You could try to trace all possible transitions, starting from the root and a full character range and see if everything in that range is always accepted... Could be fun, implementation-wise.

Looking at the examples, I wonder if this has to be a strict optimization - maybe early checking for common regexp values (.*) would be sufficient and everything else would just run as an automaton (optimized or not)?

If this isn't sufficient then I think you'll have to restore the minimization algorithm on ES side.

[1] https://github.com/cs-au-dk/dk.brics.automaton/blob/master/src/dk/brics/automaton/BasicOperations.java#L583-L590

from lucene.

rmuir commented on September 13, 2024

This is just what i'm mulling over now, relaxing isTotal to no longer require a minimal DFA:

diff --git a/lucene/core/src/java/org/apache/lucene/util/automaton/Operations.java b/lucene/core/src/java/org/apache/lucene/util/automaton/Operations.java
index 2052b1c50bf..0de4ac013ee 100644
--- a/lucene/core/src/java/org/apache/lucene/util/automaton/Operations.java
+++ b/lucene/core/src/java/org/apache/lucene/util/automaton/Operations.java
@@ -864,15 +864,22 @@ public final class Operations {
 
   /**
    * Returns true if the given automaton accepts all strings for the specified min/max range of the
-   * alphabet. The automaton must be minimized.
+   * alphabet. The automaton must be deterministic with no transitions to dead states.
    */
   public static boolean isTotal(Automaton a, int minAlphabet, int maxAlphabet) {
-    if (a.isAccept(0) && a.getNumTransitions(0) == 1) {
+    // minimal case
+    if (a.getNumStates() == 1 && a.isAccept(0) && a.getNumTransitions(0) == 1) {
       Transition t = new Transition();
       a.getTransition(0, 0, t);
       return t.dest == 0 && t.min == minAlphabet && t.max == maxAlphabet;
     }
-    return false;
+    // deterministic case
+    Automaton a2 = new Automaton();
+    int s = a2.createState();
+    a2.setAccept(s, true);
+    a2.addTransition(s, s, minAlphabet, maxAlphabet);
+    a2.finishState();
+    return sameLanguage(a, a2);
   }
 
   /**

edit: struggles :)

from lucene.

ChrisHegarty commented on September 13, 2024

In a similar vein, I wonder if RegExp could create more efficient automata out of the box. For instance, it looks like Operations#repeat could be optimized in the case when there is a single transition to directly create a minimal automaton, and this would in-turn make RegExp(".*") create an automaton that returns true for Operations#isTotal.

I had a similar thought. Looking at the code it kinda looks a little tacky, but also kinda makes sone sense, e.g.

      case REGEXP_REPEAT:
+        if (exp1.kind == Kind.REGEXP_ANYCHAR && automaton_provider == null) {
+          return Automata.makeAnyString();
+        } else {
          a = Operations.repeat(exp1.toAutomaton(automata, automaton_provider));
+       }
        break;

from lucene.

ChrisHegarty commented on September 13, 2024

@rmuir Just skimming your update to isTotal, it looks good. I think that it will be more generally useful, given that we minimize less now.

Separately, I might also make sense to improve RegExp, as suggested earlier in this issue.

from lucene.

rmuir commented on September 13, 2024

@rmuir Just skimming your update to isTotal, it looks good. I think that it will be more generally useful, given that we minimize less now.

need to stare at it some more. I don't like that it uses some stuff such as sameLanguage which has a TODO to move to tests/test-framework and is currently only used by tests.

And since we are checking for "total", we definitely don't need to do subsetOf twice: subsetOf(a2, a1) && subsetOf(a1, a2).

and I don't like that subsetOf is quadratic to the number of states: it is heavy stuff. There is probably a simpler algorithm.

from lucene.

rmuir commented on September 13, 2024

@ChrisHegarty I created draft PR, but I am still not happy with it yet.

from lucene.

rmuir commented on September 13, 2024

See PR: #13707, I took a different approach which solves the practical problem without doing scary stuff.

   public static boolean isTotal(Automaton a, int minAlphabet, int maxAlphabet) {
+    // allows some "fuzziness" to detect non-minimal forms such as those from RegExp
     if (a.isAccept(0) && a.getNumTransitions(0) == 1) {
       Transition t = new Transition();
       a.getTransition(0, 0, t);
-      return t.dest == 0 && t.min == minAlphabet && t.max == maxAlphabet;
+      int state = t.dest;
+      if (t.min == minAlphabet
+          && t.max == maxAlphabet
+          && a.isAccept(state)
+          && a.getNumTransitions(state) == 1) {
+        a.getTransition(state, 0, t);
+        return t.dest == state && t.min == minAlphabet && t.max == maxAlphabet;
+      }
     }
     return false;
   }

from lucene.

dweiss commented on September 13, 2024

I like the brevity of using sameLanguage! :)

I keep trying to find a counterexample to the assertion that a deterministic, total automaton must accept full language in each state reachable from root (there may be more than one transition but they must cover the full minAlphabet..maxAlphabet range, always ending in an accepting state somewhere.

If so, it should be possible to implement isTotal as a full traversal of the automaton in O(num states)? So something like this would also return true:

      Automaton c =
          Operations.repeat(
              Operations.union(
                  Automata.makeCharRange(Character.MIN_CODE_POINT, 100),
                  Automata.makeCharRange(101, Character.MAX_CODE_POINT)));

      System.out.println(c.toDot());

from lucene.

rmuir commented on September 13, 2024

I like the sameLanguage too, but I don't like the potential quadratic cost, considering we currently expect the calculation to be fast, and it is called on every automaton. I think it should be avoided in production code?

As far as your counterexample, it is actually difficult to create such an automaton, you managed it with union! e,g, if you just create a state and add several ranges instead of one big range, they will be collapsed into one single range when you finishState: https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/automaton/Automaton.java#L232

That's why I thought, there is something to be said for a very simple, constant-time check that will be practical as opposed to perfect: it will work for the stuff coming out of regex parser, or for "normal" stuff coming from the api (e.g. repeat). for that it needs to check two states (or same state twice) instead of one.

But if you are able to implement it in linear time that solves all the cases, that would be great, let's do that instead.

from lucene.

dweiss commented on September 13, 2024

I like the sameLanguage too, but I don't like the potential quadratic cost, considering we currently expect the calculation to be fast, and it is called on every automaton. I think it should be avoided in production code?

I agree - I don't think it's a good practical replacement solution, but it's a very elegant theoretical one. :)

But if you are able to implement it in linear time that solves all the cases, that would be great, let's do that instead.

I think the relaxation patch is fine as a short first step - it doesn't claim to be optimal (PnP, as Mike loves to say). I'll add it to my todo list, it seems like a fun little project, although finding the time is difficult.

The annoying part is probably just the reachability (e.g. regex parser produces automatons with some unreachable states),

I don't think all states need to be considered - only those reachable from the initial state. Tracking which states have been checked already may add some overhead but even with this, it should be fast (enough)?

from lucene.

dweiss commented on September 13, 2024

You could probably create an automaton containing states with an insane number of outgoing transitions, for example one transition for each character... then resolving that such a state actually covers the full min..max range, with no gaps, could be costly. The only thing I can think of is sorting transition ranges and checking for continuity (and min/max)... this may get expensive.

Whether such unrealistic automata can ever occur in reality (as a product of the set of operations we make available) is another question...

from lucene.

rmuir commented on September 13, 2024

ce4cce0

from lucene.

rmuir commented on September 13, 2024

automaton must not be empty

where "empty" means, at least one reachable state. Automaton can have all dead states, and that doesn't make it total :)

from lucene.

dweiss commented on September 13, 2024

Yes, I like it!

I had some time to think about it before I went to bed and this implementation is actually a direct rollout of the definition of accepted language equivalence for deterministic automata - just what you mentioned at the beginning. Two equivalent (deterministic) automata must accept the same set of symbols from any state reachable for any input starting at the initial state. The automaton we compare against just happens to be repeat(anyCharacter()), so in any reachable state of automaton A we compare against the only state in automaton B - a self-connected state accepting all symbols. Consistent with the conditions you mentioned.

I'm glad this worked out to be so elegant and thank you for the implementation.

from lucene.

ChrisHegarty commented on September 13, 2024

💙

from lucene.

mikemccand commented on September 13, 2024

and I don't like that subsetOf is quadratic to the number of states: it is heavy stuff. There is probably a simpler algorithm.

I am trying to review this PR, but got distracted / rat-holed into this statement lol:

Is subsetOf really quadratic? I know its javadoc says so, but I'm not convinced. Yes it has embedded for loops (one for number of transitions leaving n1, the other for number of transitions leaving n2), but that is basically doing a merge sort so its cost is linear O(max(n1, n2)).

In total I think its cost is actually O(totalTransitionCountInA1)? Anyway, this is a (fun) distraction ... I'll try to actually review the change ;)

from lucene.

mikemccand commented on September 13, 2024

I think the relaxation patch is fine as a short first step - it doesn't claim to be optimal (PnP, as Mike loves to say).

+1, heh.

from lucene.

mikemccand commented on September 13, 2024

automaton must not be empty

Ahh that last line was sneaky :)

from lucene.

rmuir commented on September 13, 2024

Is subsetOf really quadratic? I know its javadoc says so, but I'm not convinced.

I think this is only one of the scarier parts about it. The other scary part is that it may throw IllegalArgumentException, or in some cases AssertionError (if asserts are enabled, if not... UB?).

For these reasons too, I wanted to avoid its usage in something that gets called e.g. by CompiledAutomaton and proposed moving it to AutomatonTestUtil for test purposes only: #13708

from lucene.

mikemccand commented on September 13, 2024

Is subsetOf really quadratic? I know its javadoc says so, but I'm not convinced.

I think this is only one of the scarier parts about it. The other scary part is that it may throw IllegalArgumentException, or in some cases AssertionError (if asserts are enabled, if not... UB?).

For these reasons too, I wanted to avoid its usage in something that gets called e.g. by CompiledAutomaton and proposed moving it to AutomatonTestUtil for test purposes only: #13708

Yeah +1 to move it to test only!

from lucene.

rmuir commented on September 13, 2024

Closed by #13707

from lucene.

RegExp::toAutomaton no longer minimizes about lucene HOT 31 CLOSED

Comments (31)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent