I am tokenizing some text by matching a set of regexes against the beginning of a stri

Performance issue matching against beginning of very large string,about dlclark/regexp2

Comments (26)

alecthomas commented on August 26, 2024 1

BTW, thanks for writing this library. The built-in regexp library is great, but its limitations make converting regular expressions from other languages extremely difficult, or impossible. I was contemplating writing my own engine, but you've saved me a boatload of time :)

from regexp2.

dlclark commented on August 26, 2024

I included the Rune methods for performance reasons, but I'm willing to optimize even more.

Can you give me an example pattern and content to match against for a Benchmark?

Decoding runes as we go could be great, but I'll have to consider right-to-left matching and backtracking.

from regexp2.

alecthomas commented on August 26, 2024

I included the Rune methods for performance reasons, but I'm willing to optimize even more.

Understood, and I was very happy to find them once I figured out what was going on :)

Can you give me an example pattern and content to match against for a Benchmark?

Sure, I added a couple of benchmarks to a fork.

Here's a sample run on my machine:

BenchmarkTokenisationString-8   	       1	4039214647 ns/op
BenchmarkTokenisationRunes-8    	     100	  11870202 ns/op
PASS
ok  	github.com/dlclark/regexp2	5.278s

Interestingly (but I guess unsurprisingly), FindStringMatchStartingAt is actually slower than manually managing string slices myself.

from regexp2.

dlclark commented on August 26, 2024

Thanks for the example, I've poked around at it.

In general, for "follow on" matches the intended use is something like m, err := r.FindNextMatch(m) since that will feel natural and uses a cached version of the runes. In my tests this version gives very similar performance to your BenchmarkTokenisationRunes:

func BenchmarkTokenizationMatch(b *testing.B) {
	text := strings.Repeat("one two three four five six seven eight nine ten", 1000)

	r := MustCompile(`([a-z]+|\s+)`, 0)
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		m, err := r.FindStringMatch(text)
		if err != nil {
			b.Fatalf("Error: %v", err)
		}

		for m != nil {
			m, err = r.FindNextMatch(m)
			if err != nil {
				b.Fatalf("Error: %v", err)
			}
		}
	}
}

In the case where you eventually iterate the full string to get every match then regexp2 is only about 15% slower than the built-in regexp library. I'm not against optimizing this, but I'm not sure I'll be able to close that gap because of all the additional features in regexp2.

However, the original issue still stands for those just wanting to get the first match. On my machine getting the first match from your example benchmark takes 450 ns/op with the regexp package and 61048 ns/op with regexp2 using strings or 584 ns/op using runes. Clearly pre-converting the entire large input to runes is a waste if you just want the first handful of characters as a match.

I'll investigate switching the engine to a "calculate runes as we go" design. The present design is a hold-over from the original .NET engine -- strings in C# are stored as arrays of 16-bit characters so they get the "rune" conversion for free (though the .NET regex engine doesn't always properly handle 3-byte or larger runes because of this).

Once again, thanks for the issue report.

from regexp2.

alecthomas commented on August 26, 2024

I did find FindNextMatch() but it was not faster in my code, I'm not sure why. In the real code I'm matching multiple regexes against the string, could that be a factor?

from regexp2.

dlclark commented on August 26, 2024

If you're using the same input text and matching against multiple regex's it'll be faster to convert to runes in your code and then use FindRunesMatch. That way you only do the rune conversion once instead of once per pattern. From there it's best practice to use FindNextMatch to iterate your way through the string for each match.

Unfortunately if you're iterating through all the matches in the input text then you'll end up paying the cost of the rune conversion no matter how you do it. You just want to make sure you don't pay that cost more than once.

from regexp2.

dlclark commented on August 26, 2024

Ok, so this would pretty much be a re-write of runner.go to switch to a RuneSeeker interface that wraps all access of the rune slice.

The annoying and more subtle change required is that there are a lot of assumptions baked in that we know the length of input text. Some of these are early exit optimizations that could be switched to a cheap "max string length" (just assuming ascii text) and others are for hard boundary checks or for right-to-left support.

I think it's possible to get the performance gains mentioned, but it's a big change and there's a chance I'm not seeing some common case that will require we scan the string fully anyway. I'm also a bit worried that by putting the fundamental rune access behind an interface we would lose out on function inlining optimization that the current implementation gets.

All that said, I'll poke around at a re-write of runner.go and see if I can get a sense of the size of the change and the benefits involved.

from regexp2.

dlclark commented on August 26, 2024

I did a proof-of-concept implementation that represents "best case scenario" (as far as I can tell):

I cheated on the "length of the runes" and used length of string (which is wrong for multi-byte scenarios, but fine for a proof)
I put all rune access behind a function that tries to hit a cache and on a miss calculates the rune and adds it to the cache
I pre-allocated the cache to the max size

The benchmark I added (long input string, single match) is ~80% faster with these changes. Unfortunately all the other existing benchmarks became anywhere from 10% to 115% slower. The extra checks and function call overhead is just too large compared to a simple indexed array access.

If I fix the first issue (for length of runes) by using utf8.RuneCountInString then even the single improved benchmark is only ~30% faster. To not use this function I'd have to totally redo the runner and prefix scanner so they don't need to know the last rune index, but it doesn't seem like doing that is going to be worth it.

You can check out the runecache branch to see if I'm missing something or if you have ideas to optimize further.

from regexp2.

alecthomas commented on August 26, 2024

Hey Doug, thanks for taking the time to have a look, I appreciate it.

from regexp2.

alecthomas commented on August 26, 2024

I know it's been a while, but I've tracked down exactly where the performance issue is with Chroma's approach of matching multiple regular expressions against the "next" token using FindRunesMatchStartingAt(re, cursor).

Given the text some text to tokenise and the tokenising regexes \s+ and \w+, if all patterns are being matched against the cursor position, this is what will occur:

input: some text... re: \s+ (will scan the first 4 characters and match the 5th, but will be rejected as it's not at the beginning of the input)
input: some text... re: \w+ will scan and match some

Step 1 is the problem. The obvious solution is to anchor the regex with ^, but this means that multi-line regexes will no longer match the start of a line. This is what I'm currently doing, as without it performance degrades by several orders of magnitude.

Benchmark-8   	       1	55196400363 ns/op	68033096 B/op	 1034663 allocs/op

vs.

Benchmark-8   	      20	  90075923 ns/op	10604870 B/op	  211505 allocs/op

My first thought is that a new set of functions for explicitly matching at the "beginning" of the string or offset would solve this problem, but maybe there's some other solution I'm not seeing?

from regexp2.

dlclark commented on August 26, 2024

Can you drop in a simple benchmark here for me to look at?

from regexp2.

dlclark commented on August 26, 2024

Also, check out the Multiline option, it changes the anchors to work as either start/end of string or start/end of line.

from regexp2.

alecthomas commented on August 26, 2024

For sure, here you go.

package re

import (
	"strings"
	"testing"

	"github.com/dlclark/regexp2"
	"github.com/stretchr/testify/require"
)

type token struct {
	typ  int
	text string
}

var (
	str = `
-module(repl).

-export([run/0]).

run() ->
    read_eval_process().
`

	largeStr = []rune(strings.Repeat(str, 100))

	fastRegexes = []*regexp2.Regexp{
		regexp2.MustCompile(`^[ ]+`, regexp2.Multiline),
		regexp2.MustCompile(`^-`, regexp2.Multiline),
		regexp2.MustCompile(`^(->|-|[<>()[\]./])`, regexp2.Multiline),
		regexp2.MustCompile(`^module`, regexp2.Multiline),
		regexp2.MustCompile(`^export`, regexp2.Multiline),
		regexp2.MustCompile(`^\w+`, regexp2.Multiline),
		regexp2.MustCompile(`^\n`, regexp2.Multiline),
	}
	regexes = []*regexp2.Regexp{
		regexp2.MustCompile(`[ ]+`, regexp2.Multiline),
		regexp2.MustCompile(`^-`, regexp2.Multiline),
		regexp2.MustCompile(`->|-|[()[\]./]`, regexp2.Multiline),
		regexp2.MustCompile(`module`, regexp2.Multiline),
		regexp2.MustCompile(`export`, regexp2.Multiline),
		regexp2.MustCompile(`\w+`, regexp2.Multiline),
		regexp2.MustCompile(`\n`, regexp2.Multiline),
	}

	expected = []token{
		{6, "\n"}, {1, "-"}, {3, "module"}, {2, "("}, {5, "repl"}, {2, ")"}, {2, "."},
		{6, "\n"}, {6, "\n"}, {1, "-"}, {4, "export"}, {2, "("}, {2, "["}, {5, "run"},
		{2, "/"}, {5, "0"}, {2, "]"}, {2, ")"}, {2, "."}, {6, "\n"}, {6, "\n"}, {5, "run"},
		{2, "("}, {2, ")"}, {0, " "}, {2, "->"}, {6, "\n"}, {0, "    "},
		{5, "read_eval_process"}, {2, "("}, {2, ")"}, {2, "."}, {6, "\n"},
	}
)

func TestTokenise(t *testing.T) {
	tokens := tokenise(t, []rune(str))
	require.Equal(t, expected, tokens)
}

func TestFastTokenise(t *testing.T) {
	tokens := tokeniseFast(t, []rune(str))
	require.Equal(t, expected, tokens)
}

func BenchmarkSlowTokenise(b *testing.B) {
	for i := 0; i < b.N; i++ {
		tokenise(b, largeStr)
	}
}

func BenchmarkFastTokenise(b *testing.B) {
	for i := 0; i < b.N; i++ {
		tokeniseFast(b, largeStr)
	}
}

func tokenise(t require.TestingT, str []rune) []token {
	cursor := 0
	tokens := []token{}
	for {
		matched := false
		for typ, re := range regexes {
			match, err := re.FindRunesMatchStartingAt(str, cursor)
			require.NoError(t, err)
			if match == nil || match.Index > cursor || match.Length == 0 {
				continue
			}
			text := string(str[cursor : cursor+match.Length])
			tokens = append(tokens, token{typ, text})
			cursor += match.Length
			matched = true
			break
		}
		if cursor >= len(str) {
			break
		}
		require.True(t, matched, string(str[cursor:]))
	}
	return tokens
}

func tokeniseFast(t require.TestingT, str []rune) []token {
	cursor := 0
	tokens := []token{}
	for {
		matched := false
		for typ, re := range fastRegexes {
			match, err := re.FindRunesMatch(str[cursor:])
			require.NoError(t, err)
			if match == nil {
				continue
			}
			text := string(str[cursor : cursor+match.Length])
			tokens = append(tokens, token{typ, text})
			cursor += match.Length
			matched = true
			break
		}
		if cursor >= len(str) {
			break
		}
		require.True(t, matched, string(str[cursor:]))
	}
	return tokens
}

from regexp2.

dlclark commented on August 26, 2024

To fix the #define regex you should use multiline mode, like this:

regexp2.MustCompile(`^#define`, regexp2.Multiline),

Multiline mode makes the anchors ^ and $ work on the start and end of lines instead of just on the start and end of the string. That will trip the panic in your example, so I assume it causes the behavior you want. Removing the panic and switching both Benchmarks to use FindRunesMatchStartingAt shows the "fast" benchmark is about 20x faster.

Hope this helps!

from regexp2.

alecthomas commented on August 26, 2024

All the regexes in Chroma are already multiline (via (?m)).

from regexp2.

alecthomas commented on August 26, 2024

Okay I've updated the code with something that more closely reflects the real issue. TestTokenise() passes as it produces the expected tokens. TestTokeniseFast() does not pass (as expected).

The benchmark results are as follows:

$ go test -bench='.*' -run=skip ./re
goos: darwin
goarch: amd64
pkg: github.com/alecthomas/chroma/re
BenchmarkSlowTokenise-8   	      10	 119757161 ns/op
BenchmarkFastTokenise-8   	      20	  57588113 ns/op
PASS
ok  	github.com/alecthomas/chroma/re	2.540s

This is 2x faster, but just adding a few extra patterns quickly blows this out.

from regexp2.

dlclark commented on August 26, 2024

I've looked into the example you gave and figured out a few things:

The regex's were working, it was the test harness that was broken
In tokeniseFast you checked the match == nil for non-matches, but didn't check for what happens if the regex does match much further into the string. You guard against this in tokenise with match.Index > cursor and need to add something similar to the "fast" version:

if match == nil || match.Index > 0 || match.Length == 0 {
   continue
}

Your regex's overlap. Everything that matches second pattern (^-) also matches the third pattern (^(->|-|[<>()[\]./])) AND the input -> matches the second pattern first and doesn't ever get to check the third pattern. This means your second pattern really needs to be ^-(?!>) or changed in order.
With these fixes the "fast" version is now slower than the "slow" version by ~10%. I think the speedup you were seeing was because large amounts of the input text due to issue #1 above you didn't have to execute as many regexes overall. Unfortunately this produced the incorrect output.
Tokenizing in this manner is just going to be slow. I highly recommend you build a basic scanner for this. Using the example tokens I built a simple scanner that matched the behavior (which has a few oddities, like numbers being the same kinds of tokens as string idents, string idents allowed to start with numbers, all special characters under a single token, etc). I'm including the code here as a starting point. The scanner system for the Go language itself is really simple and easy to follow, I'd recommend using it as a starting point (https://golang.org/src/go/scanner/scanner.go).

Here's the code I slapped together as an example scanner for the grammar in your examples. Even with the abbreviated grammar it performs roughly 1000x faster than using regexes...as the number of regex's grows that performance gap will get even wider.

package token_test

import (
	"testing"
	"unicode"
	"unicode/utf8"

	"github.com/stretchr/testify/require"
)

var str = `
-module(repl).

-export([run/0]).

run() ->
    read_eval_process().
`

var expected = []token{
	{6, "\n"}, {1, "-"}, {3, "module"}, {2, "("}, {5, "repl"}, {2, ")"}, {2, "."},
	{6, "\n"}, {6, "\n"}, {1, "-"}, {4, "export"}, {2, "("}, {2, "["}, {5, "run"},
	{2, "/"}, {5, "0"}, {2, "]"}, {2, ")"}, {2, "."}, {6, "\n"}, {6, "\n"}, {5, "run"},
	{2, "("}, {2, ")"}, {0, " "}, {2, "->"}, {6, "\n"}, {0, "    "},
	{5, "read_eval_process"}, {2, "("}, {2, ")"}, {2, "."}, {6, "\n"},
}

func TestRealFastTokenise(t *testing.T) {
	tokens := tokeniseRealFast(t, []rune(str))
	require.Equal(t, expected, tokens)
}

func BenchmarkRealFastTokenise(b *testing.B) {
	for i := 0; i < b.N; i++ {
		tokeniseRealFast(b, largeStr)
	}
}

func tokeniseRealFast(t require.TestingT, str []rune) []token {
	s := NewScanner(str)
	tokens := []token{}

	for {
		_, tok, lit := s.Scan()

		if tok == EOF {
			break
		}

		tokens = append(tokens, token{int(tok - 1), lit})
	}

	return tokens
}

type Token int

const (
	ILLEGAL Token = iota
	SPACES
	DASH    // -
	SPECIAL // -> <>()[\]./
	MODULE  // module
	EXPORT  // export
	IDENT   // non-keyword identifier of some kind
	NEWLINE // \n

	EOF
)

var keywords = map[string]Token{
	"module": MODULE,
	"export": EXPORT,
}

// Lookup maps an identifier to its keyword token or IDENT (if not a keyword).
//
func Lookup(ident string) Token {
	if tok, isKeyword := keywords[ident]; isKeyword {
		return tok
	}
	return IDENT
}

type Scanner struct {
	src      []rune
	ch       rune // current char
	offset   int  // char offset
	rdOffset int  // reading offset (pos after current ch)
}

const bom = 0xFEFF // byte order mark, only permitted as very first character

// Read the next Unicode char into s.ch.
// s.ch < 0 means end-of-file.
//
func (s *Scanner) next() {
	if s.rdOffset < len(s.src) {
		s.offset = s.rdOffset
		s.ch = s.src[s.rdOffset]
		s.rdOffset++
	} else {
		s.offset = len(s.src)
		s.ch = -1 // eof
	}
}

// peek returns the rune following the most recently read character without
// advancing the scanner. If the scanner is at EOF, peek returns 0.
func (s *Scanner) peek() rune {
	if s.rdOffset < len(s.src) {
		return s.src[s.rdOffset]
	}
	return 0
}

func NewScanner(src []rune) *Scanner {
	s := &Scanner{
		src: src,
	}
	s.next()
	if s.ch == bom {
		s.next() // ignore BOM if we have one
	}
	return s
}

func (s *Scanner) Scan() (pos int, tok Token, lit string) {
	pos = s.offset
	ch := s.ch

	if isDigit(ch) || isLetter(ch) {
		// uncommon, typically idents only start with letters
		// to easily separate from numeric constants during lexing
		lit = s.scanIdentifier()
		tok = Lookup(lit)
		return
	}

	// always make progress
	s.next()

	switch ch {
	case -1:
		//eof
		tok = EOF
	case '\n':
		tok = NEWLINE
	case '-':
		if s.ch == '>' {
			tok = SPECIAL
			lit = "->"
			s.next()
		} else {
			tok = DASH
		}
	case ' ':
		lit = s.scanSpaces()
		tok = SPACES
	case '<', '>', '(', ')', '[', '\\', ']', '.', '/':
		tok = SPECIAL
	}

	if lit == "" {
		lit = string(ch)
	}
	return
}

func (s *Scanner) scanSpaces() string {
	offs := s.offset - 1
	for s.ch == ' ' {
		s.next()
	}
	return string(s.src[offs:s.offset])
}

func (s *Scanner) scanIdentifier() string {
	offs := s.offset
	for isLetter(s.ch) || isDigit(s.ch) {
		s.next()
	}
	return string(s.src[offs:s.offset])
}

func isDigit(ch rune) bool {
	return '0' <= ch && ch <= '9' || ch >= utf8.RuneSelf && unicode.IsDigit(ch)
}

func isLetter(ch rune) bool {
	return 'a' <= ch && ch <= 'z' || 'A' <= ch && ch <= 'Z' || ch == '_' || ch >= utf8.RuneSelf && unicode.IsLetter(ch)
}

from regexp2.

alecthomas commented on August 26, 2024

I know how to build a scanner...I've built many. It's not practical in this case as there are literally hundreds of languages supported by Chroma and they are largely auto-generated from Pygments.

I think you're still misunderstanding the root of the issue though. Possibly I'm not conveying it correctly. I think I'll just try and add the API I need in my fork, and send a PR.

from regexp2.

alecthomas commented on August 26, 2024

Actually one last try... there's no efficient way to correctly match and differentiate between ^a and a starting at a tokenisation cursor offset without the addition of a MatchPrefixRunes(rns) and MatchPrefixRunesStartingAt(rns, offset).

eg. All of the following would scan at most 2 characters (2 in the case of matching ^a against \nalpha).

"^a".MatchPrefixRunesStartingAt("foo\nalpha", 0) - fail after scanning 1 character
"^a".MatchPrefixRunesStartingAt("foo\nalpha", 4) - succeed after scanning 2 characters (\na)
"a".MatchPrefixRunesStartingAt("foo\nalpha", 0) - fail after scanning 1 character
"a".MatchPrefixRunesStartingAt("foo\nalpha", 8) - succeed after scanning 1 character

With the current API, a find is issued and if it succeeds the index must be 0.

"^a".FindRunesMatchStartingAt("foo\nalpha", 0) - fail after scanning 4 characters
"^a".FindRunesMatchStartingAt("foo\nalpha", 4) - succeed after scanning 2 characters (\na)
"a".FindRunesMatchStartingAt("foo\nalpha", 0) - fail after scanning 4 characters
"a".FindRunesMatchStartingAt("foo\nalpha", 8) - succeed after scanning 1 characters

I think this shows the issue with unnecessary over-scanning for this use-case.

In fact, now that I look... these two APIs are identical to Python's Pattern.match and Pattern.search, respectively. regexp2 has the latter but not the former.

from regexp2.

dlclark commented on August 26, 2024

I appreciate the shorter examples to discuss. In hindsight, suggesting a token scanner wasn't very productive, sorry about that.

I think the examples are missing a key element because this does only scan the first character before failing: "^a".FindRunesMatchStartingAt("foo\nalpha", 0)

Based on your previous examples I'm guessing you're using the Multiline option, so the actual regex is "(?m)^a" which does scan more because you're explicitly telling it to keep scanning and look for newlines. If you want ^ to only mean "start of string" then you need to stop using the Multiline option.

from regexp2.

alecthomas commented on August 26, 2024

Based on your previous examples I'm guessing you're using the Multiline option, so the actual regex is "(?m)^a" which does scan more because you're explicitly telling it to keep scanning and look for newlines. If you want ^ to only mean "start of string" then you need to stop using the Multiline option.

Right, which is exactly what FindRunesMatchStartingAt fixes... it will not keep scanning as (?m)^a will not match at the cursor position.

from regexp2.

dlclark commented on August 26, 2024

Assuming you meant your suggested new function (MatchPrefixRunesStartingAt) then it doesn't address my main issue--why add a new function to the library so that (?m)^a and ^a behave the same? The regex itself is telling the engine to behave that way. Why not fix the regex and remove the (?m)?

from regexp2.

alecthomas commented on August 26, 2024

Because I want the behaviour of (?m)^a where it matches the beginning of a newline, not just the beginning of the string. I'm assuming that MatchPrefixRunesStartingAt(rns, offset) treats the start of the rune slice as the only beginning of the string, not at the offset.

from regexp2.

dlclark commented on August 26, 2024

Ah, now I get it. You're correct, the start of the given slice is the Anchor if you use ^; to use the given startAt index you can use \G. It has the fail-fast behavior at index != 4, but does match when the index == 4 in this example:

func TestFindStart(t *testing.T) {
	re := regexp2.MustCompile(`\Ga`, Debug)
	m, err := re.FindRunesMatchStartingAt([]rune("foo\nalpha"), 4)

	if err != nil {
		t.Fatal(err)
	}
	if m == nil {
		t.Fatal("expected match")
	}
}

Honestly I had forgotten about \G until just now digging through the anchoring logic. If that doesn't work for you let me know.

from regexp2.

alecthomas commented on August 26, 2024

Oooh, that is potentially promising. I'll give it a go.

from regexp2.

alecthomas commented on August 26, 2024

Yes!! Thank you.

I've never seen a character escape that does that before, handy.

from regexp2.

Performance issue matching against beginning of very large string about regexp2 HOT 26 CLOSED

Comments (26)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent