Giter VIP home page Giter VIP logo

regexp2's Introduction

regexp2 - full featured regular expressions for Go

Regexp2 is a feature-rich RegExp engine for Go. It doesn't have constant time guarantees like the built-in regexp package, but it allows backtracking and is compatible with Perl5 and .NET. You'll likely be better off with the RE2 engine from the regexp package and should only use this if you need to write very complex patterns or require compatibility with .NET.

Basis of the engine

The engine is ported from the .NET framework's System.Text.RegularExpressions.Regex engine. That engine was open sourced in 2015 under the MIT license. There are some fundamental differences between .NET strings and Go strings that required a bit of borrowing from the Go framework regex engine as well. I cleaned up a couple of the dirtier bits during the port (regexcharclass.cs was terrible), but the parse tree, code emmitted, and therefore patterns matched should be identical.

New Code Generation

For extra performance use regexp2 with regexp2cg. It is a code generation utility for regexp2 and you can likely improve your regexp runtime performance by 3-10x in hot code paths. As always you should benchmark your specifics to confirm the results. Give it a try!

Installing

This is a go-gettable library, so install is easy:

go get github.com/dlclark/regexp2

To use the new Code Generation (while it's in beta) you'll need to use the code_gen branch:

go get github.com/dlclark/regexp2@code_gen

Usage

Usage is similar to the Go regexp package. Just like in regexp, you start by converting a regex into a state machine via the Compile or MustCompile methods. They ultimately do the same thing, but MustCompile will panic if the regex is invalid. You can then use the provided Regexp struct to find matches repeatedly. A Regexp struct is safe to use across goroutines.

re := regexp2.MustCompile(`Your pattern`, 0)
if isMatch, _ := re.MatchString(`Something to match`); isMatch {
    //do something
}

The only error that the *Match* methods should return is a Timeout if you set the re.MatchTimeout field. Any other error is a bug in the regexp2 package. If you need more details about capture groups in a match then use the FindStringMatch method, like so:

if m, _ := re.FindStringMatch(`Something to match`); m != nil {
    // the whole match is always group 0
    fmt.Printf("Group 0: %v\n", m.String())

    // you can get all the groups too
    gps := m.Groups()

    // a group can be captured multiple times, so each cap is separately addressable
    fmt.Printf("Group 1, first capture", gps[1].Captures[0].String())
    fmt.Printf("Group 1, second capture", gps[1].Captures[1].String())
}

Group 0 is embedded in the Match. Group 0 is an automatically-assigned group that encompasses the whole pattern. This means that m.String() is the same as m.Group.String() and m.Groups()[0].String()

The last capture is embedded in each group, so g.String() will return the same thing as g.Capture.String() and g.Captures[len(g.Captures)-1].String().

If you want to find multiple matches from a single input string you should use the FindNextMatch method. For example, to implement a function similar to regexp.FindAllString:

func regexp2FindAllString(re *regexp2.Regexp, s string) []string {
	var matches []string
	m, _ := re.FindStringMatch(s)
	for m != nil {
		matches = append(matches, m.String())
		m, _ = re.FindNextMatch(m)
	}
	return matches
}

FindNextMatch is optmized so that it re-uses the underlying string/rune slice.

The internals of regexp2 always operate on []rune so Index and Length data in a Match always reference a position in runes rather than bytes (even if the input was given as a string). This is a dramatic difference between regexp and regexp2. It's advisable to use the provided String() methods to avoid having to work with indices.

Compare regexp and regexp2

Category regexp regexp2
Catastrophic backtracking possible no, constant execution time guarantees yes, if your pattern is at risk you can use the re.MatchTimeout field
Python-style capture groups (?P<name>re) yes no (yes in RE2 compat mode)
.NET-style capture groups (?<name>re) or (?'name're) no yes
comments (?#comment) no yes
branch numbering reset (?|a|b) no no
possessive match (?>re) no yes
positive lookahead (?=re) no yes
negative lookahead (?!re) no yes
positive lookbehind (?<=re) no yes
negative lookbehind (?<!re) no yes
back reference \1 no yes
named back reference \k'name' no yes
named ascii character class [[:foo:]] yes no (yes in RE2 compat mode)
conditionals (?(expr)yes|no) no yes

RE2 compatibility mode

The default behavior of regexp2 is to match the .NET regexp engine, however the RE2 option is provided to change the parsing to increase compatibility with RE2. Using the RE2 option when compiling a regexp will not take away any features, but will change the following behaviors:

  • add support for named ascii character classes (e.g. [[:foo:]])
  • add support for python-style capture groups (e.g. (P<name>re))
  • change singleline behavior for $ to only match end of string (like RE2) (see #24)
  • change the character classes \d \s and \w to match the same characters as RE2. NOTE: if you also use the ECMAScript option then this will change the \s character class to match ECMAScript instead of RE2. ECMAScript allows more whitespace characters in \s than RE2 (but still fewer than the the default behavior).
  • allow character escape sequences to have defaults. For example, by default \_ isn't a known character escape and will fail to compile, but in RE2 mode it will match the literal character _
re := regexp2.MustCompile(`Your RE2-compatible pattern`, regexp2.RE2)
if isMatch, _ := re.MatchString(`Something to match`); isMatch {
    //do something
}

This feature is a work in progress and I'm open to ideas for more things to put here (maybe more relaxed character escaping rules?).

Catastrophic Backtracking and Timeouts

regexp2 supports features that can lead to catastrophic backtracking. Regexp.MatchTimeout can be set to to limit the impact of such behavior; the match will fail with an error after approximately MatchTimeout. No timeout checks are done by default.

Timeout checking is not free. The current timeout checking implementation starts a background worker that updates a clock value approximately once every 100 milliseconds. The matching code compares this value against the precomputed deadline for the match. The performance impact is as follows.

  1. A match with a timeout runs almost as fast as a match without a timeout.
  2. If any live matches have a timeout, there will be a background CPU load (~0.15% currently on a modern machine). This load will remain constant regardless of the number of matches done including matches done in parallel.
  3. If no live matches are using a timeout, the background load will remain until the longest deadline (match timeout + the time when the match started) is reached. E.g., if you set a timeout of one minute the load will persist for approximately a minute even if the match finishes quickly.

See PR #58 for more details and alternatives considered.

Goroutine leak error

If you're using a library during unit tests (e.g. https://github.com/uber-go/goleak) that validates all goroutines are exited then you'll likely get an error if you or any of your dependencies use regex's with a MatchTimeout. To remedy the problem you'll need to tell the unit test to wait until the backgroup timeout goroutine is exited.

func TestSomething(t *testing.T) {
    defer goleak.VerifyNone(t)
    defer regexp2.StopTimeoutClock()

    // ... test
}

//or

func TestMain(m *testing.M) {
    // setup
    // ...

    // run 
    m.Run()

    //tear down
    regexp2.StopTimeoutClock()
    goleak.VerifyNone(t)
}

This will add ~100ms runtime to each test (or TestMain). If that's too much time you can set the clock cycle rate of the timeout goroutine in an init function in a test file. regexp2.SetTimeoutCheckPeriod isn't threadsafe so it must be setup before starting any regex's with Timeouts.

func init() {
	//speed up testing by making the timeout clock 1ms
	regexp2.SetTimeoutCheckPeriod(time.Millisecond)
}

ECMAScript compatibility mode

In this mode the engine provides compatibility with the regex engine described in the ECMAScript specification.

Additionally a Unicode mode is provided which allows parsing of \u{CodePoint} syntax that is only when both are provided.

Library features that I'm still working on

  • Regex split

Potential bugs

I've run a battery of tests against regexp2 from various sources and found the debug output matches the .NET engine, but .NET and Go handle strings very differently. I've attempted to handle these differences, but most of my testing deals with basic ASCII with a little bit of multi-byte Unicode. There's a chance that there are bugs in the string handling related to character sets with supplementary Unicode chars. Right-to-Left support is coded, but not well tested either.

Find a bug?

I'm open to new issues and pull requests with tests if you find something odd!

regexp2's People

Contributors

chinaykc avatar dlclark avatar dop251 avatar dthadi3 avatar eclipseo avatar ghemawat avatar hunshcn avatar mstoykov avatar u1735067 avatar vassudanagunta avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

regexp2's Issues

Regexp is not working for following code. Could you please correct it , if the usage of "regexp2" library is wrong?

package main

import (
"fmt"
"github.com/dlclark/regexp2"
)

func main() {
re,_ := regexp2.Compile(Deployment, 0)
fmt.Println(re.MatchString(D.*)) // ExpectedOutput: true , ActualOutput: false
fmt.Println(re.MatchString(D*)) // ExpectedOutput: true , ActualOutput: false
fmt.Println(re.MatchString(Dep)) // ExpectedOutput: true , ActualOutput: false
fmt.Println(re.MatchString(Deployment)) // ExpectedOutput: true , ActualOutput: true

}

Error while trying to match a string with a specific unicode against a RegExp that contains a space and a group

When trying to match (phrase.MatchString(X)) messages like gg 󠀀 󠀀 (notice that these are not the regular spaces) against a phrase like regexp2.MustCompile("\\bcool (house)\\b", 0), the following error will be thrown:

panic: runtime error: index out of range [917504] with length 128

goroutine 1 [running]:
github.com/dlclark/regexp2/syntax.(*BmPrefix).Scan(0xc000180540, {0xc000b70948, 0x6, 0x0?}, 0x0?, 0x0, 0x6)
        C:/Users/X/go/pkg/mod/github.com/dlclark/[email protected]/syntax/prefix.go:716 +0x3bb
github.com/dlclark/regexp2.(*runner).findFirstChar(0xc000623a00)
        C:/Users/X/go/pkg/mod/github.com/dlclark/[email protected]/runner.go:1305 +0x366
github.com/dlclark/regexp2.(*runner).scan(0xc000623a00, {0xc000b70948?, 0x6, 0xc000b70948?}, 0x6?, 0x1, 0xc00008f8e8?)
        C:/Users/X/go/pkg/mod/github.com/dlclark/[email protected]/runner.go:130 +0x1e5
github.com/dlclark/regexp2.(*Regexp).run(0xc0000f6200, 0xf4?, 0xffffffffffffffff, {0xc000b70948, 0x6, 0x6})
        C:/Users/X/go/pkg/mod/github.com/dlclark/[email protected]/runner.go:91 +0xfa
github.com/dlclark/regexp2.(*Regexp).MatchString(0x10f9c40?, {0x108f0f4?, 0xc00008fb48?})
        C:/Users/X/go/pkg/mod/github.com/dlclark/[email protected]/regexp.go:213 +0x45
main.main()
        C:/Users/X/Desktop/GoRegExTests/test.go:127 +0xbdc

The error is only being thrown when:
a. The message contains those unicode characters
b. The RegExp contains a space and a group like (house)

The RegExp above is just a very basic example to demonstrate this problem.

Bulk replace

Hello,

I'd just like to ask you if you have any plans to implement bulk replace functions to your regexp2 as the Go standard regex?
https://golang.org/pkg/regexp/#Regexp.ReplaceAll

  • func (re *Regexp) ReplaceAll(src, repl []byte) []byte
    
  • func (re *Regexp) ReplaceAllFunc(src []byte, repl func([]byte) []byte) []byte
    
  • func (re *Regexp) ReplaceAllLiteral(src, repl []byte) []byte
    
  • func (re *Regexp) ReplaceAllLiteralString(src, repl string) string
    
  • func (re *Regexp) ReplaceAllString(src, repl string) string
    
  • func (re *Regexp) ReplaceAllStringFunc(src string, repl func(string) string) string
    

Thank you,

Panic when trying to match `(?:){40}`

r := MustCompile(`(?:){40}`, RE2)
 m, err := r.FindStringMatch("12")

will panic with

panic: runtime error: index out of range [-1] [recovered]
        panic: runtime error: index out of range [-1]

goroutine 6 [running]:
testing.tRunner.func1.1(0x590420, 0xc000016320)
        testing/testing.go:988 +0x30d
testing.tRunner.func1(0xc000134120)
        testing/testing.go:991 +0x3f9
panic(0x590420, 0xc000016320)
        runtime/panic.go:969 +0x166
github.com/dlclark/regexp2.(*runner).trackPush1(...)
        github.com/dlclark/regexp2/runner.go:992
github.com/dlclark/regexp2.(*runner).execute(0xc000146000, 0x0, 0x0)
        github.com/dlclark/regexp2/runner.go:387 +0x4511
github.com/dlclark/regexp2.(*runner).scan(0xc000146000, 0xc000014230, 0x2, 0x2, 0x0, 0x0, 0x7fffffffffffffff, 0x0, 0x8, 0x8)
        github.com/dlclark/regexp2/runner.go:144 +0x1c3
github.com/dlclark/regexp2.(*Regexp).run(0xc000132100, 0x5a3d00, 0x0, 0xc000014230, 0x2, 0x2, 0x0, 0x0, 0x0)
        github.com/dlclark/regexp2/runner.go:91 +0x21a
github.com/dlclark/regexp2.(*Regexp).FindStringMatch(...)
        github.com/dlclark/regexp2/regexp.go:159
github.com/dlclark/regexp2.TestRE2ECMA(0xc000134120)
        github.com/dlclark/regexp2/regexp_re2_test.go:125 +0x8b
testing.tRunner(0xc000134120, 0x5b45d8)
        testing/testing.go:1039 +0xdc
created by testing.(*T).Run
        testing/testing.go:1090 +0x372
exit status 2
FAIL    github.com/dlclark/regexp2      0.006s

Things that I know don't matter:

  • whether RE2 or ECMAScript (the issue was found first in goja)
  • the input isn't important, the original one("\xe90000000") had Unicode
  • the 40 needs to be 17 or bigger

This was found through fuzzing goja with the go-fuzz corpus for regexp which is why the example is such :). I may rewrite it to fuzz regexp2 as well and post it if there is interest.

Continuous 4byte emoji would crash when ReplaceFunc()

Hello, it's been a long time.

Today I found an issue regarding some special "4byte" emojis on ReplaceFunc().

  • sample 4byte emojis: 📍😏️📣🍣🍺
  • sample 3byte emoji: ✔️⚾️

You can inspect the above with http://r12a.github.io/apps/conversion/ like the following:

image

Sample1: causes panic

Please take a look at the following: You can reproduce the issue by uncommenting the str assignment lines one by one.

As far as I checked, ReplaceFunc()'d get panic under the following condition:

  • target contains some continuous 4byte emojis, and
  • regex contains 3bytes UTF-8 characters and contains NO 4byte emojis
package main

import (
	"github.com/dlclark/regexp2"
	"github.com/k0kubun/pp"
)

func main() {
	str := "高" // panic: Japanese Kanji
	// str := "は" // panic: Japanese Hiragana
	// str := "パ" // panic: Japanese Katakana
	// str := "[a-zA-Z0-9]{,2}" // works fine: Japanese Hiragana
	// str := "峰起|烽起" // works fine: longer Japanese Hiragana (I wonder why)
	// str := "フトレス" // panic: longer Japanese Katakana
	// str := "ALLWAYS|Allways|allways|AllWays" // works fine: Alphabet
	// str := "📍" // works fine: 4byte emoji
	// str := "📍📍" // works fine: continuous 4byte emoji
	// str := "✔️" // panic: 3byte emoji
	// str := "✔️✔️" // panic: coutinuous 3byte emoji
	// str := "📍️✔️" // works fine: 4 and 3byte emoji
	// str := "️✔📍️" // works fine: 3 and 4byte emoji
	// str := "📍️は️" // works fine: 4byte emoji and Hiragana
	// str := "️は📍️" // works fine: Hiragana and 4byte emoji

	re := regexp2.MustCompile(str, 0)
	result, _ := re.ReplaceFunc("📍✔️😏⚾️📣🍣🍺🍺 <- continuous 4byte emoji 寿司ビール文字あり", func(m regexp2.Match) string {
		return "࿗" + "࿘" + string(m.Capture.Runes()) + "࿌"
	}, -1, -1)

	pp.Println(result)
}

Sample2: all works fine

The following is a kind of control group that works fine. The key is that the target contains no "continuous 4byte emojis".

package main

import (
	"github.com/dlclark/regexp2"
	"github.com/k0kubun/pp"
)

func main() {
        // All of the following patterns work fine perhaps because ""✔✔⚾⚾️ <- 3byte emoji 寿司ビール文字なし" contains no continuous 4byte emojis. You can check them by uncommenting them one by one.
	str := "高"
	// str := "は"
	// str := "パ"
	// str := "[a-zA-Z0-9]{,2}"
	// str := "峰起|烽起"
	// str := "フトレス"
	// str := "ALLWAYS|Allways|allways|AllWays"
	// str := "📍" 
	// str := "📍📍" 
	// str := "✔️" 
	// str := "✔️✔️" 
	// str := "📍️✔️" 
	// str := "️✔📍️" 
	// str := "📍️は️" 
	// str := "️は📍️" 

	re := regexp2.MustCompile(str, 0)
       // The following target works fine: there's no continuous 4byte emojis
	result, _ := re.ReplaceFunc("✔✔⚾⚾️ <- 3byte emoji 寿司ビール文字なし", func(m regexp2.Match) string {
		return "࿗" + "࿘" + string(m.Capture.Runes()) + "࿌"
	}, -1, -1)

	pp.Println(result)
}

FYI

The issue looks a little bit similar to "sushi-beer" issue: https://gist.github.com/kamipo/37576ce436c564d8cc28

I hope you'd check and fix it.

Best regards, 🙇

Performance issue matching against beginning of very large string

I am tokenizing some text by matching a set of regexes against the beginning of a string holding the contents of a file. I noticed that regexp2 was extremely slow for this use-case, and after running the profiler found that the time was dominated by getRunes().

This is occurring because, before every match, regexp2 converts the entire 22kb string to a slice of runes. I've worked around the issue be pre-converting the string to a slice of runes myself, then using FindRulesMatch(), but it was quite surprising and non-obvious.

A solution would be to convert runes on the fly (as most matches are under 10 characters, converting the whole string each time is redundant). Looking at the code, it doesn't seem like it would super painful to achieve. The runner would need to be modified to use DecodeRuneInString to advance the index into the string, rather than a direct index into a slice of runes.

\Z not work on regexp2.RE2 mode

s1  := `^Google\nApple$`
s2  := `^Google\nApple\Z`
data := "Google\nApple\n"
// will get result
re, err := regexp2.Compile(s, regexp2.Singleline)
// will not get result
re, err := regexp2.Compile(s, regexp2.Singleline|regexp2.RE2)

Why?

Seems to fail a positive lookahead

Hello, I was checking it out and it seems to fail a regular expression. For a given text like this one, the expression ((Art\.\s\d+)[\S\s]*?(?=Art\.\s\d+)) fails to match every Art. block in the text. I've tested the expression on this website and there it gives me the correct count of 12 matches.

Am I missing something? Maybe a multiline flag?

Regex Multiline

a regex= ^(ac|bb)$\n, but this i dont use option Multiline,I think it will error when MustCompile,but it not ,and can match string "ac\n",so how can i do ,it will throw an error

Is it possible to get the name of the currently matched group?

Say I have a regex to tokenize some language..

# in python.
regex = re.compile(
    "(?P<comment>#.*?$)|"
    "(?P<newline>\n)|"     # has to go ahead of the whitespace
    "(?P<comma>,)|"       
    "(?P<double_quote_string>\".*?\")|" 
    "(?P<single_quote_string>'.*?')|"   
    "(?P<whitespace>[ \t\r\f\v]+)|"    ... etc

Here you expect to get multiple matches for each group name when tokenizing a file and you want to keep the ordering of the tokens.

If I use the same approach using regexp2 can I go from match to group name? E.g. how do I get the last matched group name for a match? Is that possible?

Leaking go routines using `fastclock`

With the introduction of fastclock, it spawns a go routine with a given timeout.

https://github.com/dlclark/regexp2/blob/master/fastclock.go#L75

This timeout is defaulted to "forever".

https://github.com/dlclark/regexp2/blob/master/regexp.go#L22-L32

If you are using any unit tests, this can leak if using uber-go/goleak.

I am using Chroma which sets the timeout to 250ms, which is better than never, but it still leaks a routine on my quicker tests.


I do not know the solution, but can a way be implemented to make sure this go routine is killed when it is no longer needed? Could we store the number of Matches that is using the clock, and when the matches all go away, the go routine stops as soon as it can?

As someone who is new to this repo, I am not 100% sure. It is just a problem we are hitting now in our unit tests.

CPU is too high, how to reduce CPU, Pprof shows as follows

1.31mins 54.98% 54.98% 1.31mins 55.00% github.com/dlclark/regexp2/syntax.CharSet.CharIn
0.25mins 10.34% 65.32% 0.25mins 10.36% github.com/dlclark/regexp2.(*runner).forwardcharnext
0.19mins 8.09% 73.41% 1.75mins 73.45% github.com/dlclark/regexp2.(*runner).findFirstChar

Panics found through fuzzing

Here is a small script reproducing a panic that I found while fuzzing:
Notes:

  • those were the more readable examples (without strange non printable characters)
  • I use []byte mostly because it makes the copying between output and program easier
package main

import (
	"fmt"
	"runtime/debug"

	"github.com/dlclark/regexp2"
)

var testCases = []struct {
	r, s []byte
}{
	{
		s: []byte{0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30},
		r: []byte{0x28, 0x28, 0x29, 0x5c, 0x37, 0x28, 0x3f, 0x28, 0x29, 0x29},
	},
	{
		r: []byte{0x28, 0x5c, 0x32, 0x28, 0x3f, 0x28, 0x30, 0x29, 0x29},
		s: []byte{0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30},
	},
	{
		r: []byte{0x28, 0x3f, 0x28, 0x29, 0x29, 0x5c, 0x31, 0x30, 0x28, 0x3f, 0x28, 0x30, 0x29},
		s: []byte{0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30},
	},

	{
		r: []byte{0x28, 0x29, 0x28, 0x28, 0x29, 0x5c, 0x37, 0x28, 0x3f, 0x28, 0x29, 0x29},
		s: []byte{0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30},
	},
}

func test(r, s []byte) (b bool) {
	defer func() {
		if r := recover(); r != nil {
			fmt.Println(r)
			debug.PrintStack()
			b = true
		}
	}()

	re, err := regexp2.Compile(string(r), regexp2.ECMAScript|regexp2.Multiline)
	if err != nil {
		return false
	}
	_, _ = re.FindStringMatch(string(s))
	return false
}

func main() {
	for _, c := range testCases {
		fmt.Println("#############################################################################")
		if test(c.r, c.s) {
			fmt.Printf("Test case regex='%#v', string='%#v' panics\nstring values '%s', '%s'\n",
				c.r, c.s, string(c.r), string(c.s),
			)
		} else {
			fmt.Printf("Test case regex='%#v', string='%#v' DOES NOT panic\nstring values '%s', '%s'\n",
				c.r, c.s, string(c.r), string(c.s),
			)
		}
	}
}

Output is

#############################################################################
runtime error: index out of range [3] with length 3
goroutine 1 [running]:
runtime/debug.Stack(0x34, 0x0, 0x0)
	runtime/debug/stack.go:24 +0x9d
runtime/debug.PrintStack()
	runtime/debug/stack.go:16 +0x22
main.test.func1(0xc000083e38)
	command-line-arguments/test.go:36 +0x97
panic(0x4f0c20, 0xc0000162a0)
	runtime/panic.go:969 +0x166
github.com/dlclark/regexp2.(*Match).addMatch(0xc0000d6000, 0x3, 0x1, 0x0)
	github.com/dlclark/regexp2/match.go:170 +0x31c
github.com/dlclark/regexp2.(*runner).capture(0xc0000d4000, 0x3, 0x1, 0x1)
	github.com/dlclark/regexp2/runner.go:1420 +0x9e
github.com/dlclark/regexp2.(*runner).execute(0xc0000d4000, 0x0, 0x0)
	github.com/dlclark/regexp2/runner.go:254 +0x276e
github.com/dlclark/regexp2.(*runner).scan(0xc0000d4000, 0xc000018150, 0x9, 0xc, 0x0, 0x0, 0x7fffffffffffffff, 0x9, 0xc, 0x4490be)
	github.com/dlclark/regexp2/runner.go:144 +0x1c3
github.com/dlclark/regexp2.(*Regexp).run(0xc0000d2080, 0xc000083d00, 0xffffffffffffffff, 0xc000018150, 0x9, 0xc, 0x0, 0x0, 0x0)
	github.com/dlclark/regexp2/runner.go:91 +0xf0
github.com/dlclark/regexp2.(*Regexp).FindStringMatch(...)
	github.com/dlclark/regexp2/regexp.go:159
main.test(0x5b91b8, 0xa, 0xa, 0x5b9188, 0x9, 0x9, 0x0)
	command-line-arguments/test.go:45 +0x168
main.main()
	command-line-arguments/test.go:52 +0x174
Test case regex='[]byte{0x28, 0x28, 0x29, 0x5c, 0x37, 0x28, 0x3f, 0x28, 0x29, 0x29}', string='[]byte{0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30}' panics
string values '(()\7(?())', '000000000'
#############################################################################
runtime error: index out of range [2] with length 2
goroutine 1 [running]:
runtime/debug.Stack(0x34, 0x0, 0x0)
	runtime/debug/stack.go:24 +0x9d
runtime/debug.PrintStack()
	runtime/debug/stack.go:16 +0x22
main.test.func1(0xc000083e38)
	command-line-arguments/test.go:36 +0x97
panic(0x4f0c20, 0xc0000162c0)
	runtime/panic.go:969 +0x166
github.com/dlclark/regexp2.(*Match).addMatch(0xc0000d60e0, 0x2, 0x0, 0x1)
	github.com/dlclark/regexp2/match.go:170 +0x31c
github.com/dlclark/regexp2.(*runner).capture(0xc0000d4100, 0x2, 0x0, 0x1)
	github.com/dlclark/regexp2/runner.go:1420 +0x9e
github.com/dlclark/regexp2.(*runner).execute(0xc0000d4100, 0x0, 0x0)
	github.com/dlclark/regexp2/runner.go:254 +0x276e
github.com/dlclark/regexp2.(*runner).scan(0xc0000d4100, 0xc0000181b0, 0x9, 0xc, 0x0, 0x0, 0x7fffffffffffffff, 0x9, 0xc, 0x4490be)
	github.com/dlclark/regexp2/runner.go:144 +0x1c3
github.com/dlclark/regexp2.(*Regexp).run(0xc0000d2180, 0xc000083d00, 0xffffffffffffffff, 0xc0000181b0, 0x9, 0xc, 0x0, 0x0, 0x0)
	github.com/dlclark/regexp2/runner.go:91 +0xf0
github.com/dlclark/regexp2.(*Regexp).FindStringMatch(...)
	github.com/dlclark/regexp2/regexp.go:159
main.test(0x5b9198, 0x9, 0x9, 0x5b91a8, 0x9, 0x9, 0x0)
	command-line-arguments/test.go:45 +0x168
main.main()
	command-line-arguments/test.go:52 +0x174
Test case regex='[]byte{0x28, 0x5c, 0x32, 0x28, 0x3f, 0x28, 0x30, 0x29, 0x29}', string='[]byte{0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30}' panics
string values '(\2(?(0))', '000000000'
#############################################################################
runtime error: index out of range [1] with length 1
goroutine 1 [running]:
runtime/debug.Stack(0x34, 0x0, 0x0)
	runtime/debug/stack.go:24 +0x9d
runtime/debug.PrintStack()
	runtime/debug/stack.go:16 +0x22
main.test.func1(0xc000083e38)
	command-line-arguments/test.go:36 +0x97
panic(0x4f0c20, 0xc0000162e0)
	runtime/panic.go:969 +0x166
github.com/dlclark/regexp2.(*Match).addMatch(0xc0000d61c0, 0x1, 0x0, 0x1)
	github.com/dlclark/regexp2/match.go:170 +0x31c
github.com/dlclark/regexp2.(*runner).capture(0xc0000d4200, 0x1, 0x0, 0x1)
	github.com/dlclark/regexp2/runner.go:1420 +0x9e
github.com/dlclark/regexp2.(*runner).execute(0xc0000d4200, 0x0, 0x0)
	github.com/dlclark/regexp2/runner.go:254 +0x276e
github.com/dlclark/regexp2.(*runner).scan(0xc0000d4200, 0xc0000281c0, 0xd, 0x10, 0x0, 0x0, 0x7fffffffffffffff, 0xd, 0x10, 0x4490be)
	github.com/dlclark/regexp2/runner.go:144 +0x1c3
github.com/dlclark/regexp2.(*Regexp).run(0xc0000d2280, 0xc000083d00, 0xffffffffffffffff, 0xc0000281c0, 0xd, 0x10, 0x0, 0x0, 0x0)
	github.com/dlclark/regexp2/runner.go:91 +0xf0
github.com/dlclark/regexp2.(*Regexp).FindStringMatch(...)
	github.com/dlclark/regexp2/regexp.go:159
main.test(0x5b9588, 0xd, 0xd, 0x5b9598, 0xd, 0xd, 0x0)
	command-line-arguments/test.go:45 +0x168
main.main()
	command-line-arguments/test.go:52 +0x174
Test case regex='[]byte{0x28, 0x3f, 0x28, 0x29, 0x29, 0x5c, 0x31, 0x30, 0x28, 0x3f, 0x28, 0x30, 0x29}', string='[]byte{0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30}' panics
string values '(?())\10(?(0)', '0000000000000'
#############################################################################
runtime error: index out of range [4] with length 4
goroutine 1 [running]:
runtime/debug.Stack(0x34, 0x0, 0x0)
	runtime/debug/stack.go:24 +0x9d
runtime/debug.PrintStack()
	runtime/debug/stack.go:16 +0x22
main.test.func1(0xc000083e38)
	command-line-arguments/test.go:36 +0x97
panic(0x4f0c20, 0xc000016320)
	runtime/panic.go:969 +0x166
github.com/dlclark/regexp2.(*Match).addMatch(0xc0000d62a0, 0x4, 0x1, 0x0)
	github.com/dlclark/regexp2/match.go:170 +0x31c
github.com/dlclark/regexp2.(*runner).capture(0xc0000d4300, 0x4, 0x1, 0x1)
	github.com/dlclark/regexp2/runner.go:1420 +0x9e
github.com/dlclark/regexp2.(*runner).execute(0xc0000d4300, 0x0, 0x0)
	github.com/dlclark/regexp2/runner.go:254 +0x276e
github.com/dlclark/regexp2.(*runner).scan(0xc0000d4300, 0xc000018210, 0xc, 0xc, 0x0, 0x0, 0x7fffffffffffffff, 0xc, 0xc, 0x4490be)
	github.com/dlclark/regexp2/runner.go:144 +0x1c3
github.com/dlclark/regexp2.(*Regexp).run(0xc0000d2380, 0xc000083d00, 0xffffffffffffffff, 0xc000018210, 0xc, 0xc, 0x0, 0x0, 0x0)
	github.com/dlclark/regexp2/runner.go:91 +0xf0
github.com/dlclark/regexp2.(*Regexp).FindStringMatch(...)
	github.com/dlclark/regexp2/regexp.go:159
main.test(0x5b91c8, 0xc, 0xc, 0x5b91d8, 0xc, 0xc, 0x0)
	command-line-arguments/test.go:45 +0x168
main.main()
	command-line-arguments/test.go:52 +0x174
Test case regex='[]byte{0x28, 0x29, 0x28, 0x28, 0x29, 0x5c, 0x37, 0x28, 0x3f, 0x28, 0x29, 0x29}', string='[]byte{0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30}' panics
string values '()(()\7(?())', '000000000000'

FYI: a new "absent operator" on Ruby 2.4.1

This is NOT an issue and just to let you know that a new "absent operator" has been implemented on Ruby's regexp lib named Onigmo. Sorry for this if this'd disturb you.

Note that the implementation of the operator has a rigid background theory: https://staff.aist.go.jp/tanaka-akira/pub/prosym49-akr-paper.pdf

I recognize that your Regexp2 is based upon .NET Framework and extending your lib like that might not be good in some cases.
Note that I don't mean I need the operator right now.
I just wrote that for the case you'd have any interests in the new operator.

Cheers,

Running MatchString is slow

Run the following example (https://go.dev/play/p/BDU6yN5NvEZ):

package main

import (
	"log"
	"regexp"
	"time"

	"github.com/dlclark/regexp2"
)

func main() {
	url := "https://www.dhgate.com/product/magnetic-liquid-eyeliner-magnetic-false-eyelashes/481362313.html"

	reg1 := regexp.MustCompile(`dhgate(?:.[a-z]+)+\/product\/`)
	log.Println("start regexp match string...")
	begin := time.Now()
	reg1.MatchString(url)
	log.Println("time taken:", time.Since(begin))

	reg2 := regexp2.MustCompile(`dhgate(?:.[a-z]+)+\/product\/`, regexp2.IgnoreCase)
	log.Println("start regexp2 match string...")
	begin = time.Now()
	reg2.MatchString(url)
	log.Println("time taken:", time.Since(begin))
}

output:

2021/12/08 14:16:30 start regexp match string...
2021/12/08 14:16:30 time taken: 21.583µs
2021/12/08 14:16:30 start regexp2 match string...

regexp2 version is v1.4.0
Hope it helps to improve performance.

The best way to get all named captured groups

I'm trying to use this library to get all the named captured groups to a map[string]string.
This is my code:

caps := make(map[string]string)
re, err := regexp2.Compile(pattern, regexp2.RE2)
if err != nil {
	panic(err)
}
names := re.GetGroupNames()
mat, err := re.FindStringMatch(text)
if err != nil {
	panic(err)
}
if mat != nil {
	gps := mat.Groups()
	for i, value := range names {
		if value != strconv.Itoa(i) {
			if len(gps[i].Captures) > 0 {
				caps[value] = gps[i].Captures[0].String()
			}
		}
	}

	fmt.Println(caps)
}

Is this the best way in term of performance to do it?
First it calls FindStringMatch(), then it calls Groups() and finally, a for loop. Seem a little too many jobs to do. :D

xeger functionality

Does this library support the xeger functionality? For example I have the following regex that is not supported by standard regexp library.

(?!((?!a(b|c)z)|(?!a(c|d)z)))

I need to do something like

r, _ := regexp2.Compile(`(?!((?!a(b|c)z)|(?!a(c|d)z)))`, 0)
s, _ := r.GenerateMatchingString()

I need something like this that gives me a string that matches the regex, if any exists, for example:

acz

Is this functionality already implemented? I believe Fare has this feature. (https://github.com/moodmosaic/Fare/blob/master/Src/Fare/Xeger.cs)

Can we probably use those codes to add this feature? I am willing to contribute and add this feature if it is welcome.

bugs in scenarios of Chinese characters or incorrect using of match.Index

the following codes fails

package main

import (
	"fmt"
	"github.com/dlclark/regexp2"
)

func main()  {
	regex := regexp2.MustCompile("<style", regexp2.IgnoreCase|regexp2.Singleline)
	match, err := regex.FindStringMatch(sample)
	if err != nil {
		panic(err)
	}
	if match != nil {
		t, err := regex.Replace(sample, "xxx", match.Index, -1)
		if err != nil {
			panic(err)
		}
		fmt.Printf("%s", t)
	}
}

var sample = "<title>错<style"

if i search some words/regex successfully, and then replace something from match.Index instead of -1, the codes fails.

however, if removed the Chinese character , the codes succeeds.

so, in such scenario, what should beginning index be if I want to replace all and don't want to replace from -1(begining)

Capture.Length undefined, making FindAllString hard to implement

Hello! First, thanks for this great library - this is an impressive feat!

I needed an equivalent function for https://golang.org/pkg/regexp/#Regexp.FindAllString which ideally would be a part of this library, but unfortunately doesn't exist today. I took a stab at implementing it (without the n parameter):

func regexp2FindAllString(re *regexp2.Regexp, s string) []string {
	var matches []string
	for {
		match, _ := re.FindStringMatch(s)
		if match == nil {
			break
		} else {
			matches = append(matches, match.String())
			s = s[match.Index+match.Length:]
		}
	}
	return matches
}

At first glance, this seemed correct and appeared to work - however I realized that it in fact is incompatible with unicode because match.Length appears to report length in runes not bytes. I'm not sure whether or not Capture.Index reports bytes or runes either, and the docs don't define this:

    // the position in the original string where the first character of
    // captured substring was found.
    Index int
    // the length of the captured substring.
    Length int

From testing, it appears that Capture.Index oddly is in bytes and not runes. A corrected implementation is:

func regexp2FindAllString(re *regexp2.Regexp, s string) []string {
	var matches []string
	for {
		match, _ := re.FindStringMatch(s)
		if match == nil {
			break
		} else {
			matches = append(matches, match.String())
-			s = s[match.Index+match.Length:]
+			s = s[match.Index+len(match.String()):]
		}
	}
	return matches
}

This brings me to my points of feedback:

  1. Index in bytes and Length in runes is an odd inconsistency, I imagine they should be the same.
  2. The docstrings should ideally clarify this.
  3. It would be great if the library exposed a FindAllString implementation

Thanks again for the great library!

Valid regex doesn't compile

Hi!

If pattern contains \_, regexp2 fails to compile it. Example:

_, err := regexp2.Compile("^/legacy/([\w|\d|\-\_]+)/([\w|\d|\-\_]+)/.*", 0)
if err != nil {
	fmt.Println(err)
}

Error is error parsing regexp: unrecognized escape sequence \_ in ^/legacy/([\w|\d|\-\_]+)/([\w|\d|\-\_]+)/.*)

This pattern works in regexp package.

thanks!

runtime error: index out of range [#number] with length 128

This is again from fuzzing:

package main

import (
        "fmt"
        "runtime/debug"

        "github.com/dlclark/regexp2"
)

var testCases = []struct {
        r, s []byte
}{
        {
                r: []byte{0x30, 0xbf, 0x30, 0x2a, 0x30, 0x30},
                s: []byte{0xf0, 0xb0, 0x80, 0x91, 0xf7},
        },
        {
                s: []byte{0xf3, 0x80, 0x80, 0x87, 0x80, 0x89},
                r: []byte{0x30, 0xaf, 0xf3, 0x30, 0x2a},
        },
}

func test(r, s []byte) (b bool) {
        defer func() {
                if r := recover(); r != nil {
                        fmt.Println(r)
                        debug.PrintStack()
                        b = true
                }
        }()

        re, err := regexp2.Compile(string(r), regexp2.ECMAScript)
        if err != nil {
                return false
        }
        _, _ = re.FindStringMatch(string(s))
        return false
}

func main() {
        for _, c := range testCases {
                fmt.Printf("Test case regex='%#v', string='%#v' panics\nstring values '%s', '%s'\n",
                        c.r, c.s, string(c.r), string(c.s),
                )
                fmt.Println("#############################################################################")
                if test(c.r, c.s) {
                } else {
                        fmt.Printf("Test case regex='%#v', string='%#v' DOES NOT panic\nstring values '%s', '%s'\n",
                                c.r, c.s, string(c.r), string(c.s),
                        )
                }
        }
}

will get you

est case regex='[]byte{0x30, 0xbf, 0x30, 0x2a, 0x30, 0x30}', string='[]byte{0xf0, 0xb0, 0x80, 0x91, 0xf7}' panics
string values '00*00', '𰀑'
#############################################################################
runtime error: index out of range [196625] with length 128
goroutine 1 [running]:
runtime/debug.Stack(0x3b, 0x0, 0x0)
        /home/mstoykov/.gvm/gos/go1.14.9/src/runtime/debug/stack.go:24 +0x9d
runtime/debug.PrintStack()
        /home/mstoykov/.gvm/gos/go1.14.9/src/runtime/debug/stack.go:16 +0x22
main.test.func1(0xc000113e38)
        /home/mstoykov/.gvm/pkgsets/go1.14.9/global/src/github.com/mstoykov/goja-regexp2-fuzzing/crashers/test.go:27 +0x97
panic(0x4f0ac0, 0xc0001420a0)
        /home/mstoykov/.gvm/gos/go1.14.9/src/runtime/panic.go:969 +0x166
github.com/dlclark/regexp2/syntax.(*BmPrefix).Scan(0xc0001602a0, 0xc000136078, 0x2, 0x2, 0x0, 0x0, 0x2, 0x7f9a4a2befb8)
        /home/mstoykov/.gvm/pkgsets/go1.14.9/global/src/github.com/dlclark/regexp2/syntax/prefix.go:716 +0x3be
github.com/dlclark/regexp2.(*runner).findFirstChar(0xc000170000, 0xc000170000)
        /home/mstoykov/.gvm/pkgsets/go1.14.9/global/src/github.com/dlclark/regexp2/runner.go:1305 +0x4d3
github.com/dlclark/regexp2.(*runner).scan(0xc000170000, 0xc000136078, 0x2, 0x2, 0x0, 0xc000113d00, 0x7fffffffffffffff, 0x4, 0xfffd, 0x5)
        /home/mstoykov/.gvm/pkgsets/go1.14.9/global/src/github.com/dlclark/regexp2/runner.go:130 +0x128
github.com/dlclark/regexp2.(*Regexp).run(0xc00016e080, 0xc000113d00, 0xffffffffffffffff, 0xc000136078, 0x2, 0x2, 0x0, 0x0, 0x0)
        /home/mstoykov/.gvm/pkgsets/go1.14.9/global/src/github.com/dlclark/regexp2/runner.go:91 +0xf0
github.com/dlclark/regexp2.(*Regexp).FindStringMatch(...)
        /home/mstoykov/.gvm/pkgsets/go1.14.9/global/src/github.com/dlclark/regexp2/regexp.go:159
main.test(0x5ba04c, 0x6, 0x6, 0x5ba034, 0x5, 0x5, 0x0)
        /home/mstoykov/.gvm/pkgsets/go1.14.9/global/src/github.com/mstoykov/goja-regexp2-fuzzing/crashers/test.go:36 +0x168
main.main()
        /home/mstoykov/.gvm/pkgsets/go1.14.9/global/src/github.com/mstoykov/goja-regexp2-fuzzing/crashers/test.go:46 +0x355
Test case regex='[]byte{0x30, 0xaf, 0xf3, 0x30, 0x2a}', string='[]byte{0xf3, 0x80, 0x80, 0x87, 0x80, 0x89}' panics
string values '00*', '󀀇'
#############################################################################
runtime error: index out of range [786439] with length 128
goroutine 1 [running]:
runtime/debug.Stack(0x3b, 0x0, 0x0)
        /home/mstoykov/.gvm/gos/go1.14.9/src/runtime/debug/stack.go:24 +0x9d
runtime/debug.PrintStack()
        /home/mstoykov/.gvm/gos/go1.14.9/src/runtime/debug/stack.go:16 +0x22
main.test.func1(0xc000113e38)
        /home/mstoykov/.gvm/pkgsets/go1.14.9/global/src/github.com/mstoykov/goja-regexp2-fuzzing/crashers/test.go:27 +0x97
panic(0x4f0ac0, 0xc000142100)
        /home/mstoykov/.gvm/gos/go1.14.9/src/runtime/panic.go:969 +0x166
github.com/dlclark/regexp2/syntax.(*BmPrefix).Scan(0xc000160540, 0xc0001360d0, 0x3, 0x4, 0x0, 0x0, 0x3, 0xc00016e100)
        /home/mstoykov/.gvm/pkgsets/go1.14.9/global/src/github.com/dlclark/regexp2/syntax/prefix.go:716 +0x3be
github.com/dlclark/regexp2.(*runner).findFirstChar(0xc000170100, 0xc000170100)
        /home/mstoykov/.gvm/pkgsets/go1.14.9/global/src/github.com/dlclark/regexp2/runner.go:1305 +0x4d3
github.com/dlclark/regexp2.(*runner).scan(0xc000170100, 0xc0001360d0, 0x3, 0x4, 0x0, 0xc000113d00, 0x7fffffffffffffff, 0x5, 0xfffd, 0x6)
        /home/mstoykov/.gvm/pkgsets/go1.14.9/global/src/github.com/dlclark/regexp2/runner.go:130 +0x128
github.com/dlclark/regexp2.(*Regexp).run(0xc00016e180, 0xc000113d00, 0xffffffffffffffff, 0xc0001360d0, 0x3, 0x4, 0x0, 0x0, 0x0)
        /home/mstoykov/.gvm/pkgsets/go1.14.9/global/src/github.com/dlclark/regexp2/runner.go:91 +0xf0
github.com/dlclark/regexp2.(*Regexp).FindStringMatch(...)
        /home/mstoykov/.gvm/pkgsets/go1.14.9/global/src/github.com/dlclark/regexp2/regexp.go:159
main.test(0x5ba03c, 0x5, 0x5, 0x5ba054, 0x6, 0x6, 0x0)
        /home/mstoykov/.gvm/pkgsets/go1.14.9/global/src/github.com/mstoykov/goja-regexp2-fuzzing/crashers/test.go:36 +0x168
main.main()
        /home/mstoykov/.gvm/pkgsets/go1.14.9/global/src/github.com/mstoykov/goja-regexp2-fuzzing/crashers/test.go:46 +0x355

I have more test cases but these ones were the shortest and just as readable :(

Support ASCII Character Classes

Hi,

Thank you for the library. I needed negative lookbehinds and was disappointed to find them not supported in the standard Go regexp package.

In the course of converting some code over to use your package, I had to modify some of the regexes to use Perl character classes instead of the ASCII classes defined here: https://github.com/google/re2/wiki/Syntax

Example: https://play.golang.org/p/MlCaJtyvQ7q

Copied below as well:

	re := regexp.MustCompile(`^[[:digit:]]+$`)
	if isMatch := re.MatchString(`12345667890`); isMatch {
		fmt.Println("Matched regexp")
	} else {
		fmt.Println("No Match regexp")
	}
	
	re2 := regexp2.MustCompile(`^[[:digit:]]+$`, 0)
	if isMatch, _ := re2.MatchString(`12345667890`); isMatch {
		fmt.Println("Matched regexp2")
	} else {
		fmt.Println("No Match regexp2")
	}

Output:

Matched regexp
No Match regexp2

It'd be nice to support these larger character classes as well to keep compatibility with the standard library's regexp package.

Request: unicode character class implementations

Thank you very much for the porting!

I checked your library and found that most unicode character classes have not been implemented yet.

Reference: http://www.fileformat.info/info/unicode/category/index.htm

Looks like fundamental character categories, such as [\p{P}] (= any punctuations), are available:

package main

import (
    "fmt"

    "github.com/dlclark/regexp2"
)

func main() {
    re := regexp2.MustCompile(`(?<=[カキケコ\p{Po}])ん+`, 0) // works
    isMatch, err := re.FindStringMatch(`ブック。んん`)
    if err == nil {
        fmt.Println(isMatch)
    }
}

But most advanced character classes (block) such as [\p{Katakana}] have not been implemented:

package main

import (
    "fmt"

    "github.com/dlclark/regexp2"
)

func main() {
    re := regexp2.MustCompile(`(?<=[カキケコ\p{Katakana}])ん+`, 0) // panic with [\p{Katakana}]
    isMatch, err := re.FindStringMatch(`ブック。んん`)
    if err == nil {
        fmt.Println(isMatch)
    }
}

The sample code above causes panic: not impelemented.

I hope you'd implement them in a future.

A bug when .* in the content to match

The code that caused the error:

image

Why nil ?

This should be the right result:

image

Sample code:

`package main

import (
"fmt"

"github.com/dlclark/regexp2"

)

func main() {

r, err := regexp2.Compile(`(?<=1234\.\*56).*(?=890)`, regexp2.Compiled)

if err != nil {
	panic(err)
}

m, err := r.FindStringMatch(`1234.*567890`)
if err != nil {
	panic(err)
}

fmt.Println(m)

}`

Force timeout for testing?

I'm trying to force a timeout as part of my unit testing. Unfortunately, the expression gets evaluated too quickly and never times out. Roughly, my code looks like:

https://play.golang.com/p/fuXQh3RdyuO

Example

package main

import (
	"github.com/dlclark/regexp2"
	"testing"
	"time"
)

var regex = regexp2.MustCompile(`\d{4}-\d{2}-\d{2}`, regexp2.None)

func init() {
	regex.MatchTimeout = 1 * time.Second
}

func StringMatches(input string) (bool, error) {
	return regex.MatchString(input)
}

func TestLastIndex(t *testing.T) {
	originalTimeout := regex.MatchTimeout
	regex.MatchTimeout = -1 * time.Nanosecond

	result, err := StringMatches("2023-03-28")
	if result == true {
		t.Error("expected match false due to timeout")
	}
	if err == nil {
		t.Error("expected timeout error")
	}

	regex.MatchTimeout = originalTimeout
}

The only major difference being that my regular expression is a more complicated date time string matcher with named groups.

Is there a way to force the evaluator to timeout for testing? Otherwise, I'm not sure how I can cover the error case of MatchString/FindStringMatch.

Add support backslash reference in Replace()

Test case:

func TestReplaceRef(t *testing.T) {
	re := MustCompile("(123)hello(789)", None)
	res, err := re.Replace("123hello789", "\\1456\\2", -1, -1)
	if err != nil {
		t.Fatal(err)
	}
	if res != "123456789" {
		t.Fatalf("Wrong result: %s", res)
	}
}

Result:

--- FAIL: TestReplaceRef (0.00s)
	regexp_test.go:775: Wrong result: \1456\2

not work with `(\d)\1{3}`

hi,

i want found some repeated number in a string

string : 3331112233
reg: (\d)\1{3}

result is nil

error parsing regexp: unrecognized grouping construct: (?-1

package parse

import (
	"fmt"
	"github.com/dlclark/regexp2"
	"testing"
)

func TestJsonRe2(t *testing.T) {
	text := `{
  "code" : "0",
  "message" : "success",
  "responseTime" : 2,
  "traceId" : "a469b12c7d7aaca5",
  "returnCode" : null,
  "result" : {
    "total" : 0,
    "list" : [ ]
}
}`
	reg := `/(\{(?:(?>[^{}"'\/]+)|(?>"(?:(?>[^\\"]+)|\\.)*")|(?>'(?:(?>[^\\']+)|\\.)*')|(?>\/\/.*\n)|(?>\/\*.*?\*\/)|(?-1))*\})/`
	r, err := regexp2.Compile(reg, regexp2.RE2|regexp2.Multiline|regexp2.ECMAScript)
	if err != nil {
		fmt.Println(err)
		return
	}

	matchedStrings, err := r.FindStringMatch(text)
	if err != nil {
		fmt.Println(err)
		return
	}
	fmt.Println(matchedStrings)
}

output:

error parsing regexp: unrecognized grouping construct: (?-1 in `/(\{(?:(?>[^{}"'\/]+)|(?>"(?:(?>[^\\"]+)|\\.)*")|(?>'(?:(?>[^\\']+)|\\.)*')|(?>\/\/.*\n)|(?>\/\*.*?\*\/)|(?-1))*\})/`

but in https://regex101.com/,it is ok
image

compile failed

s := 	`[\r\n;\/\*]+\s*\b(include|require)(_once)?\b[\s\(]*['"][^\n'"]{1,100}((\.(jpg|png|txt|jpeg|log|tmp|db|cache)|\_(tmp|log))|((http|https|file|php|data|ftp)\:\/\/\[.{0,25}))['"][\s\)]*[\r\n;\/\*]+`

regexp.MustCompile(s, regexp.None)

panic: regexp2: Compile(`[\r\n;\/\*]+\s*\b(include|require)(_once)?\b[\s\(]*['"][^\n'"]{1,100}((\.(jpg|png|txt|jpeg|log|tmp|db|cache)|\_(tmp|log))|((http|https|file|php|data|ftp)\:\/\/\[.{0,25}))['"][\s\)]*[\r\n;\/\*]+`): error parsing regexp: unrecognized escape sequence \_ in `[\r\n;\/\*]+\s*\b(include|require)(_once)?\b[\s\(]*['"][^\n'"]{1,100}((\.(jpg|png|txt|jpeg|log|tmp|db|cache)|\_(tmp|log))|((http|https|file|php|data|ftp)\:\/\/\[.{0,25}))['"][\s\)]*[\r\n;\/\*]+`

it is panic. But succeeded in python.

In [47]: s = r"""[\r\n;\/\*]+\s*\b(include|require)(_once)?\b[\s\(]*['"][^\n'"]{1,100}((\.(jpg|png|txt|jpeg|log|tmp|db|cache)|\_(tmp|log))|((http|https|file|php|data|ftp)\:\/\/\[.{0,25}))['"][\s\)]*[\r\n;
    ...: \/\*]+"""

In [48]: re.compile(s)
Out[48]:
re.compile(r'[\r\n;\/\*]+\s*\b(include|require)(_once)?\b[\s\(]*[\'"][^\n\'"]{1,100}((\.(jpg|png|txt|jpeg|log|tmp|db|cache)|\_(tmp|log))|((http|https|file|php|data|ftp)\:\/\/\[.{0,25}))[\'"][\s\)]*[\r\n;\/\*]+',
re.UNICODE)

Licensing and specific ATTRIB details

As part as an effort that includes packaging your library for Debian, I'm wondering if it would be possible to have more details or information about which particular files are covered by each original license?

In particular, could you provide some more details regarding these comments on ATTRIB:

Some of this code is ported from dotnet/corefx, which was released under this license:
...

Small pieces of code are copied from the Go framework under this license:
...

I am aware it might be a bit difficult to retrieve that history, but any insight would be much appreciated in the hopes of making sure licenses and copyright are attributed as faithfully as possible. Thanks in advance!

Problems with Negative Lookahead

re := regexp2.MustCompile(`(?m)^.*(?!/bin/bash)$`,0)
match,_ := re.FindStringMatch(string(passwd))

I'm trying to take all the string execpt the ones containing /bin/bash but actually the result is just the first line of /etc/passwd that contains /bin/bash

runtime error: index out of range [<number>] with length <samenumber>

One more that was fuzzed during the night ;)

package main

import (
        "fmt"
        "runtime/debug"

        "github.com/dlclark/regexp2"
)

var testCases = []struct {
        r, s []byte
}{
        {
                r: []byte{0x30, 0x28, 0x3f, 0x3e, 0x28, 0x29, 0x2b, 0x3f, 0x30, 0x29, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x77},
                s: []byte{0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30},
        },
        {
                r: []byte{0x28, 0x3f, 0x3e, 0x28, 0x3f, 0x3e, 0x29, 0x2b, 0x3f, 0x3e, 0x29, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30},
                s: []byte{0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x3e, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30},
        },
}

func test(r, s []byte) (b bool) {
        defer func() {
                if r := recover(); r != nil {
                        fmt.Println(r)
                        debug.PrintStack()
                        b = true
                }
        }()

        re, err := regexp2.Compile(string(r), regexp2.ECMAScript)
        if err != nil {
                return false
        }
        _, _ = re.FindStringMatch(string(s))
        return false
}

func main() {
        for _, c := range testCases {
                fmt.Printf("Test case regex='%#v', string='%#v' panics\nstring values '%s', '%s'\n",
                        c.r, c.s, string(c.r), string(c.s),
                )
                fmt.Println("#############################################################################")
                if test(c.r, c.s) {
                } else {
                        fmt.Printf("Test case regex='%#v', string='%#v' DOES NOT panic\nstring values '%s', '%s'\n",
                                c.r, c.s, string(c.r), string(c.s),
                        )
                }
        }
}

panics with

Test case regex='[]byte{0x30, 0x28, 0x3f, 0x3e, 0x28, 0x29, 0x2b, 0x3f, 0x30, 0x29, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x77}', string='[]byte{0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30}' panics
string values '0(?>()+?0)00000000w', '0000000000000000000'
#############################################################################
runtime error: index out of range [72] with length 72
goroutine 1 [running]:
runtime/debug.Stack(0x36, 0x0, 0x0)
        runtime/debug/stack.go:24 +0x9d
runtime/debug.PrintStack()
        runtime/debug/stack.go:16 +0x22
main.test.func1(0xc00015be38)
        command-line-arguments/test.go:27 +0x97
panic(0x4f0b40, 0xc0001320e0)
        runtime/panic.go:969 +0x166
github.com/dlclark/regexp2.(*runner).backtrack(0xc000162000)
        github.com/dlclark/regexp2/runner.go:1033 +0x246
github.com/dlclark/regexp2.(*runner).execute(0xc000162000, 0x0, 0x0)
        github.com/dlclark/regexp2/runner.go:904 +0x9b
github.com/dlclark/regexp2.(*runner).scan(0xc000162000, 0xc0001340a0, 0x13, 0x14, 0x0, 0x0, 0x7fffffffffffffff, 0x13, 0x14, 0x4490be)
        github.com/dlclark/regexp2/runner.go:144 +0x1c3
github.com/dlclark/regexp2.(*Regexp).run(0xc000160080, 0xc00015bd00, 0xffffffffffffffff, 0xc0001340a0, 0x13, 0x14, 0x0, 0x0, 0x0)
        github.com/dlclark/regexp2/runner.go:91 +0xf0
github.com/dlclark/regexp2.(*Regexp).FindStringMatch(...)
        github.com/dlclark/regexp2/regexp.go:159
main.test(0x5b9710, 0x13, 0x13, 0x5b9730, 0x13, 0x13, 0x0)
        command-line-arguments/test.go:36 +0x168
main.main()
        command-line-arguments/test.go:46 +0x355
Test case regex='[]byte{0x28, 0x3f, 0x3e, 0x28, 0x3f, 0x3e, 0x29, 0x2b, 0x3f, 0x3e, 0x29, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30}', string='[]byte{0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x3e, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30}' panics
string values '(?>(?>)+?>)0000000000', '00000000000000>000000'
#############################################################################
runtime error: index out of range [32] with length 32
goroutine 1 [running]:
runtime/debug.Stack(0x36, 0x0, 0x0)
        runtime/debug/stack.go:24 +0x9d
runtime/debug.PrintStack()
        runtime/debug/stack.go:16 +0x22
main.test.func1(0xc00015be38)
        command-line-arguments/test.go:27 +0x97
panic(0x4f0b40, 0xc000132160)
        runtime/panic.go:969 +0x166
github.com/dlclark/regexp2.(*runner).popcrawl(...)
        github.com/dlclark/regexp2/runner.go:938
github.com/dlclark/regexp2.(*runner).uncapture(...)
        github.com/dlclark/regexp2/runner.go:1467
github.com/dlclark/regexp2.(*runner).execute(0xc000162100, 0x0, 0x0)
        github.com/dlclark/regexp2/runner.go:507 +0x408c
github.com/dlclark/regexp2.(*runner).scan(0xc000162100, 0xc000100120, 0x15, 0x18, 0x0, 0x0, 0x7fffffffffffffff, 0x15, 0x18, 0x4490be)
        github.com/dlclark/regexp2/runner.go:144 +0x1c3
github.com/dlclark/regexp2.(*Regexp).run(0xc000160180, 0xc00015bd00, 0xffffffffffffffff, 0xc000100120, 0x15, 0x18, 0x0, 0x0, 0x0)
        github.com/dlclark/regexp2/runner.go:91 +0xf0
github.com/dlclark/regexp2.(*Regexp).FindStringMatch(...)
        github.com/dlclark/regexp2/regexp.go:159
main.test(0x5b9750, 0x15, 0x15, 0x5b9770, 0x15, 0x15, 0x0)
        command-line-arguments/test.go:36 +0x168
main.main()
        command-line-arguments/test.go:46 +0x355

Add more examples to README

Could you add more examples to the README? There's not a single runnable example of FindStringMatch, or FindNextMatch. I'm trying to use FindStringMatch to capture two capture groups in the below regexp, but the second one doesn't exist. Some more complex examples (find all matches for regexp, extract several capture groups from a match, regexps with lookaheads) on the README would be helpful for debugging. It looks like a really useful library (since it has support for lookahead expressions!) but I'm having a lot of trouble using it due to the lack of documentation.

package main

import (
	"fmt"
	"github.com/dlclark/regexp2"
)

func main() {
	re := regexp2.MustCompile(`(\b\w+)=(.*?(?=\s\w+=|$))`, 0)
	s := `timestamp=05/Dec/2018:14:39:41 -0500 foo=bar`
	if matches, _ := re.FindStringMatch(s); matches != nil {
		fmt.Printf("Group 0: %v\n", matches.String())
		gps := matches.Groups()
		fmt.Println(gps[1].Captures[0].String()) 
		fmt.Println(gps[0].Captures[1].String()) //why is this capture group nil?
	}
}

The matching results of strings containing Chinese characters are incorrect

package main

import (
	"fmt"
	"github.com/dlclark/regexp2"
	"regexp"
)

func main() {
	str := `我的邮箱是[email protected][email protected]`
	reg := `\b(((([*+\-=?^_{|}~\w])|([*+\-=?^_{|}~\w][*+\-=?^_{|}~\.\w]{0,}[*+\-=?^_{|}~\w]))[@]\w+([-.]\w+)*\.[A-Za-z]{2,8}))\b`
	re, _ := regexp.Compile(reg)
	re2, _ := regexp2.Compile(reg, 0)
	fmt.Println(re.FindAllString(str, -1))
	result, _ := re2.FindStringMatch(str)
	fmt.Println(result.String())
}

The result of regexp is [email protected] [email protected], but the regexp2 result is 163.com和[email protected].

Wrong first reference in Replace()

Test case:

func TestReplaceRef(t *testing.T) {
	re := MustCompile("(123)hello(789)", None)
	res, err := re.Replace("123hello789", "$1456$2", -1, -1)
	if err != nil {
		t.Fatal(err)
	}
	if res != "123456789" {
		t.Fatalf("Wrong result: %s", res)
	}
}

Result:

--- FAIL: TestReplaceRef (0.00s)
	regexp_test.go:775: Wrong result: $1456789

Support for Python-style named backreference

In RE2 compatibility mode, regexp2 supports Python-style named capture groups (eg. (?P<name>re)). But there doesn't appear to be support for Python-style named backreferences (eg. (?P=name)).

Do you have any plans to support those? More info here. Thanks!

Infinite match loop

The following test results in an infinite loop.

func TestOverlappingMatch(t *testing.T) {
	re := MustCompile(`((?:0*)+?(?:.*)+?)?`, 0)
	match, err := re.FindStringMatch("0\xfd")
	if err != nil {
		t.Fatal(err)
	}
	for match != nil {
		t.Logf("start: %d, length: %d", match.Index, match.Length)
		match, err = re.FindNextMatch(match)
		if err != nil {
			t.Fatal(err)
		}
	}
}

$ go test -v -run TestOverlappingMatch
=== RUN TestOverlappingMatch
TestOverlappingMatch: regexp_test.go:802: start: 0, length: 2
TestOverlappingMatch: regexp_test.go:802: start: 1, length: 1
TestOverlappingMatch: regexp_test.go:802: start: 1, length: 1
TestOverlappingMatch: regexp_test.go:802: start: 1, length: 1
....

Compatibility issue with NKo Digits

It looks like \d matches ߀ (\u07c0) with regexp2, but not with the standard library regexp.

See the following example:

package main

import (
	"fmt"
	"regexp"

	"github.com/dlclark/regexp2"
)

func main() {
	re := regexp.MustCompile(`^\d$`)
	re2 := regexp2.MustCompile(`^\d$`, regexp2.RE2)

	notZero := "߀" // \u07c0

	match := re.MatchString(notZero)
	fmt.Printf("regexp: %v\n", match)

	match2, _ := re2.MatchString(notZero)
	fmt.Printf("regexp2: %v\n", match2)
}

Perhaps this is a known issue, but I'm wondering if there is a way to get additional compatibility with the standard library.

One more 4byte emoji issue

Hello dlclark,

I found one more issue, which is relevant to #5 .

I checked this with the brand-new Go 1.8, but I guess the version of the Go does not affect the issue.

1. Sample that encounter panic

Condition:

  • regex pattern contains ""(U+731F), followed by a Japanese character like Hiragana
  • target string contains 4byte emojis, followed by the same Japanese characters as above (Hiragana in the case)
  • the 4byte emojis in the target is not adjacent to the same Japanese characters behind it
package main

import (
	"github.com/dlclark/regexp2"
	"github.com/k0kubun/pp"
)

func main() {
	str := "猟な" // factor 1: the kanji + the Hiragana

	re := regexp2.MustCompile(str, 0)
	result, _ := re.ReplaceFunc(
		"🍺な" + // factor 2: 4byte emoji + the same Hiragana as above
                "なあ🍺な", // factor 3: the same Hiragana does not surround the 4byte emoji; if you remove "あ" from this, it works fine
		func(m regexp2.Match) string {
			return "࿗" + "1" + "࿘" + string(m.Capture.Runes()) + "࿌"
		}, -1, -1)

	pp.Println(result)
}

2. Sample that works fine

package main

import (
	"github.com/dlclark/regexp2"
	"github.com/k0kubun/pp"
)

func main() {
	str := "猟な" // works fine with the kanji + trailing Japanese char ("な" in the case)

	re := regexp2.MustCompile(str, 0)
	result, _ := re.ReplaceFunc(
		"な📍な"+ // works fine if the same "な" surrounds the 4byte emoji
			"な✔️な"+
			"な😏な"+
			"な⚾️な"+
			"な📣な"+
			"な🍣な"+
			"な🍺な"+
			"な📍✔️😏⚾️📣🍣🍺な", func(m regexp2.Match) string {
			return "࿗" + "1" + "࿘" + string(m.Capture.Runes()) + "࿌"
		}, -1, -1)

	pp.Println(result)
}

Best regards, 🙇

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.