Giter VIP home page Giter VIP logo

regexp2's People

Contributors

dlclark avatar dop251 avatar dthadi3 avatar eclipseo avatar ghemawat avatar hunshcn avatar mstoykov avatar u1735067 avatar vassudanagunta avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

regexp2's Issues

Wrong first reference in Replace()

Test case:

func TestReplaceRef(t *testing.T) {
	re := MustCompile("(123)hello(789)", None)
	res, err := re.Replace("123hello789", "$1456$2", -1, -1)
	if err != nil {
		t.Fatal(err)
	}
	if res != "123456789" {
		t.Fatalf("Wrong result: %s", res)
	}
}

Result:

--- FAIL: TestReplaceRef (0.00s)
	regexp_test.go:775: Wrong result: $1456789

xeger functionality

Does this library support the xeger functionality? For example I have the following regex that is not supported by standard regexp library.

(?!((?!a(b|c)z)|(?!a(c|d)z)))

I need to do something like

r, _ := regexp2.Compile(`(?!((?!a(b|c)z)|(?!a(c|d)z)))`, 0)
s, _ := r.GenerateMatchingString()

I need something like this that gives me a string that matches the regex, if any exists, for example:

acz

Is this functionality already implemented? I believe Fare has this feature. (https://github.com/moodmosaic/Fare/blob/master/Src/Fare/Xeger.cs)

Can we probably use those codes to add this feature? I am willing to contribute and add this feature if it is welcome.

The matching results of strings containing Chinese characters are incorrect

package main

import (
	"fmt"
	"github.com/dlclark/regexp2"
	"regexp"
)

func main() {
	str := `我的邮箱是[email protected][email protected]`
	reg := `\b(((([*+\-=?^_{|}~\w])|([*+\-=?^_{|}~\w][*+\-=?^_{|}~\.\w]{0,}[*+\-=?^_{|}~\w]))[@]\w+([-.]\w+)*\.[A-Za-z]{2,8}))\b`
	re, _ := regexp.Compile(reg)
	re2, _ := regexp2.Compile(reg, 0)
	fmt.Println(re.FindAllString(str, -1))
	result, _ := re2.FindStringMatch(str)
	fmt.Println(result.String())
}

The result of regexp is [email protected] [email protected], but the regexp2 result is 163.com和[email protected].

Performance issue matching against beginning of very large string

I am tokenizing some text by matching a set of regexes against the beginning of a string holding the contents of a file. I noticed that regexp2 was extremely slow for this use-case, and after running the profiler found that the time was dominated by getRunes().

This is occurring because, before every match, regexp2 converts the entire 22kb string to a slice of runes. I've worked around the issue be pre-converting the string to a slice of runes myself, then using FindRulesMatch(), but it was quite surprising and non-obvious.

A solution would be to convert runes on the fly (as most matches are under 10 characters, converting the whole string each time is redundant). Looking at the code, it doesn't seem like it would super painful to achieve. The runner would need to be modified to use DecodeRuneInString to advance the index into the string, rather than a direct index into a slice of runes.

Panic when trying to match `(?:){40}`

r := MustCompile(`(?:){40}`, RE2)
 m, err := r.FindStringMatch("12")

will panic with

panic: runtime error: index out of range [-1] [recovered]
        panic: runtime error: index out of range [-1]

goroutine 6 [running]:
testing.tRunner.func1.1(0x590420, 0xc000016320)
        testing/testing.go:988 +0x30d
testing.tRunner.func1(0xc000134120)
        testing/testing.go:991 +0x3f9
panic(0x590420, 0xc000016320)
        runtime/panic.go:969 +0x166
github.com/dlclark/regexp2.(*runner).trackPush1(...)
        github.com/dlclark/regexp2/runner.go:992
github.com/dlclark/regexp2.(*runner).execute(0xc000146000, 0x0, 0x0)
        github.com/dlclark/regexp2/runner.go:387 +0x4511
github.com/dlclark/regexp2.(*runner).scan(0xc000146000, 0xc000014230, 0x2, 0x2, 0x0, 0x0, 0x7fffffffffffffff, 0x0, 0x8, 0x8)
        github.com/dlclark/regexp2/runner.go:144 +0x1c3
github.com/dlclark/regexp2.(*Regexp).run(0xc000132100, 0x5a3d00, 0x0, 0xc000014230, 0x2, 0x2, 0x0, 0x0, 0x0)
        github.com/dlclark/regexp2/runner.go:91 +0x21a
github.com/dlclark/regexp2.(*Regexp).FindStringMatch(...)
        github.com/dlclark/regexp2/regexp.go:159
github.com/dlclark/regexp2.TestRE2ECMA(0xc000134120)
        github.com/dlclark/regexp2/regexp_re2_test.go:125 +0x8b
testing.tRunner(0xc000134120, 0x5b45d8)
        testing/testing.go:1039 +0xdc
created by testing.(*T).Run
        testing/testing.go:1090 +0x372
exit status 2
FAIL    github.com/dlclark/regexp2      0.006s

Things that I know don't matter:

  • whether RE2 or ECMAScript (the issue was found first in goja)
  • the input isn't important, the original one("\xe90000000") had Unicode
  • the 40 needs to be 17 or bigger

This was found through fuzzing goja with the go-fuzz corpus for regexp which is why the example is such :). I may rewrite it to fuzz regexp2 as well and post it if there is interest.

Valid regex doesn't compile

Hi!

If pattern contains \_, regexp2 fails to compile it. Example:

_, err := regexp2.Compile("^/legacy/([\w|\d|\-\_]+)/([\w|\d|\-\_]+)/.*", 0)
if err != nil {
	fmt.Println(err)
}

Error is error parsing regexp: unrecognized escape sequence \_ in ^/legacy/([\w|\d|\-\_]+)/([\w|\d|\-\_]+)/.*)

This pattern works in regexp package.

thanks!

Running MatchString is slow

Run the following example (https://go.dev/play/p/BDU6yN5NvEZ):

package main

import (
	"log"
	"regexp"
	"time"

	"github.com/dlclark/regexp2"
)

func main() {
	url := "https://www.dhgate.com/product/magnetic-liquid-eyeliner-magnetic-false-eyelashes/481362313.html"

	reg1 := regexp.MustCompile(`dhgate(?:.[a-z]+)+\/product\/`)
	log.Println("start regexp match string...")
	begin := time.Now()
	reg1.MatchString(url)
	log.Println("time taken:", time.Since(begin))

	reg2 := regexp2.MustCompile(`dhgate(?:.[a-z]+)+\/product\/`, regexp2.IgnoreCase)
	log.Println("start regexp2 match string...")
	begin = time.Now()
	reg2.MatchString(url)
	log.Println("time taken:", time.Since(begin))
}

output:

2021/12/08 14:16:30 start regexp match string...
2021/12/08 14:16:30 time taken: 21.583µs
2021/12/08 14:16:30 start regexp2 match string...

regexp2 version is v1.4.0
Hope it helps to improve performance.

Error while trying to match a string with a specific unicode against a RegExp that contains a space and a group

When trying to match (phrase.MatchString(X)) messages like gg 󠀀 󠀀 (notice that these are not the regular spaces) against a phrase like regexp2.MustCompile("\\bcool (house)\\b", 0), the following error will be thrown:

panic: runtime error: index out of range [917504] with length 128

goroutine 1 [running]:
github.com/dlclark/regexp2/syntax.(*BmPrefix).Scan(0xc000180540, {0xc000b70948, 0x6, 0x0?}, 0x0?, 0x0, 0x6)
        C:/Users/X/go/pkg/mod/github.com/dlclark/[email protected]/syntax/prefix.go:716 +0x3bb
github.com/dlclark/regexp2.(*runner).findFirstChar(0xc000623a00)
        C:/Users/X/go/pkg/mod/github.com/dlclark/[email protected]/runner.go:1305 +0x366
github.com/dlclark/regexp2.(*runner).scan(0xc000623a00, {0xc000b70948?, 0x6, 0xc000b70948?}, 0x6?, 0x1, 0xc00008f8e8?)
        C:/Users/X/go/pkg/mod/github.com/dlclark/[email protected]/runner.go:130 +0x1e5
github.com/dlclark/regexp2.(*Regexp).run(0xc0000f6200, 0xf4?, 0xffffffffffffffff, {0xc000b70948, 0x6, 0x6})
        C:/Users/X/go/pkg/mod/github.com/dlclark/[email protected]/runner.go:91 +0xfa
github.com/dlclark/regexp2.(*Regexp).MatchString(0x10f9c40?, {0x108f0f4?, 0xc00008fb48?})
        C:/Users/X/go/pkg/mod/github.com/dlclark/[email protected]/regexp.go:213 +0x45
main.main()
        C:/Users/X/Desktop/GoRegExTests/test.go:127 +0xbdc

The error is only being thrown when:
a. The message contains those unicode characters
b. The RegExp contains a space and a group like (house)

The RegExp above is just a very basic example to demonstrate this problem.

runtime error: index out of range [#number] with length 128

This is again from fuzzing:

package main

import (
        "fmt"
        "runtime/debug"

        "github.com/dlclark/regexp2"
)

var testCases = []struct {
        r, s []byte
}{
        {
                r: []byte{0x30, 0xbf, 0x30, 0x2a, 0x30, 0x30},
                s: []byte{0xf0, 0xb0, 0x80, 0x91, 0xf7},
        },
        {
                s: []byte{0xf3, 0x80, 0x80, 0x87, 0x80, 0x89},
                r: []byte{0x30, 0xaf, 0xf3, 0x30, 0x2a},
        },
}

func test(r, s []byte) (b bool) {
        defer func() {
                if r := recover(); r != nil {
                        fmt.Println(r)
                        debug.PrintStack()
                        b = true
                }
        }()

        re, err := regexp2.Compile(string(r), regexp2.ECMAScript)
        if err != nil {
                return false
        }
        _, _ = re.FindStringMatch(string(s))
        return false
}

func main() {
        for _, c := range testCases {
                fmt.Printf("Test case regex='%#v', string='%#v' panics\nstring values '%s', '%s'\n",
                        c.r, c.s, string(c.r), string(c.s),
                )
                fmt.Println("#############################################################################")
                if test(c.r, c.s) {
                } else {
                        fmt.Printf("Test case regex='%#v', string='%#v' DOES NOT panic\nstring values '%s', '%s'\n",
                                c.r, c.s, string(c.r), string(c.s),
                        )
                }
        }
}

will get you

est case regex='[]byte{0x30, 0xbf, 0x30, 0x2a, 0x30, 0x30}', string='[]byte{0xf0, 0xb0, 0x80, 0x91, 0xf7}' panics
string values '00*00', '𰀑'
#############################################################################
runtime error: index out of range [196625] with length 128
goroutine 1 [running]:
runtime/debug.Stack(0x3b, 0x0, 0x0)
        /home/mstoykov/.gvm/gos/go1.14.9/src/runtime/debug/stack.go:24 +0x9d
runtime/debug.PrintStack()
        /home/mstoykov/.gvm/gos/go1.14.9/src/runtime/debug/stack.go:16 +0x22
main.test.func1(0xc000113e38)
        /home/mstoykov/.gvm/pkgsets/go1.14.9/global/src/github.com/mstoykov/goja-regexp2-fuzzing/crashers/test.go:27 +0x97
panic(0x4f0ac0, 0xc0001420a0)
        /home/mstoykov/.gvm/gos/go1.14.9/src/runtime/panic.go:969 +0x166
github.com/dlclark/regexp2/syntax.(*BmPrefix).Scan(0xc0001602a0, 0xc000136078, 0x2, 0x2, 0x0, 0x0, 0x2, 0x7f9a4a2befb8)
        /home/mstoykov/.gvm/pkgsets/go1.14.9/global/src/github.com/dlclark/regexp2/syntax/prefix.go:716 +0x3be
github.com/dlclark/regexp2.(*runner).findFirstChar(0xc000170000, 0xc000170000)
        /home/mstoykov/.gvm/pkgsets/go1.14.9/global/src/github.com/dlclark/regexp2/runner.go:1305 +0x4d3
github.com/dlclark/regexp2.(*runner).scan(0xc000170000, 0xc000136078, 0x2, 0x2, 0x0, 0xc000113d00, 0x7fffffffffffffff, 0x4, 0xfffd, 0x5)
        /home/mstoykov/.gvm/pkgsets/go1.14.9/global/src/github.com/dlclark/regexp2/runner.go:130 +0x128
github.com/dlclark/regexp2.(*Regexp).run(0xc00016e080, 0xc000113d00, 0xffffffffffffffff, 0xc000136078, 0x2, 0x2, 0x0, 0x0, 0x0)
        /home/mstoykov/.gvm/pkgsets/go1.14.9/global/src/github.com/dlclark/regexp2/runner.go:91 +0xf0
github.com/dlclark/regexp2.(*Regexp).FindStringMatch(...)
        /home/mstoykov/.gvm/pkgsets/go1.14.9/global/src/github.com/dlclark/regexp2/regexp.go:159
main.test(0x5ba04c, 0x6, 0x6, 0x5ba034, 0x5, 0x5, 0x0)
        /home/mstoykov/.gvm/pkgsets/go1.14.9/global/src/github.com/mstoykov/goja-regexp2-fuzzing/crashers/test.go:36 +0x168
main.main()
        /home/mstoykov/.gvm/pkgsets/go1.14.9/global/src/github.com/mstoykov/goja-regexp2-fuzzing/crashers/test.go:46 +0x355
Test case regex='[]byte{0x30, 0xaf, 0xf3, 0x30, 0x2a}', string='[]byte{0xf3, 0x80, 0x80, 0x87, 0x80, 0x89}' panics
string values '00*', '󀀇'
#############################################################################
runtime error: index out of range [786439] with length 128
goroutine 1 [running]:
runtime/debug.Stack(0x3b, 0x0, 0x0)
        /home/mstoykov/.gvm/gos/go1.14.9/src/runtime/debug/stack.go:24 +0x9d
runtime/debug.PrintStack()
        /home/mstoykov/.gvm/gos/go1.14.9/src/runtime/debug/stack.go:16 +0x22
main.test.func1(0xc000113e38)
        /home/mstoykov/.gvm/pkgsets/go1.14.9/global/src/github.com/mstoykov/goja-regexp2-fuzzing/crashers/test.go:27 +0x97
panic(0x4f0ac0, 0xc000142100)
        /home/mstoykov/.gvm/gos/go1.14.9/src/runtime/panic.go:969 +0x166
github.com/dlclark/regexp2/syntax.(*BmPrefix).Scan(0xc000160540, 0xc0001360d0, 0x3, 0x4, 0x0, 0x0, 0x3, 0xc00016e100)
        /home/mstoykov/.gvm/pkgsets/go1.14.9/global/src/github.com/dlclark/regexp2/syntax/prefix.go:716 +0x3be
github.com/dlclark/regexp2.(*runner).findFirstChar(0xc000170100, 0xc000170100)
        /home/mstoykov/.gvm/pkgsets/go1.14.9/global/src/github.com/dlclark/regexp2/runner.go:1305 +0x4d3
github.com/dlclark/regexp2.(*runner).scan(0xc000170100, 0xc0001360d0, 0x3, 0x4, 0x0, 0xc000113d00, 0x7fffffffffffffff, 0x5, 0xfffd, 0x6)
        /home/mstoykov/.gvm/pkgsets/go1.14.9/global/src/github.com/dlclark/regexp2/runner.go:130 +0x128
github.com/dlclark/regexp2.(*Regexp).run(0xc00016e180, 0xc000113d00, 0xffffffffffffffff, 0xc0001360d0, 0x3, 0x4, 0x0, 0x0, 0x0)
        /home/mstoykov/.gvm/pkgsets/go1.14.9/global/src/github.com/dlclark/regexp2/runner.go:91 +0xf0
github.com/dlclark/regexp2.(*Regexp).FindStringMatch(...)
        /home/mstoykov/.gvm/pkgsets/go1.14.9/global/src/github.com/dlclark/regexp2/regexp.go:159
main.test(0x5ba03c, 0x5, 0x5, 0x5ba054, 0x6, 0x6, 0x0)
        /home/mstoykov/.gvm/pkgsets/go1.14.9/global/src/github.com/mstoykov/goja-regexp2-fuzzing/crashers/test.go:36 +0x168
main.main()
        /home/mstoykov/.gvm/pkgsets/go1.14.9/global/src/github.com/mstoykov/goja-regexp2-fuzzing/crashers/test.go:46 +0x355

I have more test cases but these ones were the shortest and just as readable :(

Bulk replace

Hello,

I'd just like to ask you if you have any plans to implement bulk replace functions to your regexp2 as the Go standard regex?
https://golang.org/pkg/regexp/#Regexp.ReplaceAll

  • func (re *Regexp) ReplaceAll(src, repl []byte) []byte
    
  • func (re *Regexp) ReplaceAllFunc(src []byte, repl func([]byte) []byte) []byte
    
  • func (re *Regexp) ReplaceAllLiteral(src, repl []byte) []byte
    
  • func (re *Regexp) ReplaceAllLiteralString(src, repl string) string
    
  • func (re *Regexp) ReplaceAllString(src, repl string) string
    
  • func (re *Regexp) ReplaceAllStringFunc(src string, repl func(string) string) string
    

Thank you,

Panics found through fuzzing

Here is a small script reproducing a panic that I found while fuzzing:
Notes:

  • those were the more readable examples (without strange non printable characters)
  • I use []byte mostly because it makes the copying between output and program easier
package main

import (
	"fmt"
	"runtime/debug"

	"github.com/dlclark/regexp2"
)

var testCases = []struct {
	r, s []byte
}{
	{
		s: []byte{0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30},
		r: []byte{0x28, 0x28, 0x29, 0x5c, 0x37, 0x28, 0x3f, 0x28, 0x29, 0x29},
	},
	{
		r: []byte{0x28, 0x5c, 0x32, 0x28, 0x3f, 0x28, 0x30, 0x29, 0x29},
		s: []byte{0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30},
	},
	{
		r: []byte{0x28, 0x3f, 0x28, 0x29, 0x29, 0x5c, 0x31, 0x30, 0x28, 0x3f, 0x28, 0x30, 0x29},
		s: []byte{0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30},
	},

	{
		r: []byte{0x28, 0x29, 0x28, 0x28, 0x29, 0x5c, 0x37, 0x28, 0x3f, 0x28, 0x29, 0x29},
		s: []byte{0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30},
	},
}

func test(r, s []byte) (b bool) {
	defer func() {
		if r := recover(); r != nil {
			fmt.Println(r)
			debug.PrintStack()
			b = true
		}
	}()

	re, err := regexp2.Compile(string(r), regexp2.ECMAScript|regexp2.Multiline)
	if err != nil {
		return false
	}
	_, _ = re.FindStringMatch(string(s))
	return false
}

func main() {
	for _, c := range testCases {
		fmt.Println("#############################################################################")
		if test(c.r, c.s) {
			fmt.Printf("Test case regex='%#v', string='%#v' panics\nstring values '%s', '%s'\n",
				c.r, c.s, string(c.r), string(c.s),
			)
		} else {
			fmt.Printf("Test case regex='%#v', string='%#v' DOES NOT panic\nstring values '%s', '%s'\n",
				c.r, c.s, string(c.r), string(c.s),
			)
		}
	}
}

Output is

#############################################################################
runtime error: index out of range [3] with length 3
goroutine 1 [running]:
runtime/debug.Stack(0x34, 0x0, 0x0)
	runtime/debug/stack.go:24 +0x9d
runtime/debug.PrintStack()
	runtime/debug/stack.go:16 +0x22
main.test.func1(0xc000083e38)
	command-line-arguments/test.go:36 +0x97
panic(0x4f0c20, 0xc0000162a0)
	runtime/panic.go:969 +0x166
github.com/dlclark/regexp2.(*Match).addMatch(0xc0000d6000, 0x3, 0x1, 0x0)
	github.com/dlclark/regexp2/match.go:170 +0x31c
github.com/dlclark/regexp2.(*runner).capture(0xc0000d4000, 0x3, 0x1, 0x1)
	github.com/dlclark/regexp2/runner.go:1420 +0x9e
github.com/dlclark/regexp2.(*runner).execute(0xc0000d4000, 0x0, 0x0)
	github.com/dlclark/regexp2/runner.go:254 +0x276e
github.com/dlclark/regexp2.(*runner).scan(0xc0000d4000, 0xc000018150, 0x9, 0xc, 0x0, 0x0, 0x7fffffffffffffff, 0x9, 0xc, 0x4490be)
	github.com/dlclark/regexp2/runner.go:144 +0x1c3
github.com/dlclark/regexp2.(*Regexp).run(0xc0000d2080, 0xc000083d00, 0xffffffffffffffff, 0xc000018150, 0x9, 0xc, 0x0, 0x0, 0x0)
	github.com/dlclark/regexp2/runner.go:91 +0xf0
github.com/dlclark/regexp2.(*Regexp).FindStringMatch(...)
	github.com/dlclark/regexp2/regexp.go:159
main.test(0x5b91b8, 0xa, 0xa, 0x5b9188, 0x9, 0x9, 0x0)
	command-line-arguments/test.go:45 +0x168
main.main()
	command-line-arguments/test.go:52 +0x174
Test case regex='[]byte{0x28, 0x28, 0x29, 0x5c, 0x37, 0x28, 0x3f, 0x28, 0x29, 0x29}', string='[]byte{0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30}' panics
string values '(()\7(?())', '000000000'
#############################################################################
runtime error: index out of range [2] with length 2
goroutine 1 [running]:
runtime/debug.Stack(0x34, 0x0, 0x0)
	runtime/debug/stack.go:24 +0x9d
runtime/debug.PrintStack()
	runtime/debug/stack.go:16 +0x22
main.test.func1(0xc000083e38)
	command-line-arguments/test.go:36 +0x97
panic(0x4f0c20, 0xc0000162c0)
	runtime/panic.go:969 +0x166
github.com/dlclark/regexp2.(*Match).addMatch(0xc0000d60e0, 0x2, 0x0, 0x1)
	github.com/dlclark/regexp2/match.go:170 +0x31c
github.com/dlclark/regexp2.(*runner).capture(0xc0000d4100, 0x2, 0x0, 0x1)
	github.com/dlclark/regexp2/runner.go:1420 +0x9e
github.com/dlclark/regexp2.(*runner).execute(0xc0000d4100, 0x0, 0x0)
	github.com/dlclark/regexp2/runner.go:254 +0x276e
github.com/dlclark/regexp2.(*runner).scan(0xc0000d4100, 0xc0000181b0, 0x9, 0xc, 0x0, 0x0, 0x7fffffffffffffff, 0x9, 0xc, 0x4490be)
	github.com/dlclark/regexp2/runner.go:144 +0x1c3
github.com/dlclark/regexp2.(*Regexp).run(0xc0000d2180, 0xc000083d00, 0xffffffffffffffff, 0xc0000181b0, 0x9, 0xc, 0x0, 0x0, 0x0)
	github.com/dlclark/regexp2/runner.go:91 +0xf0
github.com/dlclark/regexp2.(*Regexp).FindStringMatch(...)
	github.com/dlclark/regexp2/regexp.go:159
main.test(0x5b9198, 0x9, 0x9, 0x5b91a8, 0x9, 0x9, 0x0)
	command-line-arguments/test.go:45 +0x168
main.main()
	command-line-arguments/test.go:52 +0x174
Test case regex='[]byte{0x28, 0x5c, 0x32, 0x28, 0x3f, 0x28, 0x30, 0x29, 0x29}', string='[]byte{0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30}' panics
string values '(\2(?(0))', '000000000'
#############################################################################
runtime error: index out of range [1] with length 1
goroutine 1 [running]:
runtime/debug.Stack(0x34, 0x0, 0x0)
	runtime/debug/stack.go:24 +0x9d
runtime/debug.PrintStack()
	runtime/debug/stack.go:16 +0x22
main.test.func1(0xc000083e38)
	command-line-arguments/test.go:36 +0x97
panic(0x4f0c20, 0xc0000162e0)
	runtime/panic.go:969 +0x166
github.com/dlclark/regexp2.(*Match).addMatch(0xc0000d61c0, 0x1, 0x0, 0x1)
	github.com/dlclark/regexp2/match.go:170 +0x31c
github.com/dlclark/regexp2.(*runner).capture(0xc0000d4200, 0x1, 0x0, 0x1)
	github.com/dlclark/regexp2/runner.go:1420 +0x9e
github.com/dlclark/regexp2.(*runner).execute(0xc0000d4200, 0x0, 0x0)
	github.com/dlclark/regexp2/runner.go:254 +0x276e
github.com/dlclark/regexp2.(*runner).scan(0xc0000d4200, 0xc0000281c0, 0xd, 0x10, 0x0, 0x0, 0x7fffffffffffffff, 0xd, 0x10, 0x4490be)
	github.com/dlclark/regexp2/runner.go:144 +0x1c3
github.com/dlclark/regexp2.(*Regexp).run(0xc0000d2280, 0xc000083d00, 0xffffffffffffffff, 0xc0000281c0, 0xd, 0x10, 0x0, 0x0, 0x0)
	github.com/dlclark/regexp2/runner.go:91 +0xf0
github.com/dlclark/regexp2.(*Regexp).FindStringMatch(...)
	github.com/dlclark/regexp2/regexp.go:159
main.test(0x5b9588, 0xd, 0xd, 0x5b9598, 0xd, 0xd, 0x0)
	command-line-arguments/test.go:45 +0x168
main.main()
	command-line-arguments/test.go:52 +0x174
Test case regex='[]byte{0x28, 0x3f, 0x28, 0x29, 0x29, 0x5c, 0x31, 0x30, 0x28, 0x3f, 0x28, 0x30, 0x29}', string='[]byte{0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30}' panics
string values '(?())\10(?(0)', '0000000000000'
#############################################################################
runtime error: index out of range [4] with length 4
goroutine 1 [running]:
runtime/debug.Stack(0x34, 0x0, 0x0)
	runtime/debug/stack.go:24 +0x9d
runtime/debug.PrintStack()
	runtime/debug/stack.go:16 +0x22
main.test.func1(0xc000083e38)
	command-line-arguments/test.go:36 +0x97
panic(0x4f0c20, 0xc000016320)
	runtime/panic.go:969 +0x166
github.com/dlclark/regexp2.(*Match).addMatch(0xc0000d62a0, 0x4, 0x1, 0x0)
	github.com/dlclark/regexp2/match.go:170 +0x31c
github.com/dlclark/regexp2.(*runner).capture(0xc0000d4300, 0x4, 0x1, 0x1)
	github.com/dlclark/regexp2/runner.go:1420 +0x9e
github.com/dlclark/regexp2.(*runner).execute(0xc0000d4300, 0x0, 0x0)
	github.com/dlclark/regexp2/runner.go:254 +0x276e
github.com/dlclark/regexp2.(*runner).scan(0xc0000d4300, 0xc000018210, 0xc, 0xc, 0x0, 0x0, 0x7fffffffffffffff, 0xc, 0xc, 0x4490be)
	github.com/dlclark/regexp2/runner.go:144 +0x1c3
github.com/dlclark/regexp2.(*Regexp).run(0xc0000d2380, 0xc000083d00, 0xffffffffffffffff, 0xc000018210, 0xc, 0xc, 0x0, 0x0, 0x0)
	github.com/dlclark/regexp2/runner.go:91 +0xf0
github.com/dlclark/regexp2.(*Regexp).FindStringMatch(...)
	github.com/dlclark/regexp2/regexp.go:159
main.test(0x5b91c8, 0xc, 0xc, 0x5b91d8, 0xc, 0xc, 0x0)
	command-line-arguments/test.go:45 +0x168
main.main()
	command-line-arguments/test.go:52 +0x174
Test case regex='[]byte{0x28, 0x29, 0x28, 0x28, 0x29, 0x5c, 0x37, 0x28, 0x3f, 0x28, 0x29, 0x29}', string='[]byte{0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30}' panics
string values '()(()\7(?())', '000000000000'

Infinite match loop

The following test results in an infinite loop.

func TestOverlappingMatch(t *testing.T) {
	re := MustCompile(`((?:0*)+?(?:.*)+?)?`, 0)
	match, err := re.FindStringMatch("0\xfd")
	if err != nil {
		t.Fatal(err)
	}
	for match != nil {
		t.Logf("start: %d, length: %d", match.Index, match.Length)
		match, err = re.FindNextMatch(match)
		if err != nil {
			t.Fatal(err)
		}
	}
}

$ go test -v -run TestOverlappingMatch
=== RUN TestOverlappingMatch
TestOverlappingMatch: regexp_test.go:802: start: 0, length: 2
TestOverlappingMatch: regexp_test.go:802: start: 1, length: 1
TestOverlappingMatch: regexp_test.go:802: start: 1, length: 1
TestOverlappingMatch: regexp_test.go:802: start: 1, length: 1
....

Regex Multiline

a regex= ^(ac|bb)$\n, but this i dont use option Multiline,I think it will error when MustCompile,but it not ,and can match string "ac\n",so how can i do ,it will throw an error

One more 4byte emoji issue

Hello dlclark,

I found one more issue, which is relevant to #5 .

I checked this with the brand-new Go 1.8, but I guess the version of the Go does not affect the issue.

1. Sample that encounter panic

Condition:

  • regex pattern contains ""(U+731F), followed by a Japanese character like Hiragana
  • target string contains 4byte emojis, followed by the same Japanese characters as above (Hiragana in the case)
  • the 4byte emojis in the target is not adjacent to the same Japanese characters behind it
package main

import (
	"github.com/dlclark/regexp2"
	"github.com/k0kubun/pp"
)

func main() {
	str := "猟な" // factor 1: the kanji + the Hiragana

	re := regexp2.MustCompile(str, 0)
	result, _ := re.ReplaceFunc(
		"🍺な" + // factor 2: 4byte emoji + the same Hiragana as above
                "なあ🍺な", // factor 3: the same Hiragana does not surround the 4byte emoji; if you remove "あ" from this, it works fine
		func(m regexp2.Match) string {
			return "࿗" + "1" + "࿘" + string(m.Capture.Runes()) + "࿌"
		}, -1, -1)

	pp.Println(result)
}

2. Sample that works fine

package main

import (
	"github.com/dlclark/regexp2"
	"github.com/k0kubun/pp"
)

func main() {
	str := "猟な" // works fine with the kanji + trailing Japanese char ("な" in the case)

	re := regexp2.MustCompile(str, 0)
	result, _ := re.ReplaceFunc(
		"な📍な"+ // works fine if the same "な" surrounds the 4byte emoji
			"な✔️な"+
			"な😏な"+
			"な⚾️な"+
			"な📣な"+
			"な🍣な"+
			"な🍺な"+
			"な📍✔️😏⚾️📣🍣🍺な", func(m regexp2.Match) string {
			return "࿗" + "1" + "࿘" + string(m.Capture.Runes()) + "࿌"
		}, -1, -1)

	pp.Println(result)
}

Best regards, 🙇

Is it possible to get the name of the currently matched group?

Say I have a regex to tokenize some language..

# in python.
regex = re.compile(
    "(?P<comment>#.*?$)|"
    "(?P<newline>\n)|"     # has to go ahead of the whitespace
    "(?P<comma>,)|"       
    "(?P<double_quote_string>\".*?\")|" 
    "(?P<single_quote_string>'.*?')|"   
    "(?P<whitespace>[ \t\r\f\v]+)|"    ... etc

Here you expect to get multiple matches for each group name when tokenizing a file and you want to keep the ordering of the tokens.

If I use the same approach using regexp2 can I go from match to group name? E.g. how do I get the last matched group name for a match? Is that possible?

Seems to fail a positive lookahead

Hello, I was checking it out and it seems to fail a regular expression. For a given text like this one, the expression ((Art\.\s\d+)[\S\s]*?(?=Art\.\s\d+)) fails to match every Art. block in the text. I've tested the expression on this website and there it gives me the correct count of 12 matches.

Am I missing something? Maybe a multiline flag?

FYI: a new "absent operator" on Ruby 2.4.1

This is NOT an issue and just to let you know that a new "absent operator" has been implemented on Ruby's regexp lib named Onigmo. Sorry for this if this'd disturb you.

Note that the implementation of the operator has a rigid background theory: https://staff.aist.go.jp/tanaka-akira/pub/prosym49-akr-paper.pdf

I recognize that your Regexp2 is based upon .NET Framework and extending your lib like that might not be good in some cases.
Note that I don't mean I need the operator right now.
I just wrote that for the case you'd have any interests in the new operator.

Cheers,

Continuous 4byte emoji would crash when ReplaceFunc()

Hello, it's been a long time.

Today I found an issue regarding some special "4byte" emojis on ReplaceFunc().

  • sample 4byte emojis: 📍😏️📣🍣🍺
  • sample 3byte emoji: ✔️⚾️

You can inspect the above with http://r12a.github.io/apps/conversion/ like the following:

image

Sample1: causes panic

Please take a look at the following: You can reproduce the issue by uncommenting the str assignment lines one by one.

As far as I checked, ReplaceFunc()'d get panic under the following condition:

  • target contains some continuous 4byte emojis, and
  • regex contains 3bytes UTF-8 characters and contains NO 4byte emojis
package main

import (
	"github.com/dlclark/regexp2"
	"github.com/k0kubun/pp"
)

func main() {
	str := "高" // panic: Japanese Kanji
	// str := "は" // panic: Japanese Hiragana
	// str := "パ" // panic: Japanese Katakana
	// str := "[a-zA-Z0-9]{,2}" // works fine: Japanese Hiragana
	// str := "峰起|烽起" // works fine: longer Japanese Hiragana (I wonder why)
	// str := "フトレス" // panic: longer Japanese Katakana
	// str := "ALLWAYS|Allways|allways|AllWays" // works fine: Alphabet
	// str := "📍" // works fine: 4byte emoji
	// str := "📍📍" // works fine: continuous 4byte emoji
	// str := "✔️" // panic: 3byte emoji
	// str := "✔️✔️" // panic: coutinuous 3byte emoji
	// str := "📍️✔️" // works fine: 4 and 3byte emoji
	// str := "️✔📍️" // works fine: 3 and 4byte emoji
	// str := "📍️は️" // works fine: 4byte emoji and Hiragana
	// str := "️は📍️" // works fine: Hiragana and 4byte emoji

	re := regexp2.MustCompile(str, 0)
	result, _ := re.ReplaceFunc("📍✔️😏⚾️📣🍣🍺🍺 <- continuous 4byte emoji 寿司ビール文字あり", func(m regexp2.Match) string {
		return "࿗" + "࿘" + string(m.Capture.Runes()) + "࿌"
	}, -1, -1)

	pp.Println(result)
}

Sample2: all works fine

The following is a kind of control group that works fine. The key is that the target contains no "continuous 4byte emojis".

package main

import (
	"github.com/dlclark/regexp2"
	"github.com/k0kubun/pp"
)

func main() {
        // All of the following patterns work fine perhaps because ""✔✔⚾⚾️ <- 3byte emoji 寿司ビール文字なし" contains no continuous 4byte emojis. You can check them by uncommenting them one by one.
	str := "高"
	// str := "は"
	// str := "パ"
	// str := "[a-zA-Z0-9]{,2}"
	// str := "峰起|烽起"
	// str := "フトレス"
	// str := "ALLWAYS|Allways|allways|AllWays"
	// str := "📍" 
	// str := "📍📍" 
	// str := "✔️" 
	// str := "✔️✔️" 
	// str := "📍️✔️" 
	// str := "️✔📍️" 
	// str := "📍️は️" 
	// str := "️は📍️" 

	re := regexp2.MustCompile(str, 0)
       // The following target works fine: there's no continuous 4byte emojis
	result, _ := re.ReplaceFunc("✔✔⚾⚾️ <- 3byte emoji 寿司ビール文字なし", func(m regexp2.Match) string {
		return "࿗" + "࿘" + string(m.Capture.Runes()) + "࿌"
	}, -1, -1)

	pp.Println(result)
}

FYI

The issue looks a little bit similar to "sushi-beer" issue: https://gist.github.com/kamipo/37576ce436c564d8cc28

I hope you'd check and fix it.

Best regards, 🙇

Support for Python-style named backreference

In RE2 compatibility mode, regexp2 supports Python-style named capture groups (eg. (?P<name>re)). But there doesn't appear to be support for Python-style named backreferences (eg. (?P=name)).

Do you have any plans to support those? More info here. Thanks!

error parsing regexp: unrecognized grouping construct: (?-1

package parse

import (
	"fmt"
	"github.com/dlclark/regexp2"
	"testing"
)

func TestJsonRe2(t *testing.T) {
	text := `{
  "code" : "0",
  "message" : "success",
  "responseTime" : 2,
  "traceId" : "a469b12c7d7aaca5",
  "returnCode" : null,
  "result" : {
    "total" : 0,
    "list" : [ ]
}
}`
	reg := `/(\{(?:(?>[^{}"'\/]+)|(?>"(?:(?>[^\\"]+)|\\.)*")|(?>'(?:(?>[^\\']+)|\\.)*')|(?>\/\/.*\n)|(?>\/\*.*?\*\/)|(?-1))*\})/`
	r, err := regexp2.Compile(reg, regexp2.RE2|regexp2.Multiline|regexp2.ECMAScript)
	if err != nil {
		fmt.Println(err)
		return
	}

	matchedStrings, err := r.FindStringMatch(text)
	if err != nil {
		fmt.Println(err)
		return
	}
	fmt.Println(matchedStrings)
}

output:

error parsing regexp: unrecognized grouping construct: (?-1 in `/(\{(?:(?>[^{}"'\/]+)|(?>"(?:(?>[^\\"]+)|\\.)*")|(?>'(?:(?>[^\\']+)|\\.)*')|(?>\/\/.*\n)|(?>\/\*.*?\*\/)|(?-1))*\})/`

but in https://regex101.com/,it is ok
image

Regexp is not working for following code. Could you please correct it , if the usage of "regexp2" library is wrong?

package main

import (
"fmt"
"github.com/dlclark/regexp2"
)

func main() {
re,_ := regexp2.Compile(Deployment, 0)
fmt.Println(re.MatchString(D.*)) // ExpectedOutput: true , ActualOutput: false
fmt.Println(re.MatchString(D*)) // ExpectedOutput: true , ActualOutput: false
fmt.Println(re.MatchString(Dep)) // ExpectedOutput: true , ActualOutput: false
fmt.Println(re.MatchString(Deployment)) // ExpectedOutput: true , ActualOutput: true

}

Add support backslash reference in Replace()

Test case:

func TestReplaceRef(t *testing.T) {
	re := MustCompile("(123)hello(789)", None)
	res, err := re.Replace("123hello789", "\\1456\\2", -1, -1)
	if err != nil {
		t.Fatal(err)
	}
	if res != "123456789" {
		t.Fatalf("Wrong result: %s", res)
	}
}

Result:

--- FAIL: TestReplaceRef (0.00s)
	regexp_test.go:775: Wrong result: \1456\2

Problems with Negative Lookahead

re := regexp2.MustCompile(`(?m)^.*(?!/bin/bash)$`,0)
match,_ := re.FindStringMatch(string(passwd))

I'm trying to take all the string execpt the ones containing /bin/bash but actually the result is just the first line of /etc/passwd that contains /bin/bash

Support ASCII Character Classes

Hi,

Thank you for the library. I needed negative lookbehinds and was disappointed to find them not supported in the standard Go regexp package.

In the course of converting some code over to use your package, I had to modify some of the regexes to use Perl character classes instead of the ASCII classes defined here: https://github.com/google/re2/wiki/Syntax

Example: https://play.golang.org/p/MlCaJtyvQ7q

Copied below as well:

	re := regexp.MustCompile(`^[[:digit:]]+$`)
	if isMatch := re.MatchString(`12345667890`); isMatch {
		fmt.Println("Matched regexp")
	} else {
		fmt.Println("No Match regexp")
	}
	
	re2 := regexp2.MustCompile(`^[[:digit:]]+$`, 0)
	if isMatch, _ := re2.MatchString(`12345667890`); isMatch {
		fmt.Println("Matched regexp2")
	} else {
		fmt.Println("No Match regexp2")
	}

Output:

Matched regexp
No Match regexp2

It'd be nice to support these larger character classes as well to keep compatibility with the standard library's regexp package.

compile failed

s := 	`[\r\n;\/\*]+\s*\b(include|require)(_once)?\b[\s\(]*['"][^\n'"]{1,100}((\.(jpg|png|txt|jpeg|log|tmp|db|cache)|\_(tmp|log))|((http|https|file|php|data|ftp)\:\/\/\[.{0,25}))['"][\s\)]*[\r\n;\/\*]+`

regexp.MustCompile(s, regexp.None)

panic: regexp2: Compile(`[\r\n;\/\*]+\s*\b(include|require)(_once)?\b[\s\(]*['"][^\n'"]{1,100}((\.(jpg|png|txt|jpeg|log|tmp|db|cache)|\_(tmp|log))|((http|https|file|php|data|ftp)\:\/\/\[.{0,25}))['"][\s\)]*[\r\n;\/\*]+`): error parsing regexp: unrecognized escape sequence \_ in `[\r\n;\/\*]+\s*\b(include|require)(_once)?\b[\s\(]*['"][^\n'"]{1,100}((\.(jpg|png|txt|jpeg|log|tmp|db|cache)|\_(tmp|log))|((http|https|file|php|data|ftp)\:\/\/\[.{0,25}))['"][\s\)]*[\r\n;\/\*]+`

it is panic. But succeeded in python.

In [47]: s = r"""[\r\n;\/\*]+\s*\b(include|require)(_once)?\b[\s\(]*['"][^\n'"]{1,100}((\.(jpg|png|txt|jpeg|log|tmp|db|cache)|\_(tmp|log))|((http|https|file|php|data|ftp)\:\/\/\[.{0,25}))['"][\s\)]*[\r\n;
    ...: \/\*]+"""

In [48]: re.compile(s)
Out[48]:
re.compile(r'[\r\n;\/\*]+\s*\b(include|require)(_once)?\b[\s\(]*[\'"][^\n\'"]{1,100}((\.(jpg|png|txt|jpeg|log|tmp|db|cache)|\_(tmp|log))|((http|https|file|php|data|ftp)\:\/\/\[.{0,25}))[\'"][\s\)]*[\r\n;\/\*]+',
re.UNICODE)

not work with `(\d)\1{3}`

hi,

i want found some repeated number in a string

string : 3331112233
reg: (\d)\1{3}

result is nil

Compatibility issue with NKo Digits

It looks like \d matches ߀ (\u07c0) with regexp2, but not with the standard library regexp.

See the following example:

package main

import (
	"fmt"
	"regexp"

	"github.com/dlclark/regexp2"
)

func main() {
	re := regexp.MustCompile(`^\d$`)
	re2 := regexp2.MustCompile(`^\d$`, regexp2.RE2)

	notZero := "߀" // \u07c0

	match := re.MatchString(notZero)
	fmt.Printf("regexp: %v\n", match)

	match2, _ := re2.MatchString(notZero)
	fmt.Printf("regexp2: %v\n", match2)
}

Perhaps this is a known issue, but I'm wondering if there is a way to get additional compatibility with the standard library.

The best way to get all named captured groups

I'm trying to use this library to get all the named captured groups to a map[string]string.
This is my code:

caps := make(map[string]string)
re, err := regexp2.Compile(pattern, regexp2.RE2)
if err != nil {
	panic(err)
}
names := re.GetGroupNames()
mat, err := re.FindStringMatch(text)
if err != nil {
	panic(err)
}
if mat != nil {
	gps := mat.Groups()
	for i, value := range names {
		if value != strconv.Itoa(i) {
			if len(gps[i].Captures) > 0 {
				caps[value] = gps[i].Captures[0].String()
			}
		}
	}

	fmt.Println(caps)
}

Is this the best way in term of performance to do it?
First it calls FindStringMatch(), then it calls Groups() and finally, a for loop. Seem a little too many jobs to do. :D

A bug when .* in the content to match

The code that caused the error:

image

Why nil ?

This should be the right result:

image

Sample code:

`package main

import (
"fmt"

"github.com/dlclark/regexp2"

)

func main() {

r, err := regexp2.Compile(`(?<=1234\.\*56).*(?=890)`, regexp2.Compiled)

if err != nil {
	panic(err)
}

m, err := r.FindStringMatch(`1234.*567890`)
if err != nil {
	panic(err)
}

fmt.Println(m)

}`

runtime error: index out of range [<number>] with length <samenumber>

One more that was fuzzed during the night ;)

package main

import (
        "fmt"
        "runtime/debug"

        "github.com/dlclark/regexp2"
)

var testCases = []struct {
        r, s []byte
}{
        {
                r: []byte{0x30, 0x28, 0x3f, 0x3e, 0x28, 0x29, 0x2b, 0x3f, 0x30, 0x29, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x77},
                s: []byte{0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30},
        },
        {
                r: []byte{0x28, 0x3f, 0x3e, 0x28, 0x3f, 0x3e, 0x29, 0x2b, 0x3f, 0x3e, 0x29, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30},
                s: []byte{0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x3e, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30},
        },
}

func test(r, s []byte) (b bool) {
        defer func() {
                if r := recover(); r != nil {
                        fmt.Println(r)
                        debug.PrintStack()
                        b = true
                }
        }()

        re, err := regexp2.Compile(string(r), regexp2.ECMAScript)
        if err != nil {
                return false
        }
        _, _ = re.FindStringMatch(string(s))
        return false
}

func main() {
        for _, c := range testCases {
                fmt.Printf("Test case regex='%#v', string='%#v' panics\nstring values '%s', '%s'\n",
                        c.r, c.s, string(c.r), string(c.s),
                )
                fmt.Println("#############################################################################")
                if test(c.r, c.s) {
                } else {
                        fmt.Printf("Test case regex='%#v', string='%#v' DOES NOT panic\nstring values '%s', '%s'\n",
                                c.r, c.s, string(c.r), string(c.s),
                        )
                }
        }
}

panics with

Test case regex='[]byte{0x30, 0x28, 0x3f, 0x3e, 0x28, 0x29, 0x2b, 0x3f, 0x30, 0x29, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x77}', string='[]byte{0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30}' panics
string values '0(?>()+?0)00000000w', '0000000000000000000'
#############################################################################
runtime error: index out of range [72] with length 72
goroutine 1 [running]:
runtime/debug.Stack(0x36, 0x0, 0x0)
        runtime/debug/stack.go:24 +0x9d
runtime/debug.PrintStack()
        runtime/debug/stack.go:16 +0x22
main.test.func1(0xc00015be38)
        command-line-arguments/test.go:27 +0x97
panic(0x4f0b40, 0xc0001320e0)
        runtime/panic.go:969 +0x166
github.com/dlclark/regexp2.(*runner).backtrack(0xc000162000)
        github.com/dlclark/regexp2/runner.go:1033 +0x246
github.com/dlclark/regexp2.(*runner).execute(0xc000162000, 0x0, 0x0)
        github.com/dlclark/regexp2/runner.go:904 +0x9b
github.com/dlclark/regexp2.(*runner).scan(0xc000162000, 0xc0001340a0, 0x13, 0x14, 0x0, 0x0, 0x7fffffffffffffff, 0x13, 0x14, 0x4490be)
        github.com/dlclark/regexp2/runner.go:144 +0x1c3
github.com/dlclark/regexp2.(*Regexp).run(0xc000160080, 0xc00015bd00, 0xffffffffffffffff, 0xc0001340a0, 0x13, 0x14, 0x0, 0x0, 0x0)
        github.com/dlclark/regexp2/runner.go:91 +0xf0
github.com/dlclark/regexp2.(*Regexp).FindStringMatch(...)
        github.com/dlclark/regexp2/regexp.go:159
main.test(0x5b9710, 0x13, 0x13, 0x5b9730, 0x13, 0x13, 0x0)
        command-line-arguments/test.go:36 +0x168
main.main()
        command-line-arguments/test.go:46 +0x355
Test case regex='[]byte{0x28, 0x3f, 0x3e, 0x28, 0x3f, 0x3e, 0x29, 0x2b, 0x3f, 0x3e, 0x29, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30}', string='[]byte{0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x3e, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30}' panics
string values '(?>(?>)+?>)0000000000', '00000000000000>000000'
#############################################################################
runtime error: index out of range [32] with length 32
goroutine 1 [running]:
runtime/debug.Stack(0x36, 0x0, 0x0)
        runtime/debug/stack.go:24 +0x9d
runtime/debug.PrintStack()
        runtime/debug/stack.go:16 +0x22
main.test.func1(0xc00015be38)
        command-line-arguments/test.go:27 +0x97
panic(0x4f0b40, 0xc000132160)
        runtime/panic.go:969 +0x166
github.com/dlclark/regexp2.(*runner).popcrawl(...)
        github.com/dlclark/regexp2/runner.go:938
github.com/dlclark/regexp2.(*runner).uncapture(...)
        github.com/dlclark/regexp2/runner.go:1467
github.com/dlclark/regexp2.(*runner).execute(0xc000162100, 0x0, 0x0)
        github.com/dlclark/regexp2/runner.go:507 +0x408c
github.com/dlclark/regexp2.(*runner).scan(0xc000162100, 0xc000100120, 0x15, 0x18, 0x0, 0x0, 0x7fffffffffffffff, 0x15, 0x18, 0x4490be)
        github.com/dlclark/regexp2/runner.go:144 +0x1c3
github.com/dlclark/regexp2.(*Regexp).run(0xc000160180, 0xc00015bd00, 0xffffffffffffffff, 0xc000100120, 0x15, 0x18, 0x0, 0x0, 0x0)
        github.com/dlclark/regexp2/runner.go:91 +0xf0
github.com/dlclark/regexp2.(*Regexp).FindStringMatch(...)
        github.com/dlclark/regexp2/regexp.go:159
main.test(0x5b9750, 0x15, 0x15, 0x5b9770, 0x15, 0x15, 0x0)
        command-line-arguments/test.go:36 +0x168
main.main()
        command-line-arguments/test.go:46 +0x355

Licensing and specific ATTRIB details

As part as an effort that includes packaging your library for Debian, I'm wondering if it would be possible to have more details or information about which particular files are covered by each original license?

In particular, could you provide some more details regarding these comments on ATTRIB:

Some of this code is ported from dotnet/corefx, which was released under this license:
...

Small pieces of code are copied from the Go framework under this license:
...

I am aware it might be a bit difficult to retrieve that history, but any insight would be much appreciated in the hopes of making sure licenses and copyright are attributed as faithfully as possible. Thanks in advance!

Add more examples to README

Could you add more examples to the README? There's not a single runnable example of FindStringMatch, or FindNextMatch. I'm trying to use FindStringMatch to capture two capture groups in the below regexp, but the second one doesn't exist. Some more complex examples (find all matches for regexp, extract several capture groups from a match, regexps with lookaheads) on the README would be helpful for debugging. It looks like a really useful library (since it has support for lookahead expressions!) but I'm having a lot of trouble using it due to the lack of documentation.

package main

import (
	"fmt"
	"github.com/dlclark/regexp2"
)

func main() {
	re := regexp2.MustCompile(`(\b\w+)=(.*?(?=\s\w+=|$))`, 0)
	s := `timestamp=05/Dec/2018:14:39:41 -0500 foo=bar`
	if matches, _ := re.FindStringMatch(s); matches != nil {
		fmt.Printf("Group 0: %v\n", matches.String())
		gps := matches.Groups()
		fmt.Println(gps[1].Captures[0].String()) 
		fmt.Println(gps[0].Captures[1].String()) //why is this capture group nil?
	}
}

bugs in scenarios of Chinese characters or incorrect using of match.Index

the following codes fails

package main

import (
	"fmt"
	"github.com/dlclark/regexp2"
)

func main()  {
	regex := regexp2.MustCompile("<style", regexp2.IgnoreCase|regexp2.Singleline)
	match, err := regex.FindStringMatch(sample)
	if err != nil {
		panic(err)
	}
	if match != nil {
		t, err := regex.Replace(sample, "xxx", match.Index, -1)
		if err != nil {
			panic(err)
		}
		fmt.Printf("%s", t)
	}
}

var sample = "<title>错<style"

if i search some words/regex successfully, and then replace something from match.Index instead of -1, the codes fails.

however, if removed the Chinese character , the codes succeeds.

so, in such scenario, what should beginning index be if I want to replace all and don't want to replace from -1(begining)

Force timeout for testing?

I'm trying to force a timeout as part of my unit testing. Unfortunately, the expression gets evaluated too quickly and never times out. Roughly, my code looks like:

https://play.golang.com/p/fuXQh3RdyuO

Example

package main

import (
	"github.com/dlclark/regexp2"
	"testing"
	"time"
)

var regex = regexp2.MustCompile(`\d{4}-\d{2}-\d{2}`, regexp2.None)

func init() {
	regex.MatchTimeout = 1 * time.Second
}

func StringMatches(input string) (bool, error) {
	return regex.MatchString(input)
}

func TestLastIndex(t *testing.T) {
	originalTimeout := regex.MatchTimeout
	regex.MatchTimeout = -1 * time.Nanosecond

	result, err := StringMatches("2023-03-28")
	if result == true {
		t.Error("expected match false due to timeout")
	}
	if err == nil {
		t.Error("expected timeout error")
	}

	regex.MatchTimeout = originalTimeout
}

The only major difference being that my regular expression is a more complicated date time string matcher with named groups.

Is there a way to force the evaluator to timeout for testing? Otherwise, I'm not sure how I can cover the error case of MatchString/FindStringMatch.

\Z not work on regexp2.RE2 mode

s1  := `^Google\nApple$`
s2  := `^Google\nApple\Z`
data := "Google\nApple\n"
// will get result
re, err := regexp2.Compile(s, regexp2.Singleline)
// will not get result
re, err := regexp2.Compile(s, regexp2.Singleline|regexp2.RE2)

Why?

Request: unicode character class implementations

Thank you very much for the porting!

I checked your library and found that most unicode character classes have not been implemented yet.

Reference: http://www.fileformat.info/info/unicode/category/index.htm

Looks like fundamental character categories, such as [\p{P}] (= any punctuations), are available:

package main

import (
    "fmt"

    "github.com/dlclark/regexp2"
)

func main() {
    re := regexp2.MustCompile(`(?<=[カキケコ\p{Po}])ん+`, 0) // works
    isMatch, err := re.FindStringMatch(`ブック。んん`)
    if err == nil {
        fmt.Println(isMatch)
    }
}

But most advanced character classes (block) such as [\p{Katakana}] have not been implemented:

package main

import (
    "fmt"

    "github.com/dlclark/regexp2"
)

func main() {
    re := regexp2.MustCompile(`(?<=[カキケコ\p{Katakana}])ん+`, 0) // panic with [\p{Katakana}]
    isMatch, err := re.FindStringMatch(`ブック。んん`)
    if err == nil {
        fmt.Println(isMatch)
    }
}

The sample code above causes panic: not impelemented.

I hope you'd implement them in a future.

Capture.Length undefined, making FindAllString hard to implement

Hello! First, thanks for this great library - this is an impressive feat!

I needed an equivalent function for https://golang.org/pkg/regexp/#Regexp.FindAllString which ideally would be a part of this library, but unfortunately doesn't exist today. I took a stab at implementing it (without the n parameter):

func regexp2FindAllString(re *regexp2.Regexp, s string) []string {
	var matches []string
	for {
		match, _ := re.FindStringMatch(s)
		if match == nil {
			break
		} else {
			matches = append(matches, match.String())
			s = s[match.Index+match.Length:]
		}
	}
	return matches
}

At first glance, this seemed correct and appeared to work - however I realized that it in fact is incompatible with unicode because match.Length appears to report length in runes not bytes. I'm not sure whether or not Capture.Index reports bytes or runes either, and the docs don't define this:

    // the position in the original string where the first character of
    // captured substring was found.
    Index int
    // the length of the captured substring.
    Length int

From testing, it appears that Capture.Index oddly is in bytes and not runes. A corrected implementation is:

func regexp2FindAllString(re *regexp2.Regexp, s string) []string {
	var matches []string
	for {
		match, _ := re.FindStringMatch(s)
		if match == nil {
			break
		} else {
			matches = append(matches, match.String())
-			s = s[match.Index+match.Length:]
+			s = s[match.Index+len(match.String()):]
		}
	}
	return matches
}

This brings me to my points of feedback:

  1. Index in bytes and Length in runes is an odd inconsistency, I imagine they should be the same.
  2. The docstrings should ideally clarify this.
  3. It would be great if the library exposed a FindAllString implementation

Thanks again for the great library!

Leaking go routines using `fastclock`

With the introduction of fastclock, it spawns a go routine with a given timeout.

https://github.com/dlclark/regexp2/blob/master/fastclock.go#L75

This timeout is defaulted to "forever".

https://github.com/dlclark/regexp2/blob/master/regexp.go#L22-L32

If you are using any unit tests, this can leak if using uber-go/goleak.

I am using Chroma which sets the timeout to 250ms, which is better than never, but it still leaks a routine on my quicker tests.


I do not know the solution, but can a way be implemented to make sure this go routine is killed when it is no longer needed? Could we store the number of Matches that is using the clock, and when the matches all go away, the go routine stops as soon as it can?

As someone who is new to this repo, I am not 100% sure. It is just a problem we are hitting now in our unit tests.

CPU is too high, how to reduce CPU, Pprof shows as follows

1.31mins 54.98% 54.98% 1.31mins 55.00% github.com/dlclark/regexp2/syntax.CharSet.CharIn
0.25mins 10.34% 65.32% 0.25mins 10.36% github.com/dlclark/regexp2.(*runner).forwardcharnext
0.19mins 8.09% 73.41% 1.75mins 73.45% github.com/dlclark/regexp2.(*runner).findFirstChar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.