Giter VIP home page Giter VIP logo

joni's Introduction

JRuby - an implementation of the Ruby language on the JVM

Master: JRuby CI, JRuby CI (Windows) 9.3 branch: JRuby CI, JRuby CI (Windows)

About

JRuby is an implementation of the Ruby language using the JVM.

It aims to be a complete, correct and fast implementation of Ruby, at the same time as providing powerful new features such as concurrency without a global-interpreter-lock, true parallelism, and tight integration to the Java language to allow you to use Java classes in your Ruby program and to allow JRuby to be embedded into a Java application.

You can use JRuby simply as a faster version of Ruby, you can use it to run Ruby on the JVM and access powerful JVM libraries such as highly tuned concurrency primitives, you can use it to embed Ruby as a scripting language in your Java program, or many other possibilities.

We're a welcoming community - you can talk to us on #jruby on Libera. There are core team members in the EU and US time zones.

Visit the JRuby website and the JRuby wiki for more information.

Getting JRuby

To run JRuby you will need a JRE (the Java VM runtime environment) version 8 or higher.

Your operating system may provide a JRE and JRuby in a package manager, but you may find that this version is very old.

An alternative is to use one of the Ruby version managers.

For rbenv you will need the ruby-build plugin. You may find that your system package manager can provide these. To see which versions of JRuby are available you should run:

$ rbenv install jruby

Note: if you do not regularly git update rbenv this list of versions may be out of date.

We recommend always selecting the latest version of JRuby from the list. You can install that particular version (9.2.13.0 is just for illustration):

$ rbenv install jruby-9.2.13.0

For rvm you can simply do:

$ rvm install jruby

Using Homebrew works too:

$ brew install jruby

You can also download packages from the JRuby website that you can unpack and run in place.

Building JRuby

See BUILDING for information about prerequisites, how to compile JRuby from source and how to test it.

Authors

Stefan Matthias Aust, Anders Bengtsson, Geert Bevin, Ola Bini, Piergiuliano Bossi, Johannes Brodwall, Rocky Burt, Paul Butcher, Benoit Cerrina, Wyss Clemens, David Corbin, Benoit Daloze, Thomas E Enebo, Robert Feldt, Chad Fowler, Russ Freeman, Joey Gibson, Kiel Hodges, Xandy Johnson, Kelvin Liu, Kevin Menard, Alan Moore, Akinori Musha, Charles Nutter, Takashi Okamoto, Jan Arne Petersen, Tobias Reif, David Saff, Subramanya Sastry, Chris Seaton, Nick Sieger, Ed Sinjiashvili, Vladimir Sizikov, Daiki Ueno, Matthias Veit, Jason Voegele, Sergey Yevtushenko, Robert Yokota, and many gracious contributors from the community.

JRuby uses code generously shared by the creator of the Ruby language, Yukihiro Matsumoto [email protected].

Project Contact: Thomas E Enebo [email protected]

License

JRuby is licensed under a tri EPL/GPL/LGPL license. You can use it, redistribute it and/or modify it under the terms of the:

Eclipse Public License version 2.0 OR GNU General Public License version 2 OR GNU Lesser General Public License version 2.1

Some components have other licenses and copyright. See the COPYING file for more specifics.

joni's People

Contributors

anba avatar angelozerr avatar arthurscchan avatar bbrowning avatar chenzhang22 avatar dependabot[bot] avatar edwardbetts avatar enebo avatar haozhun avatar headius avatar henrich avatar henry-thompson avatar jirkamarsik avatar joelhockey avatar jordansissel avatar kares avatar kishorkunal-raj avatar lopex avatar michaelklishin avatar nezda avatar nicksieger avatar qmx avatar sebthom avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

joni's Issues

Joni spins forever on invalid input

If you create a Matcher from a byte array containing invalid UTF-8, the match() method will spin forever due to invalid characters not being handled by ByteCodeMachine. For example, in the method opAnyCharStar():

    while (s < range) {
        ...
        int n = enc.length(bytes, s, end);
        if (s + n > range) {opFail(); return;}
        ...
    }

The enc.length() call returns -1 for malformed input, but this value isn't checked for, so the loop never exits. I haven't looked at this deeply enough to know the correct solution, but there are a ton of calls and none of them are checked.

Joni regex matcher hang.

When I run the following code, the program will hang, is it a bug?

import org.jcodings.specific.UTF8Encoding;
import org.joni.Matcher;
import org.joni.Option;
import org.joni.Regex;

public class Demo {
    public static void main(String[] args) {
        byte[] str = "m1666666654656dsffddfssubscribeaaaaa_3499_g415780803".getBytes();
        byte[] pattern = "^([a-z0-9]+)+$".getBytes();

        Regex regex = new Regex(pattern, 0, pattern.length, Option.NONE, UTF8Encoding.INSTANCE);
        Matcher matcher = regex.matcher(str);
        int result = matcher.search(0, str.length, Option.DEFAULT);
        System.out.println("result: " + result);
    }
}

regexp causes hang in jruby but terminates in MRI

In MRI 2.6:

% ruby -v 
ruby 2.6.0p0 (2018-12-25 revision 66547) [x86_64-darwin18]
% ruby -e 'puts "foo========:bar baz================================================bingo".scan(/(?:=+=+)+:/)'
========:

With Latest JRuby snapshot:

/tmp/jruby-9.2.8.0-SNAPSHOT % java -version
openjdk version "11.0.1" 2018-10-16
OpenJDK Runtime Environment 18.9 (build 11.0.1+13)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.1+13, mixed mode)
/tmp/jruby-9.2.8.0-SNAPSHOT % jruby -v
jruby 9.2.8.0-SNAPSHOT (2.5.3) 2019-07-19 b416404 OpenJDK 64-Bit Server VM 11.0.1+13 on 11.0.1+13 +jit [darwin-x86_64]
/tmp/jruby-9.2.8.0-SNAPSHOT % jruby -e 'puts "foo========:bar baz================================================bingo".scan(/(?:=+=+)+:/)'
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.jruby.util.SecurityHelper to field java.lang.reflect.Field.modifiers
WARNING: Please consider reporting this to the maintainers of org.jruby.util.SecurityHelper
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release

[edit] Also hangs on jruby 1.7.27, 9.2.5.0, 9.2.7.0

how to interrupt hanging thread?

call thread.interrupt() and nothing happened. so how to stop the hanging thread?

Charset _charset = Charset.forName("GB18030");
/* text containing irregular binary data will make thread hang */
Thread thread = new Thread(new Runnable() {
	@Override
	public void run() {
		try {
			String key = "a"; // any character
			byte[] pattern = key.getBytes(_charset);

			Regex regex = new Regex(pattern, 0, pattern.length, Option.IGNORECASE, GB18030Encoding.INSTANCE);

			byte[] source = new byte[]{0x2f, 0x2f, (byte) 0xaf}; // text content.
			/* Encoded by GB18030, It reads "//�" where � means that "0xaf" is wrong or unsupported? */
			System.out.println(new String(source, _charset));

			Matcher matcher = regex.matcher(source);
			// search Interruptible ?
			int idx=matcher.searchInterruptible(0, source.length, Option.DEFAULT);
			System.out.println(idx+"");
		} catch (InterruptedException e) {
			System.out.println("InterruptedException");
			e.printStackTrace();
		}
	}
});

thread.start();

new Thread(new Runnable() {
	@Override
	public void run() {
		try {
			Thread.sleep(500);
		} catch (InterruptedException e) {
			e.printStackTrace();
		}
		System.out.println("interrupt !!! ");
		thread.interrupt();  // called but not working. 
	}
}).start();

v2.1.30

org.jcodings.exception.CharacterPropertyException: invalid character property name <graphemeclusterbreak=emodifier>

Recently, it starts to fail to be tested on Debian unstable environment with below message

[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.05 s - in org.joni.test.TestU
[INFO] Running org.joni.test.TestU8
Pattern: [/\X/] Str: ["
"] Encoding: [UTF-8] Option: [] Syntax: [TEST]
org.jcodings.exception.CharacterPropertyException: invalid character property name <graphemeclusterbreak=emodifier>
        at org.jruby.jcodings/org.jcodings.unicode.UnicodeEncoding.propertyNameToCType(UnicodeEncoding.java:99)
        at org.jruby.joni/org.joni.Parser$GraphemeNames.nameToCtype(Parser.java:954)
        at org.jruby.joni/org.joni.Parser.parseExtendedGraphemeCluster(Parser.java:1082)
        at org.jruby.joni/org.joni.Parser.parseExp(Parser.java:792)
        at org.jruby.joni/org.joni.Parser.parseBranch(Parser.java:1520)
        at org.jruby.joni/org.joni.Parser.parseSubExp(Parser.java:1546)
        at org.jruby.joni/org.joni.Parser.parseRegexp(Parser.java:1579)
        at org.jruby.joni/org.joni.Analyser.compile(Analyser.java:78)
        at org.jruby.joni/org.joni.Regex.<init>(Regex.java:155)
        at org.jruby.joni/org.joni.Regex.<init>(Regex.java:134)
        at org.jruby.joni/org.joni.test.Test.xx(Test.java:113)
        at org.jruby.joni/org.joni.test.Test.x2s(Test.java:223)
        at org.jruby.joni/org.joni.test.Test.x2s(Test.java:218)
        at org.jruby.joni/org.joni.test.TestU8.test(TestU8.java:112)
        at org.jruby.joni/org.joni.test.Test.testRegexp(Test.java:256)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
        at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
        at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:52)
        at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
        at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
        at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
        at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
        at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
        at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
        at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
        at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
        at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
        at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
        at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
        at org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
        at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
        at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
        at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
        at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
        at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
        at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
SEVERE ERROR: invalid character property name <graphemeclusterbreak=emodifier>
(snip)

However, it succeeded once in Octobar 2018.
Could you give me an advice, please?

Enable GitHub Discussions

GitHub Discussions provides a transparent place to discuss things that don't make as much sense directly as issues. For example:

... but Onigmo hasn't been updated for quite a while. Can this version (used by jruby) include a superset of Ruby features (and fixes)?

  • Q & A / how do I X?
  • djl includes a "Show and tell" area which could be cool. (I'm curious which high profile libraries and systems use joni!)

ArrayIndexOutOfBoundsException for valid input

JONI fails with ArrayIndexOutOfBoundsException for pattern ^show\s*(\b.+\b)\s*vs\s*(\b.+\b)$ and input show c.

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 6
	at io.airlift.jcodings.specific.UTF8Encoding.length(UTF8Encoding.java:35)
	at io.airlift.jcodings.specific.BaseUTF8Encoding.mbcToCode(BaseUTF8Encoding.java:91)
	at io.airlift.jcodings.specific.UTF8Encoding.mbcToCode(UTF8Encoding.java:24)
	at io.airlift.jcodings.Encoding.isMbcWord(Encoding.java:469)
	at io.airlift.joni.ByteCodeMachine.opWordBound(ByteCodeMachine.java:1063)
	at io.airlift.joni.ByteCodeMachine.matchAt(ByteCodeMachine.java:239)
	at io.airlift.joni.Matcher.matchCheck(Matcher.java:304)
	at io.airlift.joni.Matcher.searchInterruptible(Matcher.java:457)
	at io.airlift.joni.Matcher.search(Matcher.java:318)

Reproduction:

        byte[] pattern = "^show\\s*(\\b.+\\b)\\s*vs\\s*(\\b.+\\b)$".getBytes(StandardCharsets.UTF_8);
        byte[] str = ("show c").getBytes(StandardCharsets.UTF_8);
        Regex regex = new Regex(pattern, 0, pattern.length, Option.NEGATE_SINGLELINE, UTF8Encoding.INSTANCE, Syntax.Java);
        Matcher matcher = regex.matcher(str);
        int result = matcher.search(0, str.length, Option.DEFAULT);
        System.out.println(result);

Multiline Option with ^ and $ anchors

Hi,

I am struggling with proper configuration of Option passed to search method with the Syntax.ECMAScript. I would expect that with Option.DEFAULT / Option.NONE regex with usage of ^ ,$ anchors and no explicit newline will fail with newline character. For example

byte[] pattern = "^[a-z]{1,10}$".getBytes();
byte[] str = "a\nb".getBytes();

Regex regex = new Regex(pattern, 0, pattern.length, Option.NONE, UTF8Encoding.INSTANCE, Syntax.ECMAScript);
Matcher matcher = regex.matcher(str);
int result = matcher.search(0, str.length, Option.DEFAULT);

should results with -1 but currently results with 0. Even passing Option.SINGLELINE does not change it. What I did to make this work, was to subtract the Option.MULTILINE

int result = matcher.search(0, str.length, -Option.MULTILINE)

I have tested this case with multiple online regex tools and JavaScript regex implementation in my browser and this example always gives me no match (as I expect). Only adding multiline option gives me similar result as with Joni library.

Setting syntax to Java works as expected and gives similar result as this snippet with built-in java regex

String pattern = "^[a-z]{1,10}$";
String str = "a\nb";

Pattern p = Pattern.compile(pattern);
java.util.regex.Matcher m = p.matcher(str);
boolean result = m.find();

Is the MULTILINE option default for library ECMAScript syntax and should it be? I was digging into the ECMAScript and looks like multiline = false is the default (user has to explicitly pass m flag).

joni interprets `[\w-#]` as `[ !"#0-9A-Z_a-z]`

Joni interprets [\w-#] as [ !"#0-9A-Z_a-z] in both default syntax and Java syntax. Java Pattern interprets it as [-0-9A-Z_a-z].

An addition question: in general, is it considered a bug if interpretation doesn't match Java pattern when syntax is set to Java?

namedBackrefIterator throws NPE when there is no named

I wrote the following tests:

import org.jcodings.specific.UTF8Encoding;
import org.joni.Matcher;
import org.joni.Option;
import org.joni.Regex;
import org.junit.Assert;
import org.junit.Test;


public class TestJoni {

    @Test
    public void testWithName() {
        byte[] pattern = "(?<name>a)a*".getBytes();
        byte[] str = "aaa".getBytes();

        Regex regex = new Regex(pattern, 0, pattern.length, Option.NONE, UTF8Encoding.INSTANCE);
        Matcher matcher = regex.matcher(str);
        int result = matcher.search(0, str.length, Option.DEFAULT);
        
        Assert.assertEquals(0, result);
        Assert.assertNotNull(regex.namedBackrefIterator());
        
    }

    @Test
    public void testNoName() {
        byte[] pattern = "(a)a*".getBytes();
        byte[] str = "aaa".getBytes();

        Regex regex = new Regex(pattern, 0, pattern.length, Option.NONE, UTF8Encoding.INSTANCE);
        Matcher matcher = regex.matcher(str);
        int result = matcher.search(0, str.length, Option.DEFAULT);
        
        Assert.assertEquals(0, result);
        Assert.assertNotNull(regex.namedBackrefIterator());
        
    }

}

The first test (testWithName) succed, my code is correct.
The second fails with a NPE:

java.lang.NullPointerException
	at org.joni.Regex.namedBackrefIterator(Regex.java:260)
	at TestJoni.testNoName(TestJoni.java:35)

I think namedBackrefIterator don't like when there is no named pattern. It should return an empty iterator (or null) instead.

[Q] Is it possible to improve parse speed of the Joni regexp library.

Hello, team.

First of all. Thank you very much for making great JRuby software.

Now, I'm making embulk-parser-joni_regexp.
I wanted to use Oniguruma compatible regular expression library.
That's why I'm using Joni.

Currently, My Joni code over three times slower than java.util.regex library.

My original code is here.
https://github.com/hiroyuki-sato/embulk-parser-joni_regexp/blob/master/src/main/java/org/embulk/parser/joni_regexp/JoniRegexpParserPlugin.java#L91-L119

I made test code regexp_test for compare regex speed.

The main part is the following.

It it possible to improve parse speed of the Joni regex library?

Thank you for your advice.

Kind regards.

format string

        String format = "^(?<host>[^ ]*) [^ ]* (?<user>[^ ]*) \\[(?<time>[^\\]]*)\\] \"(?<method>\\S+)(?: +(?<path>[^ ]*) +\\S*)?\" (?<code>[^ ]*) (?<size>[^ ]*)(?: \"(?<referer>[^\\\"]*)\" \"(?<agent>[^\\\"]*)\")?$";

Joni

        byte[] pattern = format.getBytes(StandardCharsets.UTF_8);
        Regex regexp = new Regex(pattern, 0, pattern.length, Option.NONE, UTF8Encoding.INSTANCE);
// ...
            while (true) {
// ...
                byte[] line_bytes = line.getBytes(StandardCharsets.UTF_8);
                Matcher matcher = regexp.matcher(line_bytes);
                int result = matcher.search(0, line_bytes.length, Option.DEFAULT);
// ...
            }

java.util.regex

        Pattern pattern = Pattern.compile(format);

            while (true) {
// ...
                Matcher matcher = pattern.matcher(line);
                if (matcher.matches()) {
// ...
            }

ArrayIndexOutOfBoundsException for grapheme_clusters

Not sure if I should report this here or to JRuby but the error seems to come from joni:

$ bin/jruby -e 'p [0xA4].pack("C").force_encoding("UTF-8").grapheme_clusters'
Unhandled Java exception: java.lang.ArrayIndexOutOfBoundsException: -1
java.lang.ArrayIndexOutOfBoundsException: -1
                          length at org/jcodings/specific/UTF8Encoding.java:30
                       isMbcHead at org/jcodings/Encoding.java:497
                      opCClassMB at org/joni/ByteCodeMachine.java:793
                         execute at org/joni/ByteCodeMachine.java:203
                         matchAt at org/joni/ByteCodeMachine.java:167
                     matchCommon at org/joni/Matcher.java:115
                           match at org/joni/Matcher.java:92
       enumerateGraphemeClusters at org/jruby/RubyString.java:5859
               grapheme_clusters at org/jruby/RubyString.java:5872
                            call at org/jruby/RubyString$INVOKER$i$0$0$grapheme_clusters.gen:-1
                            call at org/jruby/internal/runtime/methods/JavaMethod.java:309
                    cacheAndCall at org/jruby/runtime/callsite/CachingCallSite.java:323
                            call at org/jruby/runtime/callsite/CachingCallSite.java:139
  invokeOther5:grapheme_clusters at -e:1
                          <main> at -e:1
             invokeWithArguments at java/lang/invoke/MethodHandle.java:627
                            load at org/jruby/ir/Compiler.java:94
                       runScript at org/jruby/Ruby.java:852
                     runNormally at org/jruby/Ruby.java:771
                     runNormally at org/jruby/Ruby.java:789
                     runFromMain at org/jruby/Ruby.java:601
                   doRunFromMain at org/jruby/Main.java:415
                     internalRun at org/jruby/Main.java:307
                             run at org/jruby/Main.java:234
                            main at org/jruby/Main.java:206

The new spec spec/ruby/core/string/shared/grapheme_clusters.rb added in jruby/jruby#5385 fails due to this.

Temporarily commented out 'character class has duplicated range'

In 2.1.14 we updated some data and this uncovered some issues with joni and JRuby interactions involving warnings. The main visible issue is some regexps are generating the warning:

character class has duplicated range

This warning is sometimes coming out from internal expansions (like \X). If an expansion is internally diplicating we definitely do not want end users to be warned. We actually fixed one case where we were making a regexp UTF-8 when it shouldn't have been, but we are still see some other missing cases.

Joni's design compounds this issue because some constructor paths use the DEFAULT WarnCallback which is literally a system.err.println() call. This means we cannot change anything in JRuby specifically to avoid this potentially being used since not all joni Regex code is from JRuby core. We also have native extension authors who might be calling a constructor using DEFAULT.

That probably was not a super clear description but the solution should be reasonably easy to follow:

  • uncomment warn for 'character class has duplicated range' (in ScanEnvironment)
  • Add ability to register a default WarnCallback handler
  • (on jruby side) use this new register API

Additional things to do:

  • (on jruby side) audit all regex constructors and figure out where our remaining duplicated class warnings are coming from
  • augment joni warning to provide the actual regexp which is generating the warning (MRI does print out the failing regexp). But warn(message, regexp) would be a great API for debugging issues like this so we should change joni to be like that.

\g with not existing subexpression name leads to java.lang.StringIndexOutOfBoundsException

If subexpression name doesn't exist (e.g. "\\gA") then java.lang.StringIndexOutOfBoundsException exception is thrown:

offset 4, count 7, length 8 (java.lang.StringIndexOutOfBoundsException)
	from java.lang.String.checkBoundsOffCount(String.java:4587)
	from java.lang.String.<init>(String.java:523)
	from java.lang.String.<init>(String.java:1413)
	from org.joni.Lexer.syntaxWarn(Lexer.java:1327)
	from org.joni.Lexer.fetchTokenFor_subexpCall(Lexer.java:916)
	from org.joni.Lexer.fetchToken(Lexer.java:1152)
	from org.joni.Parser.parseRegexp(Parser.java:1383)
	from org.joni.Analyser.compile(Analyser.java:78)
	from org.joni.Regex.<init>(Regex.java:155)
	from org.joni.Regex.<init>(Regex.java:134)
...

Notes

It seems the issue is in the following code:

// src/org/joni/Lexer.java
    protected final void syntaxWarn(String message) {
        if (env.warnings != WarnCallback.NONE) {
            env.warnings.warn(message + ": /" + new String(bytes, getBegin(), getEnd()) + "/");
        }
    }

And new String(bytes, getBegin(), getEnd()) should be replaced with new String(bytes, getBegin(), getEnd() - getBegin()) as far as String constructor accepts offset and length arguments instead of start and end indices.

Copyright issue

Hi, I'd like to use this library, but I can't find copyright.
I think this library is MIT license. So, There should be 'Copyright (c) ' text.

Please tell me where I can find those copyright text.

Support for matches, replaceAll and split

Hi,
I'd like to know if there are any plans supporting operations matches, replaceAll, which java.util.regex has, and split which is a frequently-used operation.

guava has split, but it is based on java.util.regex.

Thanks.

Graphenes are not matched correctly using \X

Testing the letter à in the form of a graphene encoded as U+0061 U+0300 using a Ruby MRI (2.3.1 here but version doesn't matter), a \X will match the graphene:

$ irb
2.3.1 :001 > x = "h\u0061\u0300llo"
 => "hàllo" 
2.3.1 :002 > x =~ /h\Xllo/
 => 0 

The match fails when testing the same thing using JRuby:

$ irb
jruby-9.1.7.0 :001 > x = "h\u0061\u0300llo"
 => "hàllo"
jruby-9.1.7.0 :002 > x =~ /h\Xllo/
 => nil 

Failed to parse textmate regex: invalid pattern in look-behind

There is a pattern to match a string from 0 to 71 chars long: (?<=^.{0,71}).
I use it in a lot of cases in different places.
As I discovered, there is no functionality for handling lookbehind/lookahead together with the {min,max} quantifier.
Is there any workaround/planned development for this? Thank you

It's slow (hang and cause OutOfMemoryError) in certain case

        <dependency>
            <groupId>org.jruby.joni</groupId>
            <artifactId>joni</artifactId>
            <version>2.1.30</version>
        </dependency>

code:

	/** A half of a 32kb binary text block encoded in GB18030 among which I want to execute regex search */
	final static byte[] Data =new byte[]{
			(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x41,(byte)0x62,(byte)0x6f,(byte)0x75,(byte)0x74,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x0e,(byte)0x7f,(byte)0x41,(byte)0x62,(byte)0x72,(byte)0x69,(byte)0x20,(byte)0x48,(byte)0x65,(byte)0x72,(byte)0x62,(byte)0x61,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x0e,(byte)0x88,(byte)0x41,(byte)0x63,(byte)0x61,(byte)0x63,(byte)0x69,(byte)0x61,(byte)0x20,(byte)0x63,(byte)0x61,(byte)0x74,(byte)0x65,(byte)0x63,(byte)0x68,(byte)0x75,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x0e,(byte)0x8f,(byte)0x41,(byte)0x63,(byte)0x61,(byte)0x6e,(byte)0x74,(byte)0x68,(byte)0x6f,(byte)0x70,(byte)0x61,(byte)0x6e,(byte)0x61,(byte)0x63,(byte)0x69,(byte)0x73,(byte)0x20,(byte)0x52,(byte)0x61,(byte)0x64,(byte)0x69,(byte)0x63,(byte)0x69,(byte)0x73,(byte)0x20,(byte)0x43,(byte)0x6f,(byte)0x72,(byte)0x74,(byte)0x65,(byte)0x78,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x0e,(byte)0x98,(byte)0x41,(byte)0x63,(byte)0x61,(byte)0x6e,(byte)0x74,(byte)0x68,(byte)0x6f,(byte)0x70,(byte)0x61,(byte)0x6e,(byte)0x61,(byte)0x63,(byte)0x69,(byte)0x73,(byte)0x20,(byte)0x53,(byte)0x65,(byte)0x6e,(byte)0x74,(byte)0x69,(byte)0x63,(byte)0x6f,(byte)0x73,(byte)0x69,(byte)0x20,(byte)0x52,(byte)0x61,(byte)0x64,(byte)0x69,(byte)0x78,(byte)0x20,(byte)0x65,(byte)0x74,(byte)0x20,(byte)0x43,(byte)0x61,(byte)0x75,(byte)0x6c,(byte)0x69,(byte)0x73,(byte)0x20,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x0e,(byte)0xa1,(byte)0x41,(byte)0x63,(byte)0x6f,(byte)0x6e,(byte)0x69,(byte)0x74,(byte)0x69,(byte)0x20,(byte)0x52,(byte)0x61,(byte)0x64,(byte)0x69,(byte)0x78,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x0e,(byte)0xa8,(byte)0x41,(byte)0x63,(byte)0x6f,(byte)0x6e,(byte)0x69,(byte)0x74,(byte)0x69,(byte)0x20,(byte)0x54,(byte)0x75,(byte)0x62,(byte)0x65,(byte)0x72,(byte)0x20,(byte)0x4c,(byte)0x61,(byte)0x74,(byte)0x65,(byte)0x72,(byte)0x61,(byte)0x6c,(byte)0x65,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x0e,(byte)0xaf,(byte)0x41,(byte)0x63,(byte)0x6f,(byte)0x6e,(byte)0x69,(byte)0x74,(byte)0x75,(byte)0x6d,(byte)0x20,(byte)0x62,(byte)0x72,(byte)0x61,(byte)0x63,(byte)0x68,(byte)0x79,(byte)0x70,(byte)0x6f,(byte)0x64,(byte)0x75,(byte)0x6d,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x0e,(byte)0xbc,(byte)0x41,(byte)0x63,(byte)0x6f,(byte)0x72,(byte)0x69,(byte)0x20,(byte)0x52,(byte)0x68,(byte)0x69,(byte)0x7a,(byte)0x6f,(byte)0x6d,(byte)0x61,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x0e,(byte)0xc5,(byte)0x41,(byte)0x63,(byte)0x74,(byte)0x69,(byte)0x6e,(byte)0x6f,(byte)0x6c,(byte)0x69,(byte)0x74,(byte)0x75,(byte)0x6d,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x0e,(byte)0xce,(byte)0x41,(byte)0x63,(byte)0x79,(byte)0x72,(byte)0x61,(byte)0x6e,(byte)0x74,(byte)0x68,(byte)0x69,(byte)0x73,(byte)0x20,(byte)0x42,(byte)0x69,(byte)0x64,(byte)0x65,(byte)0x6e,(byte)0x74,(byte)0x61,(byte)0x74,(byte)0x61,(byte)0x65,(byte)0x20,(byte)0x52,(byte)0x61,(byte)0x64,(byte)0x69,(byte)0x78,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x0e,(byte)0xd5,(byte)0x41,(byte)0x64,(byte)0x65,(byte)0x6e,(byte)0x6f,(byte)0x70,(byte)0x68,(byte)0x6f,(byte)0x72,(byte)0x61,(byte)0x65,(byte)0x20,(byte)0x52,(byte)0x61,(byte)0x64,(byte)0x69,(byte)0x78,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x0e,(byte)0xde,(byte)0x41,(byte)0x65,(byte)0x73,(byte)0x63,(byte)0x75,(byte)0x6c,(byte)0x69,(byte)0x20,(byte)0x46,(byte)0x72,(byte)0x75,(byte)0x63,(byte)0x74,(byte)0x75,(byte)0x73,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x0e,(byte)0xe7,(byte)0x41,(byte)0x67,(byte)0x61,(byte)0x73,(byte)0x74,(byte)0x61,(byte)0x63,(byte)0x68,(byte)0x65,(byte)0x73,(byte)0x20,(byte)0x73,(byte)0x65,(byte)0x75,(byte)0x20,(byte)0x50,(byte)0x6f,(byte)0x67,(byte)0x6f,(byte)0x73,(byte)0x74,(byte)0x65,(byte)0x6d,(byte)0x69,(byte)0x20,(byte)0x48,(byte)0x65,(byte)0x72,(byte)0x62,(byte)0x61,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x0e,(byte)0xee,(byte)0x41,(byte)0x67,(byte)0x6b,(byte)0x69,(byte)0x73,(byte)0x74,(byte)0x72,(byte)0x6f,(byte)0x64,(byte)0x6f,(byte)0x6e,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x0e,(byte)0xf5,(byte)0x41,(byte)0x67,(byte)0x72,(byte)0x69,(byte)0x6d,(byte)0x6f,(byte)0x6e,(byte)0x69,(byte)0x61,(byte)0x65,(byte)0x20,(byte)0x48,(byte)0x65,(byte)0x72,(byte)0x62,(byte)0x61,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x0e,(byte)0xfe,(byte)0x41,(byte)0x67,(byte)0x72,(byte)0x69,(byte)0x6d,(byte)0x6f,(byte)0x6e,(byte)0x69,(byte)0x61,(byte)0x20,(byte)0x70,(byte)0x69,(byte)0x6c,(byte)0x6f,(byte)0x73,(byte)0x61,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x0f,(byte)0x07,(byte)0x41,(byte)0x69,(byte)0x64,(byte)0x69,(byte)0x63,(byte)0x68,(byte)0x61,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x0f,(byte)0x10,(byte)0x41,(byte)0x69,(byte)0x6c,(byte)0x61,(byte)0x6e,(byte)0x74,(byte)0x68,(byte)0x69,(byte)0x20,(byte)0x43,(byte)0x6f,(byte)0x72,(byte)0x74,(byte)0x65,(byte)0x78,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x0f,(byte)0x17,(byte)0x41,(byte)0x69,(byte)0x79,(byte)0x65,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x0f,(byte)0x1e,(byte)0x41,(byte)0x6c,(byte)0x62,(byte)0x69,(byte)0x7a,(byte)0x7a,(byte)0x69,(byte)0x61,(byte)0x65,(byte)0x20,(byte)0x43,(byte)0x6f,(byte)0x72,(byte)0x74,(byte)0x65,(byte)0x78,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x0f,(byte)0x27,(byte)0x41,(byte)0x6c,(byte)0x67,(byte)0x61,(byte)0x65,(byte)0x20,(byte)0x54,(byte)0x68,(byte)0x61,(byte)0x6c,(byte)0x6c,(byte)0x75,(byte)0x73,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x0f,(byte)0x2e,(byte)0x41,(byte)0x6c,(byte)0x69,(byte)0x73,(byte)0x6d,(byte)0x61,(byte)0x74,(byte)0x69,(byte)0x73,(byte)0x20,(byte)0x52,(byte)0x68,(byte)0x69,(byte)0x7a,(byte)0x6f,(byte)0x6d,(byte)0x61,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x0f,(byte)0x35,(byte)0x41,(byte)0x6c,(byte)0x6c,(byte)0x69,(byte)0x69,(byte)0x20,(byte)0x42,(byte)0x75,(byte)0x6c,(byte)0x62,(byte)0x75,(byte)0x73,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x0f,(byte)0x3c,(byte)0x41,(byte)0x6c,(byte)0x6c,(byte)0x69,(byte)0x69,(byte)0x20,(byte)0x46,(byte)0x69,(byte)0x73,(byte)0x74,(byte)0x75,(byte)0x6c,(byte)0x6f,(byte)0x73,(byte)0x69,(byte)0x20,(byte)0x42,(byte)0x75,(byte)0x6c,(byte)0x62,(byte)0x75,(byte)0x73,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x0f,(byte)0x43,(byte)0x41,(byte)0x6c,(byte)0x6c,(byte)0x69,(byte)0x69,(byte)0x20,(byte)0x53,(byte)0x61,(byte)0x74,(byte)0x69,(byte)0x76,(byte)0x69,(byte)0x20,(byte)0x42,(byte)0x75,(byte)0x6c,(byte)0x62,(byte)0x75,(byte)0x73,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x0f,(byte)0x4a,(byte)0x41,(byte)0x6c,(byte)0x6c,(byte)0x69,(byte)0x69,(byte)0x20,(byte)0x54,(byte)0x75,(byte)0x62,(byte)0x65,(byte)0x72,(byte)0x6f,(byte)0x73,(byte)0x69,(byte)0x20,(byte)0x53,(byte)0x65,(byte)0x6d,(byte)0x65,(byte)0x6e,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x0f,(byte)0x53,(byte)0x41,(byte)0x6c,(byte)0x6f,(byte)0x65,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x0f,(byte)0x5a,(byte)0x41,(byte)0x6c,(byte)0x70,(byte)0x69,(byte)0x6e,(byte)0x69,(byte)0x61,(byte)0x65,(byte)0x20,(byte)0x4b,(byte)0x61,(byte)0x74,(byte)0x73,(byte)0x75,(byte)0x6d,(byte)0x61,(byte)0x64,(byte)0x61,(byte)0x65,(byte)0x20,(byte)0x53,(byte)0x65,(byte)0x6d,(byte)0x65,(byte)0x6e,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x0f,(byte)0x63,(byte)0x41,(byte)0x6c,(byte)0x70,(byte)0x69,(byte)0x6e,(byte)0x69,(byte)0x61,(byte)0x65,(byte)0x20,(byte)0x4f,(byte)0x66,(byte)0x66,(byte)0x69,(byte)0x63,(byte)0x69,(byte)0x6e,(byte)0x61,(byte)0x72,(byte)0x75,(byte)0x6d,(byte)0x20,(byte)0x52,(byte)0x68,(byte)0x69,(byte)0x7a,(byte)0x6f,(byte)0x6d,(byte)0x61,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x0f,(byte)0x6c,(byte)0x41,(byte)0x6c,(byte)0x70,(byte)0x69,(byte)0x6e,(byte)0x69,(byte)0x61,(byte)0x65,(byte)0x20,(byte)0x4f,(byte)0x78,(byte)0x79,(byte)0x70,(byte)0x68,(byte)0x79,(byte)0x6c,(byte)0x6c,(byte)0x61,(byte)0x65,(byte)0x20,(byte)0x46,(byte)0x72,(byte)0x75,(byte)0x63,(byte)0x74,(byte)0x75,(byte)0x73,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x0f,(byte)0x75,(byte)0x41,(byte)0x6c,(byte)0x75,(byte)0x6d,(byte)0x65,(byte)0x6e,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x0f,(byte)0x7c,(byte)0x41,(byte)0x6d,(byte)0x6f,(byte)0x6d,(byte)0x69,(byte)0x20,(byte)0x43,(byte)0x61,(byte)0x72,(byte)0x64,(byte)0x61,(byte)0x6d,(byte)0x6f,(byte)0x6d,(byte)0x69,(byte)0x20,(byte)0x46,(byte)0x72,(byte)0x75,(byte)0x63,(byte)0x74,(byte)0x75,(byte)0x73,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x0f,(byte)0x83,(byte)0x41,(byte)0x6d,(byte)0x6f,(byte)0x6d,(byte)0x69,(byte)0x20,(byte)0x53,(byte)0x65,(byte)0x6d,(byte)0x65,(byte)0x6e,(byte)0x20,(byte)0x73,(byte)0x65,(byte)0x75,(byte)0x20,(byte)0x46,(byte)0x72,(byte)0x75,(byte)0x63,(byte)0x74,(byte)0x75,(byte)0x73,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x0f,(byte)0x8a,(byte)0x41,(byte)0x6d,(byte)0x6f,(byte)0x6d,(byte)0x69,(byte)0x20,(byte)0x54,(byte)0x73,(byte)0x61,(byte)0x6f,(byte)0x2d,(byte)0x6b,(byte)0x6f,(byte)0x20,(byte)0x46,(byte)0x72,(byte)0x75,(byte)0x63,(byte)0x74,(byte)0x75,(byte)0x73,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x0f,(byte)0x91,(byte)0x41,(byte)0x6d,(byte)0x70,(byte)0x65,(byte)0x6c,(byte)0x6f,(byte)0x70,(byte)0x73,(byte)0x69,(byte)0x73,(byte)0x20,(byte)0x52,(byte)0x61,(byte)0x64,(byte)0x69,(byte)0x78,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x0f,(byte)0x98,(byte)0x41,(byte)0x6d,(byte)0x79,(byte)0x64,(byte)0x61,(byte)0x65,(byte)0x20,(byte)0x43,(byte)0x61,(byte)0x72,(byte)0x61,(byte)0x70,(byte)0x61,(byte)0x78,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x0f,(byte)0x9f,(byte)0x41,(byte)0x6e,(byte)0x64,(byte)0x72,(byte)0x6f,(byte)0x67,(byte)0x72,(byte)0x61,(byte)0x70,(byte)0x68,(byte)0x69,(byte)0x64,(byte)0x69,(byte)0x73,(byte)0x20,(byte)0x48,(byte)0x65,(byte)0x72,(byte)0x62,(byte)0x61,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x0f,(byte)0xa8,(byte)0x41,(byte)0x6e,(byte)0x65,(byte)0x6d,(byte)0x61,(byte)0x72,(byte)0x72,(byte)0x68,(byte)0x65,(byte)0x6e,(byte)0x61,(byte)0x65,(byte)0x20,(byte)0x52,(byte)0x68,(byte)0x69,(byte)0x7a,(byte)0x6f,(byte)0x6d,(byte)0x61,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x0f,(byte)0xaf,(byte)0x41,(byte)0x6e,(byte)0x67,(byte)0x65,(byte)0x6c,(byte)0x69,(byte)0x63,(byte)0x61,(byte)0x20,(byte)0x44,(byte)0x61,(byte)0x68,(byte)0x75,(byte)0x72,(byte)0x69,(byte)0x63,(byte)0x61,(byte)0x65,(byte)0x20,(byte)0x52,(byte)0x61,(byte)0x64,(byte)0x69,(byte)0x78,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x0f,(byte)0xb6,(byte)0x41,(byte)0x6e,(byte)0x67,(byte)0x65,(byte)0x6c,(byte)0x69,(byte)0x63,(byte)0x61,(byte)0x20,(byte)0x44,(byte)0x75,(byte)0x68,(byte)0x75,(byte)0x6f,(byte)0x20,(byte)0x52,(byte)0x61,(byte)0x64,(byte)0x69,(byte)0x78,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x0f,(byte)0xbd,(byte)0x41,(byte)0x6e,(byte)0x67,(byte)0x65,(byte)0x6c,(byte)0x69,(byte)0x63,(byte)0x61,(byte)0x65,(byte)0x20,(byte)0x53,(byte)0x69,(byte)0x6e,(byte)0x65,(byte)0x6e,(byte)0x73,(byte)0x69,(byte)0x73,(byte)0x20,(byte)0x52,(byte)0x61,(byte)0x64,(byte)0x69,(byte)0x78,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x0f,(byte)0xc4,(byte)0x41,(byte)0x6e,(byte)0x74,(byte)0x65,(byte)0x6c,(byte)0x6f,(byte)0x70,(byte)0x69,(byte)0x73,(byte)0x20,(byte)0x43,(byte)0x6f,(byte)0x72,(byte)0x6e,(byte)0x75,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x0f,(byte)0xcd,(byte)0x41,(byte)0x70,(byte)0x6f,(byte)0x63,(byte)0x79,(byte)0x6e,(byte)0x75,(byte)0x6d,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x00,(byte)0x0f,(byte)0xd6,(byte)0x41,(byte)0x50,(byte)0x55,(byte)0x44,(byte)0xcf,(byte)0xb5,(byte)0xcd,(byte)0xb3,(byte)0xd6,(byte)0xd7,(byte)0xc1,(byte)0xf6,(byte)0xa3,(byte)0xa8,(byte)0xb2,(byte)0xa1,(byte)0xc0,(byte)0xed,(byte)0xd1,(byte)0xa7,(byte)0xa3,(byte)0xa9,(byte)0x00,
	};

	public static void main(String[] args) throws Exception {
			byte[] pattern = ".*happy".getBytes();
			Regex Joniregex = new Regex(pattern, 0, pattern.length, Option.IGNORECASE, UTF8Encoding.INSTANCE);
			Matcher Jonimatcher;
			Jonimatcher = Joniregex.matcher(data);
		try {
			System.out.println(""+Jonimatcher.match(1177, 1199, Option.DEFAULT));
		} catch (Exception e) {
			e.printStackTrace();
		}
	}

1199-1177=22. It should be very fast.

Valid UTF-8 input can cause infinite loop in JONI

In #7, @electrum identified a location that can cause inifinite loop in JONI. It is marked as won't fix because input can be sanitized beforehand and JONI assumes that the input is always valid.

When the pattern is "\uD8000", it can be pre-sanitized, as you suggested in #7. What if the pattern is "\\uD800"? How can the user sanitize it?

If JONI is willing to add a check, it would be the same fix for #7, checking whether the return value of enc.length is negative in OptExactInfo.concatStr.

Support for specifying \G position

In Onigmo, it is possible to specify the position of \G by calling the function onig_search_gpos. I have implemented this functionality in my fork here.

If you think this could be a useful addition for Joni as well, I'm very happy to PR it in—happy to make any modifications beforehand as well, please just let me know.

I'm not sure what API you would want, but for now I have implemented it as overloaded implementations of Matcher#search and Matcher#searchInterruptible:

search(int gpos, int start, int range, int option) and searchInterruptible(int gpos, int start, int range, int option).

Thanks!

Enable ability to escape from combinatorial explosion early

Joni's look-ahead/look-behind feature in evaluating regex matches can find themselves in large recursive loops causing things like elastic/elasticsearch#28731 to occur.

It would be nice to be able to enable Config.CEC so that combinatorial explosion heuristically checks can be applied to prevent certain matches to explode things.

The ability to interrupt the engine thread is nice, but it would be great if one did not have to spawn a timer on a separate thread just to watch the engine and prevent it from taking up too many resources.

any thoughts?

joni seems to be 1.5 slower than simple JNI bindings

Steps to reproduce

  1. onig4j-v003-src.zip
  2. Update jni/Makefile with proper JAVA_HOME and then call make
  3. Update lib location in src/onig4j/OnigRegex.java
  4. Run OnigPerformanceTest

We've got following results:
java: 4261ms
joni: 5798ms
onig: 3511ms
tm4e: 18ms

With a straightforward approach joni is about 1.5 times slower than oniguruma bindings.

tm4e major boost seems to be a result of src/org/eclipse/tm4e/core/internal/oniguruma/OnigRegExp.java:49: if a regexp is called consequently on the same string it just returns latest cached match result

Region named capture (-1--1)

Hi,

Can you give me some feedback on this issue?

I'm trying to mach Named capture groups in a multiline byte[] content and getting a -1 -1 index range for group2 for pattern (A).

Pattern A is: (?[0-9.]{1,5}%)|(?dev = .*)

However, pattern (B) works fine for non multiline (\n) content.

Pattern B is: (?[0-9.]{1,5}%).*(?dev = .*)

Debug regex on: https://regex101.com/r/y2ER1a/1

Content (with multiline) is:
Content >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
PING 8.8.8.8 (8.8.8.8): 56 data bytes
64 bytes from 8.8.8.8: icmp_seq=0 ttl=57 time=12.934 ms
64 bytes from 8.8.8.8: icmp_seq=1 ttl=57 time=13.145 ms

--- 8.8.8.8 ping statistics ---
2 packets transmitted, 2 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 12.934/13.040/13.145/0.106 ms
Content <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

My debug output is:

result = 226
D region Region:
0: (226-230) 1: (226-230) 2: (-1--1)
D nameEntry loss 1
loss -> 226, 230
0.0%
D nameEntry rtt 2
java.lang.StringIndexOutOfBoundsException: String index out of range: -1

My code is:

public class RegexOutputPluginHandler {

private byte[] patternBytes;
private Regex regex = null;


public RegexOutputPluginHandler() {
	String pattern = "(?<loss>[0-9\\.]{1,5}%)|(?<rtt>dev = .*)";
	
	if (pattern != null) {
		this.patternBytes = pattern.getBytes();
		this.regex = new Regex(this.patternBytes, 0, this.patternBytes.length, Option.MULTILINE, UTF8Encoding.INSTANCE);
	}
	
}


public Map<String, Object> extract(byte[] content) {
	
	System.err.println("D content " + content);

	if (content == null) {
		return null;
	}
	
	System.err.println("D content len " + content.length);

	Map<String, Object> fields = new HashMap<String, Object>();

	Matcher matcher = regex.matcher(content);
	int result = matcher.search(0, content.length, Option.MULTILINE);

	System.out.println("result = " + result);

if (result != -1) {
		Region region = matcher.getEagerRegion();
		
		System.out.println("D region " + region.toString());
		
		for (Iterator<NameEntry> entry = regex.namedBackrefIterator(); entry.hasNext();) {
			NameEntry e = entry.next();
			
			System.out.println("D nameEntry " + e.toString());
			
			int number = e.getBackRefs()[0]; // can have many refs per name
			int begin = region.beg[number];
			int end = region.end[number];

			String fieldName = new String(e.name, e.nameP, e.nameEnd - e.nameP);
			String fieldContent = new String(content, begin, end - begin);


			System.out.println(fieldName + " -> " + begin + ", " + end);
			System.out.println(fieldContent);

		}
	}else {
		System.err.println("D matcher none");
	}
	
	return fields;
}

}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.