Comments (12)
Often I have found the differences between engines on the same regexp boil down to small variations in the handling of greed (in particular: whether operators are greedy by default), zero-width assertions, anchoring (e.g., whether it is on by default), case-sensitivity, and other meta-properties of the execution model that aren't well-specified. Obviously I don't know what the causes are in this case yet, but that's where I would start looking for examples.
from enry.
Even in C, oniguruma does consistently produce strange result in UTF8 mode for the non-valid input bytes in UTF8, like above. Seems like a possible bug upstream.
As all regex for tokenization in enry are not using any Unicode character classes, all RE2-based matches are conducted in ASCII-only mode, while go-oniguruma has hardcoded UTF8.
For our use-case the fix would be to force Oniguruma also use ASCII mode and that indeed produces identical results even for non-valid bytes in UTF8.
I will submit a patch to cgo part of https://github.com/src-d/go-oniguruma to have an option to override hardcoded UTF8, expose it in Go as MustCompileWithEncoding
(similar to MustCompileWithOption
) and as soon as it's merged, move entry regex/oniguruma.go to use it in #227.
Meanwhile, only a better test case was added in there a724a2f
from enry.
Steps to reproduce
$ go test ./internal/code-generator/... -run Test_GeneratorTestSuite -testify.m TestGenerationFiles
ok gopkg.in/src-d/enry.v1/internal/code-generator/generator 35.537s
$ go test -tags oniguruma ./internal/code-generator/... -run Test_GeneratorTestSuite -testify.m TestGenerationFiles
FAIL gopkg.in/src-d/enry.v1/internal/code-generator/generator 34.583s
Both should produce the same results if results of tokenisation of all linguist samples are the same.;
Going to add extra tests to verify just that.
from enry.
Tracked the problem down to the difference in handling what seems to be latin1 encoded file.
While tokenising thå filling
encoded in latin1 and read as UTF-8
RE2 gets
th
filling
�
Oniguruma gets
th
illing
�
f
flex-based tokenised from #218 gets
th
filling
from enry.
Of the three, only Flex really seems to be doing anything reasonable here. RE2 is close—but the order of the tokens is weird. I have no idea what Oniguruma is doing there, but it seems obviously broken in at least two ways.
from enry.
Yup. And if the content is decoded from latin1 and encoded to utf8 with charmap.ISO8859_1.NewDecoder().Bytes(content)
the result is an expected one:
RE2
th
filling
å
oniguruma
th
filling
å
That would not bother me much, if 2 months ago Linguist did not add such case to their samples, and that is what content classifier is trained on and it is something we keep a "gold standard" results on, as part of our test fixtures.
I know that Linguist does use ICU-based character encoding detector https://github.com/brianmario/charlock_holmes but am not sure yet if it's part of the tokenisation.
from enry.
Yeah, I think either we should normalize the encoding or find a way to treat the Unicode replacement character as part of the token, e.g., xxx�yyy
⇒ xxx�yyy
, or use it as a separator and discard it, e.g. ⇒ xxx
�
yyy
⇒ xxx
yyy
. It seems like Linguist does the latter maybe.
from enry.
True. And linguist with flex-based tokeniser does not have this issue, so no need to encoding detection there. Thank you for suggestions, let me think a bit more about that..
from enry.
After digging deeper - ot seems that the offending tokenization rule is extractAndReplaceRegular(content)
that does [0-9A-Za-z_\.@#\/\*]+
.
The extractRemainders
is then called on it's output and does `bytes.Fields(content),
- in RE2 case it's
" \xe5 "
which results in extra�
- in Oniguruma it's
"\xe5 f"
from enry.
For the record: doing equivalent operation in Ruby where regex are backed by Oniguruma lib results in ArgumentError: invalid byte sequence in UTF-8
$irb
"th\xdd filling".scan(/[0-9A-Za-z_\.@#\/\*]+/)
from enry.
Digging a little bit deeper with oniguruma's C API using awesome examples, it starts to look like this may be a bug in go-oniguruma Regexp.findAll()
implementation 🤔
from enry.
Fixed by #227
from enry.
Related Issues (20)
- oniguruma - free memory (avoid potential memory leaks) HOT 10
- benchmark: refactor to be usable on CI for releases
- CLI app: refactor and improve performance
- Add IsGenerated function to the API HOT 1
- CLI: -mode=line reports bad results HOT 2
- enry - linguist single file report logic mismatch HOT 3
- Release: use non-EOLed JDK version HOT 1
- CLI: inconsistent path filtering HOT 1
- Reports gitignore as vendor HOT 2
- IsVendor could be changed to use a single regexp HOT 1
- Make build not working in golang:alpine container HOT 3
- import is a program, not an importable package HOT 4
- cannot find package "github.com/src-d/enry/v2" HOT 3
- Clarify installation instruction HOT 1
- Language detection accuracy measurements
- Not handling extensions HOT 2
- Get color from "parent language" if there is no definition for the language itself HOT 6
- Is this and all other srcd projects dead? HOT 13
- How to get a report of a whole directory when using it as Go module
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from enry.