Comments (8)
It should not be a memory leak, but we have made a few changes in that time so it is possible that the way those patterns are compiled has changed. 100,000 patterns is a lot of patterns, though, if they are complex.
Is it the same pattern set as 2 years ago? Are you able to describe the style of patterns? You can always contact us directly, or via the Hyperscan mailing list (on 01.org), if you prefer.
from hyperscan.
The pattern set consists of randomly generated patterns of 6-30 random letters followed by “[0-9]{2,10}” just to get a little RE into it. An example pattern: /kqvqfalogr[0-9]{2,10}/
Is that a particularly “bad” pattern for the compilation? I just tried it with just strings of random letters, and that only took 25 seconds to compile.
from hyperscan.
Very interesting. No, that's not a particularly bad pattern at all - the problem arises because there are many patterns with similarity, and we have an analysis phase that is spending a lot of time and memory trying to optimise for these patterns.
We are now investigating improvements to this particular phase.
from hyperscan.
We've pushed commit 7bcd2b0 to the develop branch that improves this particular case. It now takes a couple of minutes and a lot less memory to compile.
This will be included in the next Hyperscan release.
from hyperscan.
Thanks Matt, that sounds great. Looking forward to the release.
from hyperscan.
I faced the same issue when compiling our regex file, even with the newest version of the master branch. The regex file has about 27w lines, each line seems similar. Compiling the db takes about 2 hours at a machine with 1.8 GHz CPU and 64G memory, but the process was killed because of out of memory. Piece of regex file below:
76870 中(\p{Han}?|\P{Han}{0,5})文(\p{Han}?|\P{Han}{0,5})汉(\p{Han}?|\P{Han}{0,5})字
76871 一(\p{Han}?|\P{Han}{0,5})个(\p{Han}?|\P{Han}{0,5})例(\p{Han}?|\P{Han}{0,5})子
76872 就(\p{Han}?|\P{Han}{0,5})像(\p{Han}?|\P{Han}{0,5})这(\p{Han}?|\P{Han}{0,5})样(\p{Han}?|\P{Han}{0,5})的
compile flags:
HS_FLAG_CASELESS|HS_FLAG_DOTALL|HS_FLAG_UTF8|HS_FLAG_UCP
Is these regex too complicated to hyperscan, or something else wrong?
from hyperscan.
Hi @YueHonghui,
Your issue is a little bit different from the earlier one in this report, actually, and should probably become its own separate issue on Github -- can you create a new issue? (Or if you would prefer to contact us directly, feel free to email [email protected].)
Although they are short, the regex patterns you quote are made complex because of their use of Unicode properties like \p{Han}
and \P{Han}
. During pattern compilation, Hyperscan expands these constructs into the byte sequences that can match them, which leads to very large pattern graphs -- especially when these constructs are wrapped in bounded repeats like {0,5}
.
While Hyperscan is still able to handle these, 270,000 patterns of this form is a very large case -- it's not surprising to see very large memory requirements here.
Can you describe what your application is doing with these patterns in a bit more detail? Is the sub-pattern (\p{Han}?|\P{Han}{0,5})
precisely what you need to match, or could it be weakened to a simpler sub-pattern? Do your patterns need separate IDs, or could they use a single ID or small group of IDs? (This might allow for some merging of data structures at pattern compile time.)
Finally, if you can share a larger sample of your patterns, either in a Github issue or via email, that would make it easier to see if there are improvements we can make that reduce the memory requirements for this case.
from hyperscan.
@jviiret Thank you for your reply, I'v posted an email to [email protected].
from hyperscan.
Related Issues (20)
- What is the relationship between ssse3, sse4.1, sse4.2, avx, and avx2 in CPU instruction sets?
- Large Size of hs.lib File Compiled Under Windows and Optimization Options
- Regarding hs_multi_compile and hs_scan functionality HOT 1
- Approximate match (edit distance and hamming distance)
- unit-test failed with '-march=core2' HOT 1
- Question: Would hyperscan benefit from stacked SRAM cache ?
- The issue concerning the presence of "NOT" in logical combinations. HOT 1
- Tjv
- Windows binaries HOT 1
- 'From' parameter on match callback when 'HS_MODE_STREAM' mode always as zero HOT 1
- Numbered repeat doesn't work if the lower number is omitted HOT 1
- mutiple databases use one scratch ,if a delete a database,what can i do for scratch? HOT 1
- is hyperscan abandoned? HOT 4
- encountering problems of "multiple definition of XXX" when compiling hyperscan in centos with x86_64 HOT 1
- QUEDAS_FRIAEscaneo🐧
- Hyperscan panics if bounded repeat is exactly 32767
- memory leak occurs when calling hs_compile
- giving pattern with null to hs_compile_lit_multi HOT 1
- Chimera share Library
- Can parameters limitPatternLength and limitLiteralCount be increased? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hyperscan.