Giter VIP home page Giter VIP logo

Comments (8)

mdb256 avatar mdb256 commented on May 31, 2024

It should not be a memory leak, but we have made a few changes in that time so it is possible that the way those patterns are compiled has changed. 100,000 patterns is a lot of patterns, though, if they are complex.

Is it the same pattern set as 2 years ago? Are you able to describe the style of patterns? You can always contact us directly, or via the Hyperscan mailing list (on 01.org), if you prefer.

from hyperscan.

madswj avatar madswj commented on May 31, 2024

The pattern set consists of randomly generated patterns of 6-30 random letters followed by “[0-9]{2,10}” just to get a little RE into it. An example pattern: /kqvqfalogr[0-9]{2,10}/

Is that a particularly “bad” pattern for the compilation? I just tried it with just strings of random letters, and that only took 25 seconds to compile.

from hyperscan.

mdb256 avatar mdb256 commented on May 31, 2024

Very interesting. No, that's not a particularly bad pattern at all - the problem arises because there are many patterns with similarity, and we have an analysis phase that is spending a lot of time and memory trying to optimise for these patterns.

We are now investigating improvements to this particular phase.

from hyperscan.

mdb256 avatar mdb256 commented on May 31, 2024

We've pushed commit 7bcd2b0 to the develop branch that improves this particular case. It now takes a couple of minutes and a lot less memory to compile.

This will be included in the next Hyperscan release.

from hyperscan.

madswj avatar madswj commented on May 31, 2024

Thanks Matt, that sounds great. Looking forward to the release.

from hyperscan.

YueHonghui avatar YueHonghui commented on May 31, 2024

I faced the same issue when compiling our regex file, even with the newest version of the master branch. The regex file has about 27w lines, each line seems similar. Compiling the db takes about 2 hours at a machine with 1.8 GHz CPU and 64G memory, but the process was killed because of out of memory. Piece of regex file below:

76870   中(\p{Han}?|\P{Han}{0,5})文(\p{Han}?|\P{Han}{0,5})汉(\p{Han}?|\P{Han}{0,5})字
76871   一(\p{Han}?|\P{Han}{0,5})个(\p{Han}?|\P{Han}{0,5})例(\p{Han}?|\P{Han}{0,5})子
76872   就(\p{Han}?|\P{Han}{0,5})像(\p{Han}?|\P{Han}{0,5})这(\p{Han}?|\P{Han}{0,5})样(\p{Han}?|\P{Han}{0,5})的

compile flags:

HS_FLAG_CASELESS|HS_FLAG_DOTALL|HS_FLAG_UTF8|HS_FLAG_UCP

Is these regex too complicated to hyperscan, or something else wrong?

from hyperscan.

jviiret avatar jviiret commented on May 31, 2024

Hi @YueHonghui,

Your issue is a little bit different from the earlier one in this report, actually, and should probably become its own separate issue on Github -- can you create a new issue? (Or if you would prefer to contact us directly, feel free to email [email protected].)

Although they are short, the regex patterns you quote are made complex because of their use of Unicode properties like \p{Han} and \P{Han}. During pattern compilation, Hyperscan expands these constructs into the byte sequences that can match them, which leads to very large pattern graphs -- especially when these constructs are wrapped in bounded repeats like {0,5}.

While Hyperscan is still able to handle these, 270,000 patterns of this form is a very large case -- it's not surprising to see very large memory requirements here.

Can you describe what your application is doing with these patterns in a bit more detail? Is the sub-pattern (\p{Han}?|\P{Han}{0,5}) precisely what you need to match, or could it be weakened to a simpler sub-pattern? Do your patterns need separate IDs, or could they use a single ID or small group of IDs? (This might allow for some merging of data structures at pattern compile time.)

Finally, if you can share a larger sample of your patterns, either in a Github issue or via email, that would make it easier to see if there are improvements we can make that reduce the memory requirements for this case.

from hyperscan.

YueHonghui avatar YueHonghui commented on May 31, 2024

@jviiret Thank you for your reply, I'v posted an email to [email protected].

from hyperscan.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.