Compiling a database using hs_compile_multi() uses a lot of memory. I tried to compile

We've pushed commit <a class="commit-link" data-hovercard-type="commit" data-hovercard

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Database compilation memory leak? about hyperscan HOT 8 CLOSED

intel commented on May 31, 2024

Database compilation memory leak?

from hyperscan.

Comments (8)

mdb256 commented on May 31, 2024

It should not be a memory leak, but we have made a few changes in that time so it is possible that the way those patterns are compiled has changed. 100,000 patterns is a lot of patterns, though, if they are complex.

Is it the same pattern set as 2 years ago? Are you able to describe the style of patterns? You can always contact us directly, or via the Hyperscan mailing list (on 01.org), if you prefer.

from hyperscan.

madswj commented on May 31, 2024

The pattern set consists of randomly generated patterns of 6-30 random letters followed by “[0-9]{2,10}” just to get a little RE into it. An example pattern: /kqvqfalogr[0-9]{2,10}/

Is that a particularly “bad” pattern for the compilation? I just tried it with just strings of random letters, and that only took 25 seconds to compile.

from hyperscan.

mdb256 commented on May 31, 2024

Very interesting. No, that's not a particularly bad pattern at all - the problem arises because there are many patterns with similarity, and we have an analysis phase that is spending a lot of time and memory trying to optimise for these patterns.

We are now investigating improvements to this particular phase.

from hyperscan.

mdb256 commented on May 31, 2024

We've pushed commit 7bcd2b0 to the develop branch that improves this particular case. It now takes a couple of minutes and a lot less memory to compile.

This will be included in the next Hyperscan release.

from hyperscan.

madswj commented on May 31, 2024

Thanks Matt, that sounds great. Looking forward to the release.

from hyperscan.

YueHonghui commented on May 31, 2024

I faced the same issue when compiling our regex file, even with the newest version of the master branch. The regex file has about 27w lines, each line seems similar. Compiling the db takes about 2 hours at a machine with 1.8 GHz CPU and 64G memory, but the process was killed because of out of memory. Piece of regex file below:

76870   中(\p{Han}?|\P{Han}{0,5})文(\p{Han}?|\P{Han}{0,5})汉(\p{Han}?|\P{Han}{0,5})字
76871   一(\p{Han}?|\P{Han}{0,5})个(\p{Han}?|\P{Han}{0,5})例(\p{Han}?|\P{Han}{0,5})子
76872   就(\p{Han}?|\P{Han}{0,5})像(\p{Han}?|\P{Han}{0,5})这(\p{Han}?|\P{Han}{0,5})样(\p{Han}?|\P{Han}{0,5})的

compile flags:

HS_FLAG_CASELESS|HS_FLAG_DOTALL|HS_FLAG_UTF8|HS_FLAG_UCP

Is these regex too complicated to hyperscan, or something else wrong?

from hyperscan.

jviiret commented on May 31, 2024

Hi @YueHonghui,

Your issue is a little bit different from the earlier one in this report, actually, and should probably become its own separate issue on Github -- can you create a new issue? (Or if you would prefer to contact us directly, feel free to email [email protected].)

Although they are short, the regex patterns you quote are made complex because of their use of Unicode properties like \p{Han} and \P{Han}. During pattern compilation, Hyperscan expands these constructs into the byte sequences that can match them, which leads to very large pattern graphs -- especially when these constructs are wrapped in bounded repeats like {0,5}.

While Hyperscan is still able to handle these, 270,000 patterns of this form is a very large case -- it's not surprising to see very large memory requirements here.

Can you describe what your application is doing with these patterns in a bit more detail? Is the sub-pattern (\p{Han}?|\P{Han}{0,5}) precisely what you need to match, or could it be weakened to a simpler sub-pattern? Do your patterns need separate IDs, or could they use a single ID or small group of IDs? (This might allow for some merging of data structures at pattern compile time.)

Finally, if you can share a larger sample of your patterns, either in a Github issue or via email, that would make it easier to see if there are improvements we can make that reduce the memory requirements for this case.

from hyperscan.

YueHonghui commented on May 31, 2024

@jviiret Thank you for your reply, I'v posted an email to [email protected].

from hyperscan.

Database compilation memory leak? about hyperscan HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent