Comments (10)
If the finalizer mechanism isn't sufficient to avoid memory pressure, then the API is incompatible (despite superficial similarity) and shouldn't be treated as a drop-in replacement IMO.
from enry.
you mean incompatible with go's std. library?
This is go binding for c implementation of oniguruma. For every c implementation you have to manually free memory and finalizer in bindings is just a wish. Native go implementation doesn't have this problem.
from enry.
This is go binding for c implementation of oniguruma. For every c implementation you have to manually free memory and finalizer in bindings is just a wish. Native go implementation doesn't have this problem.
Yes, agreed. My point is that one cannot replace a use of (say) the Go regexp
package with Oniguruma C bindings without re-working the control flow of the library that depends on it, if the integration with the GC is not sufficient to manage memory. In particular, if the library demands manual release of allocated memory, it's not sufficient to update the function calls and types, we also need to change the way we track and plumb regexp matching types through the code that uses it.
Which is mostly to say that fixing this may require more changes than just plugging in free calls. The point of using Oniguruma in the first place was supposed to be performance, but that relies in large part on the ability to cache reused expressions.
from enry.
So what do you suggest?
- get rid of oniguruma and just stick to go's std. library
- keep oniguruma and just rely on finalizers
- add free/close/dispose function (next to
MustCompile
) to enry's API
from enry.
So what do you suggest?
- get rid of oniguruma and just stick to go's std. library
- keep oniguruma and just rely on finalizers
- add free/close/dispose function (next to
MustCompile
) to enry's API
Any of those seems viable. My inclination would be to rely on finalizers and not worry about it, but the main issue raises the concern that this isn't sufficient. If we aren't having memory leaks or other runtime issues because of this, I think it should be sufficient to trust the finalizer; otherwise we should do something different.
Adding a free function isn't enough: Everywhere a wrapped object is allocated we would have to match it with a release, which means adding an explicit ownership path through the API. The obvious way to do that is to release eagerly whenever a value goes out of scope. I predict that would result in a performance degradation, since we'd have to recompile the same expressions repeatedly.
I'm not sure why we switched to Oniguruma in the first place, but I suspect it's either a matter of syntax (since regexp
supports only RE2, which doesn't handle common PCRE-style extensions) or of performance. If it's for syntax we might be stuck; if it's for performance I think we can do better without needing a C binding.
from enry.
The author of the original oniguruma port here. The goal was to get a performance boost, which appeared to be great - 2x to 4x as far as I remember. I picked it because somebody mentioned it would be fun to try, and it was fun for me to write some code. Nobody knew we would have annoying memory problems beforehand.
oniguruma is a much faster regexp engine, and knowing the nature of the benchmarks/real usage, I doubt that any regexp recompilation will mitigate the benefit. I bet that it will be negligible.
from enry.
So far, we didn't notice memory leaks (that's why I emphasized potential). Mysql has a slightly different use case. For cached pre-compiled matchers we rely on finalizers, otherwise we call dispose
. So, I'm ok to keep what we have in enry.
Regarding switch to oniguruma, as far as I know the main reason was performance (oniguruma is at least 4x faster). Also for mysql oniguruma is a default regex engine because of go's regex is kind of special, but in enry it doesn't have to be a case.
from enry.
I would advise not to add complexity to this common library until this is proven necessary for all downstream use cases. Personally, so far I have not seen:
if the integration with the GC is not sufficient to manage memory
this to be the case yet and esp. because, as Michale noted above, that would most probably require API surface changes.
What makes me particularly reluctant is
So far, we didn't notice memory leaks (that's why I emphasized potential)
On the performance
The goal was to get a performance boost, which appeared to be great - 2x to 4x as far as I remember
Although that is true for general regexp matching, I would really love to have some data points first to support that claim specifically on enry performance e.g as a part of benchmark suite (#210 that could be extended to potentially spotlight any memory leaks)
from enry.
OK, so I'm closing this issue.
from enry.
performance must be considered seriously here.
It's really not a joke, namely when i am designed a centralized pipeline template that all teams use it.
Just imagine the miserable pipeline system, one guy push a huge git repo !!
from enry.
Related Issues (20)
- benchmark: refactor to be usable on CI for releases
- CLI app: refactor and improve performance
- Add IsGenerated function to the API HOT 1
- CLI: -mode=line reports bad results HOT 2
- enry - linguist single file report logic mismatch HOT 3
- Release: use non-EOLed JDK version HOT 1
- Different tokenisation results between oniguruma and RE2 HOT 12
- CLI: inconsistent path filtering HOT 1
- Reports gitignore as vendor HOT 2
- IsVendor could be changed to use a single regexp HOT 1
- Make build not working in golang:alpine container HOT 3
- import is a program, not an importable package HOT 4
- cannot find package "github.com/src-d/enry/v2" HOT 3
- Clarify installation instruction HOT 1
- Language detection accuracy measurements
- Not handling extensions HOT 2
- Get color from "parent language" if there is no definition for the language itself HOT 6
- Is this and all other srcd projects dead? HOT 13
- How to get a report of a whole directory when using it as Go module
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from enry.