Comments (11)
Another input I found that causes the stuck in while-loop is here:
String test = "7h... nh? a nhìu th?t nhìu ^^~..........................................................................."
Looks like the "....." pattern is the problem?
from ark-tweet-nlp.
This is not the only bad example. Also I don't think the problem is the while loop, it might be due to some catastrophic regex matching which can take years.
from ark-tweet-nlp.
Yeah it's one of the regexes for sure...
On Sun, Oct 21, 2012 at 11:10 PM, haijieg [email protected] wrote:
This is not the only bad example. Also I don't think the problem is the
while loop, it might be due to some catastrophic regex matching which can
take years.—
Reply to this email directly or view it on GitHubhttps://github.com//issues/14#issuecomment-9651481.
from ark-tweet-nlp.
If you could assemble as many as you can find, it would be helpful. I'm trying to create minimal test cases and it's very subtle. (The problem is in non-determinancy of the regex engine... when you start using lookaheads/lookbehinds it's no longer a DFA and horrible things can happen. I'm afraid this is looking complicated to figure out.)
from ark-tweet-nlp.
actually no don't bother i think i see the problem (the eastern emoticon system)
from ark-tweet-nlp.
OK, it's due to a use of a backreference (or some sort of regex pattern reference) in "basicface" when embedded inside "eastEmote". I have to consult with the author of this to see what's going on
from ark-tweet-nlp.
Hi, I just pushed a bugfix to master that makes these examples work. Could you try it on the large dataset? Let me know if building it is annoying (see the comment I wrote on #15 )
and re:
This is not the only bad example.
Of course. Welcome to large-scale text processing: every one-out-of-a-billion bug will happen :) .... so especially in a distributed system, have to log what section of the dataset your runner is running on...
When I used to use Hadoop for NLP software that had once per 10mil or 100mil or so bugs like this, we had to run the NLP system out-of-process to monitor deadlocks like this one, so you can notice when the process isn't working and kill it. Not a fun situation, sorry :(
from ark-tweet-nlp.
I just ran it on 1489999 or so tweets just now and here is the distribution of number of milliseconds per tweet. The top 8 worst ones are more than 500 ms, which isn't good (max 921 ms) ... this sounds like another regex bug, perhaps ... but the rest seem to terminate ok. Maybe if you go to a billion tweets the maxes are worse? I started a larger job but won't finish for a while. (Another caveat with this experiment, this includes JSON parsing, done in-process with --input-format json.)
from ark-tweet-nlp.
Cool. I'm going to test it on the daily.10k tonight.
On Tue, Oct 23, 2012 at 1:16 AM, brendano [email protected] wrote:
I just ran it on 1489999 or so tweets just now and here is the
distribution of number of milliseconds per tweet. The top 8 worst ones are
more than 500 ms, which isn't good (max 921 ms) ... this sounds like
another regex bug, perhaps ... but the rest seem to terminate ok. Maybe if
you go to a billion tweets the maxes are worse? I started a larger job but
won't finish for a while. (Another caveat with this experiment, this
includes JSON parsing, done in-process with --input-format json.)—
Reply to this email directly or view it on GitHubhttps://github.com//issues/14#issuecomment-9690885.
from ark-tweet-nlp.
It passed the test of daily.10k with 14720000 tweets. Will try on larger
dataset later tomorrow.
Jay
On Tue, Oct 23, 2012 at 1:25 AM, Haijie Gu [email protected] wrote:
Cool. I'm going to test it on the daily.10k tonight.
On Tue, Oct 23, 2012 at 1:16 AM, brendano [email protected]:
I just ran it on 1489999 or so tweets just now and here is the
distribution of number of milliseconds per tweet. The top 8 worst ones are
more than 500 ms, which isn't good (max 921 ms) ... this sounds like
another regex bug, perhaps ... but the rest seem to terminate ok. Maybe if
you go to a billion tweets the maxes are worse? I started a larger job but
won't finish for a while. (Another caveat with this experiment, this
includes JSON parsing, done in-process with --input-format json.)—
Reply to this email directly or view it on GitHubhttps://github.com//issues/14#issuecomment-9690885.
from ark-tweet-nlp.
OK, from a bigger sample of 143mil tweets ("daily 100k" here) I'm seeing a similar distribution for the lower <2ms range
There are 4146 tweets that took more than 10ms, and 96 that took longer than 1000ms (max 143 seconds); those last ones should be investigated at some point.
from ark-tweet-nlp.
Related Issues (20)
- how does the tokenizer work? (whitespace tokenizer?) HOT 1
- boutta: P => V
- Twokenize runs into NullPointerException for conll output format, with provided example (casual.txt) HOT 1
- "yeen" O => (Pronoun Verb)
- Port to PHP? HOT 1
- jar dependencies are not pulled correctly HOT 1
- Cannot build properly HOT 2
- ark tweet tagger fails with a conll input file with just one column
- the --input-field command option doesn't even seem to work
- Missing default model.20120919 after building from source code HOT 2
- could you explain the mean of the "model.20120919.txt"? HOT 2
- LICENSE Issue GPLv2 compatibility with GPLv3 HOT 2
- kevinzzz007/ark-tweet-nlp : WindowsError: [Error 2] The system cannot find the file specified
- Word Cluster HOT 3
- --output-file doesn't work
- Use POS without tokenizer HOT 1
- Use twitter-text to extract hashtags, mentions, and URLs HOT 1
- Cannot Train POS with Locale Other Than English HOT 2
- GPL
- Trying to get in touch regarding a security issue
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ark-tweet-nlp.