i found a problem by using your Twokenize tokenizer. The program stu

Tokenizer gets stuck on bad-match regex matching about ark-tweet-nlp HOT 11 OPEN

brendano commented on August 23, 2024

Tokenizer gets stuck on bad-match regex matching

from ark-tweet-nlp.

Comments (11)

haijieg commented on August 23, 2024

Another input I found that causes the stuck in while-loop is here:

String test = "7h... nh? a nhìu th?t nhìu ^^~..........................................................................."

Looks like the "....." pattern is the problem?

from ark-tweet-nlp.

haijieg commented on August 23, 2024

This is not the only bad example. Also I don't think the problem is the while loop, it might be due to some catastrophic regex matching which can take years.

from ark-tweet-nlp.

brendano commented on August 23, 2024

Yeah it's one of the regexes for sure...

On Sun, Oct 21, 2012 at 11:10 PM, haijieg [email protected] wrote:

This is not the only bad example. Also I don't think the problem is the
while loop, it might be due to some catastrophic regex matching which can
take years.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/14#issuecomment-9651481.

from ark-tweet-nlp.

brendano commented on August 23, 2024

If you could assemble as many as you can find, it would be helpful. I'm trying to create minimal test cases and it's very subtle. (The problem is in non-determinancy of the regex engine... when you start using lookaheads/lookbehinds it's no longer a DFA and horrible things can happen. I'm afraid this is looking complicated to figure out.)

from ark-tweet-nlp.

brendano commented on August 23, 2024

actually no don't bother i think i see the problem (the eastern emoticon system)

from ark-tweet-nlp.

brendano commented on August 23, 2024

OK, it's due to a use of a backreference (or some sort of regex pattern reference) in "basicface" when embedded inside "eastEmote". I have to consult with the author of this to see what's going on

from ark-tweet-nlp.

brendano commented on August 23, 2024

Hi, I just pushed a bugfix to master that makes these examples work. Could you try it on the large dataset? Let me know if building it is annoying (see the comment I wrote on #15 )

and re:

This is not the only bad example.

Of course. Welcome to large-scale text processing: every one-out-of-a-billion bug will happen :) .... so especially in a distributed system, have to log what section of the dataset your runner is running on...

When I used to use Hadoop for NLP software that had once per 10mil or 100mil or so bugs like this, we had to run the NLP system out-of-process to monitor deadlocks like this one, so you can notice when the process isn't working and kill it. Not a fun situation, sorry :(

from ark-tweet-nlp.

brendano commented on August 23, 2024

I just ran it on 1489999 or so tweets just now and here is the distribution of number of milliseconds per tweet. The top 8 worst ones are more than 500 ms, which isn't good (max 921 ms) ... this sounds like another regex bug, perhaps ... but the rest seem to terminate ok. Maybe if you go to a billion tweets the maxes are worse? I started a larger job but won't finish for a while. (Another caveat with this experiment, this includes JSON parsing, done in-process with --input-format json.)

from ark-tweet-nlp.

haijieg commented on August 23, 2024

Cool. I'm going to test it on the daily.10k tonight.

On Tue, Oct 23, 2012 at 1:16 AM, brendano [email protected] wrote:

I just ran it on 1489999 or so tweets just now and here is the
distribution of number of milliseconds per tweet. The top 8 worst ones are
more than 500 ms, which isn't good (max 921 ms) ... this sounds like
another regex bug, perhaps ... but the rest seem to terminate ok. Maybe if
you go to a billion tweets the maxes are worse? I started a larger job but
won't finish for a while. (Another caveat with this experiment, this
includes JSON parsing, done in-process with --input-format json.)

https://a248.e.akamai.net/camo.github.com/3b827bd72554c0b981f8972d277e0656495432f6/687474703a2f2f6272656e6f636f6e2e636f6d2f53637265656e25323073686f74253230323031322d31302d32332532306174253230312e30392e3237253230414d2e706e67

https://a248.e.akamai.net/camo.github.com/1b54beac9b63cf3b5a548c6106036fcd766882a2/687474703a2f2f6272656e6f636f6e2e636f6d2f53637265656e25323073686f74253230323031322d31302d32332532306174253230312e30392e3538253230414d2e706e67

https://a248.e.akamai.net/camo.github.com/79711448c761d9b5c635ff39c1e8473ecc7f0205/687474703a2f2f6272656e6f636f6e2e636f6d2f53637265656e25323073686f74253230323031322d31302d32332532306174253230312e30382e3035253230414d2e706e67

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/14#issuecomment-9690885.

from ark-tweet-nlp.

haijieg commented on August 23, 2024

It passed the test of daily.10k with 14720000 tweets. Will try on larger
dataset later tomorrow.

Jay

On Tue, Oct 23, 2012 at 1:25 AM, Haijie Gu [email protected] wrote:

Cool. I'm going to test it on the daily.10k tonight.

On Tue, Oct 23, 2012 at 1:16 AM, brendano [email protected]:

I just ran it on 1489999 or so tweets just now and here is the
distribution of number of milliseconds per tweet. The top 8 worst ones are
more than 500 ms, which isn't good (max 921 ms) ... this sounds like
another regex bug, perhaps ... but the rest seem to terminate ok. Maybe if
you go to a billion tweets the maxes are worse? I started a larger job but
won't finish for a while. (Another caveat with this experiment, this
includes JSON parsing, done in-process with --input-format json.)

https://a248.e.akamai.net/camo.github.com/3b827bd72554c0b981f8972d277e0656495432f6/687474703a2f2f6272656e6f636f6e2e636f6d2f53637265656e25323073686f74253230323031322d31302d32332532306174253230312e30392e3237253230414d2e706e67

https://a248.e.akamai.net/camo.github.com/1b54beac9b63cf3b5a548c6106036fcd766882a2/687474703a2f2f6272656e6f636f6e2e636f6d2f53637265656e25323073686f74253230323031322d31302d32332532306174253230312e30392e3538253230414d2e706e67

https://a248.e.akamai.net/camo.github.com/79711448c761d9b5c635ff39c1e8473ecc7f0205/687474703a2f2f6272656e6f636f6e2e636f6d2f53637265656e25323073686f74253230323031322d31302d32332532306174253230312e30382e3035253230414d2e706e67

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/14#issuecomment-9690885.

from ark-tweet-nlp.

brendano commented on August 23, 2024

OK, from a bigger sample of 143mil tweets ("daily 100k" here) I'm seeing a similar distribution for the lower <2ms range

There are 4146 tweets that took more than 10ms, and 96 that took longer than 1000ms (max 143 seconds); those last ones should be investigated at some point.

from ark-tweet-nlp.

Tokenizer gets stuck on bad-match regex matching about ark-tweet-nlp HOT 11 OPEN

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent