Giter VIP home page Giter VIP logo

Comments (11)

haijieg avatar haijieg commented on August 23, 2024

Another input I found that causes the stuck in while-loop is here:

String test = "7h... nh? a nhìu th?t nhìu ^^~..........................................................................."

Looks like the "....." pattern is the problem?

from ark-tweet-nlp.

haijieg avatar haijieg commented on August 23, 2024

This is not the only bad example. Also I don't think the problem is the while loop, it might be due to some catastrophic regex matching which can take years.

from ark-tweet-nlp.

brendano avatar brendano commented on August 23, 2024

Yeah it's one of the regexes for sure...

On Sun, Oct 21, 2012 at 11:10 PM, haijieg [email protected] wrote:

This is not the only bad example. Also I don't think the problem is the
while loop, it might be due to some catastrophic regex matching which can
take years.


Reply to this email directly or view it on GitHubhttps://github.com//issues/14#issuecomment-9651481.

from ark-tweet-nlp.

brendano avatar brendano commented on August 23, 2024

If you could assemble as many as you can find, it would be helpful. I'm trying to create minimal test cases and it's very subtle. (The problem is in non-determinancy of the regex engine... when you start using lookaheads/lookbehinds it's no longer a DFA and horrible things can happen. I'm afraid this is looking complicated to figure out.)

from ark-tweet-nlp.

brendano avatar brendano commented on August 23, 2024

actually no don't bother i think i see the problem (the eastern emoticon system)

from ark-tweet-nlp.

brendano avatar brendano commented on August 23, 2024

OK, it's due to a use of a backreference (or some sort of regex pattern reference) in "basicface" when embedded inside "eastEmote". I have to consult with the author of this to see what's going on

from ark-tweet-nlp.

brendano avatar brendano commented on August 23, 2024

Hi, I just pushed a bugfix to master that makes these examples work. Could you try it on the large dataset? Let me know if building it is annoying (see the comment I wrote on #15 )

and re:

This is not the only bad example.

Of course. Welcome to large-scale text processing: every one-out-of-a-billion bug will happen :) .... so especially in a distributed system, have to log what section of the dataset your runner is running on...

When I used to use Hadoop for NLP software that had once per 10mil or 100mil or so bugs like this, we had to run the NLP system out-of-process to monitor deadlocks like this one, so you can notice when the process isn't working and kill it. Not a fun situation, sorry :(

from ark-tweet-nlp.

brendano avatar brendano commented on August 23, 2024

I just ran it on 1489999 or so tweets just now and here is the distribution of number of milliseconds per tweet. The top 8 worst ones are more than 500 ms, which isn't good (max 921 ms) ... this sounds like another regex bug, perhaps ... but the rest seem to terminate ok. Maybe if you go to a billion tweets the maxes are worse? I started a larger job but won't finish for a while. (Another caveat with this experiment, this includes JSON parsing, done in-process with --input-format json.)

from ark-tweet-nlp.

haijieg avatar haijieg commented on August 23, 2024

Cool. I'm going to test it on the daily.10k tonight.

On Tue, Oct 23, 2012 at 1:16 AM, brendano [email protected] wrote:

I just ran it on 1489999 or so tweets just now and here is the
distribution of number of milliseconds per tweet. The top 8 worst ones are
more than 500 ms, which isn't good (max 921 ms) ... this sounds like
another regex bug, perhaps ... but the rest seem to terminate ok. Maybe if
you go to a billion tweets the maxes are worse? I started a larger job but
won't finish for a while. (Another caveat with this experiment, this
includes JSON parsing, done in-process with --input-format json.)

https://a248.e.akamai.net/camo.github.com/3b827bd72554c0b981f8972d277e0656495432f6/687474703a2f2f6272656e6f636f6e2e636f6d2f53637265656e25323073686f74253230323031322d31302d32332532306174253230312e30392e3237253230414d2e706e67

https://a248.e.akamai.net/camo.github.com/1b54beac9b63cf3b5a548c6106036fcd766882a2/687474703a2f2f6272656e6f636f6e2e636f6d2f53637265656e25323073686f74253230323031322d31302d32332532306174253230312e30392e3538253230414d2e706e67

https://a248.e.akamai.net/camo.github.com/79711448c761d9b5c635ff39c1e8473ecc7f0205/687474703a2f2f6272656e6f636f6e2e636f6d2f53637265656e25323073686f74253230323031322d31302d32332532306174253230312e30382e3035253230414d2e706e67


Reply to this email directly or view it on GitHubhttps://github.com//issues/14#issuecomment-9690885.

from ark-tweet-nlp.

haijieg avatar haijieg commented on August 23, 2024

It passed the test of daily.10k with 14720000 tweets. Will try on larger
dataset later tomorrow.

Jay

On Tue, Oct 23, 2012 at 1:25 AM, Haijie Gu [email protected] wrote:

Cool. I'm going to test it on the daily.10k tonight.

On Tue, Oct 23, 2012 at 1:16 AM, brendano [email protected]:

I just ran it on 1489999 or so tweets just now and here is the
distribution of number of milliseconds per tweet. The top 8 worst ones are
more than 500 ms, which isn't good (max 921 ms) ... this sounds like
another regex bug, perhaps ... but the rest seem to terminate ok. Maybe if
you go to a billion tweets the maxes are worse? I started a larger job but
won't finish for a while. (Another caveat with this experiment, this
includes JSON parsing, done in-process with --input-format json.)

https://a248.e.akamai.net/camo.github.com/3b827bd72554c0b981f8972d277e0656495432f6/687474703a2f2f6272656e6f636f6e2e636f6d2f53637265656e25323073686f74253230323031322d31302d32332532306174253230312e30392e3237253230414d2e706e67

https://a248.e.akamai.net/camo.github.com/1b54beac9b63cf3b5a548c6106036fcd766882a2/687474703a2f2f6272656e6f636f6e2e636f6d2f53637265656e25323073686f74253230323031322d31302d32332532306174253230312e30392e3538253230414d2e706e67

https://a248.e.akamai.net/camo.github.com/79711448c761d9b5c635ff39c1e8473ecc7f0205/687474703a2f2f6272656e6f636f6e2e636f6d2f53637265656e25323073686f74253230323031322d31302d32332532306174253230312e30382e3035253230414d2e706e67


Reply to this email directly or view it on GitHubhttps://github.com//issues/14#issuecomment-9690885.

from ark-tweet-nlp.

brendano avatar brendano commented on August 23, 2024

OK, from a bigger sample of 143mil tweets ("daily 100k" here) I'm seeing a similar distribution for the lower <2ms range

There are 4146 tweets that took more than 10ms, and 96 that took longer than 1000ms (max 143 seconds); those last ones should be investigated at some point.

from ark-tweet-nlp.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.