Giter VIP home page Giter VIP logo

Comments (5)

artemyarulin avatar artemyarulin commented on August 25, 2024

Hi, found recently your project and trying to get familiar with that - great project and thanks for making it open source!

Related to that issue - I've noticed that HTML parsing is happening in it's own thread and wonder why? With crawler we don't care about latency but only about throughput of a whole system: Wouldn't we have the same result if we return parsing thread back to the crawler and do parsing right away in all the threads?

It would block IO that for sure, but after all throughput of a whole system should be on the same level, no? Assuming that we have some sort of back pressure from the parsing thread

from crusty-core.

let4be avatar let4be commented on August 25, 2024

Hi, thanks for the input!

Doing parsing in all threads right in place(inside TaskProcessor) is certainly possible and this how it was implemented in the beginning. However it means unpredictably blocking current thread which may have other async code pending on it.

As you noticed we don't care much about latency but I like keeping it manageable after all we also have DB connections(clickhouse/redis, in the scope of Crusty) and if those happen to land on a misbehaving thread(say busy by some weirdly constructed content that takes way-way longer to process than a typical html page) we could have disconnects.
There's also a problem that we could have more disconnects overall to websites we crawl, so we'd probably have to raise connect timeouts

Also when done in it's own threadpool we can control how much work there is and put a cap on it(channel buffer).
In Broad Web Crawling(Crusty) I see the following situation all the time - we get a spike in average html page complexity either due to actual complexity increase or due to size increase which causes buffer to start to fill up but we do not slow down crawling when this happens in hopes that situation is temporary and will resolve itself.
=> it helps even out the load and utilize CPU more effectively, I find it's way easier to saturate hardware when we always have jobs in Parser buffer to keep all parser threads busy 99.99% of the time

from crusty-core.

let4be avatar let4be commented on August 25, 2024

Probably will also need to put select parsing under a feature flag

from crusty-core.

artemyarulin avatar artemyarulin commented on August 25, 2024

Thanks for such a detailed response!

I wonder if https://github.com/servo/html5ever has problems like that with parsing taking unpredictable amount of time. It uses callbacks so essentially every time it fires we can decide should we continue or give event loop a spin to avoid blocking thread for long amount of time

from crusty-core.

let4be avatar let4be commented on August 25, 2024

Feel free to open a separate issue, might be worth considering

  • if it's possible to abstract away threadpool/in-place parsing
  • if it's worth switching to a in-place parsing, do we waste any resources by sending tasks to a dedicated threadpool and if so how much, does it warrant switch to an in-place parsing or do we more benefits from it(like possibly more even resource saturation under some configurations)

from crusty-core.

Related Issues (15)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.