Which package is this bug report for? If unsure which one to select, leave blank

I was hopping I could solve this with: <div class="highlight highlight-source-ts n

The request queue scans all 450k (99.999% of which are done towards the end) requests for each iteration about crawlee HOT 2 OPEN

zopieux commented on May 25, 2024

The request queue scans all 450k (99.999% of which are done towards the end) requests for each iteration

from crawlee.

Comments (2)

zopieux commented on May 25, 2024

Uhh, I think I got it. On a whim I decided to monitor file accesses to request_queues directory, and it turns out Crawlee is doing a full open/read/write of every single json file in there, as well as its .lock, including the ones that are already done. That's 447'994 × N file operations per cycle, and each cycle becomes shorter and shorter as we arrive to deep pages with just a few new links discovered per cycle.

N is some internal number around 2, meaning it takes a good million file operations before Crawlee can do the next iteration at my scale. I moved the request_queues to a ramfs but it's barely helping, it just helps with my SSD wear I guess :-)

I see that a V2 request queue is being slated for a future release. Is this new request queue moving the done requests in a less I/O intensive place, e.g. an in-memory hashmap (set) of the uniqueKeys? Storing 400k keys in an hashmap is peanuts. That would help tremendously with performance (and disk wear!).

from crawlee.

zopieux commented on May 25, 2024

I was hopping I could solve this with:

const memoryStorage = new MemoryStorage({ persistStorage: true, writeMetadata: false })
const requestQueue = await RequestQueue.open(null, { storageClient: memoryStorage })
const crawler = new PlaywrightCrawler({
  requestQueue,
})

but sadly persistStorage: true does not mean what I thought ("read all at init, save all at exit"): instead it continues to do a full scan of the persistence directory for each cycle.

But at this stage of my scrape (towards the very end) I obviously cannot afford to start from scratch and never persist the state: the list of already-scraped URLs is very important.

from crawlee.

Recommend Projects

The request queue scans all 450k (99.999% of which are done towards the end) requests for each iteration about crawlee HOT 2 OPEN

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent