Comments (2)
Uhh, I think I got it. On a whim I decided to monitor file accesses to request_queues
directory, and it turns out Crawlee is doing a full open/read/write of every single json file in there, as well as its .lock
, including the ones that are already done. That's 447'994 × N file operations per cycle, and each cycle becomes shorter and shorter as we arrive to deep pages with just a few new links discovered per cycle.
N is some internal number around 2, meaning it takes a good million file operations before Crawlee can do the next iteration at my scale. I moved the request_queues
to a ramfs but it's barely helping, it just helps with my SSD wear I guess :-)
I see that a V2 request queue is being slated for a future release. Is this new request queue moving the done requests in a less I/O intensive place, e.g. an in-memory hashmap (set) of the uniqueKeys? Storing 400k keys in an hashmap is peanuts. That would help tremendously with performance (and disk wear!).
from crawlee.
I was hopping I could solve this with:
const memoryStorage = new MemoryStorage({ persistStorage: true, writeMetadata: false })
const requestQueue = await RequestQueue.open(null, { storageClient: memoryStorage })
const crawler = new PlaywrightCrawler({
requestQueue,
})
but sadly persistStorage: true
does not mean what I thought ("read all at init, save all at exit"): instead it continues to do a full scan of the persistence directory for each cycle.
But at this stage of my scrape (towards the very end) I obviously cannot afford to start from scratch and never persist the state: the list of already-scraped URLs is very important.
from crawlee.
Related Issues (20)
- Adopt a code formatter and enforce it with CI
- Can not run crawleee puppeteer unit test with Jest
- Errors in node_modules/@crawlee/http/internals/http-crawler.d.ts 140 errors in the same file HOT 1
- Rename `Snapshotter` to something more accurate
- Http crawler does not return response in gzip format
- Pass options to browser context
- Huge sitemap takes forever to load
- Make RequestQueueV2 default
- type error `puppeteerUtils.gotoExtended` ?
- Issue with decoding quotation mark HOT 2
- Incorrect Request Timeout in Error Message
- Missing `create*Router` helper for AdaptivePlaywrightCrawler
- Support for crawling from secondary IP address HOT 1
- Statistics does not use crawler log HOT 1
- Race conditions in CI/CD HOT 4
- Malformed Sitemap content when url contains searchParams HOT 7
- Mysterious timeout hard-kills `CheerioCrawler` script HOT 7
- The default value of `availableMemoryRatio` is too low HOT 10
- Some "run on Apify" examples do not work HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from crawlee.