Comments (10)
Perhaps we can set the default value of availableMemoryRatio
in the Apify SDK's Configuration
(here)?
We might use the APIFY_IS_AT_HOME
envvar to switch between the original default and the new default ratio (~0.9
?)
from crawlee.
This seems to be caused by the following snippet (line 178):
crawlee/packages/core/src/autoscaling/snapshotter.ts
Lines 176 to 180 in 6f2e6b0
The availableMemoryRatio
is by default 0.25
, which checks out with our observations. This is probably alright for non-Apify users, but a bit dumb for Actors on Apify, where we should utilize all the available resources.
In such cases, this can be remedied by overriding the defaults with the CRAWLEE_AVAILABLE_MEMORY_RATIO
envvar or by passing a customized Configuration
instance to the crawler:
new PlaywrightCrawler(
{},
new Configuration({
availableMemoryRatio: 1,
})
);
from crawlee.
Humph, it'd make sense to me if Actor.init
set the ratio to 1. Or can we set the default value of the env var for all runs on the platform without forcing everybody to update dependencies?
from crawlee.
I'd say lets set it at the base image level, with cheerio and normal node having a higher ratio than browser images, but what do you think would be better?
from crawlee.
I'd say lets set it at the base image level, with cheerio and normal node having a higher ratio than browser images, but what do you think would be better?
I'm probably missing important info here - if I start a new crawlee project, I get a Dockerfile based on one of the base images, correct?
If I change the crawler type in my code (perfectly legit thing IMO), won't configuration done in the base image just stick? That seems hard to track down...
from crawlee.
If I change the crawler type in my code (perfectly legit thing IMO), won't configuration done in the base image just stick? That seems hard to track down..
this is true, but you should also update the image in that case... I guess this is a rough thing to fix... Maybe we can middleground? Expose an env variable from base images that specify the img type and actor.init decides on default ratio based on it?
Or maybe I'm just high and there's a better solution! I'm just throwing ideas here :D
from crawlee.
I realized I haven't commented on this anywhere, we only discussed this with Jindra on Thursday - so here is the thing: we already set this value to 1 on the platform, and it worked just fine until recently. It's done in the SDK in Actor.init
here:
https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L203
What I think might have happened is that a wrong config is resolved via AsyncLocalStorage (as by default all places use the global config which resolves to a scoped one via ALS). If that's the case, it could be caused by #2371.
from crawlee.
What I think might have happened is that a wrong config is resolved via AsyncLocalStorage (as by default all places use the global config which resolves to a scoped one via ALS). If that's the case, it could be caused by #2371.
Could you elaborate how? That ALS is not even in place when you're not working with AdaptivePlaywrightCrawler. Or is this just a hunch that two supposedly independent instances of AsyncLocalStorage may interfere in weird ways?
from crawlee.
Yes, it's a hunch, based on years of experience working with ALS, seeing all the weird edge cases myself (been using it before it became stable).
What I am sure about:
- we were setting the ratio to 1 (inside
Actor.init
) since inception - it was working just fine for a very long time (since the initial 3.0 release)
- the config resolution depends on ALS
- only recently we started adding more ALS usage
It could be as well about some other refactoring, but that particular PR sounds like the ideal first candidate to check.
I haven't tried to reproduce this yet, not sure if it's surfacing always or if it's just a fluke? If it's happening the same all the time, I would first try to revert that PR via patch-package
to see if it helps.
Next time let's please at least add a link to slack discussions to the OP for more context.
from crawlee.
Related Issues (20)
- Adopt a code formatter and enforce it with CI
- Can not run crawleee puppeteer unit test with Jest
- Errors in node_modules/@crawlee/http/internals/http-crawler.d.ts 140 errors in the same file HOT 1
- Rename `Snapshotter` to something more accurate
- Http crawler does not return response in gzip format
- Pass options to browser context
- Huge sitemap takes forever to load
- Make RequestQueueV2 default
- type error `puppeteerUtils.gotoExtended` ?
- Issue with decoding quotation mark HOT 2
- Incorrect Request Timeout in Error Message
- The request queue scans all 450k (99.999% of which are done towards the end) requests for each iteration HOT 2
- Missing `create*Router` helper for AdaptivePlaywrightCrawler
- Support for crawling from secondary IP address HOT 1
- Statistics does not use crawler log HOT 1
- Race conditions in CI/CD HOT 4
- Malformed Sitemap content when url contains searchParams HOT 7
- Mysterious timeout hard-kills `CheerioCrawler` script HOT 7
- Some "run on Apify" examples do not work HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from crawlee.