Comments (3)
→ We can crawl using HTTP by default, and if we try selecting an element that is not rendered or run a JS command (like scrolling), an error will be raised and the crawler will retry with JS rendering next, is that correct? This would need tweaking the
detectionRatio
to always run HTTP by default right?
Well, almost, just a few clarifications:
- The crawler starts crawling with a browser, and tries to switch to HTTP occasionally. Then it rechecks if results match every so often, which can result into it reverting to browser crawling. The frequency of these rechecks can be adjusted with
detectionRatio
. - Scrolling is currently not possible in the adaptive crawler, but we're considering it. If we implement it, it will behave as you say - it will fail in HTTP mode and the crawler will retry it in a browser.
In terms of API design i imagine combining a default, crawler level rendering parameter, and then tying the rendering type with the handler that will handle the request could be an easy solution, something like:
const crawler = new AdaptivePlaywrightCrawler({ renderJs: true, // renderingTypeDetectionRatio: 0.1, requestHandler: router, });await enqueueLinks({ selector: "<selector>", label: "<handler name>", renderJs: false, });
Well, specifying a crawler-wide rendering type default doesn't seem useful to me. If your results can be extracted with plain HTTP, the crawler will detect that soon enough, and the few initial browser crawls should not present a problem.
We may consider something like the renderJs
flag in the future though, that would make sense.
I don't really get this rendering type hint, i'd rather be in full control but maybe my use case is too specific.
Yeah, I believe we mean the same thing - by passing the hint, you'd basically enforce the rendering type for a particular request.
from crawlee.
Hello and thank you for trying out AdaptivePlaywrightCrawler
! We'll be happy to hear any further feedback you might have in the future.
Regarding manually specifying the rendering type, we were considering adding it but we felt that it'll be better not to offer too many options at first. Furthermore, the general idea is that the crawler should do this automatically. We are open to discussing this though, we might end up with something cool in the end 🙂
When considering this initially, our idea of the solution was another callback in the crawler options that would return a rendering type hint for a Request
object. What do you think about that?
- use HTTP crawling by default, but if the request is blocked (for example, finding the word 'captcha' in the loaded url), switch to JS rendering and try to unblock the page
This is, to an extent, done automatically - if a HTTP crawl throws an exception, it is retried in browser.
- navigate catalog pages of an e-commerce website with JS, but then extract data from product pages with HTTP only (product pages represent >90% of pages of such website hence to motivation to use HTTP only)
Since the data used for rendering type prediction is strictly categorized by request label, this should also work out of the box if you use different labels for product listings and details. It is true that this should be documented before we consider the feature stable.
- many more
Please elaborate if you like 🙂
from crawlee.
Thanks for your detailed answer, i think i understand better now:
→ We can crawl using HTTP by default, and if we try selecting an element that is not rendered or run a JS command (like scrolling), an error will be raised and the crawler will retry with JS rendering next, is that correct?
This would need tweaking the detectionRatio
to always run HTTP by default right?
In terms of API design i imagine combining a default, crawler level rendering parameter, and then tying the rendering type with the handler that will handle the request could be an easy solution, something like:
const crawler = new AdaptivePlaywrightCrawler({
renderJs: true, // renderingTypeDetectionRatio: 0.1,
requestHandler: router,
});
await enqueueLinks({
selector: "<selector>",
label: "<handler name>",
renderJs: false,
});
When considering this initially, our idea of the solution was another callback in the crawler options that would return a rendering type hint for a Request object. What do you think about that?
I don't really get this rendering type hint, i'd rather be in full control but maybe my use case is too specific.
from crawlee.
Related Issues (20)
- Handle `Crawl-delay` directive in robots.txt
- RobotsFile.isAllowed returns false for allowed routes HOT 3
- Refactor `retireOnBlockedStatusCodes` to `isBlockedStatusCode` and move the retiring out of the `Session` class
- Cheerio crawler going out of memory unexpectedly with lot of concatenated strings
- Support reading sitemap from a variable
- Trouble unzipping `sitemap.xml` (`zlib: incorrect header check`) HOT 5
- handledRequestCount from requestQueue.getInfo() after restart is 0 HOT 1
- Improve docs about request locking HOT 2
- Load more function for google maps
- Rename `Job` from statistics module to `RequestProcessingRecord`
- Replace spaces with tabs for indentation
- Consider porting over `ContextPipeline` from the Python counterpart
- Remove experimental containers in browser pool
- Remove version private field from browser pool
- Proxy authentication error HOT 1
- URLs rejected from file HOT 4
- Third-party proxy IP does not take effect
- Unify `RequestList` and `RequestProvider` interfaces and extract their "tandem" behavior from `BasicCrawler`
- Investigate a more explicit way of detecting retries in `ProxyConfiguration`
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from crawlee.