Comments (3)
Hello, @batu-archive! (and sorry for the wait).
Indeed, this page is a bit weird - navigation is fully JS-based, so enqueueLinks
cannot do much here. Unfortunately, it also uses some more-complicated-than-usual logic for tab management, so the enqueueLinksByClickingElements
doesn't work out of the box either (although this method should work with opening popup tabs too).
I found one thing when playing with DevTools on the page - it loads all the (important) data with XHR requests from https://www.metacareers.com/graphql
. With the preNavigationHooks
and page.route()
you can do something like this:
const crawler = new PlaywrightCrawler({
requestHandler: router,
preNavigationHooks: [async ({ page, sendRequest, enqueueLinks, request: pageRequest }) => {
// do this only on the first page
if(pageRequest.url !== 'https://www.metacareers.com/jobs') return;
// catch the XHR requests to the API endpoint
await page.route('https://www.metacareers.com/graphql', async route => {
const request = await route.request();
// check if we caught the correct API request (with all the job data)
if(request.headers()['x-fb-friendly-name'] === 'CareersJobSearchResultsQuery') {
// if so, we make the request again separately
const data = await sendRequest({
url: request.url(),
method: request.method() as any,
headers: request.headers(),
body: request.postData() as any,
}).then(x => x.body)
const openings = JSON.parse(data).data.job_search; // get the data
// enqueue all of the urls of the job openings
await enqueueLinks({
urls: openings.map(x => new URL(x.id, 'https://www.metacareers.com/jobs/').toString()),
});
}
await route.continue(); // resume the original API request
})
}],
headless: false,
});
The cool thing is - you don't need pagination here, as all of the job data (all 1600+ positions) are loaded in the first and only API request.
Another thing I also noticed is that the separate job pages also load without JS - so you might get much better performance if you process these requests with the CheerioCrawler
(this I didn't test though, it's possible that the server might block a request from a non-browser HTTP client).
We won't make any adjustments to the enqueueLinks
methods just now, as this is a really specific case... but I'd love to hear more about your use case!
Did the code above help? I'll close this issue as wontfix
now, but in case you have any additional questions - fire away!
Thanks!
from crawlee.
I have found a way looking at source code which is enqueueLinksByClickingElements
imho. However I think this method listens for navigation in current page context. In my case when we click on job ad, it tries to open new tab which is immediately closed in ms for some reason
from crawlee.
Hi @barjin thank you very much for detailed answer ❤️💎🫶 and yes it definitely helped even though I was looking for "how to fish" rather than "getting fish ready" 😅
from crawlee.
Related Issues (20)
- The default value of `availableMemoryRatio` is too low HOT 10
- Some "run on Apify" examples do not work HOT 1
- Handle `Crawl-delay` directive in robots.txt
- RobotsFile.isAllowed returns false for allowed routes HOT 3
- Refactor `retireOnBlockedStatusCodes` to `isBlockedStatusCode` and move the retiring out of the `Session` class
- `AdaptivePlaywrightCrawler`: programmatically deciding when to render JS HOT 3
- Cheerio crawler going out of memory unexpectedly with lot of concatenated strings
- Support reading sitemap from a variable
- Trouble unzipping `sitemap.xml` (`zlib: incorrect header check`) HOT 5
- handledRequestCount from requestQueue.getInfo() after restart is 0 HOT 1
- Improve docs about request locking HOT 2
- Load more function for google maps
- Rename `Job` from statistics module to `RequestProcessingRecord`
- Replace spaces with tabs for indentation
- Consider porting over `ContextPipeline` from the Python counterpart
- Remove experimental containers in browser pool
- Remove version private field from browser pool
- Proxy authentication error HOT 1
- URLs rejected from file HOT 4
- Third-party proxy IP does not take effect
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from crawlee.