Comments (22)
Is there any way you could send us a minimum reproduction sample we can use to debug this further? 🙏
from crawlee.
Sure.
// For more information, see https://crawlee.dev/
import { PlaywrightCrawler, ProxyConfiguration } from 'crawlee';
import { firefox } from 'playwright';
import { router } from './routes.js';
import cookies from './cookies.json' assert { type: "json" }
import storage from './storage.json' assert { type: "json" }
const crawler = new PlaywrightCrawler({
preNavigationHooks: [
async (crawlingContext, gotoOptions) => {
const {page} = crawlingContext;
await page.context().addCookies(cookies.map(cookie => ({...cookie, sameSite: 'None'})));
{
const data = storage;
const code = items =>
Object
.entries(items)
.forEach(([key, value]) => localStorage[key] = value);
await page.evaluate(code, data).catch(error => console.warn(error.message));
}
},
],
postNavigationHooks: [
async (crawlingContext, gotoOptions) => {
await crawlingContext.closeCookieModals();
}
],
headless: false,
// proxyConfiguration: new ProxyConfiguration({ proxyUrls: ['...'] }),
requestHandler: router,
// Comment this option to scrape the full website.
maxRequestsPerCrawl: 20,
requestHandlerTimeoutSecs: 5 * 60,
useSessionPool: true,
persistCookiesPerSession: true,
launchContext: {
launcher: firefox,
useChrome: true,
useIncognitoPages: false
},
});
await crawler.run(startUrls);
from crawlee.
Maybe something smaller? Or..if its easier for you, a GitHub repository we can clone? Either way we'll take a look
from crawlee.
Great - thanks. It's not really dependent on any specific code. The issue simply is that crawlers are using incognito contexts with cookies and other site data features disabled apparently. The question is how to change that.
from crawlee.
FWIW, useIncognitoPages
defaults to false, I doubt your problem is about that.
from crawlee.
Indeed. Desperate attempt to induce some change in behavior.
from crawlee.
Don't you need to use page.evaluate
here to actually execute the code in the browser? As that is where the localStorage
object lives.
from crawlee.
Yes, there is page.evaluate
. The problem is it's evaluating within an incognito context with access to these window props disabled
from crawlee.
And another suspicious thing, why would you use useChrome: true
with firefox?
launcher: firefox,
useChrome: true,
from crawlee.
Maybe firefox opens in incognito context by default, not sure about that. Does it work when you actually use chrome?
from crawlee.
Have tried shuffling various different properties around so there might be some leftovers but pretty sure that has no effect here.
from crawlee.
Doesn't matter which browser is used. The problem seems to be at a higher level - browserPool
most likely.
from crawlee.
Well, I am more than sure that what you say is not true with chrome, we would be well aware if we open in incognito by default, as that hurts performance badly (~50% overhead).
from crawlee.
Negative on chrome. Same thing across all browsers.
from crawlee.
Not sure what's wrong. The setup seems pretty standard. Caught me by surprise to find out about the above.
from crawlee.
Didn't see any reason for it other than some flag the library is using on startup, since it's happening with any browser. Should be quite straightforward to reproduce by visiting chrome://settings/content/cookies
from crawlee.
Standard Playwright project setup produced by npx crawlee create
from crawlee.
I dont see how settings page is connected to this, when I open settings locally in incognito mode, it opens them in the normal window.
from crawlee.
Most likely, that is what's causing the problem with access to local storage as described in https://www.chromium.org/for-testers/bug-reporting-guidelines/uncaught-securityerror-failed-to-read-the-localstorage-property-from-window-access-is-denied-for-this-document
from crawlee.
I don't see any other reason why all browsers should have this setting enabled unless the library is forcing that behavior, through a launch flag...
from crawlee.
I can confirm that in local browser, the settings are open in non-incognito window for me as well. However, I'm not entirely sure that incognito window vs. incognito context in Playwright are the same thing and can be compared in such way.
from crawlee.
I seem to have misunderstood the third-party cookie settings page by glancing over it. This option seems to be enabled always, since it's about third-party cookies. The question then is about the possibility to change this setting in order to avoid the error. Since this no longer seems to be related to the library, I'll need to dig deeper and find out if this setting can be changed through flags.
If this setting is checked, third-party scripts cookies are disallowed and access to localStorage may result in thrown SecurityError exceptions.
from crawlee.
Related Issues (20)
- `maxUsageCount: 1` does not retire session after a single use HOT 1
- `useIncognitoPages` doesn't rotate fingerprints HOT 1
- Add support for all tags defined by the sitemap protocol
- `page.evaluate` results error HOT 2
- HttpCrawler - determining character encoding
- Add `waitForAllRequestsToBeAdded` option to `enqueueLinks`
- XPATH selectors support HOT 4
- Multiple calls to enqueueLinks with Promise.all result in a crash HOT 1
- `RestrictedCrawlingContext` should not extend `Record<string, unknown>` HOT 2
- Could not kill browser: Cannot read private member #process from an object whose class did not declare it HOT 2
- Image not available(build status) in readme
- page.waitForTimeout is removed
- `querySelector` should be propagated to all crawling contexts
- Write an e2e test of adaptive playwright crawler
- Adaptive crawling
- Initial PoC version of adaptive crawler
- ProxyUrl not accepted: "(array `proxyUrls`) Expected property string values to be a URL, got " HOT 1
- Detect when user calls global SDK methods from "side-effect free" context HOT 1
- Tor as proxy
- Adopt a code formatter and enforce it with CI
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from crawlee.