Giter VIP home page Giter VIP logo

puppeteer-cluster's People

Contributors

apn-carmine avatar cd9 avatar daniellevinson avatar dependabot-preview[bot] avatar dependabot-support avatar greenkeeper[bot] avatar honzamac avatar hugopoi avatar ilantc avatar jackmac92 avatar mhaseebkhan avatar shannonmoeller avatar thomasdondorf avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

puppeteer-cluster's Issues

Is it possible to create task queue dynamically

I want to create a node server for scrapping using puppeteer (pass search term in GET request to scrap google search results)

currently my server is not able process more then 5 parallel request after its goes out of memory

Using in Jest context / Node.js version 6 support

I am interested in exploring using puppeteer cluster in a Jest test context.

I am not able to import or require - without getting an Unexpected identifier error on that line.

import Cluster from 'puppeteer-cluster';
// or
const { Cluster } = require('puppeteer-cluster');

Error:

static async launch(options) {
                     ^^^^^^
    SyntaxError: Unexpected identifier

Thanks...

Inter Process Communication

Hello There.
I have multiple instances of puppeteer that scrapes data from some sites. after scraping, each instance uses process.send() to output the data so that it can be saved to a database. I would love to know if it's possible to listen to data/message sent by each instance so that they can be saved to the DB same way we have cluster.on('taskerror') event handler and how to implement it. Regards.

Unable to return a variable from a queued function

Hi,

I am having a little trouble figuring out a way to return a variable from a queued function.

Given the sample function-queuing-complex.js example, I have tried using both return and resolve in extractTitle since I read from the README that cluster.queue returns a Promise. Both resulted in undefined being returned. A Promise.all doesn't seem to work either. Is this a bug or am I doing something wrong?

const extractTitle = async ({ page, data: url }) => {
	await page.goto(url);
	const pageTitle = await page.evaluate(() => document.title);
 
        // How do I return pageTitle to use outside this async function?
};
const task1= await cluster.queue("https://reddit.com/", extractTitle);
const task2 = await cluster.queue("https://twitter.com/", extractTitle);
Promise.all([task1, task2]).then(result => console.log(result)); // returns undefined

Extensions are not loading on any concurrency model

I'm trying to run puppeteer in a cluster using this library however when I try the following I get no errors however the plugin itself doesn't load. The same arguments work perfectly with puppeteer directly.

Anyone have an idea why this is happening?

    cluster = await Cluster.launch({
        concurrency: Cluster.CONCURRENCY_PAGE,
        maxConcurrency: 2,
        monitor: false,
        puppeteerOptions: {
            headless: true,
            args: [
                '--no-sandbox',  
                '--disable-gpu',
                '--enable-usermedia-screen-capturing',
                '--allow-http-screen-capture',
                '--auto-select-desktop-capture-source=ppc',
                '--load-extension=' + __dirname+'/chrome-plugin',
                '--disable-extensions-except=' + __dirname+'/chrome-plugin',
                '--disable-infobars',
                '--window-size=1920,1080',
            ],
        }
    });

Add more events

Something like:

    cluster.on('monitor', (data) => {
        console.log(data);
    });

Program hang when maxConcurrency is set over 50

First of all, great project!
I tried the example while set maxConcurrency = 50/100. What I noticed is when set to 100, pretty much every time the program will hang somewhere. When set to 50, program will hang sometimes. Not sure what caused this issue. Thanks for any input.

Improve error documentation or maybe even think about catching "stupid" errors

Currently the library does not catch asynchronously thrown errors. That means code like this can lead to errors:

page.on('dialog', async dialog => {
  await dialog.dismiss();
});

The correct way right now is to put a try catch block around the code inside the function. This is a problem, as the library might still come to a stop when the code is badly written.

  • Option 1: Improve documentation regarding asynchronous errors.
  • Option 2: Use something like process.on('uncaughtException') and/or process.on('unhandledRejection') to handle all kind of errors. This might interfere with bigger applications that have this kind of handling already build in.

Note sure which one is the way to go. Open for ideas and opinions.

Timeout config not honored

Related code:

  const cluster = await Cluster.launch({
    puppeteerOptions: {
      headless: true,
      ignoreHTTPSErrors: env.IGNORE_HTTPS || false,
      args: ['--disable-http2'],
      timeout: env.PUPPETEER_TIMEOUT || 60000,            //attempt 2
    },
    concurrency: Cluster.CONCURRENCY_CONTEXT,
    maxConcurrency: parseInt(env.MAX_WORKER) || 4,
    skipDuplicateUrls: false,
    monitor: env.MONITOR === 'true' || false,
    timeout: env.PUPPETEER_TIMEOUT || 60000,              //attempt 1
  });

I've tried to set timeout in cluster launch options and passed it to puppeteerOptions, all failed. Log says timeout was still at 30000.

app:cluster:err TimeoutError: Navigation Timeout Exceeded: 30000ms exceeded
  app:cluster:err     at Promise.then (/home/bambang/project/om-screenshoot/node_modules/puppeteer/lib/FrameManager.js:1276:21)
  app:cluster:err   -- ASYNC --
  app:cluster:err     at Frame.<anonymous> (/home/bambang/project/om-screenshoot/node_modules/puppeteer/lib/helper.js:144:27)
  app:cluster:err     at Page.goto (/home/bambang/project/om-screenshoot/node_modules/puppeteer/lib/Page.js:624:49)
  app:cluster:err     at Page.<anonymous> (/home/bambang/project/om-screenshoot/node_modules/puppeteer/lib/helper.js:145:23)
  app:cluster:err     at GenericHandler.processPage (/home/bambang/project/om-screenshoot/src/handlers/v1base.js:47:21)
  app:cluster:err     at GenericHandler.process (/home/bambang/project/om-screenshoot/src/handlers/v1base.js:94:16)
  app:cluster:err     at module.exports (/home/bambang/project/om-screenshoot/src/handlers/site.js:27:24)
  app:cluster:err     at Worker.<anonymous> (/home/bambang/project/om-screenshoot/node_modules/puppeteer-cluster/dist/Worker.js:56:54)
  app:cluster:err     at Generator.next (<anonymous>)
  app:cluster:err     at fulfilled (/home/bambang/project/om-screenshoot/node_modules/puppeteer-cluster/dist/Worker.js:4:58)
  app:cluster:err     at process.internalTickCallback (internal/process/next_tick.js:77:7) +789ms

Any guidance on how to trace/ fix this issue?

An in-range update of debug is breaking the build 🚨

Version 3.2.0 of debug was just published.

Branch Build failing 🚨
Dependency debug
Current Version 3.1.0
Type dependency

This version is covered by your current version range and after updating it in your project the build failed.

debug is a direct dependency of this project, and it is very likely causing it to break. If other packages depend on yours, this update is probably also breaking those in turn.

Status Details
  • continuous-integration/travis-ci/push: The Travis CI build could not complete due to an error (Details).
  • coverage/coveralls: First build on greenkeeper/debug-3.2.0 at 74.478% (Details).

Release Notes 3.2.0

A long-awaited release to debug is available now: 3.2.0.

Due to the delay in release and the number of changes made (including bumping dependencies in order to mitigate vulnerabilities), it is highly recommended maintainers update to the latest package version and test thoroughly.


Minor Changes

Patches

Credits

Huge thanks to @DanielRuf, @EirikBirkeland, @KyleStay, @Qix-, @abenhamdine, @alexey-pelykh, @DiegoRBaquero, @febbraro, @kwolfy, and @TooTallNate for their help!

Commits

The new version differs by 25 commits.

There are 25 commits in total.

See the full diff

FAQ and help

There is a collection of frequently asked questions. If those don’t help, you can always ask the humans behind Greenkeeper.


Your Greenkeeper Bot 🌴

Should queued task take care about closing the page?

My use case is the following: create a cluster with Cluster.CONCURRENCY_BROWSER and never close it.

const { connect } = require('amqplib');
const { Cluster } = require('puppeteer-cluster');
const { crawler, puppeteerOptions, redis } = require('./docroot');
const { Resource } = require('./docroot/Component');

(async ({ RABBITMQ_USER, RABBITMQ_PASS, RABBITMQ_HOST, RABBITMQ_PORT, RABBITMQ_QUEUE, RABBITMQ_THREADS, REDIS_LIST }) => {
  const cluster = await Cluster.launch({
    monitor: true,
    concurrency: Cluster.CONCURRENCY_BROWSER,
    maxConcurrency: Number(RABBITMQ_THREADS),
    puppeteerOptions,
  });

  const channel = await (await connect(`amqp://${RABBITMQ_USER}:${RABBITMQ_PASS}@${RABBITMQ_HOST}:${RABBITMQ_PORT}`)).createChannel();

  channel.assertQueue(RABBITMQ_QUEUE, {
    durable: false,
  });

  await cluster.task(async ({ data, page }) => {
    const { resource, message } = data;
    const metadata = await crawler.crawl(resource, page);

    await redis.rpush(REDIS_LIST, JSON.stringify(metadata));

    channel.ack(message);
  });

  channel.consume(RABBITMQ_QUEUE, message => {
    const content = JSON.parse(message.content.toString('utf8'));
    const resource = new Resource(content.resource);

    if (Array.isArray(content.links_to_check_for)) {
      resource.setLinks(content.links_to_check_for);
    }

    cluster.queue({ resource, message });
  });
})(process.env);

As you can see above, the cluster's queue gets filled once RabbitMQ sends something. This means the process is kinda daemon and shouldn't be stopped. I'm worry about of whether the pages that cluster creates should be closed (await page.close() after const metadata = await crawler.crawl(resource, page);) once not needed anymore or is it done automatically?

Same URL Concurrency

This goes away from the traditional idea of "New browser per task" or "New page per task". This one is more about keeping a cluster of pages open the entire time and periodically refreshing them.

Why would I want to do to this you ask?...

Let's say I have a page that has d3 charts and I want to turn all the charts into images (my actual product isn't d3 charts). If the charts update in real time and I want a screenshot every 5 minutes (assuming there are 100s of charts), opening a page / browser each time takes a while. If I just kept the tab open and kept screenshotting, then I'd have the screenshots a lot sooner.

Now for my more techy way: I'm exposing a function to the site I'm screenshotting, and that function retrieves arguments from puppeteer/chrome to render specific items on the page.

Sudo-Code

// browser
if (typeof window.getRenderOpts === 'function') {
    window.getRenderOpts().then((opts) => updateChart(opts));
}

// puppeteer
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com');

async function getPageAndLock(): Promise<Page> {
  // .. get's a page that's idle or waits till one becomes idle...
}

async function pageIsReady(): Promise<Page> {
  // ...
}



... (req, res) => {
    const page = await getPageAndLock();

    await page.evaluate(`render(`${JSON.stringify({/* ... */)`)`);

    const screenshot = await page.screenshot(/* ... */);

    pageIsReady(page);

    res.send(screenshot);
}

It's probably out of the scope of this library, but I'm not sure if anyone would be interested in this type of concurrency.

I did benchmarks of "New browser per task", "New page per task", "Same page per task", and keeping the page open and taking screenshots periodically is A LOT FASTER. I can get these benchmarks back if you want me to. This was when I was experimenting.

multiple crawl does not crawl all my urls.

when i run puppeteer cluster with 100 urls,it only crawls 98 or 99 urls ..
here is my code

`const { Cluster } = require('puppeteer-cluster');
var link=[];
var total=0;
var start=3;
const size= process.argv[2];
for(let i =0;i<size;i++)
{ link.push(process.argv[start++]);}

(async () => {
const cluster = await Cluster.launch({
concurrency: Cluster.CONCURRENCY_PAGE,
maxConcurrency:50,
timeout:400000,
monitor:false
});

await cluster.task(async ({ page, data: url }) => {
const response=await page.goto(url,{timeout:100000,waitUntil: 'networkidle2'});
console.log(response.url());

 if(response.status()==404){
	console.log('program encountered error');		
	return;
}
total++;   (counts the number of urls)

const hrefs = await page.evaluate(() => {
		const anchors = document.querySelectorAll('a');
		 return [].map.call(anchors, a => a.href);
						});  

});

for(let i =0;i<size;i++){await cluster.queue(link[i]); }

await cluster.idle();
await cluster.close();

console.log(total);
process.exit(0);
})();`

Change license to MIT

This software seems really interesting and useful.

Do you have any plans on changing your open source license from the GNU General Public License 3.0 to something else, such as Apache License 2.0, BSD or MIT?

I'm asking since many individuals and organisations cannot use GPL-licensed software. Thanks.

Usage in an HTTP environment.

Can I use this in micro / express / etc and be able to have an endpoint process a "screenshot" task and return a value when the task completes?

Is this a thing?

CONCURRENCY_PAGE with headless: false hangs up and breaks

const { Cluster } = require('../dist');

(async () => {
    // Create a cluster with 2 workers
    const cluster = await Cluster.launch({
        concurrency: Cluster.CONCURRENCY_PAGE,
        maxConcurrency: 2,
        puppeteerOptions: {headless: false}
    });

    // Define a task (in this case: screenshot of page)
    await cluster.task(async ({ page, data: url }) => {
        await page.goto(url);

        const path = url.replace(/[^a-zA-Z]/g, '_') + '.png';
        await page.screenshot({ path });
        console.log(`Screenshot of ${url} saved: ${path}`);
    });

    // Add some pages to queue
    await cluster.queue('https://www.google.com');
    await cluster.queue('https://www.wikipedia.org');
    await cluster.queue('https://github.com/');

    // Shutdown after everything is done
    await cluster.idle();
    await cluster.close();
})();

This only generated screenshots for wiki and github. Browser also hung for some time.

Long-term runs of puppteer-cluster

I'm gonna document some puppeteer-cluster test runs, to see how the different concurrency types and options work together.

Feel free to add your own runs

Problem with headless: true

Hello there.
I'm testing puppeteer-cluster in a project I'm working on and I have problem in headless mode.
Because I cannot send the original code, I tried to reproduce the problem using the simple Queuing functions example. When headless is false it works like a charm. When set to true, nothing happens.
Am I missing something?

const { Cluster } = require('puppeteer-cluster');

(async () => {

    const cluster = await Cluster.launch({
        concurrency: Cluster.CONCURRENCY_CONTEXT,
        maxConcurrency: 3,
        puppeteerOptions: {
            headless: true
        },
        monitor: true
    });

    await cluster.queue(async ({ page }) => {
        await page.goto('http://www.wikipedia.org');
        await page.screenshot({path: 'wikipedia.png'});
    });

    await cluster.queue(async ({ page }) => {
        await page.goto('https://www.google.com/');
        const pageTitle = await page.evaluate(() => document.title);
        console.log('google');
    });

    await cluster.queue(async ({ page }) => {
        await page.goto('https://www.imdb.com/');
        console.log('IMDB');
    });
    await cluster.idle();
    await cluster.close();
})();

puppeteer v1.9.0,
puppeteer-cluster v0.11.2

I will appreciate your help. Thank you

Crawler on demand instead a queue

Hi Guys,

I try to use express to wrap a little REST API above of puppeteer, but i see the only way to add a new url is use the cluster queue. My concern is that i do "parallel" requests i will receive the wrong answer, i mean the content of another url.

My question is: Is possible to run synchronous tasks ?

Thanks and sorry for may bad english.

about CONCURRENCY_PAGE

There is such a scene,I have a number of URLs, want to open the bulk of parallel, but if the URL is under the same domain name, each page needs to delay a few seconds to open, to avoid being blocked by the target webmaster,How to do it, I follow the following settings do not seem to

concurrency: Cluster.CONCURRENCY_PAGE,
maxConcurrency: 10,
retryLimit: 5,//失败重试5次
retryDelay: 2000,//重试间隔2秒
sameDomainDelay:30*1000,//统一域名下,延时10秒打开,貌似没用
skipDuplicateUrls: true,//跳过重复url
workerCreationDelay: 500,//标签打开延时

Will it be delayed for 20 seconds?

The code is as follows
const { Cluster } = require('puppeteer-cluster');
(async () => {
const cluster = await Cluster.launch({
concurrency: Cluster.CONCURRENCY_CONTEXT,
maxConcurrency: 2,
sameDomainDelay:20*1000 //Will it be delayed for 20 seconds?
});

await cluster.task(async ({ page, data: url }) => {
await page.goto(url);
const screen = await page.screenshot();
// Store screenshot, do something else
});

await cluster.queue('http://www.google.com/a.html');
await cluster.queue('http://www.google.com/b.html');
await cluster.queue('http://www.google.com/c.html');
// many more pages

await cluster.idle();
await cluster.close();
})();

My question is, if a.HTML opens first, then B.Html,c.HTML, will be delayed 20 seconds to open it?
Do not understand how this sameDomainDelay uses

Can you add an idle event

There is a requirement, a database table, there is a lot of url, need to be accessed one by one, I want to read in batches to prevent too much memory, the program has been running, only every once in a while to read the database, and then execute, Can you add an idle event to this loop? Read the database only when you are idle, I wonder if this method is feasible

Use "puppeteer-core" instead of "puppeteer"

Is it possible to use "puppeteer-core" instead of "puppeteer" for the sake of not having to specify the environment variable to exclude a chrome download? I have to manually remove the chrome package from my distribution.

Does puppeteer-cluster have "worker_index" to work with Task Function?

I'm looking for an index of worker that's simply a number of current worker calling Task Function

This code below always captures a screenshot to the same file screen.png.

const puppeteer = require('puppeteer-core');

const {
    Cluster,
} = require('puppeteer-cluster');

(async () => {
    const cluster = await Cluster.launch({
        concurrency: Cluster.CONCURRENCY_CONTEXT,
        maxConcurrency: 2,
        puppeteer,
        puppeteerOptions: {
            executablePath: 'C:\\Users\\..\\AppData\\Local\\Google\\Chrome SxS\\Application\\chrome.exe',
        },
    });

    await cluster.task(async ({
        page,
        data: url,
    }) => {
        await page.goto(url);
        await page.screenshot({
            path: './screen.png',
        });
        // Store screenshot, do something else
    });

    await cluster.queue('http://www.google.com/');
    await cluster.queue('http://www.wikipedia.org/');
    // many more pages

    await cluster.idle();
    await cluster.close();
})();

I want something like:

await cluster.task(async ({
    page,
    data: url,
    wIndex,
}) => {
    await page.goto(url);
    await page.screenshot({
        path: `./screen_${wIndex}.png`,
    });
    // Store screenshot, do something else
});

With wIndex is a number of current Worker.

Simple solution for this example can be done by using URL of the current queue (https://github.com/thomasdondorf/puppeteer-cluster/blob/master/examples/minimal.js)

But what if it working with the same URL for each queue?

P/s: Also I want to launch puppeteer with the different launch.Options on each Worker

Minor type checking improvement for cluster.queue method

First of all, thank you for your work!
I have a minor suggestion on improving type checking fot the cluster.queue() method.
Now we have this:

public async queue(
        data: JobData | TaskFunction,
        taskFunction?: TaskFunction,
    ): Promise<void> {
...
}

As one can inspect, JobData is of type any and it is used both as a first argument to cluster.queue() method as well as the data property of the TaskFunctionArguments interface. This approach does not provide sufficient type checking when we call the cluster.queue() method with two arguments. I'd suggest to use generic types here like this:

type QueueFunction<T> = (arg: QueueFunctionArguments<T>) => Promise<void>;

interface QueueFunctionArguments<T> {
  page: puppeteer.Page;
  data: T;
  worker: {
    id: number;
  };
}

public async queue<T>(
    data: T | TaskFunction,
    taskFunction?: QueueFunction<T>,
): Promise<void> {
...
}

An in-range update of ts-jest is breaking the build 🚨

The devDependency ts-jest was updated from 23.1.4 to 23.10.0.

🚨 View failing branch.

This version is covered by your current version range and after updating it in your project the build failed.

ts-jest is a devDependency of this project. It might not break your production code or affect downstream projects, but probably breaks your build or test tools, which may prevent deploying or publishing.

Status Details
  • continuous-integration/travis-ci/push: The Travis CI build failed (Details).
  • coverage/coveralls: First build on greenkeeper/ts-jest-23.10.0 at 0.0% (Details).

Release Notes for 23.10.0

ts-jest, reloaded!

  • lots of new features including full type-checking and internal cache (see changelog)
  • improved performances
  • Babel not required anymore
  • improved (and growing) documentation
  • a ts-jest Slack community where you can find some instant help
  • end-to-end isolated testing over multiple jest, typescript and babel versions
Commits

The new version differs by 293 commits.

  • 0e5ffed chore(release): 23.10.0
  • 3665609 Merge pull request #734 from huafu/appveyor-optimizations
  • 45d44d1 Merge branch 'master' into appveyor-optimizations
  • 76e2fe5 ci(appveyor): cache npm versions as well
  • 191c464 ci(appveyor): try to improve appveyor's config
  • 0f31b42 Merge pull request #733 from huafu/fix-test-snap
  • 661853a Merge branch 'master' into fix-test-snap
  • aa7458a Merge pull request #731 from kulshekhar/dependabot/npm_and_yarn/tslint-plugin-prettier-2.0.0
  • 70775f1 ci(lint): run lint scripts in series instead of parallel
  • a18e919 style(fix): exclude package.json from tslint rules
  • 011b580 test(config): stop using snapshots for pkg versions
  • 7e5a3a1 build(deps-dev): bump tslint-plugin-prettier from 1.3.0 to 2.0.0
  • fbe90a9 Merge pull request #730 from kulshekhar/dependabot/npm_and_yarn/@types/node-10.10.1
  • a88456e build(deps-dev): bump @types/node from 10.9.4 to 10.10.1
  • 54fd239 Merge pull request #729 from kulshekhar/dependabot/npm_and_yarn/prettier-1.14.3

There are 250 commits in total.

See the full diff

FAQ and help

There is a collection of frequently asked questions. If those don’t help, you can always ask the humans behind Greenkeeper.


Your Greenkeeper Bot 🌴

Roadmap for v1.0

I'm thinking about what kind of functionality this library should provide before it should be released as v1. I might edit the list in the future:

My goals:

  • (#25) Make sure it's reliable and crawl more than 10 million pages with it (so far the maximum I crawled was ~800k pages)
  • (#9) Improve sameDomainDelay and skipDuplicateUrls. Detection of domains should use TLD.js for example. Documentation should be better. And there should be a way to provide the URL without using data or { url: ... } Not a goal for 1.0 anymore
  • (#28) Optimize the code, fix code smells
  • More tests, get code coverage up to > 90%
  • More documentation on the concurrency types. Maybe make CONCURRENCY_BROWSER the default as it is more robust?
  • More code snippets in the documentation page (for Cluster.queue for example)
  • Provide a cluster.execute function which executes the job with higher priority (does not queue it at the end) and returns a Promise which is resolved when the job is finished. Might also solve this confusion: #10 (comment)
  • Statistics API: How many jobs in queue, how many jobs processes, etc.
  • #41 Offer more functionality, maybe provide a way to use puppeteer-extra?
  • #36 Sandbox Offer a way to run code from users in a sandbox, maybe even Docker? => This can now be implemented via custom concurrency implementations (although there are now custom implementations right now)
  • #70 Improve types

Maybe:

  • Provide a simple but robust data store with the library
  • Rename API: Some parts of API are rather unfortunate
    • concurrency should be concurrencyType
    • maxConcurrency maybe maxWorkers?
  • Provide queue function to the task function for a more functional syntax (so that you don't need to access cluster from inside the task

Not planned (for now):

  • #8 (comment) Mixed concurrency models
    • Reason: It does not work well together with the idea of having a sandbox (which part of the browser/page/context stuff should be sandboxed then)

Cannot find module ../dist

I'm trying to puppeteer-cluster with minimal.js example. I'm getting the following error:

  • Windows 7
  • node: v10.15.à
  • npm: v6.4.1

D:\Developpement\NodeJS\minimal>node minimal.js
internal/modules/cjs/loader.js:583
throw err;
^

Error: Cannot find module '../dist'
at Function.Module._resolveFilename (internal/modules/cjs/loader.js:581:15)
at Function.Module._load (internal/modules/cjs/loader.js:507:25)
at Module.require (internal/modules/cjs/loader.js:637:17)
at require (internal/modules/cjs/helpers.js:22:18)
at Object. (D:\Developpement\NodeJS\minimal\minimal.js:1:83)
at Module._compile (internal/modules/cjs/loader.js:689:30)
at Object.Module._extensions..js (internal/modules/cjs/loader.js:700:10)
at Module.load (internal/modules/cjs/loader.js:599:32)
at tryModuleLoad (internal/modules/cjs/loader.js:538:12)
at Function.Module._load (internal/modules/cjs/loader.js:530:3)

With my configuration the directory ../dist does not exist.
I have

24/01/2019 15:10 .
24/01/2019 15:10 ..
24/01/2019 15:10 minimal
24/01/2019 15:04 node_modules

I replace const { Cluster } = require('../dist'); by const { Cluster } = require('puppeteer-cluster'); It's OK.

Is It possible to get browser version?

Hello,

Is it possible to get the version of browser ?
I do This

const cluster = await Cluster.launch({
    concurrency: Cluster.CONCURRENCY_CONTEXT,
    maxConcurrency: 2,
    monitor: true,
    retryLimit: 0,
    timeout: 180000,
    }
  });

await cluster.queue('www.example.com', main);

// Display browser version
// console.log(cluster.browser.version()) ?

const main = async ({ page, data: url }) => {
    await page.goto(url);
    const results = await page.evaluate(async () => {
    debugger;
      let title = document.title;
      return title;
    }).then((data) => {
      console.log(data);
    });
  };

Usage with JEST tests in different files?

A common use-case would be to have many different tests spread out over multiple files.

This seems to be exactly what I need to speed up my tests - but I don't understand how to utilise it to run tests in different files in parallell.

Ex;
One test suite in home/tests/e2e/LoginPage.test.js
Another test suite in loan/tests/e2e/OverviewPage.test.js

I understand I could use it within the same test suite - but what about running different test suites in parallel?

why `await`?

Cool project, but I am confused that why you use await in your example:

const { Cluster } = require('puppeteer-cluster');

(async () => {
  const cluster = await Cluster.launch({
    concurrency: Cluster.CONCURRENCY_CONTEXT,
    maxConcurrency: 2,
  });
  
  // Is `await` necessary?
  await cluster.task(async ({ page, data: url }) => {
    await page.goto(url);
    const screen = await page.screenshot();
    // Store screenshot, do something else
  });

  await cluster.queue('http://www.google.com/');
  await cluster.queue('http://www.wikipedia.org/');
  // many more pages

  await cluster.idle();
  await cluster.close();
})();

when you define a task or try to add some queues, why await? I try to remove them and is ok to do that.

Use TLD.js for sameDomainDelay

So far the domain extraction just takes the hostname from Node.js which includes subdomains.

Should be using TLD.js to make it work with normal top level domains and also for *.co.uk.

Limit number of tasks per browser instance

Is there any way to limit the number of tasks used per browser instance? I'm thinking of something along the lines (perhaps) of tasksPerInstance: 1000, and then the cluster will track the number of tasks that have been used in a specific browser instance and then whenever that limit is reached will kill that browser instance and launch another, as a (potential) shield against browser memory growth. Its a technique I've seen used in other process pooling models (I think some of the Apache web server modules let you specify a maximum number of requests a worker process will serve before it is terminated and replaced with a fresh process).

Browser closes during debugging

Hello,

I have got few questions, not sure if should have created multiple issues.

Question: Using below code (example code), when I am debugging, browser window closes suddenly not letting me finish stepping through my code. Am I missing any config option?

const cluster = await Cluster.launch({
    concurrency: Cluster.CONCURRENCY_CONTEXT,
    maxConcurrency: 2,
    monitor: true,
    retryLimit: 0,
    puppeteerOptions: {
      headless: false,
      devtools: true,
      defaultViewport: {
        width: 1920,
        height: 1080
      }
    }
  });

await cluster.queue('www.example.com', main);

const main = async ({ page, data: url }) => {
    await page.goto(url);
    const results = await page.evaluate(async () => {
    debugger;
      let title = document.title;
      return title;
    }).then((data) => {
      console.log(data);
    });
  };

thanks

Tests might silently fail

A failing expect call will not lead to an error if it gets caught. See jestjs/jest#3917 for discussion. This might currently lead to failing test that are not reported as the generous error handling catches them.

Three options:

  1. Rename taskerror to error which will make sure that Node.js crashes in that case. Users will have to take care of the error handler then.
  2. Enable an option throwOnTaskerror so that task errors will not get caught
  3. Just take care of it in the tests

importing / requiring Cluster

Hi,

thanks for this awesome library :)

Unfortunately, I do not seem to get it to work, as none of the importing / requiring mechanisms seem to work:

const { Cluster } = require('puppeteer-cluster'); -> Cluster = undefined
import { Cluster } from 'puppeteer-cluster'; -> Cluster = undefined
import Cluster from 'puppeteer-cluster'; -> Cluster = {}

I'm on Node v8.11.4

What am I doing wrong?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.