Giter VIP home page Giter VIP logo

Comments (2)

danielnieto avatar danielnieto commented on May 21, 2024

I think this might have something to do with memory or internet bandwidth? As I cannot reproduce it consistently, I tried your code and it works fine for me, but if I do 15+ request it does show this behavior, I suspect it has something to do specifically with nytimes.com and electron's events (not Scrapman) as I could reproduce this bug on Nightmare as well, with this code

var Nightmare = require('nightmare');

var nightmare = function(io) {
    var nm = Nightmare({
        show: false,
        'web-preferences': {'web-security': false}
    });

    return nm
        .goto('https://www.nytimes.com/')
        .evaluate(function () {
            var links = document.querySelectorAll('a');
            var hrefs = [];
            for (var i=0; i<links.length; i++){
                hrefs.push(links[i].href);
            }
            return hrefs;
        })
        .end()
        .then(function (result) {
            console.log(result.length);
            //io.emit('node', result);
        })
        .catch(function (error) {
            console.error('Search failed:', error);
        });
};

const amount = 20;

console.log(`NIGHTMARE > Scrapping ${amount} times`);

for (var i=0; i<amount; i++){
    nightmare();
}

and this is the output

image

So, here I'm booting 20 instances of Nightmare (Electron) at the same time, as you can see it fails a lot of course, and when it doesn't it does not show all links (due to not fully render picky nytimes.com's html)....

I think that this is something that can be worked around by using the maxConcurrentOperations which limits the numbers of parallel request that can be performed at the same time, the default value is 50 (which seems ok for all the sites I've tested Scrapman against) but specially for nytimes.com I found it to be just too much, by using a value of 10, it works great as you can see with this output, the first one was using default 50 maxConcurrentOperations and the seconds one was only accepting 10 concurrent operations

image

I think this is a bandwidth or memory issue because I've run the code you've put in here, and it always gives me the same amount of links as you can see here, so, it's an indication that this bug might be related to the specific machine that you are using or OS, or how much RAM memory you own, not sure...
image

If you do encounter a way to reproduce this or any other site that fails at rendering the HTML re-open this issue and I will take a look, nytimes.com behaves kind of odd, if you load the page and immediately attach an eventlistener for "ready" state to it (through chrome developer tools) it shows you ~300 links, but if you query it a couple of seconds later it will show all 600+, so I'm not sure how Chromium determines it is completely loaded and sometimes it gives you all the links and sometimes (when it's running out of memory or bandwidth) just returns before it's actually fully loaded... thus I can recommend as well using maxConcurrentOperations and wait configuration options in case you still encounter issues using scrapman for nytimes.com

from scrapman.

kirchnch avatar kirchnch commented on May 21, 2024

Yikes. It does seem the issue is not limited to scrapman based on your perspective running the nightmare script. The funny thing is, a consistent amount of links show up with that nightmare script running on my computer at 647 exactly, contrary to your results somehow. Looking into your bandwidth suggestion, it appears that the link counts do correlate to bandwidth, but inversely. Running the nightmare script on 5 Ghz WIFI frequency (~200 Mb/s) gave low link counts and timeouts, probably because it was maxing out the CPU. Conversely, running the nightmare script on the 2.4 GHz WIFI frequency (~30 Mb/s) gave consistent link counts without any timeouts. However, I am only seeing small differences in link counts running the 5 parallel requests to nytimes.com using scrapman with either frequency currently.

from scrapman.

Related Issues (5)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.