mape / node-scraper Goto Github PK

View Code? Open in Web Editor NEW

515.0 19.0 63.0 234 KB

Easier web scraping using node.js and jQuery

License: MIT License

JavaScript 100.00%

node-scraper's Introduction

node-scraper

A little module that makes scraping websites a little easier. Uses node.js and jQuery.

Installation

Via npm:

$ npm install scraper

Examples

Simple

First argument is an url as a string, second is a callback which exposes a jQuery object with your scraped site as "body" and third is an object from the request containing info about the url.

var scraper = require('scraper');
scraper('http://search.twitter.com/search?q=javascript', function(err, jQuery) {
    if (err) {throw err}

    jQuery('.msg').each(function() {
        console.log(jQuery(this).text().trim()+'\n');
    });
});

"Advanced"

First argument is an object containing settings for the "request" instance used internally, second is a callback which exposes a jQuery object with your scraped site as "body" and third is an object from the request containing info about the url.

var scraper = require('scraper');
scraper(
    {
       'uri': 'http://search.twitter.com/search?q=nodejs'
           , 'headers': {
               'User-Agent': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)'
           }
    }
    , function(err, $) {
        if (err) {throw err}

        $('.msg').each(function() {
            console.log($(this).text().trim()+'\n');
        });
    }
);

Parallel

First argument is an array containing either strings or objects, second is a callback which exposes a jQuery object with your scraped site as "body" and third is an object from the request containing info about the url.

You can also add rate limiting to the fetcher by adding an options object as the third argument containing 'reqPerSec': float.

var scraper = require('scraper');
scraper(
    [
        'http://search.twitter.com/search?q=javascript'
        , 'http://search.twitter.com/search?q=css'
        , {
            'uri': 'http://search.twitter.com/search?q=nodejs'
            , 'headers': {
                'User-Agent': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)'
            }
        }
        , 'http://search.twitter.com/search?q=html5'
    ]
    , function(err, $) {
        if (err) {throw err;}

        $('.msg').each(function() {
            console.log($(this).text().trim()+'\n');
        });
    }
    , {
        'reqPerSec': 0.2 // Wait 5sec between each external request
    }
);

Arguments

First (required)

Contains the info about what page/pages will be scraped

string

'http://www.nodejs.org'

request object

{
   'uri': 'http://search.twitter.com/search?q=nodejs'
       , 'headers': {
           'User-Agent': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)'
       }
}

Array (if you want to do fetches on multiple URLs)

[
    urlString
    , urlString
    , requestObject
    , urlString
]

Second (optional)

The callback that allows you do use the data retrieved from the fetch.

function(err, $) {
    if (err) {throw err;}
    
    $('.msg').each(function() {
        console.log($(this).text().trim()+'\n');
    }
}

Third (optional)

This argument is an object containing settings for the fetcher overall.

reqPerSec: float; (allows you to throttle your fetches so you don't hammer the server you are scraping)

Depends on

node-scraper's People

Contributors

Stargazers

Watchers

node-scraper's Issues

jsdom.jQueryify() not passing correct directory syntax on Windows node environments

The following code in your project does not work on a Windows Node.js environment:

jsdom.jQueryify(window, __dirname + '/../deps/jquery-1.6.1.min.js'), function (win, $) {
    $('head').append($(body).find('head').html());
    $('body').append($(body).find('body').html());
    callback(null, $);
});

Please add require('path'); to the scraper.js file and then make this slight modification to the above code:

jsdom.jQueryify(window, path.join(__dirname , '/../deps/jquery-1.6.1.min.js')), function (win, $) {
    $('head').append($(body).find('head').html());
    $('body').append($(body).find('body').html());
    callback(null, $);
});

With that tiny change the scraper.js works on Windows and *nix environments

sorry, i'm new to GITHUB and GIT in general, otherwise i'd submit the code myself

A way to know when all scraper requests have completed for the entire process?

Please advise if I'm just a n00b and it's really obvious...

Final Callback

This is not an issue as much as feature request, and i really don't know where else to put it, but i was wondering whether it would be a good idea to have a universal final callback after all the scraping is done.

For example, if i am hitting 20 urls, there is a callback for every hit, but what i want to close a db connection at the end of all the 20 hits and data manipulation. Thought it might be helpful at some point.

Thanks in advance

How to add a url to scrap?

Hi,

I'm scrapping a page, that contains links to other pages I'd like to scrap too (that have the next format, eg for paginated results).

Was wondering if there is a way to inject more urls to parse in the current scrapper?

X+

P.S. jQuery and node.js is by far the more pleasant way to parse (still feel the pain of regex and other string operations ;)

FATAL ERROR: CALL_AND_RETRY_2 Allocation failed - process out of memory

I have a recursive script running and after about 100 scrapes I always get:

FATAL ERROR: CALL_AND_RETRY_2 Allocation failed - process out of memory

Initially I thought it was some JSON.stringify code that periodically saves the scrapped data to a text file but now I'm suspecting it's the scraper library. Have you experienced this at all?

Text not scraped, but on the demo site it works!

From this url the compiled version of Goose cannot extract cleaned text:
http://www.lastampa.it/2012/12/17/esteri/tecnico-italiano-rapito-in-siria-XHnMBpQFSLnYX3l2xHRvzI/pagina.html

But on the demo website this url works.
Why?

jquery not included correctly

I get an error that jquery wasn't found.

ENOENT, No such file or directory '/usr/local/lib/node/.npm/scraper/0.0.2/node_modules/jsdom/usr/local/lib/node/.npm/scraper/0.0.2/package/deps/jquery-1.4.2.min.js'

How do I know the parsed url?

Hi,

This one is probably obvious, but how do I know which url is parsed (if I pull several in parallel)?
Also the object described in the url has several attributes that can't be guessed by parsing the page.

What's the best way know from the callback what url was parsed, and to pass extra arguments?

eg:
url: https://github.com/mape/node-scraper/issues/new
install: "nmp install scraper"
author: "mape"

from within the callback and beside the status and jquery dom, I'd like url, install and author.
X+
X+

Error when running demo code provided

Scraper was working fine, suddenly tonight it started gving me errors. I rebuilt my node.js server and didnt resolve the issue. If I run the following demo code, I get the same error. Any ideas?

Demo Code:
var scraper = require('scraper');
scraper('http://search.twitter.com/search?q=javascript', function(err, jQuery) {
if (err) {throw err}

jQuery('.msg').each(function() {
    console.log(jQuery(this).text().trim()+'\n');
});

});

Error:
TypeError: Cannot read property '1' of null
at [object Context]:6348:45
at [object Context]:8316:2
at Object.javascript (/usr/local/lib/node/.npm/jsdom/0.2.0/package/lib/jsdom/level2/languages/javascript.js:17:14)
at Object._eval (/usr/local/lib/node/.npm/jsdom/0.2.0/package/lib/jsdom/level2/html.js:1195:46)
at /usr/local/lib/node/.npm/jsdom/0.2.0/package/lib/jsdom/level2/html.js:43:20
at Object.check (/usr/local/lib/node/.npm/jsdom/0.2.0/package/lib/jsdom/level2/html.js:235:34)
at /usr/local/lib/node/.npm/jsdom/0.2.0/package/lib/jsdom/level2/html.js:251:12
at IncomingMessage. (/usr/local/lib/node/.npm/jsdom/0.2.0/package/lib/jsdom/level2/html.js:85:11)
at IncomingMessage.emit (events.js:59:20)
at HTTPParser.onMessageComplete (http.js:111:23)
TypeError: undefined is not a function
at CALL_NON_FUNCTION (native)
at /usr/local/lib/node/.npm/scraper/0.0.7/package/lib/scraper.js:57:7
at /usr/local/lib/node/.npm/jsdom/0.2.0/package/lib/jsdom.js:151:30
at Object. (/usr/local/lib/node/.npm/jsdom/0.2.0/package/lib/jsdom/level2/events.js:274:17)
at Object.dispatchEvent (/usr/local/lib/node/.npm/jsdom/0.2.0/package/lib/jsdom/level2/html.js:415:55)
at /usr/local/lib/node/.npm/jsdom/0.2.0/package/lib/jsdom/level2/html.js:56:15
at Object.check (/usr/local/lib/node/.npm/jsdom/0.2.0/package/lib/jsdom/level2/html.js:235:34)
at /usr/local/lib/node/.npm/jsdom/0.2.0/package/lib/jsdom/level2/html.js:251:12
at IncomingMessage. (/usr/local/lib/node/.npm/jsdom/0.2.0/package/lib/jsdom/level2/html.js:85:11)
at IncomingMessage.emit (events.js:59:20)

Charset ??

When I scrape a webpage with iso-8859-1 charset I get encoding problems...

Incatchable errors

When scraping google.de, I get:

Error: Invalid character: Invalid character in tag name: ){
at Object.createElement (/Users/andi/node_modules/scraper/node_modules/jsdom/lib/jsdom/level1/core.js:1190:13)
at setChild (/Users/andi/node_modules/scraper/node_modules/jsdom/lib/jsdom/browser/htmltodom.js:128:35)
at setChild (/Users/andi/node_modules/scraper/node_modules/jsdom/lib/jsdom/browser/htmltodom.js:171:7)
at setChild (/Users/andi/node_modules/scraper/node_modules/jsdom/lib/jsdom/browser/htmltodom.js:171:7)
at setChild (/Users/andi/node_modules/scraper/node_modules/jsdom/lib/jsdom/browser/htmltodom.js:171:7)
at HtmlToDom.appendHtmlToElement (/Users/andi/node_modules/scraper/node_modules/jsdom/lib/jsdom/browser/htmltodom.js:77:9)
at Object.innerHTML (/Users/andi/node_modules/scraper/node_modules/jsdom/lib/jsdom/browser/index.js:420:27)
at Function.clean (/Users/andi/node_modules/scraper/deps/jquery-1.6.1.min.js:18:317)
at Function.buildFragment (/Users/andi/node_modules/scraper/deps/jquery-1.6.1.min.js:17:31854)
at [object Object].init (/Users/andi/node_modules/scraper/deps/jquery-1.6.1.min.js:16:7963)
Error: Invalid character: Invalid character in tag name: )
at Object.createElement (/Users/andi/node_modules/scraper/node_modules/jsdom/lib/jsdom/level1/core.js:1190:13)
at setChild (/Users/andi/node_modules/scraper/node_modules/jsdom/lib/jsdom/browser/htmltodom.js:128:35)
at setChild (/Users/andi/node_modules/scraper/node_modules/jsdom/lib/jsdom/browser/htmltodom.js:171:7)
at setChild (/Users/andi/node_modules/scraper/node_modules/jsdom/lib/jsdom/browser/htmltodom.js:171:7)
at setChild (/Users/andi/node_modules/scraper/node_modules/jsdom/lib/jsdom/browser/htmltodom.js:171:7)
at setChild (/Users/andi/node_modules/scraper/node_modules/jsdom/lib/jsdom/browser/htmltodom.js:171:7)
at HtmlToDom.appendHtmlToElement (/Users/andi/node_modules/scraper/node_modules/jsdom/lib/jsdom/browser/htmltodom.js:77:9)
at Object.innerHTML (/Users/andi/node_modules/scraper/node_modules/jsdom/lib/jsdom/browser/index.js:420:27)
at Function.clean (/Users/andi/node_modules/scraper/deps/jquery-1.6.1.min.js:18:317)
at Function.buildFragment (/Users/andi/node_modules/scraper/deps/jquery-1.6.1.min.js:17:31854)
Error: Invalid character: Invalid character in tag name: )
at Object.createElement (/Users/andi/node_modules/scraper/node_modules/jsdom/lib/jsdom/level1/core.js:1190:13)
at setChild (/Users/andi/node_modules/scraper/node_modules/jsdom/lib/jsdom/browser/htmltodom.js:128:35)
at setChild (/Users/andi/node_modules/scraper/node_modules/jsdom/lib/jsdom/browser/htmltodom.js:171:7)
at setChild (/Users/andi/node_modules/scraper/node_modules/jsdom/lib/jsdom/browser/htmltodom.js:171:7)
at setChild (/Users/andi/node_modules/scraper/node_modules/jsdom/lib/jsdom/browser/htmltodom.js:171:7)
at setChild (/Users/andi/node_modules/scraper/node_modules/jsdom/lib/jsdom/browser/htmltodom.js:171:7)
at setChild (/Users/andi/node_modules/scraper/node_modules/jsdom/lib/jsdom/browser/htmltodom.js:171:7)
at setChild (/Users/andi/node_modules/scraper/node_modules/jsdom/lib/jsdom/browser/htmltodom.js:171:7)
at setChild (/Users/andi/node_modules/scraper/node_modules/jsdom/lib/jsdom/browser/htmltodom.js:171:7)
at HtmlToDom.appendHtmlToElement (/Users/andi/node_modules/scraper/node_modules/jsdom/lib/jsdom/browser/htmltodom.js:77:9)
Error: Invalid character: Invalid character in tag name: ;
at Object.createElement (/Users/andi/node_modules/scraper/node_modules/jsdom/lib/jsdom/level1/core.js:1190:13)
at setChild (/Users/andi/node_modules/scraper/node_modules/jsdom/lib/jsdom/browser/htmltodom.js:128:35)
at setChild (/Users/andi/node_modules/scraper/node_modules/jsdom/lib/jsdom/browser/htmltodom.js:171:7)
at setChild (/Users/andi/node_modules/scraper/node_modules/jsdom/lib/jsdom/browser/htmltodom.js:171:7)
at setChild (/Users/andi/node_modules/scraper/node_modules/jsdom/lib/jsdom/browser/htmltodom.js:171:7)
at setChild (/Users/andi/node_modules/scraper/node_modules/jsdom/lib/jsdom/browser/htmltodom.js:171:7)
at setChild (/Users/andi/node_modules/scraper/node_modules/jsdom/lib/jsdom/browser/htmltodom.js:171:7)
at setChild (/Users/andi/node_modules/scraper/node_modules/jsdom/lib/jsdom/browser/htmltodom.js:171:7)
at setChild (/Users/andi/node_modules/scraper/node_modules/jsdom/lib/jsdom/browser/htmltodom.js:171:7)
at HtmlToDom.appendHtmlToElement (/Users/andi/node_modules/scraper/node_modules/jsdom/lib/jsdom/browser/htmltodom.js:77:9)
...

Surely, HTML is usually not correct, but syntax tolerance is expected by clients. The scraping and analyzing seems to work, but the errors are not suppressable.

Source:

  var scraper;
  scraper = require('scraper');
  try {
    scraper({
      uri: 'http://google.de/',
      headers: {
        'User-Agent': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)'
      }
    }, function(e, $) {
      if (e) {
        throw err;
      }
      return $('body').each(function() {
        return console.log($(this).text() + '\n\n');
      });
    });
  } catch (e) {
    console.log('ERROR');
  }

Undefined function error in Node.js v0.12

My application is using scraper as an dependency and it was running fine with Node.js 10.2x versions.

Now I have upgraded to Node.js 0.12 version.
I ran into the following error:

Error:

/home/node/dependencytracker/node_modules/scraper/lib/scraper.js:56
var window = jsdom.jsdom().createWindow();
^
TypeError: undefined is not a function
at Request._callback (/home/node/dependencytracker/node_modules/scraper/lib/scraper.js:56:33)
at Request.self.callback (/home/node/dependencytracker/node_modules/scraper/node_modules/request/request.js:344:22)

In my package.json, I am using "*" for scraper version.-- i.e. latest version.

I probably should define an older version but unsure which one I was using or I can use
to by-pass jsdom issue?

Any help is appreciated.

How Do You Use This?

Do you have to create a HTML page in order for these scripts to function. What is the process that you have to take in order to use this module? Please describe.

how to get http code

How can I retrieve http code of the resulting page? like 200, 404, 503 etc.

doesn't work with jsdom 0.2.0

I'm getting issues when trying to run the example app with Node 0.4.0 and JSDom 0.2.0..

Here's the stacktrace

TypeError: Cannot read property 'prototype' of undefined
    at /usr/local/lib/node/.npm/jsdom/0.2.0/package/lib/jsdom/browser/index.js:84:16
    at String.<anonymous> ([object Context]:2500:12)
    at Function.each ([object Context]:692:29)
    at Object.add ([object Context]:2478:10)
    at [object Context]:2907:17
    at Function.each ([object Context]:692:29)
    at Object.each ([object Context]:155:17)
    at Object.one ([object Context]:2906:15)
    at Object.bind ([object Context]:2896:34)
    at [object Context]:3106:18
    at [object Context]:4376:2
    at Object.javascript (/usr/local/lib/node/.npm/jsdom/0.2.0/package/lib/jsdom/level2/languages/javascript.js:17:14)
    at Object._eval (/usr/local/lib/node/.npm/jsdom/0.2.0/package/lib/jsdom/level2/html.js:1195:46)
    at /usr/local/lib/node/.npm/jsdom/0.2.0/package/lib/jsdom/level2/html.js:43:20
    at Object.check (/usr/local/lib/node/.npm/jsdom/0.2.0/package/lib/jsdom/level2/html.js:235:34)
    at /usr/local/lib/node/.npm/jsdom/0.2.0/package/lib/jsdom/level2/html.js:251:12
    at [object Object].<anonymous> (fs.js:86:5)
    at [object Object].emit (events.js:39:17)
    at afterRead (fs.js:840:12)
TypeError: undefined is not a function
    at CALL_NON_FUNCTION (native)
    at /usr/local/lib/node/.npm/scraper/0.0.8/package/lib/scraper.js:58:7
    at /usr/local/lib/node/.npm/jsdom/0.2.0/package/lib/jsdom.js:151:30
    at Object.<anonymous> (/usr/local/lib/node/.npm/jsdom/0.2.0/package/lib/jsdom/level2/events.js:274:17)
    at Object.dispatchEvent (/usr/local/lib/node/.npm/jsdom/0.2.0/package/lib/jsdom/level2/html.js:415:55)
    at /usr/local/lib/node/.npm/jsdom/0.2.0/package/lib/jsdom/level2/html.js:56:15
    at Object.check (/usr/local/lib/node/.npm/jsdom/0.2.0/package/lib/jsdom/level2/html.js:235:34)
    at /usr/local/lib/node/.npm/jsdom/0.2.0/package/lib/jsdom/level2/html.js:251:12
    at [object Object].<anonymous> (fs.js:86:5)
    at [object Object].emit (events.js:39:17)

ReferenceError when scraping

Hi,

scraper('http://bbc.co.uk/', function(err, jQuery) {
if (err) {throw err}
});

.. throws many errors in the form of 'ReferenceError: [some tag] is not defined'

I can't copy&paste the whole error message here since windows command line won't let me.

trying to fetch ñ.ñ

i used an array of urls to scrape, the number of urls i need to fetch are around 7500 urls ñ.ñ, i begin scraping and everything goes fine until the first 120 urls fetched, it starts going slooooooow... more and more, until it crashes and cannot fetch another one anymore, does the library supports heavy scraping i need to scrape tons of things :(

Getting css background-image

jQuery('*').each(function() {
  if (jQuery(this).css("background-image") !== "none") {
    console.log(jQuery(this).css("background-image"));
  }
});

This should return all background images of a fetched website instead of an empty string. Any chance for a quick fix?

Remote error: TypeError: Cannot read property '_ownerDocument' of undefined

I have two nested scraping calls, and this is thrown on the second one. Unfortunately, this seems to be happening only recently.

Would be great if you can find out whats happening.

To start of with, i can tell you the execution does not cross line 59:scraper.js because $(body) throws the error that i've pasted as the title. (Again, this happens only in the second nested call, not in the first one)

Array access doesn't seem to work?

var $headers = jQuery('header a');
var $first = $($headers[0]);
console.log($headers.text()); // logs a bunch of header text
console.log($first.text()); // freezes but doesn't throw an error

maybe I'm doing something wrong? All of the examples use each but I was wondering if it's possible to just use the regular Array access.

Parallel scraping results in misses & duplicates

What an awesome scraper platform! Got all geared up in no time.

However, as I found single page scraping to work just fine, parallel scraping with many (I had 79) URLs fails, resulting in missed URLs and duplicates while the total sum of fetched URLs is correct.

I suspect the reason to be the queuing implementation. I tried a little fix on scraper.js that produced results I was hoping.

README.md has bad link to request

It points to https://github.com/mikeal/node-utils/tree/master/request but it seems that https://github.com/mikeal/request is the new home.