Giter VIP home page Giter VIP logo

node-website-scraper's People

Contributors

aivus avatar almightyju avatar carlosflorencio avatar cerlancism avatar cslee avatar dependabot[bot] avatar greenkeeper[bot] avatar greenkeeperio-bot avatar ikeyan avatar phawxby avatar raurir avatar ryanolee avatar s0ph1e avatar skvggor avatar snyk-bot avatar ykhandor avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

node-website-scraper's Issues

Not extension in Google font css

Exzmple:

<link href='https://fonts.googleapis.com/css?family=Ubuntu:400,300,400italic,500,500italic,300italic,700,700italic&subset=latin,cyrillic' rel='stylesheet' type='text/css'>

This css and not extensions and not save to CSS dir in subdirectories

      subdirectories: [
        {directory: 'css', extensions: ['.css']}
      ],
      sources: [
        {selector: 'link[rel="stylesheet"]', attr: 'href'},
      ],

Why underscore instead of lodash?

I wondered why this project is using Underscore instead of Lodash. It is the first time that I see a Node.js module which uses Underscore. Lodash seems to be the default choice for nearly all projects.
There is a good reason for this: Lodash has a lot of features, which Underscore lacks.

When making my changes in this projects, I found that sometimes I wanted to use a Lodash function, which Underscore didn't have. This is bad because using such utility functions leads to easier understandable code.

Would you accept a pull request which replaced Underscore with Lodash?

Improve error handling

Need to have an option to choose what to do when an error occurs (now scraper process fails on first error and directory is removed).
Need to provide opportunity to ignore error and continue scraping (Maybe something like stopOnError) and maintain a list of errorred resources

issues with recursive without `depth`

var scraper = require('website-scraper');
var options = {
  urls: ['http://nodejs.org/'],
  directory: __dirname + '/nodejs.org',
  recursive: true
};

// or with promise
scraper.scrape(options).then(function (result) {
    console.log(result);
});

this code causes Segmentation fault: 11 will dig into the code see if any memory can be freed or written to a cache file to be retrieved later.

Support making static copy of entire website

I think an important use case of this module would be to make a static copy of an entire website. For this use case the following functionality would be needed:

  • The scaper is able to download the HTML of a page.
  • The scaper is able to download resources which are used on a page.
  • The scaper is able to crawl the entire website.
  • The scaper can be configured to ignore external websites.
  • The scaper can save the files in the same structure as the paths on the website.

E.g:

  • / => index.html
  • /css/layout.css => css/layout.css
  • /about => about/index.html

The scaper can be configured to ignore external websites
This can be implemented by creating an 'urlFilter' option which takes a function. Before a request to a url is done, this function is called with the url (and possible other info which could be useful). The url is only called if the urFilter returns true.

The scaper can save the files in the same structure as the paths on the website
This can be implemented by creating a 'outputPathGenerator' (or a better name) option which takes a function. If the user specifies this option the output path is generated by passing the resource to this function.

Example
The following is an example of how you could make a scraper which saves an entire website, using the above described functionality.

var _ = require('lodash');
var pathUtil = require('path');
var urlUtil = require('url');
var websiteUrl = 'http://nodejs.org/'
var options =  {
  urls: [websiteUrl],
  directory: '/path/to/save/',
  recursive: true,
  maxDepth: 10,
  urlFilter: function(url){
    return _.startsWith(url, websiteUrl);
  },
  outputPathGenerator: function(resource, directory){
    var urlObject = urlUtil.parse(resource.url);
    // Todo: add logic to check if it is an HTML page which does not end in '.html'
    // If so, append '/index.html'
    var relativePath = '.' + urlObject.path; 
    return pathUtil.resolve(directory, relativePath);
  }
};

Not correct parse srcset

Hi,
Not correct parsing srcset in img tag
Example:

<img src="http://example.com/prev4-45x45.jpg" srcset="http://example.com/prev4-150x150.jpg 150w, http://example.com/prev4-45x45.jpg 45w" sizes="(max-width: 45px) 100vw, 45px" width="45" height="45">

Please, make option for convert all other links from relative to absolute.
Thank you

Awesome! Would be cool to save cookies!

I love this! Please make it save cookies! Request can do this but it would be a nice default option. Btw it would also be nice if there was a way to make it get only one kind of file from a site and not the rest.

An in-range update of eslint is breaking the build 🚨

Version 3.10.2 of eslint just got published.

Branch Build failing 🚨
Dependency eslint
Current Version 3.10.1
Type devDependency

This version is covered by your current version range and after updating it in your project the build failed.

As eslint is β€œonly” a devDependency of this project it might not break production or downstream projects, but β€œonly” your build or test tools – preventing new deploys or publishes.

I recommend you give this issue a high priority. I’m sure you can resolve this πŸ’ͺ


Status Details
  • ❌ continuous-integration/travis-ci/push The Travis CI build is in progress Details

  • ❌ continuous-integration/appveyor/branch AppVeyor build failed Details

Commits

The new version differs by 8 commits .

  • 0840068 3.10.2
  • 621ee0a Build: package.json and changelog update for 3.10.2
  • 0643bfe Fix: correctly handle commented code in indent autofixer (fixes #7604) (#7606)
  • bd0514c Fix: syntax error after key-spacing autofix with comment (fixes #7603) (#7607)
  • f56c1ef Fix: indent crash on parenthesized global return values (fixes #7573) (#7596)
  • 100c6e1 Docs: Fix example for curly "multi-or-nest" option (#7597)
  • 6abb534 Docs: Update code of conduct link (#7599)
  • 8302cdb Docs: Update no-tabs to match existing standards & improve readbility (#7590)

See the full diff.

Not sure how things should work exactly?

There is a collection of frequently asked questions and of course you may always ask my humans.


Your Greenkeeper Bot 🌴

Utf8 encode

How do I set option for downloading site in utf8 encode?

An in-range update of lodash is breaking the build 🚨

Version 4.17.1 of lodash just got published.

Branch Build failing 🚨
Dependency lodash
Current Version 4.17.0
Type dependency

This version is covered by your current version range and after updating it in your project the build failed.

As lodash is a direct dependency of this project this is very likely breaking your project right now. If other packages depend on you it’s very likely also breaking them.
I recommend you give this issue a very high priority. I’m sure you can resolve this πŸ’ͺ


Status Details
  • ❌ continuous-integration/travis-ci/push The Travis CI build could not complete due to an error Details

  • ❌ continuous-integration/appveyor/branch AppVeyor build failed Details

Not sure how things should work exactly?

There is a collection of frequently asked questions and of course you may always ask my humans.


Your Greenkeeper Bot 🌴

links inherit the protocol they are ripped from when they are "//"

Hey sOph1e,

This is an enhancement, not a bug. Also I'm not completely sure it's in the scope of this project, but wanted to see what you thought:

When pulling site down that is hosted on https and has its links specified as '//' (which inherit https when hosted on https), it would be nice if '//' was replaced with the protocol 'https://'. In other words, just inherit the protocol for all links of the resource being downloaded.

So when someone downloads the site but not the resources, the resources are unavailable unless the user hosts his site on https. Of course, we can easily load the page up in cheerio and change this ourselves but it may be best to incorporate it into the project. Either way, thank you for the hard work it is appreciated all the way from San Diego, CA!

Thanks,

Trevor

Error when scraping URL with email address

I get the following error when scraping a site with an email address in the URL. This may just be a co-incidence. I've renamed the domain and email address for data protection purposes.

Unhandled rejection Error: ENOENT: no such file or directory, open '/tmp/www.example.org/forums/users/[email protected]/

An in-range update of lodash is breaking the build 🚨

Version 4.17.2 of lodash just got published.

Branch Build failing 🚨
Dependency lodash
Current Version 4.17.1
Type dependency

This version is covered by your current version range and after updating it in your project the build failed.

As lodash is a direct dependency of this project this is very likely breaking your project right now. If other packages depend on you it’s very likely also breaking them.
I recommend you give this issue a very high priority. I’m sure you can resolve this πŸ’ͺ


Status Details
  • ❌ continuous-integration/travis-ci/push The Travis CI build is in progress Details

  • ❌ coverage/coveralls Coverage pending from Coveralls.io Details

  • βœ… codeclimate/coverage 100% test coverage Details

  • ❌ continuous-integration/appveyor/branch AppVeyor build failed Details

Not sure how things should work exactly?

There is a collection of frequently asked questions and of course you may always ask my humans.


Your Greenkeeper Bot 🌴

CLI

Think turn this project or add functionality to run like a CLI πŸ‘

Handle svg external links

Examples:

<svg>
  <use xlink:href="sprite.svg#icon-1"></use>
</svg>
<svg width="4in" height="3in" version="1.1">
  <image x="200" y="200" width="100px" height="100px"
         xlink:href="myimage.png">
  </image>
</svg>

Notes:

  • xlink:href contains hash which is element id "#[reference element ID]", need to keep it
  • SVG 2 removed the need for the xlink namespace, so instead of xlink:href you should use href.

Rename assets to children in Resource

assets were introduced in c88b8b9 in order to fix recursive bug in utils.createOutputObject and keep backward compatibility for version 1.*

In version 2.* it should be renamed back to children

Not able to scrape pages

I getting below error
========= ERROR ============
TypeError: Cannot read property 'getUrl' of null
at Scraper.loadResource (D:\my_dir\migration\ooa\node_modules\website-scraper\lib\scraper.js:65:20)
at receivedResponse (D:\my_dir\migration\ooa\node_modules\website-scraper\lib\scraper.js:161:10)
at tryCatcher (D:\my_dir\migration\ooa\node_modules\bluebird\js\release\util.js:16:23)
at Promise._settlePromiseFromHandler (D:\my_dir\migration\ooa\node_modules\bluebird\js\release\promise.js:510:31)
at Promise._settlePromise (D:\my_dir\migration\ooa\node_modules\bluebird\js\release\promise.js:567:18)
at Promise._settlePromise0 (D:\my_dir\migration\ooa\node_modules\bluebird\js\release\promise.js:612:10)
at Promise._settlePromises (D:\my_dir\migration\ooa\node_modules\bluebird\js\release\promise.js:691:18)
at Promise._fulfill (D:\my_dir\migration\ooa\node_modules\bluebird\js\release\promise.js:636:18)
at Promise._resolveCallback (D:\my_dir\migration\ooa\node_modules\bluebird\js\release\promise.js:453:14)
at Promise._settlePromiseFromHandler (D:\my_dir\migration\ooa\node_modules\bluebird\js\release\promise.js:522:17)
at Promise._settlePromise (D:\my_dir\migration\ooa\node_modules\bluebird\js\release\promise.js:567:18)
at Promise._settlePromise0 (D:\my_dir\migration\ooa\node_modules\bluebird\js\release\promise.js:612:10)
at Promise._settlePromises (D:\my_dir\migration\ooa\node_modules\bluebird\js\release\promise.js:687:18)
at Async._drainQueue (D:\my_dir\migration\ooa\node_modules\bluebird\js\release\async.js:133:16)
at Async._drainQueues (D:\my_dir\migration\ooa\node_modules\bluebird\js\release\async.js:143:10)
at Immediate.Async.drainQueues (D:\my_dir\migration\ooa\node_modules\bluebird\js\release\async.js:17:14)

============== CODE (of course website name changed) =============
var scrape = require('website-scraper');
scrape({
urls: ['http://www.my-web-site.com/'],
directory: './goo',
recursive: true,
maxDepth: 1
}).then(console.log).catch(console.log);

Documentation is wrong about how to call this...

I had to use this code in my index.js to run scraper from the command in a terminal window:

var scraper = require('website-scraper');
scraper.scrape(options);

But your samples on the front page show this:

var scrape = require('website-scraper');
scrape(options);

So, I had to use it like an object, but your code shows using it like a function. I saw where there was a previous issue that is now closed that discussed this, but I pulled my hair out trying to call it like a function according to your docs. When I used the object syntax, it worked fine.

What am I missing here?

Using version: 0.3.3

An in-range update of lodash is breaking the build 🚨

Version 4.17.0 of lodash just got published.

Branch Build failing 🚨
Dependency lodash
Current Version 4.16.6
Type dependency

This version is covered by your current version range and after updating it in your project the build failed.

As lodash is a direct dependency of this project this is very likely breaking your project right now. If other packages depend on you it’s very likely also breaking them.
I recommend you give this issue a very high priority. I’m sure you can resolve this πŸ’ͺ


Status Details
  • ❌ continuous-integration/travis-ci/push The Travis CI build could not complete due to an error Details

  • ❌ continuous-integration/appveyor/branch AppVeyor build failed Details

Not sure how things should work exactly?

There is a collection of frequently asked questions and of course you may always ask my humans.


Your Greenkeeper Bot 🌴

Unify requestResource and loadResource usage

In file-handlers requestResource().then(loadResource) is used, but in scraper itself on original resources downloading loadResource is used which makes requestResource inside it.

It is not prominent why in one cases methods are used in order request -> load, but in other cases - vice versa

Need to unify approach of using this methods. loadResource should not make requestResource inside it

An in-range update of debug is breaking the build 🚨

Version 2.4.0 of debug just got published.

Branch Build failing 🚨
Dependency debug
Current Version 2.3.3
Type dependency

This version is covered by your current version range and after updating it in your project the build failed.

As debug is a direct dependency of this project this is very likely breaking your project right now. If other packages depend on you it’s very likely also breaking them.
I recommend you give this issue a very high priority. I’m sure you can resolve this πŸ’ͺ


Status Details
  • ❌ continuous-integration/appveyor/branch Waiting for AppVeyor build to complete Details

  • ❌ continuous-integration/travis-ci/push The Travis CI build failed Details

Commits

The new version differs by 7 commits .

  • b82d4e6 release 2.4.0
  • 41002f1 Update bower.json (#342)
  • e58d54b Node: configurable util.inspect() options (#327)
  • 00f3046 Node: %O (big O) pretty-prints the object (#322)
  • bd9faa1 allow colours in workers (#335)
  • 501521f Use same color for same namespace. (#338)
  • e2a1955 Revert "handle regex special characters"

See the full diff.

Not sure how things should work exactly?

There is a collection of frequently asked questions and of course you may always ask my humans.


Your Greenkeeper Bot 🌴

Add logger

Add logger + option like logLevel (debug, warning, error, etc.)

Some bugs and suggestions

1 No resources are allocated without extensions
Example:

<link href='https://fonts.googleapis.com/css?family=Ubuntu:400,300,400italic,500,500italic,300italic,700,700italic&subset=latin,cyrillic' rel='stylesheet' type='text/css'> 
<img alt="" src="http://1.gravatar.com/avatar/4d63e4a045c7ff22accc33dc08442f86?s=140&amp;d=%2Fwp-content%2Fuploads%2F2015%2F05%2FGood-JOb-150x150.jpg&amp;r=g" srcset="http://1.gravatar.com/avatar/4d63e4a045c7ff22accc33dc08442f86?s=280&amp;d=%2Fwp-content%2Fuploads%2F2015%2F05%2FGood-JOb-150x150.jpg&amp;r=g 2x" class="avatar avatar-140 photo avatrb" width="140" height="140">

2 Option in config of shutdown of control of preservation in one folder

3 Not parse Emoji
Example:

<img src="https://s.w.org/images/core/emoji/72x72/2764.png" alt="❀" class="emoji" draggable="false">

4 To make an option to specify a pause in seconds before parsing

Scrap site from snapshot

It would be very nice, if u can add support to parse local HTML file (or buffer with html data), saved from web.

Example usage: i can make HTML snapshot from website using PhantomJS, SlimerJS or TriffleJS, after script execution and after same actions. Then pass those html to scraper and tell that native url is "http://example.com", so scraper can resolve resources and download them.

I think it is more flexible way to support headless browsers cons.

Relative path images in css files not getting saved

Hey, thank you for making this module! It's really awesome, and overcomes a lot of issues other tools like it seem to have.

I'm having an issue on some pages (ex: http://.com/index.html) where images with a relative path in the css file aren't getting downloaded.

Has this been seen before?

Trevor

Export function instead of object

Change

var scraper = require('website-scraper');
scraper.scrape(options);

to

var scrape = require('website-scraper');
scrape(options);

Ability to pipe download to a client

I was wondering if it is possible to not download anything on the server side and just forward the download directly to requesting client ?

Basically it would pipe it as a response back to client.

An in-range update of eslint is breaking the build 🚨

Version 3.10.1 of eslint just got published.

Branch Build failing 🚨
Dependency eslint
Current Version 3.10.0
Type devDependency

This version is covered by your current version range and after updating it in your project the build failed.

As eslint is β€œonly” a devDependency of this project it might not break production or downstream projects, but β€œonly” your build or test tools – preventing new deploys or publishes.

I recommend you give this issue a high priority. I’m sure you can resolve this πŸ’ͺ


Status Details
  • ❌ continuous-integration/travis-ci/push The Travis CI build could not complete due to an error Details

  • ❌ continuous-integration/appveyor/branch AppVeyor build failed Details

Commits

The new version differs by 4 commits .

  • 9cbfa0b 3.10.1
  • 4bb6215 Build: package.json and changelog update for 3.10.1
  • 8a0e92a Fix: handle try/catch correctly in no-return-await (fixes #7581) (#7582)
  • c4dd015 Fix: no-useless-return stack overflow on unreachable loops (fixes #7583) (#7584)

See the full diff.

Not sure how things should work exactly?

There is a collection of frequently asked questions and of course you may always ask my humans.


Your Greenkeeper Bot 🌴

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.