website-scraper / node-website-scraper Goto Github PK

View Code? Open in Web Editor NEW

1.5K 1.5K 269.0 852 KB

Download website to local directory (including all css, images, js, etc.)

Home Page: https://www.npmjs.org/package/website-scraper

License: MIT License

JavaScript 93.78% HTML 5.62% CSS 0.61%

hacktoberfest javascript nodejs scraper website-scraper

node-website-scraper's People

Contributors

Stargazers

Watchers

Forkers

viniciusferreira musghost habemus maruthiprithivi callmewhy sporks cloudxtreme danetheory sslash marklagendijk marmol-dev raybenchen dlfinis rodvan d3signer iamra johnmurch yoshi95 k9team3 kirkness randyamiel bryanrasmussen slumcattt ninjapixel jrschild pearlshare milos1290 tomwaw mrrobotish efueger flamenco tomgp songzcn shameersn matmar10 batuhan botaniqqq niilante falzar bradfrost1 tuian cgongoram secuiaf hamik112 funagi markuspint carlosflorencio priestd09 roninhack vladejs kanonochina rockwell421 gantoday-spider namnv04 justmelnyc sunilake ismyband hideoyanagi cyrke colorwebdesigner noise111 codesealab suresh44t perazzi jforge andrew168 ta3woon thu-san g0dspeak hydpublic kyf athiwatp permikomnaskaltara hanfeijp slimlime alaahd ykhandor anuarputharaj ikeyan muharihar topwebstudio avaer lionsole kotl jns4u cameronhegarty linecode junaidbasit laihui0207 cloudinnovates indigos33k3r nuh-sami laerteneto88 veronikaeickhoff ndamofli adeeldev kevinejohn rif-develop codelittleprince sparksworld

node-website-scraper's Issues

Use mime-types to recognize resource type

Now extensions and html attributes are used.
Consider using mime-types instead of current approach/in addition to current approach

Not extension in Google font css

Exzmple:

<link href='https://fonts.googleapis.com/css?family=Ubuntu:400,300,400italic,500,500italic,300italic,700,700italic&subset=latin,cyrillic' rel='stylesheet' type='text/css'>

This css and not extensions and not save to CSS dir in subdirectories

      subdirectories: [
        {directory: 'css', extensions: ['.css']}
      ],
      sources: [
        {selector: 'link[rel="stylesheet"]', attr: 'href'},
      ],

Save to existing directory

Why underscore instead of lodash?

I wondered why this project is using Underscore instead of Lodash. It is the first time that I see a Node.js module which uses Underscore. Lodash seems to be the default choice for nearly all projects.
There is a good reason for this: Lodash has a lot of features, which Underscore lacks.

When making my changes in this projects, I found that sometimes I wanted to use a Lodash function, which Underscore didn't have. This is bad because using such utility functions leads to easier understandable code.

Would you accept a pull request which replaced Underscore with Lodash?

favicon.ico is saved as HTML when response is 404

This is the content of some of my favicon.ico files:

I would suggest not saving the file at all when the request returns a 404.

Improve error handling

Need to have an option to choose what to do when an error occurs (now scraper process fails on first error and directory is removed).
Need to provide opportunity to ignore error and continue scraping (Maybe something like stopOnError) and maintain a list of errorred resources

issues with recursive without `depth`

var scraper = require('website-scraper');
var options = {
  urls: ['http://nodejs.org/'],
  directory: __dirname + '/nodejs.org',
  recursive: true
};

// or with promise
scraper.scrape(options).then(function (result) {
    console.log(result);
});

this code causes Segmentation fault: 11 will dig into the code see if any memory can be freed or written to a cache file to be retrieved later.

Support making static copy of entire website

I think an important use case of this module would be to make a static copy of an entire website. For this use case the following functionality would be needed:

The scaper is able to download the HTML of a page.
The scaper is able to download resources which are used on a page.
The scaper is able to crawl the entire website.
The scaper can be configured to ignore external websites.
The scaper can save the files in the same structure as the paths on the website.

E.g:

/ => index.html
/css/layout.css => css/layout.css
/about => about/index.html

The scaper can be configured to ignore external websites
This can be implemented by creating an 'urlFilter' option which takes a function. Before a request to a url is done, this function is called with the url (and possible other info which could be useful). The url is only called if the urFilter returns true.

The scaper can save the files in the same structure as the paths on the website
This can be implemented by creating a 'outputPathGenerator' (or a better name) option which takes a function. If the user specifies this option the output path is generated by passing the resource to this function.

Example
The following is an example of how you could make a scraper which saves an entire website, using the above described functionality.

var _ = require('lodash');
var pathUtil = require('path');
var urlUtil = require('url');
var websiteUrl = 'http://nodejs.org/'
var options =  {
  urls: [websiteUrl],
  directory: '/path/to/save/',
  recursive: true,
  maxDepth: 10,
  urlFilter: function(url){
    return _.startsWith(url, websiteUrl);
  },
  outputPathGenerator: function(resource, directory){
    var urlObject = urlUtil.parse(resource.url);
    // Todo: add logic to check if it is an HTML page which does not end in '.html'
    // If so, append '/index.html'
    var relativePath = '.' + urlObject.path; 
    return pathUtil.resolve(directory, relativePath);
  }
};

Not correct parse srcset

Hi,
Not correct parsing srcset in img tag
Example:

<img src="http://example.com/prev4-45x45.jpg" srcset="http://example.com/prev4-150x150.jpg 150w, http://example.com/prev4-45x45.jpg 45w" sizes="(max-width: 45px) 100vw, 45px" width="45" height="45">

Please, make option for convert all other links from relative to absolute.
Thank you

I just crashed the site I was downloading...

Is there away to throttle the crawler to make its requests more manageable?

Add nodejs v6 to travis when it will support it

https://docs.travis-ci.com/user/languages/javascript-with-nodejs#Choosing-Node-versions-to-test-against

Awesome! Would be cool to save cookies!

I love this! Please make it save cookies! Request can do this but it would be a nice default option. Btw it would also be nice if there was a way to make it get only one kind of file from a site and not the rest.

An in-range update of eslint is breaking the build 🚨

Version 3.10.2 of eslint just got published.

Branch	Build failing 🚨
Dependency	eslint
Current Version	3.10.1
Type	devDependency

This version is covered by your current version range and after updating it in your project the build failed.

As eslint is “only” a devDependency of this project it might not break production or downstream projects, but “only” your build or test tools – preventing new deploys or publishes.

I recommend you give this issue a high priority. I’m sure you can resolve this 💪

Status Details

❌ continuous-integration/travis-ci/push The Travis CI build is in progress Details
❌ continuous-integration/appveyor/branch AppVeyor build failed Details

Commits

The new version differs by 8 commits .

0840068 3.10.2
621ee0a Build: package.json and changelog update for 3.10.2
0643bfe Fix: correctly handle commented code in indent autofixer (fixes #7604) (#7606)
bd0514c Fix: syntax error after key-spacing autofix with comment (fixes #7603) (#7607)
f56c1ef Fix: indent crash on parenthesized global return values (fixes #7573) (#7596)
100c6e1 Docs: Fix example for curly "multi-or-nest" option (#7597)
6abb534 Docs: Update code of conduct link (#7599)
8302cdb Docs: Update no-tabs to match existing standards & improve readbility (#7590)

See the full diff.

Not sure how things should work exactly?

There is a collection of frequently asked questions and of course you may always ask my humans.

Your Greenkeeper Bot 🌴

Utf8 encode

How do I set option for downloading site in utf8 encode?

Css handler improvements

Repeat #48 fixes for 1.0.0
Fix bug described in test b4b0596
Think about using something like https://github.com/reworkcss/css instead if css-url-parser

doesn't download images where path is in JS file

Hey there! Thanks for making such a rad tool :)

I've found that if an image path is in a javascript file the scraper doesn't download the image.

Here is an example URL: http://womenshealth.com-yourarticles.co/breaking-news/angelina.html

The comment images are not downloaded because the paths are loaded at runtime via a local JSON file (comments.js)

Would love to hear your thoughts, thanks!

Question: does this support downloading to S3 rather than local?

Return missing resource.updateChild in html handler

resource.updateChild was introduced in #33 but was removed in #40 by mistake
It causes incorrect output tree #19

Add missing extensions for css and html files in 'byType' filename generator

It will partially fix #49 and #55

Support not downloading, but json output only

Would it make sense for this project to support returning the urls of the parsed assets, and not physically downloading the files?

An in-range update of lodash is breaking the build 🚨

Version 4.17.1 of lodash just got published.

Branch	Build failing 🚨
Dependency	lodash
Current Version	4.17.0
Type	dependency

This version is covered by your current version range and after updating it in your project the build failed.

As lodash is a direct dependency of this project this is very likely breaking your project right now. If other packages depend on you it’s very likely also breaking them.
I recommend you give this issue a very high priority. I’m sure you can resolve this 💪

Status Details

❌ continuous-integration/travis-ci/push The Travis CI build could not complete due to an error Details
❌ continuous-integration/appveyor/branch AppVeyor build failed Details

Not sure how things should work exactly?

There is a collection of frequently asked questions and of course you may always ask my humans.

Your Greenkeeper Bot 🌴

Add codestyle checker

Set absolute url for not loaded resources

like --convert-links option in wget

Move request options to config

Write tests for extending/merging options passed to scrape function
Move hardcoded default options for request https://github.com/s0ph1e/node-website-scraper/blob/7a89d728c731b19fa732a3a6029271d2c7e601c3/lib/request.js#L6-L12 to lib/config/defaults.js

Consider using Phantomjs (or something else) to download dynamic websites

Add custom request options

for auth, setting user-agent, cookies, etc.

links inherit the protocol they are ripped from when they are "//"

Hey sOph1e,

This is an enhancement, not a bug. Also I'm not completely sure it's in the scope of this project, but wanted to see what you thought:

When pulling site down that is hosted on https and has its links specified as '//' (which inherit https when hosted on https), it would be nice if '//' was replaced with the protocol 'https://'. In other words, just inherit the protocol for all links of the resource being downloaded.

So when someone downloads the site but not the resources, the resources are unavailable unless the user hosts his site on https. Of course, we can easily load the page up in cheerio and change this ourselves but it may be best to incorporate it into the project. Either way, thank you for the hard work it is appreciated all the way from San Diego, CA!

Thanks,

Trevor

Error when scraping URL with email address

I get the following error when scraping a site with an email address in the URL. This may just be a co-incidence. I've renamed the domain and email address for data protection purposes.

Unhandled rejection Error: ENOENT: no such file or directory, open '/tmp/www.example.org/forums/users/[email protected]/

An in-range update of lodash is breaking the build 🚨

Version 4.17.2 of lodash just got published.

Branch	Build failing 🚨
Dependency	lodash
Current Version	4.17.1
Type	dependency

This version is covered by your current version range and after updating it in your project the build failed.

Status Details

❌ continuous-integration/travis-ci/push The Travis CI build is in progress Details
❌ coverage/coveralls Coverage pending from Coveralls.io Details
✅ codeclimate/coverage 100% test coverage Details
❌ continuous-integration/appveyor/branch AppVeyor build failed Details

Not sure how things should work exactly?

There is a collection of frequently asked questions and of course you may always ask my humans.

Your Greenkeeper Bot 🌴

How can I download a website that needs to have a login session?

CLI

Think turn this project or add functionality to run like a CLI 👍

Handle svg external links

Examples:

<svg>
  <use xlink:href="sprite.svg#icon-1"></use>
</svg>

<svg width="4in" height="3in" version="1.1">
  <image x="200" y="200" width="100px" height="100px"
         xlink:href="myimage.png">
  </image>
</svg>

Notes:

xlink:href contains hash which is element id "#[reference element ID]", need to keep it
SVG 2 removed the need for the xlink namespace, so instead of xlink:href you should use href.

Rename assets to children in Resource

assets were introduced in c88b8b9 in order to fix recursive bug in utils.createOutputObject and keep backward compatibility for version 1.*

In version 2.* it should be renamed back to children

Not able to scrape pages

I getting below error
========= ERROR ============
TypeError: Cannot read property 'getUrl' of null
at Scraper.loadResource (D:\my_dir\migration\ooa\node_modules\website-scraper\lib\scraper.js:65:20)
at receivedResponse (D:\my_dir\migration\ooa\node_modules\website-scraper\lib\scraper.js:161:10)
at tryCatcher (D:\my_dir\migration\ooa\node_modules\bluebird\js\release\util.js:16:23)
at Promise._settlePromiseFromHandler (D:\my_dir\migration\ooa\node_modules\bluebird\js\release\promise.js:510:31)
at Promise._settlePromise (D:\my_dir\migration\ooa\node_modules\bluebird\js\release\promise.js:567:18)
at Promise._settlePromise0 (D:\my_dir\migration\ooa\node_modules\bluebird\js\release\promise.js:612:10)
at Promise._settlePromises (D:\my_dir\migration\ooa\node_modules\bluebird\js\release\promise.js:691:18)
at Promise._fulfill (D:\my_dir\migration\ooa\node_modules\bluebird\js\release\promise.js:636:18)
at Promise._resolveCallback (D:\my_dir\migration\ooa\node_modules\bluebird\js\release\promise.js:453:14)
at Promise._settlePromiseFromHandler (D:\my_dir\migration\ooa\node_modules\bluebird\js\release\promise.js:522:17)
at Promise._settlePromise (D:\my_dir\migration\ooa\node_modules\bluebird\js\release\promise.js:567:18)
at Promise._settlePromise0 (D:\my_dir\migration\ooa\node_modules\bluebird\js\release\promise.js:612:10)
at Promise._settlePromises (D:\my_dir\migration\ooa\node_modules\bluebird\js\release\promise.js:687:18)
at Async._drainQueue (D:\my_dir\migration\ooa\node_modules\bluebird\js\release\async.js:133:16)
at Async._drainQueues (D:\my_dir\migration\ooa\node_modules\bluebird\js\release\async.js:143:10)
at Immediate.Async.drainQueues (D:\my_dir\migration\ooa\node_modules\bluebird\js\release\async.js:17:14)

============== CODE (of course website name changed) =============
var scrape = require('website-scraper');
scrape({
urls: ['http://www.my-web-site.com/'],
directory: './goo',
recursive: true,
maxDepth: 1
}).then(console.log).catch(console.log);

Documentation is wrong about how to call this...

I had to use this code in my index.js to run scraper from the command in a terminal window:

var scraper = require('website-scraper');
scraper.scrape(options);

But your samples on the front page show this:

var scrape = require('website-scraper');
scrape(options);

So, I had to use it like an object, but your code shows using it like a function. I saw where there was a previous issue that is now closed that discussed this, but I pulled my hair out trying to call it like a function according to your docs. When I used the object syntax, it worked fine.

What am I missing here?

Using version: 0.3.3

An in-range update of lodash is breaking the build 🚨

Version 4.17.0 of lodash just got published.

Branch	Build failing 🚨
Dependency	lodash
Current Version	4.16.6
Type	dependency

This version is covered by your current version range and after updating it in your project the build failed.

Status Details

❌ continuous-integration/travis-ci/push The Travis CI build could not complete due to an error Details
❌ continuous-integration/appveyor/branch AppVeyor build failed Details

Not sure how things should work exactly?

There is a collection of frequently asked questions and of course you may always ask my humans.

Your Greenkeeper Bot 🌴

Unify requestResource and loadResource usage

In file-handlers requestResource().then(loadResource) is used, but in scraper itself on original resources downloading loadResource is used which makes requestResource inside it.

It is not prominent why in one cases methods are used in order request -> load, but in other cases - vice versa

Need to unify approach of using this methods. loadResource should not make requestResource inside it

Improper file data with some css and js files

I noticed when scrapping https://dribbble.com/about that the CSS files and some JS files are saved improperly. Below is an example screenshot of a CSS file.

Any idea what is causing this? These files are pulled from Amazon's Cloudfront service and are severed over SSL.

An in-range update of debug is breaking the build 🚨

Version 2.4.0 of debug just got published.

Branch	Build failing 🚨
Dependency	debug
Current Version	2.3.3
Type	dependency

This version is covered by your current version range and after updating it in your project the build failed.

As debug is a direct dependency of this project this is very likely breaking your project right now. If other packages depend on you it’s very likely also breaking them.
I recommend you give this issue a very high priority. I’m sure you can resolve this 💪

Status Details

❌ continuous-integration/appveyor/branch Waiting for AppVeyor build to complete Details
❌ continuous-integration/travis-ci/push The Travis CI build failed Details

Commits

The new version differs by 7 commits .

b82d4e6 release 2.4.0
41002f1 Update bower.json (#342)
e58d54b Node: configurable util.inspect() options (#327)
00f3046 Node: %O (big O) pretty-prints the object (#322)
bd9faa1 allow colours in workers (#335)
501521f Use same color for same namespace. (#338)
e2a1955 Revert "handle regex special characters"

See the full diff.

Not sure how things should work exactly?

There is a collection of frequently asked questions and of course you may always ask my humans.

Your Greenkeeper Bot 🌴

cannot crawl youtube mobile

Cannot crawl any link from youtube mobile site return blank page with html incorrect .

Add logger

Add logger + option like logLevel (debug, warning, error, etc.)

Some bugs and suggestions

1 No resources are allocated without extensions
Example:

<link href='https://fonts.googleapis.com/css?family=Ubuntu:400,300,400italic,500,500italic,300italic,700,700italic&subset=latin,cyrillic' rel='stylesheet' type='text/css'> 
<img alt="" src="http://1.gravatar.com/avatar/4d63e4a045c7ff22accc33dc08442f86?s=140&amp;d=%2Fwp-content%2Fuploads%2F2015%2F05%2FGood-JOb-150x150.jpg&amp;r=g" srcset="http://1.gravatar.com/avatar/4d63e4a045c7ff22accc33dc08442f86?s=280&amp;d=%2Fwp-content%2Fuploads%2F2015%2F05%2FGood-JOb-150x150.jpg&amp;r=g 2x" class="avatar avatar-140 photo avatrb" width="140" height="140">

2 Option in config of shutdown of control of preservation in one folder

3 Not parse Emoji
Example:

<img src="https://s.w.org/images/core/emoji/72x72/2764.png" alt="❤" class="emoji" draggable="false">

4 To make an option to specify a pause in seconds before parsing

Add support for srcset attr in <picture> tag and .webp images

https://dev.opera.com/articles/responsive-images/#different-image-types-use-case
https://developers.google.com/speed/webp/gallery

save .webp files to images subdirectory by default
add rule to sources for srcset inside

Improve handling of long filenames

From PR #92. When files are too long, need to shorten the filename e.g. http://www.aplusi.com/ using typekit

functionality should be added inside filenameGenerators (?)
need to cover functionality with tests

Also need to check limitations related to full pathname, not only filename
Wikipedia: Comparison of file systems

Scrap site from snapshot

It would be very nice, if u can add support to parse local HTML file (or buffer with html data), saved from web.

Example usage: i can make HTML snapshot from website using PhantomJS, SlimerJS or TriffleJS, after script execution and after same actions. Then pass those html to scraper and tell that native url is "http://example.com", so scraper can resolve resources and download them.

I think it is more flexible way to support headless browsers cons.

Relative path images in css files not getting saved

Hey, thank you for making this module! It's really awesome, and overcomes a lot of issues other tools like it seem to have.

I'm having an issue on some pages (ex: http://.com/index.html) where images with a relative path in the css file aren't getting downloaded.

Has this been seen before?

Trevor

Export function instead of object

Change

var scraper = require('website-scraper');
scraper.scrape(options);

var scrape = require('website-scraper');
scrape(options);

Ability to pipe download to a client

I was wondering if it is possible to not download anything on the server side and just forward the download directly to requesting client ?

Basically it would pipe it as a response back to client.

An in-range update of eslint is breaking the build 🚨

Version 3.10.1 of eslint just got published.

Branch	Build failing 🚨
Dependency	eslint
Current Version	3.10.0
Type	devDependency

This version is covered by your current version range and after updating it in your project the build failed.

As eslint is “only” a devDependency of this project it might not break production or downstream projects, but “only” your build or test tools – preventing new deploys or publishes.

I recommend you give this issue a high priority. I’m sure you can resolve this 💪

Status Details

❌ continuous-integration/travis-ci/push The Travis CI build could not complete due to an error Details
❌ continuous-integration/appveyor/branch AppVeyor build failed Details

Commits

The new version differs by 4 commits .

9cbfa0b 3.10.1
4bb6215 Build: package.json and changelog update for 3.10.1
8a0e92a Fix: handle try/catch correctly in no-return-await (fixes #7581) (#7582)
c4dd015 Fix: no-useless-return stack overflow on unreachable loops (fixes #7583) (#7584)

See the full diff.

Not sure how things should work exactly?

There is a collection of frequently asked questions and of course you may always ask my humans.

Your Greenkeeper Bot 🌴

Handle <a href> with mailto, javascript, etc.

Don't try to download next resources:

<a href="mailto:[email protected]?Subject=Hello%20again">Send mail!</a>
<a href="javascript:alert('Hello World!');">Execute JavaScript</a>

A generic URI is of the form: scheme:[//[user:password@]host[:port]][/]path[?query][#fragment]
Need to filter by scheme, see https://en.wikipedia.org/wiki/Uniform_Resource_Identifier#Syntax

website-scraper / node-website-scraper Goto Github PK

node-website-scraper's People

Contributors

Stargazers

Watchers

Forkers

node-website-scraper's Issues

Version 3.10.2 of eslint just got published.

Version 4.17.1 of lodash just got published.

Version 4.17.2 of lodash just got published.

Version 4.17.0 of lodash just got published.

Version 2.4.0 of debug just got published.

Version 3.10.1 of eslint just got published.

Recommend Projects

Recommend Topics

Recommend Org