website-scraper / node-website-scraper Goto Github PK
View Code? Open in Web Editor NEWDownload website to local directory (including all css, images, js, etc.)
Home Page: https://www.npmjs.org/package/website-scraper
License: MIT License
Download website to local directory (including all css, images, js, etc.)
Home Page: https://www.npmjs.org/package/website-scraper
License: MIT License
Now extensions and html attributes are used.
Consider using mime-types instead of current approach/in addition to current approach
Exzmple:
<link href='https://fonts.googleapis.com/css?family=Ubuntu:400,300,400italic,500,500italic,300italic,700,700italic&subset=latin,cyrillic' rel='stylesheet' type='text/css'>
This css and not extensions and not save to CSS dir in subdirectories
subdirectories: [
{directory: 'css', extensions: ['.css']}
],
sources: [
{selector: 'link[rel="stylesheet"]', attr: 'href'},
],
I wondered why this project is using Underscore instead of Lodash. It is the first time that I see a Node.js module which uses Underscore. Lodash seems to be the default choice for nearly all projects.
There is a good reason for this: Lodash has a lot of features, which Underscore lacks.
When making my changes in this projects, I found that sometimes I wanted to use a Lodash function, which Underscore didn't have. This is bad because using such utility functions leads to easier understandable code.
Would you accept a pull request which replaced Underscore with Lodash?
Need to have an option to choose what to do when an error occurs (now scraper process fails on first error and directory is removed).
Need to provide opportunity to ignore error and continue scraping (Maybe something like stopOnError
) and maintain a list of errorred resources
var scraper = require('website-scraper');
var options = {
urls: ['http://nodejs.org/'],
directory: __dirname + '/nodejs.org',
recursive: true
};
// or with promise
scraper.scrape(options).then(function (result) {
console.log(result);
});
this code causes Segmentation fault: 11
will dig into the code see if any memory can be freed or written to a cache file to be retrieved later.
I think an important use case of this module would be to make a static copy of an entire website. For this use case the following functionality would be needed:
E.g:
The scaper can be configured to ignore external websites
This can be implemented by creating an 'urlFilter' option which takes a function. Before a request to a url is done, this function is called with the url (and possible other info which could be useful). The url is only called if the urFilter returns true
.
The scaper can save the files in the same structure as the paths on the website
This can be implemented by creating a 'outputPathGenerator' (or a better name) option which takes a function. If the user specifies this option the output path is generated by passing the resource
to this function.
Example
The following is an example of how you could make a scraper which saves an entire website, using the above described functionality.
var _ = require('lodash');
var pathUtil = require('path');
var urlUtil = require('url');
var websiteUrl = 'http://nodejs.org/'
var options = {
urls: [websiteUrl],
directory: '/path/to/save/',
recursive: true,
maxDepth: 10,
urlFilter: function(url){
return _.startsWith(url, websiteUrl);
},
outputPathGenerator: function(resource, directory){
var urlObject = urlUtil.parse(resource.url);
// Todo: add logic to check if it is an HTML page which does not end in '.html'
// If so, append '/index.html'
var relativePath = '.' + urlObject.path;
return pathUtil.resolve(directory, relativePath);
}
};
Hi,
Not correct parsing srcset in img tag
Example:
<img src="http://example.com/prev4-45x45.jpg" srcset="http://example.com/prev4-150x150.jpg 150w, http://example.com/prev4-45x45.jpg 45w" sizes="(max-width: 45px) 100vw, 45px" width="45" height="45">
Please, make option for convert all other links from relative to absolute.
Thank you
Is there away to throttle the crawler to make its requests more manageable?
I love this! Please make it save cookies! Request can do this but it would be a nice default option. Btw it would also be nice if there was a way to make it get only one kind of file from a site and not the rest.
Branch | Build failing π¨ |
---|---|
Dependency | eslint |
Current Version | 3.10.1 |
Type | devDependency |
This version is covered by your current version range and after updating it in your project the build failed.
As eslint is βonlyβ a devDependency of this project it might not break production or downstream projects, but βonlyβ your build or test tools β preventing new deploys or publishes.
I recommend you give this issue a high priority. Iβm sure you can resolve this πͺ
The new version differs by 8 commits .
0840068
3.10.2
621ee0a
Build: package.json and changelog update for 3.10.2
0643bfe
Fix: correctly handle commented code in indent
autofixer (fixes #7604) (#7606)
bd0514c
Fix: syntax error after key-spacing
autofix with comment (fixes #7603) (#7607)
f56c1ef
Fix: indent
crash on parenthesized global return values (fixes #7573) (#7596)
100c6e1
Docs: Fix example for curly "multi-or-nest" option (#7597)
6abb534
Docs: Update code of conduct link (#7599)
8302cdb
Docs: Update no-tabs to match existing standards & improve readbility (#7590)
See the full diff.
There is a collection of frequently asked questions and of course you may always ask my humans.
Your Greenkeeper Bot π΄
How do I set option for downloading site in utf8 encode?
css-url-parser
Hey there! Thanks for making such a rad tool :)
I've found that if an image path is in a javascript file the scraper doesn't download the image.
Here is an example URL: http://womenshealth.com-yourarticles.co/breaking-news/angelina.html
The comment images are not downloaded because the paths are loaded at runtime via a local JSON file (comments.js)
Would love to hear your thoughts, thanks!
Would it make sense for this project to support returning the urls of the parsed assets, and not physically downloading the files?
Branch | Build failing π¨ |
---|---|
Dependency | lodash |
Current Version | 4.17.0 |
Type | dependency |
This version is covered by your current version range and after updating it in your project the build failed.
As lodash is a direct dependency of this project this is very likely breaking your project right now. If other packages depend on you itβs very likely also breaking them.
I recommend you give this issue a very high priority. Iβm sure you can resolve this πͺ
There is a collection of frequently asked questions and of course you may always ask my humans.
Your Greenkeeper Bot π΄
like --convert-links
option in wget
options
passed to scrape functionlib/config/defaults.js
for auth, setting user-agent, cookies, etc.
Hey sOph1e,
This is an enhancement, not a bug. Also I'm not completely sure it's in the scope of this project, but wanted to see what you thought:
When pulling site down that is hosted on https and has its links specified as '//' (which inherit https when hosted on https), it would be nice if '//' was replaced with the protocol 'https://'. In other words, just inherit the protocol for all links of the resource being downloaded.
So when someone downloads the site but not the resources, the resources are unavailable unless the user hosts his site on https. Of course, we can easily load the page up in cheerio and change this ourselves but it may be best to incorporate it into the project. Either way, thank you for the hard work it is appreciated all the way from San Diego, CA!
Thanks,
Trevor
I get the following error when scraping a site with an email address in the URL. This may just be a co-incidence. I've renamed the domain and email address for data protection purposes.
Unhandled rejection Error: ENOENT: no such file or directory, open '/tmp/www.example.org/forums/users/[email protected]/
Branch | Build failing π¨ |
---|---|
Dependency | lodash |
Current Version | 4.17.1 |
Type | dependency |
This version is covered by your current version range and after updating it in your project the build failed.
As lodash is a direct dependency of this project this is very likely breaking your project right now. If other packages depend on you itβs very likely also breaking them.
I recommend you give this issue a very high priority. Iβm sure you can resolve this πͺ
There is a collection of frequently asked questions and of course you may always ask my humans.
Your Greenkeeper Bot π΄
How can I download a website that needs to have a login session?
Think turn this project or add functionality to run like a CLI π
Examples:
<svg>
<use xlink:href="sprite.svg#icon-1"></use>
</svg>
<svg width="4in" height="3in" version="1.1">
<image x="200" y="200" width="100px" height="100px"
xlink:href="myimage.png">
</image>
</svg>
Notes:
assets
were introduced in c88b8b9 in order to fix recursive bug in utils.createOutputObject
and keep backward compatibility for version 1.*
In version 2.* it should be renamed back to children
I getting below error
========= ERROR ============
TypeError: Cannot read property 'getUrl' of null
at Scraper.loadResource (D:\my_dir\migration\ooa\node_modules\website-scraper\lib\scraper.js:65:20)
at receivedResponse (D:\my_dir\migration\ooa\node_modules\website-scraper\lib\scraper.js:161:10)
at tryCatcher (D:\my_dir\migration\ooa\node_modules\bluebird\js\release\util.js:16:23)
at Promise._settlePromiseFromHandler (D:\my_dir\migration\ooa\node_modules\bluebird\js\release\promise.js:510:31)
at Promise._settlePromise (D:\my_dir\migration\ooa\node_modules\bluebird\js\release\promise.js:567:18)
at Promise._settlePromise0 (D:\my_dir\migration\ooa\node_modules\bluebird\js\release\promise.js:612:10)
at Promise._settlePromises (D:\my_dir\migration\ooa\node_modules\bluebird\js\release\promise.js:691:18)
at Promise._fulfill (D:\my_dir\migration\ooa\node_modules\bluebird\js\release\promise.js:636:18)
at Promise._resolveCallback (D:\my_dir\migration\ooa\node_modules\bluebird\js\release\promise.js:453:14)
at Promise._settlePromiseFromHandler (D:\my_dir\migration\ooa\node_modules\bluebird\js\release\promise.js:522:17)
at Promise._settlePromise (D:\my_dir\migration\ooa\node_modules\bluebird\js\release\promise.js:567:18)
at Promise._settlePromise0 (D:\my_dir\migration\ooa\node_modules\bluebird\js\release\promise.js:612:10)
at Promise._settlePromises (D:\my_dir\migration\ooa\node_modules\bluebird\js\release\promise.js:687:18)
at Async._drainQueue (D:\my_dir\migration\ooa\node_modules\bluebird\js\release\async.js:133:16)
at Async._drainQueues (D:\my_dir\migration\ooa\node_modules\bluebird\js\release\async.js:143:10)
at Immediate.Async.drainQueues (D:\my_dir\migration\ooa\node_modules\bluebird\js\release\async.js:17:14)
============== CODE (of course website name changed) =============
var scrape = require('website-scraper');
scrape({
urls: ['http://www.my-web-site.com/'],
directory: './goo',
recursive: true,
maxDepth: 1
}).then(console.log).catch(console.log);
I had to use this code in my index.js to run scraper from the command in a terminal window:
var scraper = require('website-scraper');
scraper.scrape(options);
But your samples on the front page show this:
var scrape = require('website-scraper');
scrape(options);
So, I had to use it like an object, but your code shows using it like a function. I saw where there was a previous issue that is now closed that discussed this, but I pulled my hair out trying to call it like a function according to your docs. When I used the object syntax, it worked fine.
What am I missing here?
Using version: 0.3.3
Branch | Build failing π¨ |
---|---|
Dependency | lodash |
Current Version | 4.16.6 |
Type | dependency |
This version is covered by your current version range and after updating it in your project the build failed.
As lodash is a direct dependency of this project this is very likely breaking your project right now. If other packages depend on you itβs very likely also breaking them.
I recommend you give this issue a very high priority. Iβm sure you can resolve this πͺ
There is a collection of frequently asked questions and of course you may always ask my humans.
Your Greenkeeper Bot π΄
In file-handlers requestResource().then(loadResource)
is used, but in scraper itself on original resources downloading loadResource
is used which makes requestResource
inside it.
It is not prominent why in one cases methods are used in order request
-> load
, but in other cases - vice versa
Need to unify approach of using this methods. loadResource should not make requestResource inside it
I noticed when scrapping https://dribbble.com/about that the CSS files and some JS files are saved improperly. Below is an example screenshot of a CSS file.
Any idea what is causing this? These files are pulled from Amazon's Cloudfront service and are severed over SSL.
Branch | Build failing π¨ |
---|---|
Dependency | debug |
Current Version | 2.3.3 |
Type | dependency |
This version is covered by your current version range and after updating it in your project the build failed.
As debug is a direct dependency of this project this is very likely breaking your project right now. If other packages depend on you itβs very likely also breaking them.
I recommend you give this issue a very high priority. Iβm sure you can resolve this πͺ
The new version differs by 7 commits .
b82d4e6
release 2.4.0
41002f1
Update bower.json (#342)
e58d54b
Node: configurable util.inspect()
options (#327)
00f3046
Node: %O
(big O) pretty-prints the object (#322)
bd9faa1
allow colours in workers (#335)
501521f
Use same color for same namespace. (#338)
e2a1955
Revert "handle regex special characters"
See the full diff.
There is a collection of frequently asked questions and of course you may always ask my humans.
Your Greenkeeper Bot π΄
Cannot crawl any link from youtube mobile site return blank page with html incorrect .
Add logger + option like logLevel
(debug, warning, error, etc.)
1 No resources are allocated without extensions
Example:
<link href='https://fonts.googleapis.com/css?family=Ubuntu:400,300,400italic,500,500italic,300italic,700,700italic&subset=latin,cyrillic' rel='stylesheet' type='text/css'>
<img alt="" src="http://1.gravatar.com/avatar/4d63e4a045c7ff22accc33dc08442f86?s=140&d=%2Fwp-content%2Fuploads%2F2015%2F05%2FGood-JOb-150x150.jpg&r=g" srcset="http://1.gravatar.com/avatar/4d63e4a045c7ff22accc33dc08442f86?s=280&d=%2Fwp-content%2Fuploads%2F2015%2F05%2FGood-JOb-150x150.jpg&r=g 2x" class="avatar avatar-140 photo avatrb" width="140" height="140">
2 Option in config of shutdown of control of preservation in one folder
3 Not parse Emoji
Example:
<img src="https://s.w.org/images/core/emoji/72x72/2764.png" alt="β€" class="emoji" draggable="false">
4 To make an option to specify a pause in seconds before parsing
https://dev.opera.com/articles/responsive-images/#different-image-types-use-case
https://developers.google.com/speed/webp/gallery
.webp
files to images
subdirectory by defaultsources
for srcset inside From PR #92. When files are too long, need to shorten the filename e.g. http://www.aplusi.com/ using typekit
Also need to check limitations related to full pathname, not only filename
Wikipedia: Comparison of file systems
It would be very nice, if u can add support to parse local HTML file (or buffer with html data), saved from web.
Example usage: i can make HTML snapshot from website using PhantomJS, SlimerJS or TriffleJS, after script execution and after same actions. Then pass those html to scraper and tell that native url is "http://example.com", so scraper can resolve resources and download them.
I think it is more flexible way to support headless browsers cons.
Hey, thank you for making this module! It's really awesome, and overcomes a lot of issues other tools like it seem to have.
I'm having an issue on some pages (ex: http://.com/index.html) where images with a relative path in the css file aren't getting downloaded.
Has this been seen before?
Trevor
Change
var scraper = require('website-scraper');
scraper.scrape(options);
to
var scrape = require('website-scraper');
scrape(options);
I was wondering if it is possible to not download anything on the server side and just forward the download directly to requesting client ?
Basically it would pipe it as a response back to client.
Branch | Build failing π¨ |
---|---|
Dependency | eslint |
Current Version | 3.10.0 |
Type | devDependency |
This version is covered by your current version range and after updating it in your project the build failed.
As eslint is βonlyβ a devDependency of this project it might not break production or downstream projects, but βonlyβ your build or test tools β preventing new deploys or publishes.
I recommend you give this issue a high priority. Iβm sure you can resolve this πͺ
The new version differs by 4 commits .
9cbfa0b
3.10.1
4bb6215
Build: package.json and changelog update for 3.10.1
8a0e92a
Fix: handle try/catch correctly in no-return-await
(fixes #7581) (#7582)
c4dd015
Fix: no-useless-return stack overflow on unreachable loops (fixes #7583) (#7584)
See the full diff.
There is a collection of frequently asked questions and of course you may always ask my humans.
Your Greenkeeper Bot π΄
Don't try to download next resources:
<a href="mailto:[email protected]?Subject=Hello%20again">Send mail!</a>
<a href="javascript:alert('Hello World!');">Execute JavaScript</a>
A generic URI is of the form: scheme:[//[user:password@]host[:port]][/]path[?query][#fragment]
Need to filter by scheme, see https://en.wikipedia.org/wiki/Uniform_Resource_Identifier#Syntax
A declarative, efficient, and flexible JavaScript library for building user interfaces.
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. πππ
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google β€οΈ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.