Comments (6)
The important question is: how would you want to do type recognition, apart from checking the extension? I think it is hard to implement a solution which covers all scenarios. To do that you would have to combine the following techniques:
- Recognize type based on extension.
- Recognize type based on where it occurs (e.g.:
link[rel="stylesheet"]
=>css
). - Recognize type based on magic file bytes
I don't think it would be realistic to implement all that. It would make more sense if users would write a custom filenameGenerator
to deal with their scenario.
from node-website-scraper.
Hi
Now scraper takes names from url. Url from example has no extension so filename is generated without extension. It is known behavior
Possible ways to solve it:
- Add option like
addExtensions
. If it istrue
scraper adds missing extensions for resources with known types (.html
for html files and.css
for css files) - Get rid of sorting by extensions in subdirectories and sort by type, for example:
subdirectories: [
{directory: 'css', types: ['css']}
],
- Extend existing sorting by extensions with types to provide ability for sorting by extension, by type or by both of them, for example:
subdirectories: [
{directory: 'images', types: ['image'], extensions: ['.jpg', '.png']}, // resource should satisfy both ext and type
{directory: 'css', types: ['css']}, // resource satisfy only type
{directory: 'scripts', extensions: ['.js']}, // rsource satisfy ony extensions
],
Way 2 and 3 requires additional types (image, script, font, etc)
@marklagendijk can I ask your advice for this please?
from node-website-scraper.
Thanks for the answer
subdirectories: [
{directory: 'css', types: ['css']}
],
This code is not working. CSS is not stored in the CSS folder.
URL example: http://newtest.beauby.ru
My config:
subdirectories: [
{directory: 'img', extensions: ['.jpg', '.png', '.svg', '.gif']},
{directory: 'source', extensions: ['.jpg', '.png', '.svg', '.gif']},
{directory: 'js', extensions: ['.js']},
{directory: 'fonts', extensions: ['.ttf', '.woff', '.woff2', '.eot', '.otf']},
{directory: 'css', types: ['css']}
],
sources: [
{selector: 'img', attr: 'src'},
{selector: 'source', attr: 'srcset'},
{selector: 'img', attr: 'srcset'},
{selector: 'link[rel="stylesheet"]', attr: 'href'},
{selector: 'script', attr: 'src'}
],
Better 3
Extend existing sorting by extensions with types to provide ability for sorting by extension, by type or by both of them
Вы говорите по русски? Вы я вижу вы из Киева, я тоже из из Киева. Проще на русском я думаю.
Вариант 3 самый лучший, а еще лучше вот так {directory: 'css', [types: ['css'], extensions: ['.css']]}
И еще есть моменты которые хотелось бы обсудить.
from node-website-scraper.
Hi again
Sorry, seems I didn't explain it properly. Nothing of suggestions above is working now. Module should be extended to support one of them.
I just wanted to ask guys @marklagendijk @aivus to look at possible solutions and discuss what is better.
I suggest continue using English on github in order to keep it clear for everyone
We can discuss everything you want on gittter in Russian and then just put summary here https://gitter.im/s0ph1e/node-website-scraper
from node-website-scraper.
Problem in distribution according to folders. Ok, let will be without expansion, but that types were distributed according to folders.
from node-website-scraper.
Adding extensions for html and css files is implemented in #59
Will be released in 1.0.0
from node-website-scraper.
Related Issues (20)
- How to get request.uri from each download file
- urlFilter terminates recursive scraping HOT 2
- Scrape files with exact same names when they include french characters HOT 4
- Pass options to Cheerio? HOT 5
- Crawling rate limit or requeue? HOT 2
- [not serious] Unsure of how to use urlFilter HOT 2
- Cookies Issue HOT 2
- Default non-downloaded paths to alternate url HOT 2
- TypeError: text.replace is not a function HOT 3
- Replace remote image assets with downloaded one HOT 3
- Website-scraper on Windows 11 throws error while scraping an academic archive site. HOT 3
- French characters "à" are not converted correctly HOT 5
- How do I change the style file name? HOT 2
- Remove plugins export
- Running `website-scraper` Without a `package.json` in the Working Directory Triggers Error HOT 5
- Design new logo HOT 2
- lazy load images are not downloading HOT 1
- Incorrect source code for dynamic javascript src HOT 5
- Adding Plugin stops recursion HOT 3
- throw new Error(`Directory ${absoluteDirectoryPath} exists`); HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from node-website-scraper.