Giter VIP home page Giter VIP logo

Comments (6)

marklagendijk avatar marklagendijk commented on May 20, 2024 2

The important question is: how would you want to do type recognition, apart from checking the extension? I think it is hard to implement a solution which covers all scenarios. To do that you would have to combine the following techniques:

  1. Recognize type based on extension.
  2. Recognize type based on where it occurs (e.g.: link[rel="stylesheet"] => css).
  3. Recognize type based on magic file bytes

I don't think it would be realistic to implement all that. It would make more sense if users would write a custom filenameGenerator to deal with their scenario.

from node-website-scraper.

s0ph1e avatar s0ph1e commented on May 20, 2024

Hi
Now scraper takes names from url. Url from example has no extension so filename is generated without extension. It is known behavior

Possible ways to solve it:

  • Add option like addExtensions. If it is true scraper adds missing extensions for resources with known types (.html for html files and .css for css files)
  • Get rid of sorting by extensions in subdirectories and sort by type, for example:
subdirectories: [
        {directory: 'css', types: ['css']}
],
  • Extend existing sorting by extensions with types to provide ability for sorting by extension, by type or by both of them, for example:
subdirectories: [
        {directory: 'images', types: ['image'], extensions: ['.jpg', '.png']},    // resource should satisfy both ext and type
        {directory: 'css', types: ['css']}, // resource satisfy only type 
        {directory: 'scripts', extensions: ['.js']}, // rsource satisfy ony extensions
],

Way 2 and 3 requires additional types (image, script, font, etc)

@marklagendijk can I ask your advice for this please?

from node-website-scraper.

Grafs avatar Grafs commented on May 20, 2024

Thanks for the answer

subdirectories: [
        {directory: 'css', types: ['css']}
],

This code is not working. CSS is not stored in the CSS folder.
URL example: http://newtest.beauby.ru
My config:

subdirectories: [
        {directory: 'img', extensions: ['.jpg', '.png', '.svg', '.gif']},
        {directory: 'source', extensions: ['.jpg', '.png', '.svg', '.gif']},
        {directory: 'js', extensions: ['.js']},
        {directory: 'fonts', extensions: ['.ttf', '.woff', '.woff2', '.eot', '.otf']},
        {directory: 'css', types: ['css']}
      ],
      sources: [
        {selector: 'img', attr: 'src'},
        {selector: 'source', attr: 'srcset'},
        {selector: 'img', attr: 'srcset'},
        {selector: 'link[rel="stylesheet"]', attr: 'href'},
        {selector: 'script', attr: 'src'}
      ],

Better 3

Extend existing sorting by extensions with types to provide ability for sorting by extension, by type or by both of them

Вы говорите по русски? Вы я вижу вы из Киева, я тоже из из Киева. Проще на русском я думаю.
Вариант 3 самый лучший, а еще лучше вот так {directory: 'css', [types: ['css'], extensions: ['.css']]}

И еще есть моменты которые хотелось бы обсудить.

from node-website-scraper.

s0ph1e avatar s0ph1e commented on May 20, 2024

Hi again
Sorry, seems I didn't explain it properly. Nothing of suggestions above is working now. Module should be extended to support one of them.
I just wanted to ask guys @marklagendijk @aivus to look at possible solutions and discuss what is better.

I suggest continue using English on github in order to keep it clear for everyone

We can discuss everything you want on gittter in Russian and then just put summary here https://gitter.im/s0ph1e/node-website-scraper

from node-website-scraper.

Grafs avatar Grafs commented on May 20, 2024

Problem in distribution according to folders. Ok, let will be without expansion, but that types were distributed according to folders.

from node-website-scraper.

s0ph1e avatar s0ph1e commented on May 20, 2024

Adding extensions for html and css files is implemented in #59
Will be released in 1.0.0

from node-website-scraper.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.