Giter VIP home page Giter VIP logo

Comments (5)

phawxby avatar phawxby commented on June 11, 2024 2

I think it's risky to assume a charset unless we know for sure what it is. What if it's ANSI, ASCII, etc, we could break existing users. However tha then brings about a new issue, there's multiple ways to specify the charset of a file

  1. On the response header, like we use now.
  2. On a meta-tag for html
  3. As a rule in CSS
  4. And although deprecated, as an attribute.

I think we have 2 options.

  1. Default to 'utf-8' for all text types, which could potentially break things for existing users, especially for those scraping older applications.
  2. Add some basic additional rules which we switch to based on the content type. I think 2 & 3 above are going to cover 99% of use cases. I think this is the best approach personally.

The best place to do this is is here.

  1. If the encoding is still binary after checking the headers then get the mime type of the response.
  2. Add a switch based on mimetype for css/html.
  3. Add 2 new functions, getMimeFromHtml and getMimeFromCss.
    a. HTML: Use cheerio to parse the response body and see if you find a <meta charset="utf-8">
    b. CSS: The CSS spec is incredibly strict, you should just be able to do .includes('@charset "UTF-8"');

I would take a stab at it but I go on vacation in a few days and i'm intentionally not taking a laptop

from node-website-scraper.

s0ph1e avatar s0ph1e commented on June 11, 2024

Hi @Jeremytijal 👋

Thank you for reporting an issue.
Could you please check if that happens with the latest version 5.2.0? I'll be able to test it by myself later today or tomorrow

from node-website-scraper.

Jeremytijal avatar Jeremytijal commented on June 11, 2024

Hi @s0ph1e,

I tested it with the latest version 5.2.0 and I have the same problem.

from node-website-scraper.

s0ph1e avatar s0ph1e commented on June 11, 2024

Yep, now I see it. The fix works when html response contains a charset in the content-type header. In this case there is no charset in the header and it saves it in binary as it was in previous versions.

Looks like it makes sense to use utf8 for html files if there was no charset. @phawxby WDYT (as the author of the latest fix)?

from node-website-scraper.

s0ph1e avatar s0ph1e commented on June 11, 2024

I like the idea of getting the encoding from html or css content, it sounds better than using utf8 as default and should not be difficult to do.

Most probably I will not have time to implement it in the near future so I really appreciate any help with it.

Thank you for your input and enjoy your vacation without laptop @phawxby 🏖️🌴

from node-website-scraper.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.