Giter VIP home page Giter VIP logo

html-encoding-sniffer's Introduction

Determine the Encoding of a HTML Byte Stream

This package implements the HTML Standard's encoding sniffing algorithm in all its glory. The most interesting part of this is how it pre-scans the first 1024 bytes in order to search for certain <meta charset>-related patterns.

const htmlEncodingSniffer = require("html-encoding-sniffer");
const fs = require("fs");

const htmlBytes = fs.readFileSync("./html-page.html");
const sniffedEncoding = htmlEncodingSniffer(htmlBytes);

The passed bytes are given as a Uint8Array; the Node.js Buffer subclass of Uint8Array will also work, as shown above.

The returned value will be a canonical encoding name (not a label). You might then combine this with the whatwg-encoding package to decode the result:

const whatwgEncoding = require("whatwg-encoding");
const htmlString = whatwgEncoding.decode(htmlBytes, sniffedEncoding);

Options

You can pass two potential options to htmlEncodingSniffer:

const sniffedEncoding = htmlEncodingSniffer(htmlBytes, {
  transportLayerEncodingLabel,
  defaultEncoding
});

These represent two possible inputs into the encoding sniffing algorithm:

  • transportLayerEncodingLabel is an encoding label that is obtained from the "transport layer" (probably a HTTP Content-Type header), which overrides everything but a BOM.
  • defaultEncoding is the ultimate fallback encoding used if no valid encoding is supplied by the transport layer, and no encoding is sniffed from the bytes. It defaults to "windows-1252", as recommended by the algorithm's table of suggested defaults for "All other locales" (including the en locale).

Credits

This package was originally based on the excellent work of @nicolashenry, in jsdom. It has since been pulled out into this separate package.

html-encoding-sniffer's People

Contributors

amishne avatar domenic avatar openandclose avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

html-encoding-sniffer's Issues

Why does it get this page wrong?

wget https://ehoba.hatenablog.com/entry/2020/10/25/190724
sourcepath = 190724

const htmlEncodingSniffer = require("html-encoding-sniffer")
const fs = require("fs");

const htmlBytes = fs.readFileSync(sourcepath);
const sniffedEncoding = htmlEncodingSniffer(htmlBytes);
console.log(sniffedEncoding) // "windows-1252"

The html source is kinda messy, but there is a <meta charset="utf-8"/> in there. Firefox renders the correct utf-8 charset.

Revamped package

Hi @domenic et al,

I was looking into using this package for cheerio, but had several issues that I wanted to fix first. As extending the current html-encoding-sniffer package turned out to be cumbersome, I opted to write a new module instead:

https://github.com/fb55/encoding-sniffer

This new package implements the current version of the encoding sniffing algo as a state machine. That allows streams to be supported without much effort. Features this supports, which aren't present in html-encoding-sniffer:

  • XML encoding types (UTF-16 prefixes and <?xml encoding="...">)
  • Configurable sniff depth (see #10)
  • x-user-defined in <meta> tags (turns out html-encoding-sniffer's support is broken)
  • Streams / sniffing of incomplete documents

I would love to join forces and have a single package that both jsdom and cheerio can use going forward. Let me know if this is something you'd be interested in!

Fail to parse short bogus comments

Even in prescan, abrupt-closing-of-empty-comment should be treated as normal comments.

for <!-->, the spec states rather explicitly.

The two 0x2D bytes can be the same as those in the '<!--' sequence.'

But I think it is also applied to <!--->.

A failure test example, just in case.
openandclose@0370865

prescan:
https://html.spec.whatwg.org/multipage/parsing.html#prescan-a-byte-stream-to-determine-its-encoding
abrupt-closing-of-empty-comment:
https://html.spec.whatwg.org/multipage/parsing.html#parse-error-abrupt-closing-of-empty-comment

Fix bug: content attribute charset value, trailing whitespace

The following example should return 'ISO-8859-2'
(note the last whitespace before '"').

<meta http-equiv="Content-Type" content="text/html; charset=iso8859-2 ">

While html-encoding-sniffer returns 'defaultEncoding',
calling whatwgEncoding.labelToName with l; charset=.

line 283:
let end = string.substring(position + 1).search(/\x09|\x0A|\x0C|\x0D|\x20|;/);

end gets the index of the shortened string.

Fix bug: content attribute, second charset index

The following example should return 'ISO-8859-2'.

<meta http-equiv="Content-Type" content="charsetcharset=iso-8859-2">

While html-encoding-sniffer freezes.

line 242:
let subPosition = string.substring(position).search(/charset/i);

subPosition gets the index of the shortened string
(from second iteration or third, always 1).

Fix bug: second and invalid http-equiv

The following example should return 'null' or default encoding.
The spec uses 'attribute list', to keep track of parsed attribute names.

<meta http-equiv="refresh" http-equiv="Content-Type" content="text/html; charset=iso8859-2">

While html-encoding-sniffer returns 'ISO-8859-2'.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.