mathiasbynens / he Goto Github PK

View Code? Open in Web Editor NEW

3.4K 64.0 253.0 653 KB

A robust HTML entity encoder/decoder written in JavaScript.

Home Page: https://mths.be/he

License: MIT License

JavaScript 99.21% HTML 0.79%

html-entities javascript encode encoder decode decoder

he's Introduction

he

he (for “HTML entities”) is a robust HTML entity encoder/decoder written in JavaScript. It supports all standardized named character references as per HTML, handles ambiguous ampersands and other edge cases just like a browser would, has an extensive test suite, and — contrary to many other JavaScript solutions — he handles astral Unicode symbols just fine. An online demo is available.

Installation

Via npm:

npm install he

Via Bower:

bower install he

Via Component:

component install mathiasbynens/he

In a browser:

<script src="he.js"></script>

In Node.js, io.js, Narwhal, and RingoJS:

var he = require('he');

In Rhino:

load('he.js');

Using an AMD loader like RequireJS:

require(
  {
    'paths': {
      'he': 'path/to/he'
    }
  },
  ['he'],
  function(he) {
    console.log(he);
  }
);

API

`he.version`

A string representing the semantic version number.

`he.encode(text, options)`

This function takes a string of text and encodes (by default) any symbols that aren’t printable ASCII symbols and &, <, >, ", ', and `, replacing them with character references.

he.encode('foo © bar ≠ baz 𝌆 qux');
// → 'foo &#xA9; bar &#x2260; baz &#x1D306; qux'

As long as the input string contains allowed code points only, the return value of this function is always valid HTML. Any (invalid) code points that cannot be represented using a character reference in the input are not encoded:

he.encode('foo \0 bar');
// → 'foo \0 bar'

However, enabling the strict option causes invalid code points to throw an exception. With strict enabled, he.encode either throws (if the input contains invalid code points) or returns a string of valid HTML.

The options object is optional. It recognizes the following properties:

`useNamedReferences`

The default value for the useNamedReferences option is false. This means that encode() will not use any named character references (e.g. ©) in the output — hexadecimal escapes (e.g. ©) will be used instead. Set it to true to enable the use of named references.

Note that if compatibility with older browsers is a concern, this option should remain disabled.

// Using the global default setting (defaults to `false`):
he.encode('foo © bar ≠ baz 𝌆 qux');
// → 'foo &#xA9; bar &#x2260; baz &#x1D306; qux'

// Passing an `options` object to `encode`, to explicitly disallow named references:
he.encode('foo © bar ≠ baz 𝌆 qux', {
  'useNamedReferences': false
});
// → 'foo &#xA9; bar &#x2260; baz &#x1D306; qux'

// Passing an `options` object to `encode`, to explicitly allow named references:
he.encode('foo © bar ≠ baz 𝌆 qux', {
  'useNamedReferences': true
});
// → 'foo &copy; bar &ne; baz &#x1D306; qux'

`decimal`

The default value for the decimal option is false. If the option is enabled, encode will generally use decimal escapes (e.g. ©) rather than hexadecimal escapes (e.g. ©). Beside of this replacement, the basic behavior remains the same when combined with other options. For example: if both options useNamedReferences and decimal are enabled, named references (e.g. ©) are used over decimal escapes. HTML entities without a named reference are encoded using decimal escapes.

// Using the global default setting (defaults to `false`):
he.encode('foo © bar ≠ baz 𝌆 qux');
// → 'foo &#xA9; bar &#x2260; baz &#x1D306; qux'

// Passing an `options` object to `encode`, to explicitly disable decimal escapes:
he.encode('foo © bar ≠ baz 𝌆 qux', {
  'decimal': false
});
// → 'foo &#xA9; bar &#x2260; baz &#x1D306; qux'

// Passing an `options` object to `encode`, to explicitly enable decimal escapes:
he.encode('foo © bar ≠ baz 𝌆 qux', {
  'decimal': true
});
// → 'foo &#169; bar &#8800; baz &#119558; qux'

// Passing an `options` object to `encode`, to explicitly allow named references and decimal escapes:
he.encode('foo © bar ≠ baz 𝌆 qux', {
  'useNamedReferences': true,
  'decimal': true
});
// → 'foo &copy; bar &ne; baz &#119558; qux'

`encodeEverything`

The default value for the encodeEverything option is false. This means that encode() will not use any character references for printable ASCII symbols that don’t need escaping. Set it to true to encode every symbol in the input string. When set to true, this option takes precedence over allowUnsafeSymbols (i.e. setting the latter to true in such a case has no effect).

// Using the global default setting (defaults to `false`):
he.encode('foo © bar ≠ baz 𝌆 qux');
// → 'foo &#xA9; bar &#x2260; baz &#x1D306; qux'

// Passing an `options` object to `encode`, to explicitly encode all symbols:
he.encode('foo © bar ≠ baz 𝌆 qux', {
  'encodeEverything': true
});
// → '&#x66;&#x6F;&#x6F;&#x20;&#xA9;&#x20;&#x62;&#x61;&#x72;&#x20;&#x2260;&#x20;&#x62;&#x61;&#x7A;&#x20;&#x1D306;&#x20;&#x71;&#x75;&#x78;'

// This setting can be combined with the `useNamedReferences` option:
he.encode('foo © bar ≠ baz 𝌆 qux', {
  'encodeEverything': true,
  'useNamedReferences': true
});
// → '&#x66;&#x6F;&#x6F;&#x20;&copy;&#x20;&#x62;&#x61;&#x72;&#x20;&ne;&#x20;&#x62;&#x61;&#x7A;&#x20;&#x1D306;&#x20;&#x71;&#x75;&#x78;'

`strict`

The default value for the strict option is false. This means that encode() will encode any HTML text content you feed it, even if it contains any symbols that cause parse errors. To throw an error when such invalid HTML is encountered, set the strict option to true. This option makes it possible to use he as part of HTML parsers and HTML validators.

// Using the global default setting (defaults to `false`, i.e. error-tolerant mode):
he.encode('\x01');
// → '&#x1;'

// Passing an `options` object to `encode`, to explicitly enable error-tolerant mode:
he.encode('\x01', {
  'strict': false
});
// → '&#x1;'

// Passing an `options` object to `encode`, to explicitly enable strict mode:
he.encode('\x01', {
  'strict': true
});
// → Parse error

`allowUnsafeSymbols`

The default value for the allowUnsafeSymbols option is false. This means that characters that are unsafe for use in HTML content (&, <, >, ", ', and `) will be encoded. When set to true, only non-ASCII characters will be encoded. If the encodeEverything option is set to true, this option will be ignored.

he.encode('foo © and & ampersand', {
  'allowUnsafeSymbols': true
});
// → 'foo &#xA9; and & ampersand'

Overriding default `encode` options globally

The global default setting can be overridden by modifying the he.encode.options object. This saves you from passing in an options object for every call to encode if you want to use the non-default setting.

// Read the global default setting:
he.encode.options.useNamedReferences;
// → `false` by default

// Override the global default setting:
he.encode.options.useNamedReferences = true;

// Using the global default setting, which is now `true`:
he.encode('foo © bar ≠ baz 𝌆 qux');
// → 'foo &copy; bar &ne; baz &#x1D306; qux'

`he.decode(html, options)`

This function takes a string of HTML and decodes any named and numerical character references in it using the algorithm described in section 12.2.4.69 of the HTML spec.

he.decode('foo &copy; bar &ne; baz &#x1D306; qux');
// → 'foo © bar ≠ baz 𝌆 qux'

The options object is optional. It recognizes the following properties:

`isAttributeValue`

The default value for the isAttributeValue option is false. This means that decode() will decode the string as if it were used in a text context in an HTML document. HTML has different rules for parsing character references in attribute values — set this option to true to treat the input string as if it were used as an attribute value.

// Using the global default setting (defaults to `false`, i.e. HTML text context):
he.decode('foo&ampbar');
// → 'foo&bar'

// Passing an `options` object to `decode`, to explicitly assume an HTML text context:
he.decode('foo&ampbar', {
  'isAttributeValue': false
});
// → 'foo&bar'

// Passing an `options` object to `decode`, to explicitly assume an HTML attribute value context:
he.decode('foo&ampbar', {
  'isAttributeValue': true
});
// → 'foo&ampbar'

`strict`

The default value for the strict option is false. This means that decode() will decode any HTML text content you feed it, even if it contains any entities that cause parse errors. To throw an error when such invalid HTML is encountered, set the strict option to true. This option makes it possible to use he as part of HTML parsers and HTML validators.

// Using the global default setting (defaults to `false`, i.e. error-tolerant mode):
he.decode('foo&ampbar');
// → 'foo&bar'

// Passing an `options` object to `decode`, to explicitly enable error-tolerant mode:
he.decode('foo&ampbar', {
  'strict': false
});
// → 'foo&bar'

// Passing an `options` object to `decode`, to explicitly enable strict mode:
he.decode('foo&ampbar', {
  'strict': true
});
// → Parse error

Overriding default `decode` options globally

The global default settings for the decode function can be overridden by modifying the he.decode.options object. This saves you from passing in an options object for every call to decode if you want to use a non-default setting.

// Read the global default setting:
he.decode.options.isAttributeValue;
// → `false` by default

// Override the global default setting:
he.decode.options.isAttributeValue = true;

// Using the global default setting, which is now `true`:
he.decode('foo&ampbar');
// → 'foo&ampbar'

`he.escape(text)`

This function takes a string of text and escapes it for use in text contexts in XML or HTML documents. Only the following characters are escaped: &, <, >, ", ', and `.

he.escape('<img src=\'x\' onerror="prompt(1)">');
// → '&lt;img src=&#x27;x&#x27; onerror=&quot;prompt(1)&quot;&gt;'

`he.unescape(html, options)`

he.unescape is an alias for he.decode. It takes a string of HTML and decodes any named and numerical character references in it.

Using the `he` binary

To use the he binary in your shell, simply install he globally using npm:

npm install -g he

After that you will be able to encode/decode HTML entities from the command line:

$ he --encode 'föo ♥ bår 𝌆 baz'
f&#xF6;o &#x2665; b&#xE5;r &#x1D306; baz

$ he --encode --use-named-refs 'föo ♥ bår 𝌆 baz'
f&ouml;o &hearts; b&aring;r &#x1D306; baz

$ he --decode 'f&ouml;o &hearts; b&aring;r &#x1D306; baz'
föo ♥ bår 𝌆 baz

Read a local text file, encode it for use in an HTML text context, and save the result to a new file:

$ he --encode < foo.txt > foo-escaped.html

Or do the same with an online text file:

$ curl -sL "http://git.io/HnfEaw" | he --encode > escaped.html

Or, the opposite — read a local file containing a snippet of HTML in a text context, decode it back to plain text, and save the result to a new file:

$ he --decode < foo-escaped.html > foo.txt

Or do the same with an online HTML snippet:

$ curl -sL "http://git.io/HnfEaw" | he --decode > decoded.txt

See he --help for the full list of options.

Support

he has been tested in at least:

Chrome 27-50
Firefox 3-45
Safari 4-9
Opera 10-12, 15–37
IE 6–11
Edge
Narwhal 0.3.2
Node.js v0.10, v0.12, v4, v5
PhantomJS 1.9.0
Rhino 1.7RC4
RingoJS 0.8-0.11

Unit tests & code coverage

After cloning this repository, run npm install to install the dependencies needed for he development and testing. You may want to install Istanbul globally using npm install istanbul -g.

Once that’s done, you can run the unit tests in Node using npm test or node tests/tests.js. To run the tests in Rhino, Ringo, Narwhal, and web browsers as well, use grunt test.

To generate the code coverage report, use grunt cover.

Acknowledgements

Thanks to Simon Pieters (@zcorpan) for the many suggestions.

Author


Mathias Bynens

License

he is available under the MIT license.

he's People

Contributors

Stargazers

Watchers

Forkers

f2er web5design enyo microadam coinhelper jkso brynm andris9 spmjs r3ddox tx2z pablonosh flovilmart jugglinmike filterkaapi goalall davojan ifa6 vivekimsit hdthinh sajib-hassan bahamas10 alvin2ye rahmanusta max8899 irfanj nachopol jemmy655 akobler thedatawhore dj31416 hcharley fabho jslegers cloudxtreme rreverser 777sunny777 ts252 atomikolex tobi-et-al sleeper01 codiacshq duncan00 gustavoteodoro only-fork-repos omidnasri cnstudios digideskio fernandezr cucygh solomancode kodemonk prabowo87 xuezhenguo qrl909109 zhalice2011 niilante nkscorpion vamouszj andersdjohnson billwangyao bluefin88 caraxiong shivamgupta211 sgq456 fairhat gostfather dofind stefcab ppcharlier magic313 tammyding jonatansalas isafesoft quangnguyen9x wangjiaojiao77 vikeel aiwwj xiaer93 liujie2019 somsne hagb4rd flypxx yves-k hpugongying denismrgnusa profpatsch rosenyang gy-jonh xflihaibo keithamus pilarodriguez splitmango-airrickd hhardyy dudemanvox starjun common110 jianoll jindwer bluenestbox

he's Issues

encode(): Add option to avoid named references (for better compatibility)

Suggested by @zcorpan:

13:47:03 <zcorpan> maybe it'd be useful to opt to avoid named references if one wants better compat with old browsers
13:47:40 <zcorpan> e.g. old IE doesn't support &apos;
13:47:56 <zcorpan> nor the 1000s of mathml entities

Incorrect error message for unknown named character references

A minor issue:

he.decode('&abc;', {strict: true}) throws error with this message: Parse error: named character reference was not terminated by a semicolon, when in fact neither a nor ab are valid legacy named character references and &abc; is terminated by ;. I think an error message to the effect of Parse error: named character reference is not spec-defined would be better in this case.

This and #50 notwithstanding, he has been a great companion to the HTML5 spec as I learn about and write a spec-compliant HTML entity decoder for Swift :)

Is there an option to escape html without double escaping if the html content has already been escaped?

Currently I'm just decoding before escaping, but I wonder if there's a faster, more elegant solution.

Thanks and great library!

HTML Entity Decoder should require trailing semicolon

Is it by design that '&ampersand' is conveter to '&ersand'? If my intuition is correct, it should not.

[he.encode] Avoid error if arg string is null

Hi, good library!
But I don't want to test if my string is null or not before to he.encode it (lazy boy :)).
The strings coming from left jointed sql requests are often null...

So I add, line 139, just after "var encode = function(string, options) { ... "
if(string === null)return '';

Alternatives in the browser?

Hey, I'm looking for alternatives to he for the browser, any recommendations? It's just for UX, I'd still be using he in the server.

Parser issue on IE after JS minification.

1.1.0 not published

I want to use the new decimal option, but it seems that [email protected] is not published to npm yet.

Don’t escape printable ASCII symbols (except <>"'&)

Currently, some symbols like + get escaped (e.g. +) simply because there exists a named character reference for them. However, it’s not really necessary to encode these symbols, since they’re printable ASCII already.

We should filter these out in data.js and make sure they don’t end up in encodeMap.

For strictly browser-side code, is there any reason to use this library in favour of hacks involving DOM elements' innerHTML and innerText properties?

Consider the following Stack Overflow answer to the question How to decode HTML entities using jQuery?

Just do:
var decoded = $('<textarea/>').html(encoded).val();
where encoded is your string containing HTML entities that you wish to decode.

This works similarly to the accepted answer, but is safe to use with untrusted user input.

As noted by Mike Samuel, doing this with a <div> instead of a <textarea> with untrusted user input is an XSS vulnerability, even if the <div> is never added to the DOM:
// Shows the alert in Firefox and Safari (and returns an empty string)
$("<div/>").html(
    '<img src="//www.google.com/images/logos/ps_logo2.png" onload=alert(1337)>'
).text()
However, this attack is not possible against a <textarea> because there are no HTML elements that are permitted content of a <textarea>. Consequently, any HTML tags still present in the 'encoded' string will be automatically entity-encoded by the browser.
// This is safe (and returns the right answer)
$("<textarea/>").html(
    '<img src="//www.google.com/images/logos/ps_logo2.png" onload=alert(1337)>'
).text()

Previously, the answer just included the first code snippet. I recently edited the answer to note the rationale behind using a textarea instead of a div. However, I'm a little uneasy, because I know that your library exists and is not (as far as I can tell) strictly targeting node users. I find myself wondering why.

I'll probably post a link to this library as an answer (unless you'd like to do so yourself) to that question regardless, since I figure that people who are using node may benefit from having a single solution that is usable both clientside and serverside. But how about everyone else? What reason is there for anyone to serve a 300 line script to serve a purpose that can - it seems to my naive eyes - be done in 50 characters with a clever hack?

Are there any situations at all in which the textarea hack fails (or at least is not guaranteed by spec to succeed)? I confess to being slightly uneasy about it since I don't know where (or for that matter, if) the spec determines the behaviour of browsers when presented with HTML elements containing disallowed children, like

<textarea>
    <p>I'm not really supposed to be here.</p>
</textarea>

but from the testing I've done, it seems to work.

Sorry to offload a question like this onto you, but it seems to be right in your area of expertise and is relevant when figuring out to whom this library is useful. (Indeed, if there is something profoundly wrong with the textarea hack, it almost seems worth noting that in this library's README - otherwise, the case for using a library for this purpose at all is unclear).

Why decode method doesn't do "%2c" "%20"

Entities are decoded multiple times

First of all: Great project, you're taking an interesting approach.

Apparently, you have your own version of a bug I had with my entities module (I'm referring to fb55/entities#8 ):

Decoding ''amp;' will first decode the hexadecimal escape, and afterwards the named entity.

Even worse, &amp; will also be decoded twice; you'll end up with &;.

Improve `options` parsing

Currently either the options argument or the global he.{en,de}code.options object is used, but not both.

(This used to work until now in most cases because there was only a single option available for each he function (decode, encode).)

Make `&aaa;` a parse error

See whatwg/html#1257:

&aaa; or <p title="&ampersand;"> should be parse errors

Error with numerical strings

If the string passed into "encode()" is a number, there's an error occurring on line no. 193.

TypeError: string.replace is not a function
string = string.replace(regexEscape, hexEscape);

Edge case: does not decode example string on w3 spec

I was testing encode/decode via https://mothereff.in/html-entities while cross-referencing the spec, and I noticed that he is not able to decode certain named references correctly. On the w3 spec page, it lists this example string, I'm &notit; I tell you, which should be parsed into I'm ¬it; I tell you with a parse error. he returns the string un-parsed. It appears that he is not able to parse legacy named references if there are one or more alphanumeric characters after the legacy named reference followed by a semicolon ; character. he parses correctly if the tail of alphanumeric characters ends with a character other than semicolon.

decode(): Make sure decimal and hexadecimal escapes are within the Unicode range

he.decode('&#x9999999999999999;'); // '\uFFFD'

Add XHTML/XML option

This may not be worth it, but here goes…

E.g.  → U+0085 in XHTML, while in HTML it’s U+2026.

http://www.w3.org/TR/xml/#d0e3895

Entities for these symbols are allowed in XML: http://www.w3.org/TR/xml/#NT-Char

SyntaxError: expected expression, got '<'

I'm trying to use this library but every time I include it (WordPress Project) I get the following error in the console in both FF & Chrome:

SyntaxError: expected expression, got '<'
http://******************************/js/he.js?ver=0.5.0
Line 32

Cheers!

`he.decode()` performance compared to browser-based hack

Hello,

Thanks for all your hard work on the library and the awesome documentation!

I did a performance test recently between he.decode() and using this trick to use the browser's <textarea> element to do the conversion for me.

Surprisingly, I found that he.decode() was 2x slower for my string than using the browser's textarea. Here is the code I used to run my benchmarks. The <script> src at the top should be changed to point to your he.js script location:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <script src="/he.js"></script>
    <title></title>
</head>
<body>
<script>
    var txtArea = document.createElement("textarea");
    function decodeHtmlSameTxtArea(html) {
        txtArea.innerHTML = html;
        return txtArea.value;
    }

    function decodeHtml(html) {
        var txt = document.createElement("textarea");
        txt.innerHTML = html;
        return txt.value;
    }

    var count = 100;
    var stringToDecode = "hes a&#039;s a&#039;&#039;s a&#039;&#039;&#039;s a&#039;&#039;&#039;&#039;s b&quot;s b&quot;&quot;s b&quot;&quot;&quot;s b&quot;&quot;&quot;&quot;s \\ // &#039;&#039; ::&quot;&quot;&amp;*^ &lt; &gt; &lt;&lt; &gt;&gt;";



    var a = performance.now();
    for (var i = 0; i < count; i++) {
        var decodedString = decodeHtml(stringToDecode);
    }
    var b = performance.now();
    console.info("Time Taken (using new txtarea each time):", (b - a)/1000, 'seconds.');



    var a = performance.now();
    for (var i = 0; i < count; i++) {
        var decodedString = decodeHtmlSameTxtArea(stringToDecode);
    }
    var b = performance.now();
    console.info("Time Taken (using same txtarea):", (b - a)/1000, 'seconds.');


    var a = performance.now();
    for (var i = 0; i < count; i++) {
        var decodedString = window.he.decode(stringToDecode);
    }
    var b = performance.now();
    console.info("Time Taken (using HtmlEntities library function):", (b - a)/1000, 'seconds.');
</script>
</body>
</html>

Just wanted to point out this interesting comparison. It isn't really an issue so you can close this!

TypeError: html.replace is not a function.

When I tested the module with the following code:

// file: decode.js
const Transform = require('stream').Transform;
const he = require('he');

const parser = new Transform();
parser._transform = function(data, encoding, done) {
    this.push(he.decode(data));
    done();
};
process.stdin.pipe(parser).pipe(process.stdout);
process.stdout.on('error', process.exit);

and run it by decoding any text file, say, decoding its own:

$ cat decode.js | node decode.js

I got the following error:

./node_modules/he/he.js:232
return html.replace(regexDecode, function($0, $1, $2, $3, $4, $5, $6, $7) {
^
TypeError: html.replace is not a function
at Object.decode (./node_modules/he/he.js:232:15)

and if I changed html.replace to String(html).replace, it fixes the TypeError. Is it a valid bug/fix?

Make scripts write data files + export the data

Like https://github.com/mathiasbynens/swapcase/blob/8ded201ff6456e72192dd39e6b3e8260ea6762db/scripts/swap-map.js#L46-L53 — it’s much nicer, and it avoids the additional process-data.js step by just making it part of export-data.js.

How to download and use in app

Hi. Thanks for this great library. I'm having a challenge downloading this library and using it directly (not via npm). I downloaded he.js from here but it seems to have other dependencies, as it errors on line 32 with Unexpected token:

var encodeMap = <%= encodeMap %>;

Those % tags look like they belong server side. How do you download this library directly? What am I doing wrong? Thanks.

encode: Add `{encode,escape}Everything` option

Implement functionality
Document in README
Add option to binary

What should happen when code points from the overrides table are encoded?

http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#table-charref-overrides

Current behavior:

> he.encode('\x80')
'&#x80;'

 is an invalid character reference (parse error) but then again, using the raw U+0080 character is just as invalid. The difference is that U+0080 in HTML source gets ignored, while  becomes € due to the overrides table.

Should we continue to return invalid entities, knowing they might map to a completely different symbol? Or should we not escape any invalid code points in the input? Or should we strip invalid characters from the input?
Should there be a strict option for encode as well (just like there is for decode) which errors in case an invalid character is part of the source?

cc @zcorpan

Cache regular expressions

IE-only named character references

Posting these here for future reference. he is not going to support these until they’re standardized or supported in more than one browser.

Source: http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2007-July/012235.html

&aafs;    U+206D  ACTIVATE ARABIC FORM SHAPING
&ass;     U+206B  ACTIVATE SYMMETRIC SWAPPING
&iafs;    U+206C  INHIBIT ARABIC FORM SHAPING
&iss;     U+206A  INHIBIT SYMMETRIC SWAPPING
&lre;     U+202A  LEFT-TO-RIGHT EMBEDDING
&lro;     U+202D  LEFT-TO-RIGHT OVERRIDE
&nads;    U+206E  NATIONAL DIGIT SHAPES
&nods;    U+206F  NOMINAL DIGIT SHAPES
&pdf;     U+202C  POP DIRECTIONAL FORMATTING
&rle;     U+202B  RIGHT-TO-LEFT EMBEDDING
&rlo;     U+202E  RIGHT-TO-LEFT OVERRIDE
&zwsp;    U+200B  ZERO WIDTH SPACE

Uncaught TypeError: Cannot read property 'replace' of undefined

When I use this function I get this error...
he.js:202 Uncaught TypeError: Cannot read property 'replace' of undefined

Where is 'regexEscape' being initialized?

He.decode : keep some part encoded

Hi,

Using he.decode, I need to keep a part of the code encoded. Is it possible ?

For example, I do he.encode on :

<h1>Title</h1>
<pre>
<p>Code</p>
</pre>

So I get :

&lt;h1&gt;Title&lt;/h1&gt;
&lt;pre&gt;
&lt;p&gt;Code&lt;/p&gt;
&lt;/pre&gt;

And here is what I need to get doing a he.decode :

<h1>Title</h1>
<pre>
&lt;p&gt;Code&lt;/p&gt;
</pre>

Different results

I did created a script in grunt to rename names.

In my object collection the original filename is:

0D0001E2-9AB0-C8D7-D1E8-4F264F3492E3/Adrichem - el Templo de Solom&#243;n.jpg

But when i use grunt-contrib-copy .. to read the src encoded the result is

 0D0001E2-9AB0-C8D7-D1E8-4F264F3492E3/Adrichem - el Templo de Solom&#x301;n.jpg

Why?

Thanks

This is the code:

cdl: {
        files: [{
          expand: true,
          dot: true,
          cwd: 'brain/files',
          src: '**/*.{jpg,JPG,png,PNG,gif,jpeg,webp,tiff,mp3,wav,avi,mp4}',
          dest: '<%= yeoman.app %>/pages',
          rename: function(dest, src) {
            var attachments = grunt.config.get('CDL.attachments'),
              newFilename;

            grunt.log.writeln(['filename:', he.encode(src) ]);

            if (typeof attachments[src] !== 'undefined') {
              newFilename = attachments[src].guid + attachments[src].format;
              grunt.log.writeln([newFilename, src.split('/')[1], dest + src.replace(src.split('/')[1], newFilename)]);
              return dest + '/' + src.replace(src.split('/')[1], newFilename);
            }

            return dest + '/' + src;
          }
        }]

Syntax error in both Firefox and Chrome

Try to use this at client side. Console throws up an error.

Firefox:
SyntaxError: expected expression, got '<'[Learn More] he.js:32:17

Chrome:
Uncaught SyntaxError: Unexpected token < h2.js:32

differences with underscore.string.escapeHtml

underscore.string has a function escapeHtml https://github.com/epeli/underscore.string#string-functions

jQuery also.

I guess he is more robust (as stated in the intro), but may we have facts ?

How about perfs ? http://jsperf.com/string-escapehtml

Ignore emojis

Is there a way to ignore emojis, like

he.encode('You\'re so young 😏', { ignoreEmoji: true })

module

Representing `escape()` as `htmlEncode()`

Your site implies an html encoder, when it is really just escaping characters, not html encoding.

https://mothereff.in/html-entities

Internet Explorer 11 + Edge 12-13-14 JavaScript parsing issues

Hello,

I'm using this library in production and get reports of JavaScript errors (in TrackJS.com) from IE & Edge browsers like:

Expected identifier, string or number
Unterminated string constant
Expected ':', '}' or ';'
illegal character

And when I check the error file / line, it's sometimes in he.js, like here :

And sometimes it's in other parts of the code (like in YUI library)... seems completely random, by the way :

The only idea I have would be that IE shits on himself when loading the very long JSON defined in he.js. Or maybe the fact that there's very long strings processed in regex ?

Here's the exact version of he.js I'm using in my project : he.js.zip
I've modified it a bit, trying to cut the very long lines into shorter lines. Which didn't fixed it.

Have you ever had any issue of this kind ?

Thanks a lot for your help.
gnutix

TypeError: html is undefined

He repo is da REAL MVP

Thanks so much for this repo!

That is all :)

`` leaks NULL character

As per spec, number should be parsed before mapping against the table, so  should be decoded just in the same way as  /  / ..., that is, replaced with \uFFFD.

Currently it instead returns actual "unsafe" \u0000 string.

Make `<a href="guitar?pedal=foo&amp=bar">x</a>` no longer be a parse error

…once whatwg/html#326 (comment) is resolved.

Invalid character in IE

In IE10, I get an error when trying to include he via a Browserify bundle.

The error is:

Invalid character
mergedAssets.min.8516d96.js (5,33406)

Which points to here in he:

"caret",Ç:"caron"

Time for a new release?

I see that there hasn't been a release since Aug 24, 2014.
Any reason for this and will there be a release soon?

Handle lone surrogates as per the spec + implement lookup table

From http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#tokenizing-character-references:

Otherwise, if the number is in the range 0xD800 to 0xDFFF or is greater than 0x10FFFF, then this is a parse error. Return a U+FFFD REPLACEMENT CHARACTER.

Some examples:

he.decode('&#xD834;&#xDF06;') → '\uFFFD\uFFFD'
he.decode('&#xD800;') → '\uFFFD'

Also check out the table in the spec, e.g.:

he.decode('&#x0;') → '\uFFFD'

Reported by the amazing @zcorpan in #whatwg.

Error for Markup characters pass through when `allowUnsafeSymbols: true`

Hello,

I ran the test for "Markup characters pass through when allowUnsafeSymbols: true" on the demo site, but I got a value different from what is expected in the test:

he.encode('foo\xA9<bar\uD834\uDF06>baz\u2603"qux', { 'allowUnsafeSymbols': true })
// results
'foo&#xA9;&#x3C;bar&#x1D306;&#x3E;baz&#x2603;&#x22;qux' // actual
'foo&#xA9;<bar&#x1D306;>baz&#x2603;"qux' // expected

Add `strict` option to `decode` that throws on parse errors

This would make it possible to use he as part of a HTML parser/validator, for example.

he.decode(`&copy123`)

he.decode('&copy123');
// → '&copy123'

he.decode('&notin; &noti;');
// → '∉ &noti;'

It should be '∉ ¬i;'.

See regexCharacterReferencesThatHaveASemicolonFreeCharacterReferenceAsSubstring (lolol) in https://github.com/mathiasbynens/mothereff.in/blob/master/ampersands/eff.js.

decode(): Support the different parsing mode that is used in attribute values

http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#character-reference-in-attribute-value-state

If the character reference is being consumed as part of an attribute, and the last character matched is not a U+003B SEMICOLON character (;), and the next character is either a U+003D EQUALS SIGN character (=) or an alphanumeric ASCII character, then, for historical reasons, all the characters that were matched after the U+0026 AMPERSAND character (&) must be unconsumed, and nothing is returned. However, if this next character is in fact a U+003D EQUALS SIGN character (=), then this is a parse error, because some legacy user agents will misinterpret the markup in those cases.

This issue was brought to you by @zcorpan’s Quality Assurance Services™.

should \u2026 be encoding into hellip not mldr?

gentlemen,
\u2026 is being decoded correctly both from hellip and mldr named entities.
However, when encoding, \u2026 is encoded into mldr.
I traversed quite few Unicode reference pages and everywhere the default named entity is referenced as hellip.
Please consider switching the encoding to hellip named entity because it is easier to remember and matches majority of reference sources on the Internet.
thank you

v1.1.0 not available to bower

Can we have v1.1.0 exposed for Bower? Thanks!

streaming implementation

Hello,

Have you considered a streaming implementation of he.

Reading the code, it seems like there are a lot of expression types to check and the streamer would probably need to be code generated out of a description of all the entities.

From your experience on he, do you think that is feasible / that would make sense ?

Thanks

mathiasbynens / he Goto Github PK

he's Introduction

he

Installation

API

he.version

he.encode(text, options)

useNamedReferences

decimal

encodeEverything

strict

allowUnsafeSymbols

Overriding default encode options globally

he.decode(html, options)

isAttributeValue

strict

Overriding default decode options globally

he.escape(text)

he.unescape(html, options)

Using the he binary

Support

Unit tests & code coverage

Acknowledgements

Author

License

he's People

Contributors

Stargazers

Watchers

Forkers

he's Issues

Recommend Projects

Recommend Topics

Recommend Org

`he.version`

`he.encode(text, options)`

`useNamedReferences`

`decimal`

`encodeEverything`

`strict`

`allowUnsafeSymbols`

Overriding default `encode` options globally

`he.decode(html, options)`

`isAttributeValue`

`strict`

Overriding default `decode` options globally

`he.escape(text)`

`he.unescape(html, options)`

Using the `he` binary