Giter VIP home page Giter VIP logo

haskell-xss-sanitize's Introduction

Summary

Tests

xss-sanitize allows you to accept html from untrusted sources by first filtering it through a white list. The white list filtering is fairly comprehensive, including support for css in style attributes, but there are limitations enumerated below.

Sanitizing allows a web application to safely use a rich text editor, allow html in comments, or otherwise display untrusted HTML.

If you trust the HTML (you wrote it), you do not need to use this. If you don't trust the html you probably also do not trust that the tags are balanced and should use the sanitizeBalance function.

Usage

provides 2 functions in the module Text.HTML.SanitizeXSS

  • sanitize - filters html to prevent XSS attacks.
  • sanitizeBalance - same as sanitize but makes sure there are no lone opening/closing tags - useful to protect against a user's html messing up your page

Details

This is not escaping! Escaping html does prevent XSS attacks. Strings (that aren't meant to be HTML) should be HTML escaped to show up properly and to prevent XSS attacks. However, escaping will ruin the display of actual HTML.

This function removes any HTML tags or attributes that are not in its white-list. This may sound picky, but most HTML should make it through unchanged, making the process unnoticeable to the user but giving us safe HTML.

Integration

It is recommended to integrate this so that it is automatically used whenever an application receives untrusted html data (instead of before it is displayed). See the Yesod web framework as an example.

Limitations

Lowercase

All tag names and attribute names are converted to lower case as a matter of convenience. If you have a use case where this is undesirable let me know.

Balancing - sanitizeBalance

The goal of this function is to prevent your html from breaking when (unknown) html with unbalanced tags are placed inside it. I would expect it to work very well in practice and don't see a downside to using it unless you have an alternative approach. However, this function does not at all guarantee valid html. In fact, it is likely that the result of balancing will still be invalid HTML. There is no guarantee for how a browser will display invalid HTML, so there is no guarantee that this function will protect your HTML from being broken by a user's html. Other possible approaches would be to run the HTML through a library like libxml2 which understands HTML or to first render the HTML in a hidden iframe or hidden div at the bottom of the page so that it is isolated, and then use JavaScript to insert it into the page where you want it.

TagSoup Parser

TagSoup is used to parse the HTML, and it does a good job. However TagSoup does not maintain all white space. TagSoup does not distinguish between the following cases:

<a href="foo">, <a href=foo>
<a   href>, <a href>
<a></a>, <a/>

In the third case, img and br tags will be output as a single self-closing tags. Other self-closing tags will be output as an open and closing pair. So <img /> or <img><img> converts to <img />, and <a></a> or <a/> converts to <a></a>. There are future updates to TagSoup planned so that TagSoup will be able to render tags exactly the same as they were parsed.

Security

Where is the white list from?

Ultimately this is where your security comes from. I would expect that a faulty white list would act as a strong deterrent, but this library strives for correctness.

The source code of html5lib is the source of the white list and my implementation reference. If you feel a tag is missing from the white list, check to see if it has been added there.

If anyone knows of better sources or thinks a particular tag/attribute/value may be vulnerable, please let me know. HTML Purifier does have a more permissive and configurable (yet safe) white list if you are looking to add anything.

Where is the code from?

Original code was taken from John MacFarlane's Pandoc (with permission), but modified by Greg Weber to be faster and with parsing redone using TagSoup, and to use html5lib's white list. Michael Snoyman added the balanced tags functionality and released css-text specifically to help with css parsing. html5lib's sanitizer.py is used as a reference implementation, and most of the code should look the same. The css parsing is different: as mentioned we use a css parser, not regexes like html5lib.

style attribute

style attributes are now parsed with the css-text and autoparsec-text dependencies. They are then ran through a white list for properties and keywords. Whitespace is not preserved. This code was again translated from sanitizer.py, but uses attoparsec instead of regexes. If you don't care about stripping css you can avoid the attoparsec dependendcy by using the older < 0.3 version of this library.

data attributes

data attributes are not on the white list. The href and style attributes are white listed, but its values must pass through a white list also. This is how the data attributes could work also.

svg and mathml

A mathml white list is fully implemented. There is some support for svg styling. There is a full white list for svg elements and attributes. However, some elements are not included because they need further filtering (just like the data attributes) and this has not been done yet.

haskell-xss-sanitize's People

Contributors

andreasabel avatar asayers avatar drmartingaswork avatar duairc avatar gregwebs avatar juhp avatar parthshah31 avatar skyfold avatar snoyberg avatar sol avatar tarrasch avatar tysonzero avatar valpackett avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

haskell-xss-sanitize's Issues

MathML support need improvements

This was discovery when using Gitit, see jgm/gitit#479.

Summary

Valid MathML tags are sanitezed.

Steps To Reproduce

  1. Generate a simple MathML element with Pandoc.

    $ pandoc -f markdown -t html --mathml <<EOF
    \$x\$
    EOF
    <p><math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mi>x</mi><annotation encoding="application/x-tex">x</annotation></semantics></math></p>
    
  2. Copy the MathML element and sanitize it.

Actual Results

<math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mi>x</mi>x</math>

This MathML element is invalid.

Expected Results

<math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mi>x</mi><annotation encoding="application/x-tex">x</annotation></semantics></math></p>

This is the unchanged output from Pandoc.

Additional Information

The semantics and annotation tags are valid MathML 3, see http://www.w3.org/TR/MathML/chapter5.html#mixing.semantic.annotations for more information.

Environment Information

$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.8.3
$ ghc-pkg list gitit 
   gitit-0.10.6.1
$ ghc-pkg list pandoc
   pandoc-1.13.3
$ ghc-pkg list texmath
   texmath-0.8.0.1
$ ghc-pkg list xss-sanitize
   xss-sanitize-0.3.5.4

Export clearTags and explain safeTags is not enough

From the documentation:

You can insert your own custom filtering, but make sure you compose your filtering function with [safeTags] or [safeTagsCustom]

Really this should say:

You can insert your own custom filtering, but make sure you compose your filtering function with (safeTags . clearTags) or (safeTagsCustom . clearTagsCustom)

The problem is, without applying clearTags first, the input may not be sanitized:

Prelude Text.HTML.SanitizeXSS> filterTags safeTags "<iframe></iframe>"
""
Prelude Text.HTML.SanitizeXSS> filterTags safeTags "<script><iframe></iframe>"
"<iframe></iframe>"

This isn't an issue with sanitizeXSS because it composes safeTags with clearTags.

Prelude Text.HTML.SanitizeXSS> sanitizeXSS "<script><iframe></iframe>"
""

I'm happy to make a pull request to fix this.

xss-sanitize 0.3.7 causes downstream tests to fail

Here are snippets of what I observed on the stackage build server. I can try to provide better repro instructions upon request.

yesod-markdown-0.12.6.11

    Yesod.Markdown
      converts Markdown to sanitized HTML FAILED [1]
      converts Markdown to unsanitized HTML

    Failures:

      test/Spec.hs:33:9:
      1) Yesod.Markdown converts Markdown to sanitized HTML
           expected: "<h1 id=\"title\">Title</h1><ul><li>one</li><li>two</li><
li>three</li></ul>\n  alert('xxs');\n"
            but got: "<h1 id=\"title\">Title</h1><ul><li>one</li><li>two</li><
li>three</li></ul>"


markdown-0.1.17.4

      test/main.hs:230:26:
      1) html block xss
           expected: "alert(&quot;evil&quot;)"
            but got: ""

Issue with style details still being included

sanitizeBalance sometimes still includes the content within the style tag.

Here's a minimum reproducible example that shows the issue:

λ> sanitizeBalance "<!DOCTYPE><html><style>html{width:100%;max-width:100%}</style></html></DOCTYPE>"
"html{width:100%;max-width:100%}"

This should just return "" and a simple example like this works:

λ> sanitizeBalance "<style>html{width:100%}</style>"

Script and style tag contents should likely be cleared

Currently sanitizing script and style stags preserves the internal content as escaped html:

sanitize "<script>console.log('foo');</script>"
-- "console.log(&#39;foo&#39;);"

sanitize "<style>* { color: red }</style>"
-- "* { color: red }"

This is of course perfectly safe. However it seems very unlikely to be the desired resulting html.

Accordingly it seems like an explicit clear-list of html tags that should be emptied instead of escaped would be useful, with ["script", "style"] as the default list.

One could reasonably argue that <head> should make the list as well.

filterTags escapes HTML entities because of TagSoup's defaults

*Text.HTML.SanitizeXSS> filterTags safeTags "text&nbsp;more text"
"text&amp;nbsp;more text"

This would display as "text &nbspmore text" instead of "text more text".

If you add optEscape = id to the renderOptions then TagSoup will stop trying to escape &"<>

*Text.HTML.SanitizeXSS> filterTags safeTags "text&nbsp;more text"
"text&nbsp;more text"

If you are ok with this fix, I'll create a pull request with my changes.

img attribute getting removed

I'm using imgsrc to support high-resolution display images: https://webkit.org/demos/srcset/

<img src="image.jpg" srcset="image-1x.jpg 1x, image-2x.jpg 2x, image-3x.jpg 3x">

Your XSS strips this down to

<img src="image.jpg">

It just gets rid of the srcset. Why, and if srcset isn't really dangerous, can you consider whitelisting it?

Parser improperly handles optional tags under html5 rules

https://html.spec.whatwg.org/multipage/syntax.html#optional-tags

Minimal incorrect example:

-- current
sanitizeBalance "<td>foo<td>bar" == "<td>foo<td>bar</td></td>"

-- correct
sanitizeBalance "<td>foo<td>bar" == "<td>foo</td><td>bar</td>"

-- potentially an option depending on semantics of "balanced"
sanitizeBalance "<td>foo<td>bar" == "<td>foo<td>bar"

If html4 / xhtml5 sanitizing is also desired then separate functions/modules may be needed.

0.3.5.7 install failure on ubuntu

garbetsp@biostat1427:~/Projects$ cabal install xss-sanitize
Resolving dependencies...
Downloading xss-sanitize-0.3.5.7...
Failed to install xss-sanitize-0.3.5.7
Build log ( /home/garbetsp/.cabal/logs/xss-sanitize-0.3.5.7.log ):
cabal: /home/garbetsp/.cabal/logs/xss-sanitize-0.3.5.7.log: does not exist

I tried a direct download from git and this fails as well. It creates no log and aborts on install. Tried the earlier 0.3.5.6 release and got the same result.

Low-level API only for attributes?

The sanitizeAttribute function is exposed with the following comment:

low-level API if you have your own HTML parser

But safeTagName or sanitaryTags aren't exposed :-(

Function to get rid of invalid html

I am looking into rendering some user-submitted html, so unsurprisingly I'm planning on using this library to sanitize it.

However:

There is no guarantee for how a browser will display invalid HTML, so there is no guarantee that this function will protect your HTML from being broken by a user's html.

This makes me pretty nervous. Preventing invalid HTML from breaking the rest of the page is pretty essential.

How difficult would it be to add a functionality that would get rid of or fix invalid html to avoid this problem? Or at least invalid html that causes problems in practice.

Are there any known examples of html that is problematic even after sanitization? I was hoping that a parent div with some overflow: hidden would be sufficient.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.