Giter VIP home page Giter VIP logo

html-sanitizer's People

Contributors

fbastien avatar javiereguiluz avatar lctrs avatar martijnve avatar norkunas avatar olegatro avatar paragonie-security avatar snebes avatar sukant-kar avatar tgalopin avatar voku avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

html-sanitizer's Issues

Add support for u (unarticulated) tag

Although u (as in underline) tag has been deprecated in HTML 4.01, u tag has been redefined in HTML 5 ("u" as in unarticulated). Thus, I think it should be included in the basic extension.

Is it possible to use this librairy without extension?

Hello, I want disallow all html tags by default.
I have read the doc but not find an option for use this library without use an extension (who provide by default a lot of allowed tags).

Is it possible to only allow a list of specific tags? I want only allow:
p, br, strong, i, u

Thanks.

Remove multiple <br> tags

Hi!

Thanks for this great package!

A lot of our html has repeated <br> tags like crazy, and I would like to reduce that to just one (or maybe two) line breaks. Is there any way to achieve this result using this package?

Thanks!

Add rel/target tags for a

Thanks for the great library! Given the intended use-case of passing in HTML from something untrusted and then being able to sanitize and display it securely, it would be really cool if we could get some options to force rel and target on a tags. While rel="nofollow" and target="_blank" are "nice to have", rel="noopener" is certainly important for security and would be great if we could force it.

Config options could look something like this:

$sanitizer = HtmlSanitizer\Sanitizer::create([
    'extensions' => ['basic', 'image', 'iframe'],
    'tags' => [
        'a' => [
            /*
             * If an array is provided, links targeting other hosts than one in this array
             * will be disabled (the `href` attribute will be blank). This can be useful if you want
             * to prevent links targeting external websites. Keep null to allow all hosts.
             * Any allowed domain also includes its subdomains.
             *
             * Example:
             *      'allowed_hosts' => ['trusted1.com', 'google.com'],
             */
            'allowed_hosts' => null,
            
            /*
             * If true, mailto links will be accepted.
             */
            'allow_mailto' => false,

            /*
             * Forces rel=nofollow in links.
             */
            'force_rel_nofollow' => false,

            /*
             * Forces rel=noopener in links.
             */
            'force_rel_noopener' => false,

            /*
             * Forces target=value unless set to false.
             */
            'force_target' => false,
        ],
...

I just quickly threw this issue together so if you would prefer something in the form of a PR I can look into doing that instead.

<u> tag not supported

Hi,
I am using html/sanitizer version 1.4.0 in a Symfony project, and the Sanitizer filters out the <u> tags even though it is explicitly configured in the tags section as in the configuration reference file.

      $sanitizer = Sanitizer::create([
          'extensions' => ['basic', 'code', 'image', 'list', 'table', 'details', 'extra'],
          'tags' => [
              'u' => [
                  'allowed_attributes' => [],
              ],
          ],
      ]);

I noticed there was a PR (#61) for adding <u> tags into the basic extension but it's not been released yet. Are tags only available from within extensions? So until that commit is released there will be no support for <u> tags?
Could someone please advise?

Thanks a lot!

Images src attribute removed

Hi,
I'm trying to use html-sanitizer to allow users to create articles in a Blog style application I'm building.
I can't figure out why sanitizer is removing src attribute from images tags.

The config I'm using is this one
$this->sanitizerConfig = [ 'extensions' => ['basic', 'code', 'image', 'list', 'table'], 'tags' => [ 'a' => [ 'allowed_hosts' => null, 'allow_mailto' => true, ], 'img' => [ 'allowed_attributes' => ['src', 'alt', 'title', 'width', 'height'], 'allowed_hosts' => null, 'allow_data_uri' => true, 'force_https' => false, ], 'div' => [ 'allowed_attributes' => ['class'], ], 'span' => [ 'allowed_attributes' => ['class'], ], 'table' => [ 'allowed_attributes' => ['class'], ], 'p' => [ 'allowed_attributes' => ['class'], ], 'h1' => [ 'allowed_attributes' => ['class'], ], 'h2' => [ 'allowed_attributes' => ['class'], ], 'h3' => [ 'allowed_attributes' => ['class'], ], 'h4' => [ 'allowed_attributes' => ['class'], ], ], ];
this is the html before sanitizing
"<p><img src="/images/uploaded/articles/1b75dd06bf92c5e04e1491af441491fe9a7d7bab.png" alt="Test image" width="960" height="638" /></p>"
and this is what I get from sanitize method.

"<p><img alt="Test image" width="960" height="638" /></p>"

Thanks for your help.

CI is not running anymore

With Travis dropping free CI for public repos, the CI is not running anymore.

I suggest migrating to GitHub Actions.

Comparison with HTMLPurifier

It would be great to have a comparison between this package and the HTMLPurifier library which is out there since a long time (what are the differences in the feature they support, etc...)

Remove span

Any way to remove span tag when used 'basic' extension ?

relative URLs

Note: the Sanitizer does not allow relative URLs: they are always filtered out for security reasons.

Any chance to disable this setting ?

Out of memory on malformed string

Hello,

I'm having out of memory exceptions when using HTML sanitizer (version 1.3.0) on a malformed string.

php.CRITICAL: Fatal Error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 352256 bytes) {"exception":"[object] (Symfony\\Component\\ErrorHandler\\Error\\OutOfMemoryError(code: 0): Error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 352256 bytes) at /srv/web/vendor/masterminds/html5/src/HTML5/Parser/Tokenizer.php:1054)"

Here is a snippet to reproduce the issue:

The message comes from a chat which truncates messages when they are too long, leading to some invalid html content. (Fun fact, this message comes from production).

<?php
/** @var SanitizerInterface $sanitizer   */
$sanitizer->sanitize("<p>Apr\u00e8s s&#039;il y a un gros bug et que tout le monde en profite, mon avis l\u00e0 dessus peut changer. Mais normalement non, pas de reset pour les joueurs arriv\u00e9s avec la beta publique.<\/p>\n\n<p>Par contre certains \u00e9quilibrages changeront, c&#");

How to prevent HTML encode (e.g @ => &#64;)

Thank you for the library.

I just have one question that I could not find an answer anywhere (so far). Is there any way to prevent HTML encode of the characters (e.g. prevent < to transform to & lt; > to & gt; @ to & #64; etc.). I use the Symfony bundle (if it matters).

Thanks.

Create a static method to pass extensions and `$html`

In my project at the moment I need to purify 2 properties in different entities with the same extensions, so I create a simple helper.

class SanitizerHelper
{
    private const SANITIZER_WHITE_LIST = ['basic', 'code', 'image', 'list', 'table', 'iframe', 'extra'];

    public static function sanitize(string $html)
    {
        $sanitizer = Sanitizer::create(['extensions' => self::SANITIZER_WHITE_LIST]);

        return $sanitizer->sanitize($html);
    }
}

The idea is to create a static method to pass extensions as first args and content as second args :
Actual

 $sanitizer = Sanitizer::create(['extensions' => self::SANITIZER_WHITE_LIST]);
 $safeHtml = sanitizer->sanitize($untrustedHtml);

After

$safeHtml = HtmlSanitizer\Sanitizer::sanitize([
    'extensions' => ['basic', 'code', 'image', 'list', 'table', 'iframe', 'details', 'extra'],
     $untrustedHtml)

WDYT ?

Extension allowing details/summary

It would be great to support the <details> and <summary> tags in an extension.

Github allows using them in its markdown for instance, to create collapsible regions.

How can I allow "target=_blank" ?

Hi @tgalopin & all

This package is extremely useful, thanks for all the hard work.
I am currently using it in a site where I need to be able to set a target when for certain links. Is there a simple way to configure this ?

I'm already using an extension provided by @olegatro to allow relative URI's. I thought perhaps it would be possible to modify that extension ?

Thanks for any help,
PhR

How to allow html5 data attribute for tags

Hay guys
I have two questions

  1. How can we configure tags to allow for html5 data attributes on tags.
    I know we can specifically set them one by one but i need a way to allows for all attributes beginning with data-* (maybe a regex way of doing it exists).
  2. Is there a ways to set a global attributes allowed on all tags.
    thanks

Should empty tags be removed

If we provide the following HTML:

<p>Hello</p>
<img src="javascript:evil();" onload="evil();" />

Then pass it through html-sanitizer we get:

<p>Hello</p>
<img />

Should the <img> tag be taken out entirely now it has no attributes left?

HtmlSanitizer returns empty string

the object parses the input string, and returns an empty string

$sanitizerHtml = HtmlSanitizer\Sanitizer::create([ 'extensions' => ['basic']]);
var_dump($sanitizerHtml->sanitize("bold")); //returns string(0) ""

using php 7.4.27

PHP Notice: "Undefined offset: 2" if allowed_hosts contains subdomains

Hi,

I recently got the following error:

Notice: Undefined offset: 2

here:

if ($uriParts[$key] !== $trustedPart) {

I got this error after adding a subdomain to allowed_hosts like sub.example.org and this causes the function to be called with the following parameters if a url like https://example.org in a href attribute is sanitized:

$uriParts = ['org', 'example']; 
$trustedParts = ['org', 'example', 'sub'];

Is a subdomain not allowed in allowed_hosts or is this a bug?

Unable to use case sensitive attributes and allow empty string attribute's value ( ="" )

Concern 01

Is there any possibility to allow case sensitive attributes with this package along with a config setting.

I have HTML like below and it removes case sensitive attributes even I add the attribute's named categoryType under allowed_attributes for div tag.

Ex:

<div class="custom" categoryType="books"></div>

Sanitizer is returning for above HTML as below.

<div class="custom"></div>

I have debug your library and found it identify the attribute as "categorytype" (all letters are in lower case). But could we have case sensitive attributes ?

Concern 02

Sanitizer package is removing empty strings value of a attribute like below. (="")

Ex:

<a rel="" href="https://github.com/tgalopin/html-sanitizer/">HTML Sanitizer</a>

Sanitizer returns

<a rel href="https://github.com/tgalopin/html-sanitizer/">HTML Sanitizer</a>

But could we have empty string as attribute value. May be we can allow this feature as well with a config setting.
<a rel="" href="https://github.com/tgalopin/html-sanitizer/">HTML Sanitizer</a>

Case Sensitive Custom Tags

Hello,

Thank you for the amazing library. I was always looking for that kind of library. I think i have found it finally. But when i trying to create my own tags, i'm having a problem. I'm trying to sanitize non-html my custom React Components from Front-end. But i just couldn't able to make the library accept like <User id="5" /> fake DOM node. So, i need to create a tag as PascalCase to make compitable with my React nodes.

Is there any way to define case sensitive custom tags? By the way, It would be so nice to have a tricky solution instead of to wait the next release.

Example Code

class UserNode extends AbstractTagNode
{
  use IsChildlessTrait;

  public function getTagName(): string
  {
    return 'User';
  }
}

class UserNodeVisitor extends AbstractNodeVisitor implements NamedNodeVisitorInterface
{
  use IsChildlessTagVisitorTrait;

  protected function getDomNodeName(): string
  {
    return 'User';
  }

  public function getDefaultAllowedAttributes(): array
  {
    return ['type', 'width', 'height'];
  }

  public function getDefaultConfiguration(): array
  {
    return ['custom_config' => NULL];
  }

  protected function createNode(\DOMNode $domNode, Cursor $cursor): NodeInterface
  {
    return new UserNode($cursor->node);
  }
}

class CustomExtension implements ExtensionInterface
{
  public function getName(): string
  {
    return 'custom';
  }

  public function createNodeVisitors(array $config = []): array
  {
    return [
      'User' => new UserNodeVisitor($config['tags']['User'] ?? []),
    ];
  }
}
$fakeHtml = <<<HTML
<div>
asd
<p>hmm</p> // p
<User type="test" /> // User type="test"
<user type="test" /> // user type="test"
<test /> // test
## Hmmm
** Yeah? **
<br> <br /> // br
<hr> <hr /> // hr
</div>
HTML;

$builder = new \HtmlSanitizer\SanitizerBuilder();
$builder->registerExtension(new CustomExtension());
$sanitizer = $builder->build(['extensions' => ['custom']]);

$safeHtml = $sanitizer->sanitize($fakeHtml);

echo trim($safeHtml);

Thanks,
Mustafa.

How to Sanitize with Unknown incoming HTML tags [Can't whitelist]?

Hi @tgalopin ,

Thanks for this great Sanitizer!

So, we are in a situation where we can't WhiteList Tags, we are using CKeditor5 and using that our user will generate any kind of content, and those content may contain variety of custom tags For example have a look at this.

<p><math xmlns="http://www.w3.org/1998/Math/MathML"><mfrac><mn>2</mn><mn>3</mn></mfrac><mo>+</mo><msqrt><mn>43</mn>
<mo>/</mo><mn>34</mn><mo></mo>
<mi mathvariant="normal">π</mi><mo></mo></msqrt></math>&nbsp;<math class="wrs_chemistry" xmlns="http://www.w3.org/1998/Math/MathML">
<msubsup><mi>mol</mi><mmultiscripts><msup><mmultiscripts><maction actiontype="argument">
<mrow>&nbsp;</mrow></maction><mprescripts>&nbsp;</mprescripts>
<mroot><msqrt>&nbsp;</msqrt><mrow>&nbsp;</mrow></mroot>
<none>&nbsp;</none></mmultiscripts>
<mrow>&nbsp;</mrow></msup><mrow>&nbsp;</mrow><none>&nbsp;</none><mprescripts>&nbsp;</mprescripts><mrow>&nbsp;</mrow><mrow>&nbsp;</mrow></mmultiscripts><mrow>&nbsp;</mrow></msubsup>
</math></p><p>&nbsp;</p><pre><code class="language-php">&lt;?php
echo "Hello world";
?&gt;</code></pre>

Now, I want to store this whole text in my DB, but i don't want any incoming XSS/SQLi scripts or tags. How it can be done? I was going through internal codebase of this project, and it seems i can add my own tags or i can introduce my own Custom Extension, but i would need to Whitelist tags, and attributes. How it can be done without whitelisting such tags?

Add an `all` extension to allow all tags

Thank you @tgalopin for this great job !
I have a small idea, it could helpful to have an all extension to allow all tags instead write all of them
WDYT ?

$sanitizer = Sanitizer::create(['extensions' => ['all']);
will be a shortcut for
$sanitizer = Sanitizer::create(['extensions' => ['basic', 'code', 'image', 'list', 'table', 'iframe', 'extra']);

Advanced filters / transforming nested node structures

Hi, thanks for developing this package. I’m currently working on a comments and webmention plugin and explored different possibilites of sanitizing HTML with PHP. After evaluating all available options, you project seems to be a great fit!

In my use-case, I do not need to sanitize HTML, but also apply some aggressive filtering like e.g. removing all class attributes, to ensure that the HTML of comments does not interfere with page styles of a blog and also would like to add a few custom cleanup routes.

The foremost important things are:

  1. Make sure, that every top-level node in given HTML fragent is wrapped by a <p>, <blockquote> or other allowed top-level element.
  2. <a> elements without an href attribute should be removed.
  3. <br> at the beginnging of an inline-element should be moved before the start of that element, otherwise this would break formatting of external links by prepending an icon. <br> elements in the middle of the link text should be preserved. Same for <br> elements at the end of and inline element.
  4. <h1>Headline</h1> etc. should be transformed into something like <p><strong>Headline</strong></p> to keep some kind of formatting, but preventing comments from messing with the document outline of the containing document.

As my project also handles webmentions, it has to deal with any possible kind of HTML markup, so I cannot use a strict whitelist to handle direct user input and forbid elements like e.g. headings in the first place.

These transforms are relatively easy to implement with PHPs native DOM Library, but after some hours of tinkering around with html-sanitizer, I could not find a solution for these particular requirements. I understand how to transform a single node, but that’s it. Can you please give me a hint, where I could hook into the DOM tree to do these kind of transformations or is there maybe a more elegant way?

Allow email addresses

Standard for presenting email addresses (can be seen in f.ex. gmail) is "John Doe [email protected]".

However custom tags like that can't be set (email changes with every user).

Temporary solution for us is to use curly braces, but it would be "nice-to-have" feature.

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.