tgalopin / html-sanitizer Goto Github PK
View Code? Open in Web Editor NEWSanitize untrustworthy HTML user input
License: MIT License
Sanitize untrustworthy HTML user input
License: MIT License
Although u
(as in underline) tag has been deprecated in HTML 4.01, u
tag has been redefined in HTML 5 ("u" as in unarticulated). Thus, I think it should be included in the basic
extension.
This replacement is breaking URLs and replacing valid =
Hello, I want disallow all html tags by default.
I have read the doc but not find an option for use this library without use an extension (who provide by default a lot of allowed tags).
Is it possible to only allow a list of specific tags? I want only allow:
p, br, strong, i, u
Thanks.
Hi!
Thanks for this great package!
A lot of our html has repeated <br>
tags like crazy, and I would like to reduce that to just one (or maybe two) line breaks. Is there any way to achieve this result using this package?
Thanks!
Thanks for the great library! Given the intended use-case of passing in HTML from something untrusted and then being able to sanitize and display it securely, it would be really cool if we could get some options to force rel
and target
on a
tags. While rel="nofollow"
and target="_blank"
are "nice to have", rel="noopener"
is certainly important for security and would be great if we could force it.
Config options could look something like this:
$sanitizer = HtmlSanitizer\Sanitizer::create([
'extensions' => ['basic', 'image', 'iframe'],
'tags' => [
'a' => [
/*
* If an array is provided, links targeting other hosts than one in this array
* will be disabled (the `href` attribute will be blank). This can be useful if you want
* to prevent links targeting external websites. Keep null to allow all hosts.
* Any allowed domain also includes its subdomains.
*
* Example:
* 'allowed_hosts' => ['trusted1.com', 'google.com'],
*/
'allowed_hosts' => null,
/*
* If true, mailto links will be accepted.
*/
'allow_mailto' => false,
/*
* Forces rel=nofollow in links.
*/
'force_rel_nofollow' => false,
/*
* Forces rel=noopener in links.
*/
'force_rel_noopener' => false,
/*
* Forces target=value unless set to false.
*/
'force_target' => false,
],
...
I just quickly threw this issue together so if you would prefer something in the form of a PR I can look into doing that instead.
Hi,
I am using html/sanitizer version 1.4.0 in a Symfony project, and the Sanitizer filters out the <u>
tags even though it is explicitly configured in the tags section as in the configuration reference file.
$sanitizer = Sanitizer::create([
'extensions' => ['basic', 'code', 'image', 'list', 'table', 'details', 'extra'],
'tags' => [
'u' => [
'allowed_attributes' => [],
],
],
]);
I noticed there was a PR (#61) for adding <u>
tags into the basic extension but it's not been released yet. Are tags only available from within extensions? So until that commit is released there will be no support for <u>
tags?
Could someone please advise?
Thanks a lot!
Hi,
I'm trying to use html-sanitizer to allow users to create articles in a Blog style application I'm building.
I can't figure out why sanitizer is removing src attribute from images tags.
The config I'm using is this one
$this->sanitizerConfig = [ 'extensions' => ['basic', 'code', 'image', 'list', 'table'], 'tags' => [ 'a' => [ 'allowed_hosts' => null, 'allow_mailto' => true, ], 'img' => [ 'allowed_attributes' => ['src', 'alt', 'title', 'width', 'height'], 'allowed_hosts' => null, 'allow_data_uri' => true, 'force_https' => false, ], 'div' => [ 'allowed_attributes' => ['class'], ], 'span' => [ 'allowed_attributes' => ['class'], ], 'table' => [ 'allowed_attributes' => ['class'], ], 'p' => [ 'allowed_attributes' => ['class'], ], 'h1' => [ 'allowed_attributes' => ['class'], ], 'h2' => [ 'allowed_attributes' => ['class'], ], 'h3' => [ 'allowed_attributes' => ['class'], ], 'h4' => [ 'allowed_attributes' => ['class'], ], ], ];
this is the html before sanitizing
"<p><img src="/images/uploaded/articles/1b75dd06bf92c5e04e1491af441491fe9a7d7bab.png" alt="Test image" width="960" height="638" /></p>"
and this is what I get from sanitize method.
"<p><img alt="Test image" width="960" height="638" /></p>"
Thanks for your help.
With Travis dropping free CI for public repos, the CI is not running anymore.
I suggest migrating to GitHub Actions.
It would be great to have a comparison between this package and the HTMLPurifier library which is out there since a long time (what are the differences in the feature they support, etc...)
Which is not installed by default using apt on Ubuntu (can be installed using apt install php7.x-xml
), and can lead to severe confusion because of silent error catch there
https://github.com/tgalopin/html-sanitizer/blob/master/src/Sanitizer.php#L96-L100
It will be useful add editorconfig for contributing
Any way to remove span tag when used 'basic' extension ?
Note: the Sanitizer does not allow relative URLs: they are always filtered out for security reasons.
Any chance to disable this setting ?
According tgalopin/html-sanitizer-bundle#26 I will allow nullable value here too
Hello,
I'm having out of memory exceptions when using HTML sanitizer (version 1.3.0
) on a malformed string.
php.CRITICAL: Fatal Error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 352256 bytes) {"exception":"[object] (Symfony\\Component\\ErrorHandler\\Error\\OutOfMemoryError(code: 0): Error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 352256 bytes) at /srv/web/vendor/masterminds/html5/src/HTML5/Parser/Tokenizer.php:1054)"
Here is a snippet to reproduce the issue:
The message comes from a chat which truncates messages when they are too long, leading to some invalid html content. (Fun fact, this message comes from production).
<?php
/** @var SanitizerInterface $sanitizer */
$sanitizer->sanitize("<p>Apr\u00e8s s'il y a un gros bug et que tout le monde en profite, mon avis l\u00e0 dessus peut changer. Mais normalement non, pas de reset pour les joueurs arriv\u00e9s avec la beta publique.<\/p>\n\n<p>Par contre certains \u00e9quilibrages changeront, c&#");
Thank you for the library.
I just have one question that I could not find an answer anywhere (so far). Is there any way to prevent HTML encode of the characters (e.g. prevent < to transform to & lt; > to & gt; @ to & #64; etc.). I use the Symfony bundle (if it matters).
Thanks.
In my project at the moment I need to purify 2 properties in different entities with the same extensions, so I create a simple helper.
class SanitizerHelper
{
private const SANITIZER_WHITE_LIST = ['basic', 'code', 'image', 'list', 'table', 'iframe', 'extra'];
public static function sanitize(string $html)
{
$sanitizer = Sanitizer::create(['extensions' => self::SANITIZER_WHITE_LIST]);
return $sanitizer->sanitize($html);
}
}
The idea is to create a static method to pass extensions as first args and content as second args :
Actual
$sanitizer = Sanitizer::create(['extensions' => self::SANITIZER_WHITE_LIST]);
$safeHtml = sanitizer->sanitize($untrustedHtml);
After
$safeHtml = HtmlSanitizer\Sanitizer::sanitize([
'extensions' => ['basic', 'code', 'image', 'list', 'table', 'iframe', 'details', 'extra'],
$untrustedHtml)
WDYT ?
It would be great to support the <details>
and <summary>
tags in an extension.
Github allows using them in its markdown for instance, to create collapsible regions.
Hi @tgalopin & all
This package is extremely useful, thanks for all the hard work.
I am currently using it in a site where I need to be able to set a target when for certain links. Is there a simple way to configure this ?
I'm already using an extension provided by @olegatro to allow relative URI's. I thought perhaps it would be possible to modify that extension ?
Thanks for any help,
PhR
Would it be possible to add support for HTML5 video
and audio
tags?
Hay guys
I have two questions
After sanitize, C'est l'été
become C'est l'été
, how prevent this change please?
I use Twig and the code is already escaped by default inside the view.
If we provide the following HTML:
<p>Hello</p>
<img src="javascript:evil();" onload="evil();" />
Then pass it through html-sanitizer we get:
<p>Hello</p>
<img />
Should the <img>
tag be taken out entirely now it has no attributes left?
the object parses the input string, and returns an empty string
$sanitizerHtml = HtmlSanitizer\Sanitizer::create([ 'extensions' => ['basic']]);
var_dump($sanitizerHtml->sanitize("bold")); //returns string(0) ""
using php 7.4.27
Hi @tgalopin I am using this sanitizer for sanitizing HTML Text, i noticed its converting
to 00a0
any idea how to fix this issue?
Can we have a option to remove tag with empty body ?
Hi,
I recently got the following error:
Notice: Undefined offset: 2
here:
I got this error after adding a subdomain to allowed_hosts
like sub.example.org
and this causes the function to be called with the following parameters if a url like https://example.org
in a href
attribute is sanitized:
$uriParts = ['org', 'example'];
$trustedParts = ['org', 'example', 'sub'];
Is a subdomain not allowed in allowed_hosts
or is this a bug?
Concern 01
Is there any possibility to allow case sensitive attributes with this package along with a config setting.
I have HTML like below and it removes case sensitive attributes even I add the attribute's named categoryType
under allowed_attributes
for div
tag.
Ex:
<div class="custom" categoryType="books"></div>
Sanitizer is returning for above HTML as below.
<div class="custom"></div>
I have debug your library and found it identify the attribute as "categorytype" (all letters are in lower case). But could we have case sensitive attributes ?
Concern 02
Sanitizer package is removing empty strings value of a attribute like below. (="")
Ex:
<a rel="" href="https://github.com/tgalopin/html-sanitizer/">HTML Sanitizer</a>
Sanitizer returns
<a rel href="https://github.com/tgalopin/html-sanitizer/">HTML Sanitizer</a>
But could we have empty string as attribute value. May be we can allow this feature as well with a config setting.
<a rel="" href="https://github.com/tgalopin/html-sanitizer/">HTML Sanitizer</a>
Hello,
Thank you for the amazing library. I was always looking for that kind of library. I think i have found it finally. But when i trying to create my own tags, i'm having a problem. I'm trying to sanitize non-html my custom React
Components from Front-end. But i just couldn't able to make the library accept like <User id="5" />
fake DOM node. So, i need to create a tag as PascalCase
to make compitable with my React nodes.
Is there any way to define case sensitive custom tags? By the way, It would be so nice to have a tricky solution instead of to wait the next release.
class UserNode extends AbstractTagNode
{
use IsChildlessTrait;
public function getTagName(): string
{
return 'User';
}
}
class UserNodeVisitor extends AbstractNodeVisitor implements NamedNodeVisitorInterface
{
use IsChildlessTagVisitorTrait;
protected function getDomNodeName(): string
{
return 'User';
}
public function getDefaultAllowedAttributes(): array
{
return ['type', 'width', 'height'];
}
public function getDefaultConfiguration(): array
{
return ['custom_config' => NULL];
}
protected function createNode(\DOMNode $domNode, Cursor $cursor): NodeInterface
{
return new UserNode($cursor->node);
}
}
class CustomExtension implements ExtensionInterface
{
public function getName(): string
{
return 'custom';
}
public function createNodeVisitors(array $config = []): array
{
return [
'User' => new UserNodeVisitor($config['tags']['User'] ?? []),
];
}
}
$fakeHtml = <<<HTML
<div>
asd
<p>hmm</p> // p
<User type="test" /> // User type="test"
<user type="test" /> // user type="test"
<test /> // test
## Hmmm
** Yeah? **
<br> <br /> // br
<hr> <hr /> // hr
</div>
HTML;
$builder = new \HtmlSanitizer\SanitizerBuilder();
$builder->registerExtension(new CustomExtension());
$sanitizer = $builder->build(['extensions' => ['custom']]);
$safeHtml = $sanitizer->sanitize($fakeHtml);
echo trim($safeHtml);
Thanks,
Mustafa.
Hi @tgalopin ,
Thanks for this great Sanitizer!
So, we are in a situation where we can't WhiteList Tags, we are using CKeditor5 and using that our user will generate any kind of content, and those content may contain variety of custom tags For example have a look at this.
<p><math xmlns="http://www.w3.org/1998/Math/MathML"><mfrac><mn>2</mn><mn>3</mn></mfrac><mo>+</mo><msqrt><mn>43</mn>
<mo>/</mo><mn>34</mn><mo>∞</mo>
<mi mathvariant="normal">π</mi><mo>∆</mo></msqrt></math> <math class="wrs_chemistry" xmlns="http://www.w3.org/1998/Math/MathML">
<msubsup><mi>mol</mi><mmultiscripts><msup><mmultiscripts><maction actiontype="argument">
<mrow> </mrow></maction><mprescripts> </mprescripts>
<mroot><msqrt> </msqrt><mrow> </mrow></mroot>
<none> </none></mmultiscripts>
<mrow> </mrow></msup><mrow> </mrow><none> </none><mprescripts> </mprescripts><mrow> </mrow><mrow> </mrow></mmultiscripts><mrow> </mrow></msubsup>
</math></p><p> </p><pre><code class="language-php"><?php
echo "Hello world";
?></code></pre>
Now, I want to store this whole text in my DB, but i don't want any incoming XSS/SQLi scripts or tags. How it can be done? I was going through internal codebase of this project, and it seems i can add my own tags or i can introduce my own Custom Extension, but i would need to Whitelist tags, and attributes. How it can be done without whitelisting such tags?
Thank you @tgalopin for this great job !
I have a small idea, it could helpful to have an all
extension to allow all tags instead write all of them
WDYT ?
$sanitizer = Sanitizer::create(['extensions' => ['all']);
will be a shortcut for
$sanitizer = Sanitizer::create(['extensions' => ['basic', 'code', 'image', 'list', 'table', 'iframe', 'extra']);
Could the font
tag be added to the basic
list of supported tags?
<a href="/faq">Faq</a>
UPD:
'allowed_schemes' => ['http', 'https', null]
Hi, thanks for developing this package. I’m currently working on a comments and webmention plugin and explored different possibilites of sanitizing HTML with PHP. After evaluating all available options, you project seems to be a great fit!
In my use-case, I do not need to sanitize HTML, but also apply some aggressive filtering like e.g. removing all class
attributes, to ensure that the HTML of comments does not interfere with page styles of a blog and also would like to add a few custom cleanup routes.
The foremost important things are:
<p>
, <blockquote>
or other allowed top-level element.<a>
elements without an href
attribute should be removed.<br>
at the beginnging of an inline-element should be moved before the start of that element, otherwise this would break formatting of external links by prepending an icon. <br>
elements in the middle of the link text should be preserved. Same for <br>
elements at the end of and inline element.<h1>Headline</h1>
etc. should be transformed into something like <p><strong>Headline</strong></p>
to keep some kind of formatting, but preventing comments from messing with the document outline of the containing document.As my project also handles webmentions, it has to deal with any possible kind of HTML markup, so I cannot use a strict whitelist to handle direct user input and forbid elements like e.g. headings in the first place.
These transforms are relatively easy to implement with PHPs native DOM Library, but after some hours of tinkering around with html-sanitizer, I could not find a solution for these particular requirements. I understand how to transform a single node, but that’s it. Can you please give me a hint, where I could hook into the DOM tree to do these kind of transformations or is there maybe a more elegant way?
Standard for presenting email addresses (can be seen in f.ex. gmail) is "John Doe [email protected]".
However custom tags like that can't be set (email changes with every user).
Temporary solution for us is to use curly braces, but it would be "nice-to-have" feature.
Thanks!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.