Giter VIP home page Giter VIP logo

htmlcleaner's Introduction

Voilab HTML cleaner

A HTML cleaner based on SimpleXML, fast and customizable

Install

Via Composer

Create a composer.json file in your project root:

{
    "require": {
        "voilab/htmlcleaner": "0.*"
    }
}
$ composer require voilab/htmlcleaner

Sample dataset

<p>
    Some paragraph with <strong>bold</strong> or
    <em><u><i>nested tags</i></u></em>.
</p>
<p>
    And a second paragraph (so two roots elements, here) with
    <a href="somesite.org">a cool link</a>,
    <a href="javascript:alert('BAM!');">a bad link</a>
    and some <span class="red">nice attributes to try to keep</span>.
</p>

Basic usage

All tags stripped

use \voilab\cleaner\HtmlCleaner;

$cleaner = new HtmlCleaner();
$raw_html = '...'; // take sample dataset above

echo $cleaner->clean($raw_html);

Allow some tags

// create cleaner...
$cleaner->addAllowedTags(['p', 'strong']);
// call clean method

Allow some tags and attributes (regardless of tags)

// create cleaner...
$cleaner
    ->addAllowedTags(['p', 'span'])
    ->addAllowedAttributes(['class']);
// call clean method

Allow some attributes only on certain tags

// create cleaner...
$cleaner
    ->addAllowedTags(['p', 'span'])
    ->addAllowedAttributes([
        // keep attribute "class" only for spans
        new \voilab\cleaner\attribute\Keep('class', 'span'),

        // you can use this shorthand too, as a string
        'style:span'
    ]);
// call clean method

Advanced usage

Processors

Processors are used to prepare HTML string before it is inserted into a new SimpleXMLElement (base of the process). They are also used to format the HTML after it is cleaned. It's some sort of pre-process and post-process.

The pre-process must remove not allowed tags.

Standard processor

The standard processor uses strip_tags() to remove not allowed tags. After process, the processor removes all carriage returns from the string.

Custom processor

You can create your own processor by implementing \voilab\cleaner\processor\Processor. Do not forget that the pre-process is responsible of removing all not allowed tags.

Attributes

Attributes classes are used to validate attributes and their content. By default an allowed attribute becomes a \voilab\cleaner\attribute\Keep. Every "not allowed" attribute becomes a \voilab\cleaner\attribute\Remove.

These two attribute types don't need to be instanciated by you. All attributes provided as a string in setAllowedTags() are converted in Keep class.

Js attribute

You may want to keep some attributes but check the content. It's true for the href attribute. It can contain a valid URL or some javascript injection. There is an attribute validator already created for that:

$cleaner
    ->addAllowedTags(['a'])
    ->addAllowedAttributes([
        new \voilab\cleaner\attribute\Js('href')
    ]);

Note that allowed attributes can be bound or not to a specific tag. In the example above, the href attribute will be valid for every HTML tag. If you want to bind the attribute to a tag, you need to specify it as a second parameter.

Known limitations

Root mixed content

Mixed content outside tags is not allowed in root position.

<!-- not valid: parts "some root " and " special " will disappear -->
some root <strong>mixed</strong> special <em>content</em>

<!-- valid -->
<p>some root <strong>mixed</strong> special <em>content</em></p>
<!-- also valid -->
<p>some root element</p>
<p>and an other root element</p>

Bad HTML format with Standard processor

If HTML is not well formatted, the cleaner will throw an \Exception. The string needs to be perfectly written, because it is processed by simplexml_load_string($html), which is very strict:

  • tags must be closed (<p></p> or <br />)
  • attributes must be wrapped in (double-)quotes (<hr class="test" />)
  • (double-)quote is not allowed in attribute content, it must be converted in &quot; before HtmlCleaner::clean() is called
  • opening tag < and & are not allowed in content, they must be converted respectivly in &lt; and &amp; before HtmlCleaner::clean() is called

These limitations will eventually be addressed in future releases.

Testing

$ vendor/bin/phpunit --bootstrap vendor/autoload.php tests/

Security

If you discover any security related issues, please use the issue tracker.

Credits

License

The MIT License (MIT). Please see License File for more information.

htmlcleaner's People

Contributors

tafel avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.