mankyd / htmlmin Goto Github PK

View Code? Open in Web Editor NEW

128.0 128.0 40.0 259 KB

A configurable HTML Minifier with safety features

Home Page: https://htmlmin.readthedocs.org/en/latest/

License: Other

Python 21.35% HTML 78.65%

htmlmin's People

Contributors

Stargazers

Watchers

htmlmin's Issues

XHTML incompatible

tried to run it with htmlmin -p img link --keep-optional-attribute-quotes file.xhtml file_minified.xhtml
but I get error because still <link href="style.css" type="text/css" rel="stylesheet" /> gets converted to <link href="style.css" type="text/css" rel="stylesheet"> and <img src="image.gif" alt="" /> to <img src="image.gif" alt>

Inconsistent behaviour between Python 2 & 3 with unicode whitespaces

Python 3 has re.UNICODE flag set by default, so \w, \s and so on works in the another way than in Python 2. In case when htmlmin used with Python 3 all non-breaking, thin and other unicode whitespaces (utf-8 encoded, not html entity) replaced with the regular whitespace. This behaviour certainly unexpected by a user.

I tried to replace \s in whitespace_re definition with [ \t\n\r\f\v] (according to the re module docs) and it seems working good for my case (non-breaking spaces whitepaces inside a text, not btw tags). But it possibly doesn’t handle all possible cases of such differences.

Empty comments results in IndexError

A comment that is immediately opened and then closed results in IndexError.

>>> minify('<!---->')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/site-packages/htmlmin/main.py", line 98, in minify
    minifier.feed(input)
  File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/HTMLParser.py", line 117, in feed
    self.goahead(0)
  File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/HTMLParser.py", line 165, in goahead
    k = self.parse_comment(i)
  File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/markupbase.py", line 178, in parse_comment
    self.handle_comment(rawdata[i+4: j])
  File "/usr/local/lib/python2.7/site-packages/htmlmin/parser.py", line 281, in handle_comment
    data[1:] if data[0] == '!' else data))
IndexError: string index out of range

For now, I'm working around this issue by replacing those empty comments with  before passing it to minify.

Better Escaping of Attributes

Per #22, we currently escape < and other characters that don't require escaping.

According to the spec, only ", ', and & potentially require escaping: https://html.spec.whatwg.org/multipage/syntax.html#syntax-attributes

Control handling of escaped ampersands in URLs

Simple example:

import htmlmin
def web_compress_html(content):
     return htmlmin.minify(
             content,
             remove_comments = True,
             remove_empty_space = True,
             remove_all_empty_space = False,
             reduce_empty_attributes = True,
             reduce_boolean_attributes = False,
             remove_optional_attribute_quotes = False,
             keep_pre = True,
             pre_tags = ('pre', 'textarea', 'nomin'),
             pre_attr = 'pre'
             )

in[0]: web_compress_html('<!DOCTYPE html>\n<html><head><title></title></head><body><iframe src="foo.com/bar?a=1&amp;b=2"></iframe></body></html>')
out[0]: '<!DOCTYPE html>\n<html><head><title></title></head><body><iframe src="foo.com/bar?a=1&b=2"></iframe></body></html>'
in[1]: web_compress_html('<!DOCTYPE html>\n<html><head><title></title></head><body><nomin><iframe src="foo.com/bar?a=1&amp;b=2"></iframe></nomin></body></html>')
out[1]: '<!DOCTYPE html>\n<html><head><title></title></head><body><nomin><iframe src="foo.com/bar?a=1&b=2"></iframe></nomin></body></html>'

Is there a way to control whether (or not) ampersands in URLs like in the above example are escaped? Thanks to w3c's validator, which is throwing tons of errors at me (error: “&” did not start a character reference), I'd like to have a switch for this behaviour. However, no matter what I do, htmlmin will turn & into &, even within (custom) pre-tags.

(Just googled through the standards and a number of related articles. Apparently, there is not consensus on how to use ampersands in URLs - not even in the HTML5 standardization community ... and I really do not want to discuss this nonsense. This is about best practices and reducing errors while debugging a website. Nevertheless, thanks a lot for this excellent tool.)

Release new version (live version has XSS!)

So I just discovered my live site has had XSS for awhile due to my use of htmlmin. I see that the bug is fixed in github with 697d4b0 , but it appears that htmlmin on pypi is still exploitable (and from 2015, no less!)

Can you release a new pypi version, that doesn't leave all pypi users stuck with glaringly-obvious XSS holes on their site?

ship testsuite

Hi,

could you please consider to add tests/ subdir into pypi release? They can be used to test by distro whether the package is working.

pre_attr does not work as expected

I am maintainer of buccaneer a static site generator. I have a "minify" plugin that uses htmlmin. During testing we discovered an issue.

Our expectation was that htmlmin would leave the href attribute intact (given appropriate config):

<a class="reference external" href="mailto:sam&#37;&#52;&#48;sample&#46;com">sam<span>&#64;</span>sample<span>&#46;</span>com</a>

...but it modifies the href attribute anyway:

<a class="reference external" href="mailto:sam%40sample.com">sam<span>&#64;</span>sample<span>&#46;</span>com</a>

here the code that calls into htmlmin:

            logger.debug('Minifying: %s' % filename)
            compressed = min(uncompressed, remove_comments=True,
                             remove_all_empty_space=True,
                             remove_empty_space=True,
                             keep_pre=True,
                             pre_attr='href')
            f.write(compressed)

please let me know what you think. Thank you.

javascript in attribute

Logical AND (&&) in attributes are incorrectly minified:

>>> import htmlmin
>>> htmlmin.minify(u"<div ng-if='foo && bar'></div>")
u'<div ng-if=foo & bar></div>'

But it works with logical OR:

>>> import htmlmin
>>> htmlmin.minify(u"<div ng-if='foo || bar'></div>")
u'<div ng-if=foo || bar></div>'

build_tag error on python 2.6 , on 2.7 is ok

.... /htmlmin/parser.py", line 112, in build_tag
result = '<{}'.format(escape(tag))
ValueError: zero length field name in format

wierd issue with Minifier.minify

I ran into an odd issue here. Earlier, I used just the minify(html)-function, and it worked great. Lately, I've changed to Minifier.minify, and ran into problems. On pages involving OAuth from Google, I get OpenTagNotFoundError-errors from the parser. However, I tried removing almost all of the code that I sent to minification, but it still outputted an error. Switching back to only using minify(html) solves it.

I have tried to debug this, but I am unable to figure it out. I tried sending only very basic html, of which I am certain is correct, and all tags are opened and closed in the correct order.

I see in the code that minify does not call anything in the parser, which Minifier.minify does. Is there a reason why there is a difference here?

Parsing error with string: H&M

Generates an error with the following string:

import htmlmin
htmlmin.minify('H&M')

Traceback (most recent call last):
File "", line 1, in
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/htmlmin/main.py", line 93, in minify
minifier.close()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/HTMLParser.py", line 118, in close
self.goahead(1)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/HTMLParser.py", line 213, in goahead
self.error("EOF in middle of entity or char ref")
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/HTMLParser.py", line 121, in error
raise HTMLParseError(message, self.getpos())
HTMLParseError: EOF in middle of entity or char ref, at line 1, column 2

">" incorrectly transformed as >

is transformed in <input value=>> which is not good HTML.

Spaces are stripped around html symbols but shouldn't

"Foo € bar" should remain with the spaces around it.
I've tested it in <title>, so it's surely broken there.

utf-8 support

It is not possible to parse this:

htmlmin.minify('<a href="http://example.com/?a=1&b=2" title="Tomé">Tomé</a>')

Incorrect text/html header detection in the middleware

Hello,
The middleware looks for a header Content-Type whose value is exactly text/html. However Flask’s default is text/html; charset=utf-8. Using this middleware as-is in a Flask app thus doesn’t work.

missed brackets after string was minimized

When I want to minify part of html ->

'<div class="footer-shadow"></div>\n<footer class="footer">\n    <div class="d-flex container justify-content-end align-items-center">\n        \n\n        \n    </div>\n</footer>'

Than I got
'<div class=footer-shadow></div><footer class=footer><div class="d-flex container justify-content-end align-items-center"></div></footer>'
As you can see there class=footer-shadow and there class=footer the brackets were missed.

Deprecation Warning: 'cgi' is deprecated and slated for removal in Python 3.13

It looks like cgi is being deprecated in an upcoming Python 3.13 release:

INFO  -  DeprecationWarning: 'cgi' is deprecated and slated for removal in Python 3.13
           File "C:\Python\Python311\Lib\warnings.py", line 514, in _deprecated
             warn(msg, DeprecationWarning, stacklevel=3)
           File "C:\Python\Python311\Lib\site-packages\htmlmin\main.py", line 28, in
             import cgi

There are only two references to cgi within this package:

https://github.com/mankyd/htmlmin/blob/master/htmlmin/main.py#L28
https://github.com/mankyd/htmlmin/blob/master/htmlmin/escape.py#L33

Also see:
https://peps.python.org/pep-0594/#cgi

new pypi release

Can we have a new PyPi release ?

missing git tags to corresponding pypi releases

Releases published on pypi should also have their corresponding commits tagged in git and published to github for transparency.

Please push missing tags (especially latest)

@mankyd: can you update pip package ? last update was 18 months ago

I can see you manage the htmlmin package on pypi.
https://pypi.python.org/pypi/htmlmin is out of date now and does not contain last fixes.
thank you very much !!!

Allow custom handling of <script> and <style>

It would be very useful to allow some kind of callbacks that let the user handle CSS and JS in whatever way they want, instead of just leaving them unminified.

Issues with minifying br-tag

I just started using you're plugin, and noticed that it strips off the / in <br/> tags. This is an issue for me, as I'm validating xhtml

import htmlmin

test_string = """
<strong>hello world</strong>
<br />
<strong>hello earthling</string>
"""

print(htmlmin.minify(test_string))

$ python test.py 
 <strong>hello world</strong> <br> <strong>hello earthling</string>

I'm guessing/hoping it is not intended behaviour?

remove space between certain tags?

it would be great to mark some tags as safe to remove space between such as p tags. that way I can remove the space between these elements:

...good idea.</p> <p>Here is why:</p> <p>Because its the best.</p>

Add -i flag to cli for in-place modification

This is inspired by sed, e.g., sed -i ..., or sed --in-place.

Example

htmlmin -i index.html

Regarding updates

Hi guys,

This is one of the few Python HTML Minifier. I see it has been close to 4 years since last commit. I was wondering if it's because there is not much to create / update. Or it's abandon?

If it's abandon, is there any new library you would recommend?

HTML Entities get decoded when minifying

Hi, I have a problem where htmlmin really can screw up a site. Consider this:

Python 3.5.0 (default, Sep 14 2015, 02:37:27)
[GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from htmlmin import minify
>>> minify('<code>&lt;script&gt;</code>')
'<code><script></code>'

It's pretty dangerous to have the escaped, safe, tag get unescaped, as the rest of the site then will be swallowed into the <script> tag, unless you have a end tag further down the page.

I don't know if this, like #17, is down to the Python HTML parser trying to be clever, and decodes it automatically without htmlmin knowing about it, or if it's something that can be solved.

`&` not handled correctly inside urls

Consider

<!DOCTYPE html>
<html lang="en">
        <head>
                <meta charset="utf-8">
                <title>Hello World</title>
        </head>
        <body>
                <p>Some text with an <a href="https://example.com/something&amp;search=test">url</a></p>
                <p>Some text with another <a href="https://example.com/something&search=test">url</a></p>
        </body>
</html>

htmlmin index.html generates

<!DOCTYPE html><html lang=en> <head><meta charset=utf-8><title>Hello World</title></head> <body> <p>Some text with an <a href="https://example.com/something&search=test">url</a></p> <p>Some text with another <a href="https://example.com/something&search=test">url</a></p> <p>And other text with another <a href="https://example.com/something%26search=test">url</a></p> </body> </html>

Notice how the first url changed from https://example.com/something&search=test to https://example.com/something&search=test.

This is technically not correct, as the ampersand character declares the beginning of an entity reference.
Also the w3 guidelines suggest to use & instead of & in this context.

Validators like htacg/tidy-html5#1017 correctly complain about the invalid entity reference.

Removes quotes when attribute value contains =

Example:

In [3]: htmlmin.minify('<meta name="viewport" content="width=device-width">')
Out[3]: '<meta name=viewport content=width=device-width>'

The output is invalid since you can't have an = in an unquoted attribute value.

The fix is just to add or '=' in val to https://github.com/mankyd/htmlmin/blob/master/htmlmin/escape.py#L64

I'll create a PR with the fix + some tests.

mankyd / htmlmin Goto Github PK

htmlmin's People

Contributors

Stargazers

Watchers

Forkers

htmlmin's Issues

Recommend Projects

Recommend Topics

Recommend Org