mankyd / htmlmin Goto Github PK
View Code? Open in Web Editor NEWA configurable HTML Minifier with safety features
Home Page: https://htmlmin.readthedocs.org/en/latest/
License: Other
A configurable HTML Minifier with safety features
Home Page: https://htmlmin.readthedocs.org/en/latest/
License: Other
tried to run it with htmlmin -p img link --keep-optional-attribute-quotes file.xhtml file_minified.xhtml
but I get error because still <link href="style.css" type="text/css" rel="stylesheet" />
gets converted to <link href="style.css" type="text/css" rel="stylesheet">
and <img src="image.gif" alt="" />
to <img src="image.gif" alt>
Python 3 has re.UNICODE flag set by default, so \w
, \s
and so on works in the another way than in Python 2. In case when htmlmin used with Python 3 all non-breaking, thin and other unicode whitespaces (utf-8 encoded, not html entity) replaced with the regular whitespace. This behaviour certainly unexpected by a user.
I tried to replace \s
in whitespace_re
definition with [ \t\n\r\f\v]
(according to the re
module docs) and it seems working good for my case (non-breaking spaces whitepaces inside a text, not btw tags). But it possibly doesn’t handle all possible cases of such differences.
A comment that is immediately opened and then closed results in IndexError
.
>>> minify('<!---->')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/site-packages/htmlmin/main.py", line 98, in minify
minifier.feed(input)
File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/HTMLParser.py", line 117, in feed
self.goahead(0)
File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/HTMLParser.py", line 165, in goahead
k = self.parse_comment(i)
File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/markupbase.py", line 178, in parse_comment
self.handle_comment(rawdata[i+4: j])
File "/usr/local/lib/python2.7/site-packages/htmlmin/parser.py", line 281, in handle_comment
data[1:] if data[0] == '!' else data))
IndexError: string index out of range
For now, I'm working around this issue by replacing those empty comments with <!-- -->
before passing it to minify
.
Per #22, we currently escape <
and other characters that don't require escaping.
According to the spec, only "
, '
, and &
potentially require escaping: https://html.spec.whatwg.org/multipage/syntax.html#syntax-attributes
Simple example:
import htmlmin
def web_compress_html(content):
return htmlmin.minify(
content,
remove_comments = True,
remove_empty_space = True,
remove_all_empty_space = False,
reduce_empty_attributes = True,
reduce_boolean_attributes = False,
remove_optional_attribute_quotes = False,
keep_pre = True,
pre_tags = ('pre', 'textarea', 'nomin'),
pre_attr = 'pre'
)
in[0]: web_compress_html('<!DOCTYPE html>\n<html><head><title></title></head><body><iframe src="foo.com/bar?a=1&b=2"></iframe></body></html>')
out[0]: '<!DOCTYPE html>\n<html><head><title></title></head><body><iframe src="foo.com/bar?a=1&b=2"></iframe></body></html>'
in[1]: web_compress_html('<!DOCTYPE html>\n<html><head><title></title></head><body><nomin><iframe src="foo.com/bar?a=1&b=2"></iframe></nomin></body></html>')
out[1]: '<!DOCTYPE html>\n<html><head><title></title></head><body><nomin><iframe src="foo.com/bar?a=1&b=2"></iframe></nomin></body></html>'
Is there a way to control whether (or not) ampersands in URLs like in the above example are escaped? Thanks to w3c's validator, which is throwing tons of errors at me (error: “&” did not start a character reference), I'd like to have a switch for this behaviour. However, no matter what I do, htmlmin will turn & into &, even within (custom) pre-tags.
(Just googled through the standards and a number of related articles. Apparently, there is not consensus on how to use ampersands in URLs - not even in the HTML5 standardization community ... and I really do not want to discuss this nonsense. This is about best practices and reducing errors while debugging a website. Nevertheless, thanks a lot for this excellent tool.)
So I just discovered my live site has had XSS for awhile due to my use of htmlmin. I see that the bug is fixed in github with 697d4b0 , but it appears that htmlmin on pypi is still exploitable (and from 2015, no less!)
Can you release a new pypi version, that doesn't leave all pypi users stuck with glaringly-obvious XSS holes on their site?
Hi,
could you please consider to add tests/ subdir into pypi release? They can be used to test by distro whether the package is working.
I am maintainer of buccaneer a static site generator. I have a "minify" plugin that uses htmlmin. During testing we discovered an issue.
Our expectation was that htmlmin would leave the href attribute intact (given appropriate config):
<a class="reference external" href="mailto:sam%40sample.com">sam<span>@</span>sample<span>.</span>com</a>
...but it modifies the href attribute anyway:
<a class="reference external" href="mailto:sam%40sample.com">sam<span>@</span>sample<span>.</span>com</a>
here the code that calls into htmlmin:
logger.debug('Minifying: %s' % filename)
compressed = min(uncompressed, remove_comments=True,
remove_all_empty_space=True,
remove_empty_space=True,
keep_pre=True,
pre_attr='href')
f.write(compressed)
please let me know what you think. Thank you.
Logical AND (&&
) in attributes are incorrectly minified:
>>> import htmlmin
>>> htmlmin.minify(u"<div ng-if='foo && bar'></div>")
u'<div ng-if=foo & bar></div>'
But it works with logical OR:
>>> import htmlmin
>>> htmlmin.minify(u"<div ng-if='foo || bar'></div>")
u'<div ng-if=foo || bar></div>'
.... /htmlmin/parser.py", line 112, in build_tag
result = '<{}'.format(escape(tag))
ValueError: zero length field name in format
I ran into an odd issue here. Earlier, I used just the minify(html)
-function, and it worked great. Lately, I've changed to Minifier.minify
, and ran into problems. On pages involving OAuth from Google, I get OpenTagNotFoundError
-errors from the parser. However, I tried removing almost all of the code that I sent to minification, but it still outputted an error. Switching back to only using minify(html)
solves it.
I have tried to debug this, but I am unable to figure it out. I tried sending only very basic html, of which I am certain is correct, and all tags are opened and closed in the correct order.
I see in the code that minify
does not call anything in the parser, which Minifier.minify
does. Is there a reason why there is a difference here?
Generates an error with the following string:
import htmlmin
htmlmin.minify('H&M')
Traceback (most recent call last):
File "", line 1, in
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/htmlmin/main.py", line 93, in minify
minifier.close()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/HTMLParser.py", line 118, in close
self.goahead(1)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/HTMLParser.py", line 213, in goahead
self.error("EOF in middle of entity or char ref")
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/HTMLParser.py", line 121, in error
raise HTMLParseError(message, self.getpos())
HTMLParseError: EOF in middle of entity or char ref, at line 1, column 2
is transformed in <input value=>> which is not good HTML.
"Foo € bar" should remain with the spaces around it.
I've tested it in <title>, so it's surely broken there.
It is not possible to parse this:
htmlmin.minify('<a href="http://example.com/?a=1&b=2" title="Tomé">Tomé</a>')
Hello,
The middleware looks for a header Content-Type
whose value is exactly text/html
. However Flask’s default is text/html; charset=utf-8
. Using this middleware as-is in a Flask app thus doesn’t work.
When I want to minify part of html ->
'<div class="footer-shadow"></div>\n<footer class="footer">\n <div class="d-flex container justify-content-end align-items-center">\n \n\n \n </div>\n</footer>'
Than I got
'<div class=footer-shadow></div><footer class=footer><div class="d-flex container justify-content-end align-items-center"></div></footer>'
As you can see there class=footer-shadow and there class=footer the brackets were missed.
It looks like cgi
is being deprecated in an upcoming Python 3.13
release:
INFO - DeprecationWarning: 'cgi' is deprecated and slated for removal in Python 3.13
File "C:\Python\Python311\Lib\warnings.py", line 514, in _deprecated
warn(msg, DeprecationWarning, stacklevel=3)
File "C:\Python\Python311\Lib\site-packages\htmlmin\main.py", line 28, in
import cgi
There are only two references to cgi
within this package:
https://github.com/mankyd/htmlmin/blob/master/htmlmin/main.py#L28
https://github.com/mankyd/htmlmin/blob/master/htmlmin/escape.py#L33
Also see:
https://peps.python.org/pep-0594/#cgi
Can we have a new PyPi release ?
Releases published on pypi should also have their corresponding commits tagged in git and published to github for transparency.
Please push missing tags (especially latest)
I can see you manage the htmlmin package on pypi.
https://pypi.python.org/pypi/htmlmin is out of date now and does not contain last fixes.
thank you very much !!!
It would be very useful to allow some kind of callbacks that let the user handle CSS and JS in whatever way they want, instead of just leaving them unminified.
I just started using you're plugin, and noticed that it strips off the /
in <br/>
tags. This is an issue for me, as I'm validating xhtml
import htmlmin
test_string = """
<strong>hello world</strong>
<br />
<strong>hello earthling</string>
"""
print(htmlmin.minify(test_string))
$ python test.py
<strong>hello world</strong> <br> <strong>hello earthling</string>
I'm guessing/hoping it is not intended behaviour?
it would be great to mark some tags as safe to remove space between such as p tags. that way I can remove the space between these elements:
...good idea.</p> <p>Here is why:</p> <p>Because its the best.</p>
This is inspired by sed
, e.g., sed -i ...
, or sed --in-place
.
Example
htmlmin -i index.html
Hi guys,
This is one of the few Python HTML Minifier. I see it has been close to 4 years since last commit. I was wondering if it's because there is not much to create / update. Or it's abandon?
If it's abandon, is there any new library you would recommend?
Hi, I have a problem where htmlmin really can screw up a site. Consider this:
Python 3.5.0 (default, Sep 14 2015, 02:37:27)
[GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from htmlmin import minify
>>> minify('<code><script></code>')
'<code><script></code>'
It's pretty dangerous to have the escaped, safe, tag get unescaped, as the rest of the site then will be swallowed into the <script>
tag, unless you have a end tag further down the page.
I don't know if this, like #17, is down to the Python HTML parser trying to be clever, and decodes it automatically without htmlmin knowing about it, or if it's something that can be solved.
Consider
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Hello World</title>
</head>
<body>
<p>Some text with an <a href="https://example.com/something&search=test">url</a></p>
<p>Some text with another <a href="https://example.com/something&search=test">url</a></p>
</body>
</html>
htmlmin index.html
generates
<!DOCTYPE html><html lang=en> <head><meta charset=utf-8><title>Hello World</title></head> <body> <p>Some text with an <a href="https://example.com/something&search=test">url</a></p> <p>Some text with another <a href="https://example.com/something&search=test">url</a></p> <p>And other text with another <a href="https://example.com/something%26search=test">url</a></p> </body> </html>
Notice how the first url changed from https://example.com/something&search=test
to https://example.com/something&search=test
.
This is technically not correct, as the ampersand character declares the beginning of an entity reference.
Also the w3 guidelines suggest to use &
instead of &
in this context.
Validators like htacg/tidy-html5#1017 correctly complain about the invalid entity reference.
Example:
In [3]: htmlmin.minify('<meta name="viewport" content="width=device-width">')
Out[3]: '<meta name=viewport content=width=device-width>'
The output is invalid since you can't have an = in an unquoted attribute value.
The fix is just to add or '=' in val
to https://github.com/mankyd/htmlmin/blob/master/htmlmin/escape.py#L64
I'll create a PR with the fix + some tests.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.