Giter VIP home page Giter VIP logo

htmlmin's People

Contributors

aabrahamowicz avatar alexwlchan avatar avacariu avatar epsirom avatar glench avatar iffy avatar jmthibault79 avatar mankyd avatar mina86 avatar mreinhardt avatar nvie avatar samupl avatar tasn avatar tenzer avatar trbs avatar zeebonk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

htmlmin's Issues

XHTML incompatible

tried to run it with htmlmin -p img link --keep-optional-attribute-quotes file.xhtml file_minified.xhtml
but I get error because still <link href="style.css" type="text/css" rel="stylesheet" /> gets converted to <link href="style.css" type="text/css" rel="stylesheet"> and <img src="image.gif" alt="" /> to <img src="image.gif" alt>

Inconsistent behaviour between Python 2 & 3 with unicode whitespaces

Python 3 has re.UNICODE flag set by default, so \w, \s and so on works in the another way than in Python 2. In case when htmlmin used with Python 3 all non-breaking, thin and other unicode whitespaces (utf-8 encoded, not html entity) replaced with the regular whitespace. This behaviour certainly unexpected by a user.

I tried to replace \s in whitespace_re definition with [ \t\n\r\f\v] (according to the re module docs) and it seems working good for my case (non-breaking spaces whitepaces inside a text, not btw tags). But it possibly doesn’t handle all possible cases of such differences.

Empty comments results in IndexError

A comment that is immediately opened and then closed results in IndexError.

>>> minify('<!---->')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/site-packages/htmlmin/main.py", line 98, in minify
    minifier.feed(input)
  File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/HTMLParser.py", line 117, in feed
    self.goahead(0)
  File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/HTMLParser.py", line 165, in goahead
    k = self.parse_comment(i)
  File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/markupbase.py", line 178, in parse_comment
    self.handle_comment(rawdata[i+4: j])
  File "/usr/local/lib/python2.7/site-packages/htmlmin/parser.py", line 281, in handle_comment
    data[1:] if data[0] == '!' else data))
IndexError: string index out of range

For now, I'm working around this issue by replacing those empty comments with <!-- --> before passing it to minify.

Control handling of escaped ampersands in URLs

Simple example:

import htmlmin
def web_compress_html(content):
     return htmlmin.minify(
             content,
             remove_comments = True,
             remove_empty_space = True,
             remove_all_empty_space = False,
             reduce_empty_attributes = True,
             reduce_boolean_attributes = False,
             remove_optional_attribute_quotes = False,
             keep_pre = True,
             pre_tags = ('pre', 'textarea', 'nomin'),
             pre_attr = 'pre'
             )

in[0]: web_compress_html('<!DOCTYPE html>\n<html><head><title></title></head><body><iframe src="foo.com/bar?a=1&amp;b=2"></iframe></body></html>')
out[0]: '<!DOCTYPE html>\n<html><head><title></title></head><body><iframe src="foo.com/bar?a=1&b=2"></iframe></body></html>'
in[1]: web_compress_html('<!DOCTYPE html>\n<html><head><title></title></head><body><nomin><iframe src="foo.com/bar?a=1&amp;b=2"></iframe></nomin></body></html>')
out[1]: '<!DOCTYPE html>\n<html><head><title></title></head><body><nomin><iframe src="foo.com/bar?a=1&b=2"></iframe></nomin></body></html>'

Is there a way to control whether (or not) ampersands in URLs like in the above example are escaped? Thanks to w3c's validator, which is throwing tons of errors at me (error: “&” did not start a character reference), I'd like to have a switch for this behaviour. However, no matter what I do, htmlmin will turn &amp; into &, even within (custom) pre-tags.

(Just googled through the standards and a number of related articles. Apparently, there is not consensus on how to use ampersands in URLs - not even in the HTML5 standardization community ... and I really do not want to discuss this nonsense. This is about best practices and reducing errors while debugging a website. Nevertheless, thanks a lot for this excellent tool.)

Release new version (live version has XSS!)

So I just discovered my live site has had XSS for awhile due to my use of htmlmin. I see that the bug is fixed in github with 697d4b0 , but it appears that htmlmin on pypi is still exploitable (and from 2015, no less!)

Can you release a new pypi version, that doesn't leave all pypi users stuck with glaringly-obvious XSS holes on their site?

ship testsuite

Hi,

could you please consider to add tests/ subdir into pypi release? They can be used to test by distro whether the package is working.

pre_attr does not work as expected

I am maintainer of buccaneer a static site generator. I have a "minify" plugin that uses htmlmin. During testing we discovered an issue.

Our expectation was that htmlmin would leave the href attribute intact (given appropriate config):

<a class="reference external" href="mailto:sam&#37;&#52;&#48;sample&#46;com">sam<span>&#64;</span>sample<span>&#46;</span>com</a>

...but it modifies the href attribute anyway:

<a class="reference external" href="mailto:sam%40sample.com">sam<span>&#64;</span>sample<span>&#46;</span>com</a>

here the code that calls into htmlmin:

            logger.debug('Minifying: %s' % filename)
            compressed = min(uncompressed, remove_comments=True,
                             remove_all_empty_space=True,
                             remove_empty_space=True,
                             keep_pre=True,
                             pre_attr='href')
            f.write(compressed)

please let me know what you think. Thank you.

javascript in attribute

Logical AND (&&) in attributes are incorrectly minified:

>>> import htmlmin
>>> htmlmin.minify(u"<div ng-if='foo && bar'></div>")
u'<div ng-if=foo & bar></div>'

But it works with logical OR:

>>> import htmlmin
>>> htmlmin.minify(u"<div ng-if='foo || bar'></div>")
u'<div ng-if=foo || bar></div>'

wierd issue with Minifier.minify

I ran into an odd issue here. Earlier, I used just the minify(html)-function, and it worked great. Lately, I've changed to Minifier.minify, and ran into problems. On pages involving OAuth from Google, I get OpenTagNotFoundError-errors from the parser. However, I tried removing almost all of the code that I sent to minification, but it still outputted an error. Switching back to only using minify(html) solves it.

I have tried to debug this, but I am unable to figure it out. I tried sending only very basic html, of which I am certain is correct, and all tags are opened and closed in the correct order.

I see in the code that minify does not call anything in the parser, which Minifier.minify does. Is there a reason why there is a difference here?

Parsing error with string: H&M

Generates an error with the following string:

import htmlmin
htmlmin.minify('H&M')

Traceback (most recent call last):
File "", line 1, in
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/htmlmin/main.py", line 93, in minify
minifier.close()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/HTMLParser.py", line 118, in close
self.goahead(1)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/HTMLParser.py", line 213, in goahead
self.error("EOF in middle of entity or char ref")
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/HTMLParser.py", line 121, in error
raise HTMLParseError(message, self.getpos())
HTMLParseError: EOF in middle of entity or char ref, at line 1, column 2

utf-8 support

It is not possible to parse this:

htmlmin.minify('<a href="http://example.com/?a=1&b=2" title="Tomé">Tomé</a>')

missed brackets after string was minimized

When I want to minify part of html ->

'<div class="footer-shadow"></div>\n<footer class="footer">\n    <div class="d-flex container justify-content-end align-items-center">\n        \n\n        \n    </div>\n</footer>'

Than I got
'<div class=footer-shadow></div><footer class=footer><div class="d-flex container justify-content-end align-items-center"></div></footer>'
As you can see there class=footer-shadow and there class=footer the brackets were missed.

Deprecation Warning: 'cgi' is deprecated and slated for removal in Python 3.13

It looks like cgi is being deprecated in an upcoming Python 3.13 release:

INFO  -  DeprecationWarning: 'cgi' is deprecated and slated for removal in Python 3.13
           File "C:\Python\Python311\Lib\warnings.py", line 514, in _deprecated
             warn(msg, DeprecationWarning, stacklevel=3)
           File "C:\Python\Python311\Lib\site-packages\htmlmin\main.py", line 28, in
             import cgi

There are only two references to cgi within this package:

https://github.com/mankyd/htmlmin/blob/master/htmlmin/main.py#L28
https://github.com/mankyd/htmlmin/blob/master/htmlmin/escape.py#L33

Also see:
https://peps.python.org/pep-0594/#cgi

Issues with minifying br-tag

I just started using you're plugin, and noticed that it strips off the / in <br/> tags. This is an issue for me, as I'm validating xhtml

import htmlmin

test_string = """
<strong>hello world</strong>
<br />
<strong>hello earthling</string>
"""

print(htmlmin.minify(test_string))
$ python test.py 
 <strong>hello world</strong> <br> <strong>hello earthling</string>

I'm guessing/hoping it is not intended behaviour?

remove space between certain tags?

it would be great to mark some tags as safe to remove space between such as p tags. that way I can remove the space between these elements:

...good idea.</p> <p>Here is why:</p> <p>Because its the best.</p>

Regarding updates

Hi guys,

This is one of the few Python HTML Minifier. I see it has been close to 4 years since last commit. I was wondering if it's because there is not much to create / update. Or it's abandon?

If it's abandon, is there any new library you would recommend?

HTML Entities get decoded when minifying

Hi, I have a problem where htmlmin really can screw up a site. Consider this:

Python 3.5.0 (default, Sep 14 2015, 02:37:27)
[GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from htmlmin import minify
>>> minify('<code>&lt;script&gt;</code>')
'<code><script></code>'

It's pretty dangerous to have the escaped, safe, tag get unescaped, as the rest of the site then will be swallowed into the <script> tag, unless you have a end tag further down the page.

I don't know if this, like #17, is down to the Python HTML parser trying to be clever, and decodes it automatically without htmlmin knowing about it, or if it's something that can be solved.

`&` not handled correctly inside urls

Consider

<!DOCTYPE html>
<html lang="en">
        <head>
                <meta charset="utf-8">
                <title>Hello World</title>
        </head>
        <body>
                <p>Some text with an <a href="https://example.com/something&amp;search=test">url</a></p>
                <p>Some text with another <a href="https://example.com/something&search=test">url</a></p>
        </body>
</html>

htmlmin index.html generates

<!DOCTYPE html><html lang=en> <head><meta charset=utf-8><title>Hello World</title></head> <body> <p>Some text with an <a href="https://example.com/something&search=test">url</a></p> <p>Some text with another <a href="https://example.com/something&search=test">url</a></p> <p>And other text with another <a href="https://example.com/something%26search=test">url</a></p> </body> </html>

Notice how the first url changed from https://example.com/something&amp;search=test to https://example.com/something&search=test.

This is technically not correct, as the ampersand character declares the beginning of an entity reference.
Also the w3 guidelines suggest to use &amp; instead of & in this context.

Validators like htacg/tidy-html5#1017 correctly complain about the invalid entity reference.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.