rennat / pynliner Goto Github PK

View Code? Open in Web Editor NEW

180.0 180.0 93.0 1.3 MB

Python CSS-to-inline-styles conversion tool for HTML using BeautifulSoup and cssutils

Home Page: http://pythonhosted.org/pynliner/

Python 100.00%

pynliner's People

Contributors

Stargazers

Watchers

pynliner's Issues

PynLiner no longer works with ID + class selection

https://github.com/rennat/pynliner/pull/31/files

aggressively converts html entities, not really usable for email templates

First off, thanks. This is a great package + well designed & documented.

I ran into some issues while trying to get this to work with mako templates (which i use to generate email).

mako templates don't work because the BeautifulSoup parser and formatter will kill things like:

<% value = 1 + 1 %>

and

% if loop.index < 6: 
% endif

becomes

% if loop.index &lt; 6: 
% endif

I tried forking and running on BS4, which allows for a 'formatter' option to be used on printing ( overriding your _get_output function ). I couldn't find a formatter that creates reliable or desirable results.

i think the design of this package will make it hard to work on templates for email , so i wanted to suggest a warning in the documents. If I have time, I might try and get this to work later in the week -- but I'm sadly not hopeful.

Anyways, great package.

[request] support of 'inherit'

/*----  a standard css template ----*/
.h3 {
    font-size: 20px;
}

span {
    font-size: 12px;
}

/*---- customization ----*/
.customized {
    font-size: inherit;
}

<!---- html ---->
<h3> blah blah ... <span class="customized"> more blah </span> </h3> 

<!---- desired inline result, as interpreted by chrome ---->
<h3 style="font-size: 20px;"> blah blah ... <span style="font-size: 20px"> more blah </span> </h3>

<!---- current inline result by pynliner ---->
<h3 style="font-size: 20px;"> blah blah ... <span style="font-size: 12px"> more blah </span> </h3>

Unmatched css selector, and invalid regex in codebase

The exception message given is not very clear; I recommend making it clear.

    raise Exception("No match was found. We're done or something is broken")

I used the debugger to extract the failing css selector:

.my-class table tr:nth-child(2n) does not match the regex'([_0-9a-zA-Z-#.:*"\'\[\\]=]+)$ (This expression doesn't compile in my regex tools. The closest working regexp I could come to is:[_0-9a-zA-Z\-#.:*"\'\[\]=]+$).

I find that the regex is missing parenthesis captures.

With  added the working regex expression does match the selector.

Suggested change the regexp here to:
pynliner/soupselect.py:98 [_0-9a-zA-Z\-#.:*"\'\[\]=]+$

Does not properly apply nested selectors

This HTML

<style>
ol ol { list-style-type: lower-alpha; }
ol ol ol { list-style-type: lower-roman; }
</style>

<ol><li><p>Sublist:</p>
  <ol><li>Item</li>
  </ol>
</ol>

will result with the nested ol having lower-roman applied, instead of lower-alpha.

Pynliner Determines Absolute URLs by checking for http:// at the start, ignores https and protocol agnostic //server.com

Described it mostly in the title. The call to _get_external_styles determines if a url is absolute by checking for http:// at the start. It really should use urlparse and fill in the missing components (scheme, host, path, etc).

https://github.com/rennat/pynliner/blob/master/pynliner/__init__.py#L163

fix error with non ascii

This patch worked for me

--- a/pynliner/__init__.py
+++ b/pynliner/__init__.py
@@ -208,7 +208,7 @@ class Pynliner(object):

         Returns self.output
         """
-        self.output = unicode(str(self.soup))
+        self.output = unicode(self.soup.renderContents(), 'utf-8')
         return self.output

def fromURL(url, log=None):

Extracting css contents from a local file produces error

When I tried to parse an HTML which has a local css file, it generated this error

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/pynliner/__init__.py", line 112, in run
    self._get_styles()
  File "/usr/local/lib/python2.7/dist-packages/pynliner/__init__.py", line 141, in _get_styles
    self._get_external_styles()
  File "/usr/local/lib/python2.7/dist-packages/pynliner/__init__.py", line 164, in _get_external_styles
    url = self.relative_url + url
AttributeError: 'Pynliner' object has no attribute 'relative_url'

Looks like the variable 'relative_url' is not defined when a call from function other than from_url() is made. To be more clear, when one tries to do a Pynliner().from_string(html_string).with_cssString(css_string).run() where the html_string already contains a 'link' to a local stylesheet, it generates this error.

For starters, it might be confusing to see this error. A newbie, in the absence of guiding documentation, might want to pass the stringified css file to the stringified html file, where the html file already contained a link to that css file locally

Embed javascript in the HTML file

I'm slightly new to web technologies, so it might seem weird to you that I'm suggesting this. Maybe it is too easy, or maybe my understanding about how javascript-html-css work in tandem is shaky.

I'm encountering a use case where I have to generate one single-file html page, which contains all the css and javascript. If I can put all the 'required' JS functions inside <script> tags, either inside the tag, or just before the closing tag, that would be awesome. I am not perfectly sure if this comes under pynliner project, but posting it here if it is simple enough for more knowledgeable people to implement this feature.

Consider removing "The generated output of this software shall not be used in a mass marketing service." from license

Hi there!
I totally understand the reason for that clause and I'm sympathetic with the general goal of reducing spam.

I just would like to point out an effect of that line: due to this restriction of use, this module cannot be included in Debian 'main' section of the repository (while still be available to debian users via the non-free section).

Feel free to close this report straight away, i just want to maybe make you revisit your position (but I would package this module regardless of your decision here).

thanks for considering,
Sandro

Python 3

Let's get the convo started :)

It looks like for starters we'd need to upgrade to BeautifulSoup4 (the package is actually named differently on pypi it seems): https://pypi.python.org/pypi/beautifulsoup4/4.3.2

Not PEP 8 compliant

Needs to be re-factored to conform to PEP 8 guidelines.

Chained CSS selectors get over-applied

CSS of the form selector, selector { style-rules; } do not seem to get applied correctly. See the example below where both <p> tags render orange, instead of just the one living inside the appropriate wrapper class.

This HTML:

<div class="orange-wrapper">
    <p>Orange.</p>
</div>
<div>
    <p>Black.</p>
</div>

Mixed with this CSS:

.orange-wrapper p, .super-orange-wrapper p {
    color:orange;
}

Results in this inlined HTML:

<div class="orange-wrapper">
    <p style="color: orange">Orange.</p>
</div>
<div>
    <p style="color: orange">Black.</p>
</div>

Should not escape HTML entities

First off, thank for the great library. Recently I upgraded the lib from 0.5.1 to 0.7.1 and now it breaks my emails.

How to reproduce

$ python -V
Python 2.7.6
$ pip freeze | grep pynliner
pynliner==0.7.1
$ python -c "import pynliner; print pynliner.fromString('<p>&nbsp;</p>')"

Expected

<p>&nbsp;</p>

Actual

<p> </p>

please include tests in the pypi tarball

Hey there! i noticed setup.py requires mock for tests, but in the tarball released on pypi the file tests.py is missing - could you add it the next time you cut a release? tahnsk!

BeautifulSoup version installed by pip is not compatible with pynliner

Hi,

pip installs BeautifulSoup 4.0b now by default:

pip install -U BeautifulSoup
Downloading/unpacking BeautifulSoup
  Downloading BeautifulSoup-4.0b.tar.gz (42Kb): 42Kb downloaded
  Running setup.py egg_info for package BeautifulSoup
Installing collected packages: BeautifulSoup
  Running setup.py install for BeautifulSoup
Successfully installed BeautifulSoup
Cleaning up...

BeautifulSoup 4.x is not backwards-compatible with BeautifulSoup 3.x (it doesn't even provide BeautifulSoup package anymore). So I think pynliner's setup.py should be fixed (it should require BeautifulSoup < 4.0a).

Doesn't handle multiple classes (or classes & ID) in a selector?

Pynliner appears to work in 95% of our cases, but doesn't appear to handle the case with selectors that combine multiple classes (or have a class plus an id). An example is:

<style>
p.colored.red { color: red; }
</style>

<p class="colored red"> Red? </p>

The paragraph should be colored red since it has both classes, but it appears that pynliner fails to recognize this. (Similarly for a style selector like " p#profile.red")

Perhaps I'm missing a configuration knob? Or if not, are there any plans to support this? Thanks for the great tool!

regression parsing .btn-large [class^="icon-"], results in exception raised

Version 0.7.0 raises an exception parsing .btn-large [class^="icon-"],
This did not occur in previous versions.

Test case:

# py.test pyliner_test.py

import pytest
import textwrap
from pynliner import fromString as inlinecss

def no_whitespace(text):
    return ''.join(text.split())

def test_email():
    expected = textwrap.dedent("""\
        tbd
        """)

    input = textwrap.dedent("""\
        <!DOCTYPE html>
            <html>
              <head>
                <style type="text/css">
            body {
              margin: 0;
              font-family: "Helvetica Neue", Helvetica, Arial, sans-serif;
              font-size: 14px;
              line-height: 20px;
              color: #333333;
              background-color: #ffffff;
              max-width: 600px;
            }
            .btn-large [class^="icon-"],
            .btn-large [class*=" icon-"] {
              margin-top: 4px;
            }
                </style>

              </head>
              <body>
                <p>hello world</p>
              </body>
            </html>""")

    result = inlinecss(input)

    assert no_whitespace(str(result)) == no_whitespace(str(expected))

Output:

$ py.test  pynliner_bug_test.py
============================================================== test session starts ===============================================================
platform linux2 -- Python 2.7.9, pytest-2.9.1, py-1.4.31, pluggy-0.3.1
rootdir: /home/vagrant/synced/symqe-bugfix-BIZ-493-semicolons-are-invading-car-ops-URLs, inifile:
collected 1 items

pynliner_bug_test.py F

==================================================================== FAILURES ====================================================================
___________________________________________________________________ test_email ___________________________________________________________________

    def test_email():
        expected = textwrap.dedent("""\

            """)

        input = textwrap.dedent("""\
            <!DOCTYPE html>
                <html>
                  <head>
                    <style type="text/css">
                body {
                  margin: 0;
                  font-family: "Helvetica Neue", Helvetica, Arial, sans-serif;
                  font-size: 14px;
                  line-height: 20px;
                  color: #333333;
                  background-color: #ffffff;
                  max-width: 600px;
                }
                .btn-large [class^="icon-"],
                .btn-large [class*=" icon-"] {
                  margin-top: 4px;
                }
                    </style>

                  </head>
                  <body>
                    <p>hello world</p>
                  </body>
                </html>""")

>       result = inlinecss(input)

pynliner_bug_test.py:42:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../env/lib/python2.7/site-packages/pynliner/__init__.py:272: in fromString
    return Pynliner(log).from_string(string).run()
../../env/lib/python2.7/site-packages/pynliner/__init__.py:126: in run
    self._apply_styles()
../../env/lib/python2.7/site-packages/pynliner/__init__.py:219: in _apply_styles
    for element in select(self.soup, selector.selectorText):
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

soup = <!DOCTYPE html>\n\n<html>\n<head>\n\n</head>\n<body>\n<p>hello world</p>\n</body>\n</html>, selector = '.btn-large [class^'

    def select(soup, selector):
        """
        soup should be a BeautifulSoup instance; selector is a CSS selector
        specifying the elements you want to retrieve.
        """
        handle_token = True
        current_context = [(soup, [])]
        operator = None
        while selector:
            if handle_token:
                # Get the rightmost token
                handle_token = False
                match = re.search('([_0-9a-zA-Z-#.:*"\'\[\\]=]+)$', selector)
                if not match:
>                   raise Exception("No match was found. We're done or something is broken")
E                   Exception: No match was found. We're done or something is broken

../../env/lib/python2.7/site-packages/pynliner/soupselect.py:109: Exception
============================================================ 1 failed in 0.70 seconds ============================================================

please include documentation in pypi releases

Hello! it would be great if the pypi released tarballs could contain the documentation (docs/ dir in particular)

Strange behavior on URLs with query strings

Hello,

First of all, I would like to say "Thanks!" for making Pynliner. I started using it for converting email templates with embedded <style> tags to inline styles, since many email clients like to strip out embedded CSS. It's been very helpful overall.

However, one issue I've discovered is that visible URLs that contain query strings show up looking a bit odd. I've been able to reproduce the issue as-follows:

import pynliner
# The original string- a link containing the text of a URL with query strings
>>> my_str = '<a href="http://www.example.com?utm_campaign=abcd&utm_medium=efgh&utm_source=ijkl">http://www.example.com?utm_campaign=abcd&utm_medium=efgh&utm_source=ijkl</a>'
# Convert with pynliner...
>>> pynliner.fromString(my_str)
u'<a href="http://www.example.com?utm_campaign=abcd&amp;utm_medium=efgh&amp;utm_source=ijkl">http://www.example.com?utm_campaign=abcd&utm;_medium=efgh&utm;_source=ijkl</a>'

As you can see, the ampersands in the href param get encoded into &amp.

Additionally, if you look at the URL wrapped within the <a> tag, the last two underscores are prefixed with a semicolon, eg. utm_medium becomes utm;_medium.

Strangely enough, AFAICT, the issue with the underscores only occurs within a query string. For example, if you change all of the instances of example.com in the above python code to ex_ample.com, the underscores come out just as they went in.

Is this a known issue? Is there any kind of workaround available? Or am I crazy and missing something obvious...?

Thanks!

media queries are removed

Given the following sample code:

<!DOCTYPE html>
<html>
    <head>
        <title>Example</title>
        <style type="text/css">
            @media screen and (min-device-width: 480)
            {
                #content
                {
                        width: 480px;
                }
            }
            #content
            {
                border: 1px solid black;
            }
        </style>
    </head>
    <body>
        <div id="content">
            <h1>Hello world</h1>
        </div>
    </body>
</html>

I would expect the default border rule to be applied, and the media query hunk to be left alone (since it will only be parsed and used on compatible devices).

As a potential quick fix, pynliner could exclude processing of style or link blocks that have media query-like syntax, or are present and not screen.

complex CSS selectors

Pynliner currently fails to apply the selector element#id.class.

Other selectors that fail include adjacent sibling (a + b) and child (a > b) selectors as well as pseudo selectors (not sure if pseudo selectors should be included... thoughts anyone?)

Need to exclude some rules from Pynliner

Related to #13

There are some rules we use as fixes for crappy web email clients. For example:

  .ExternalClass * {
    line-height: 100%;
  }

Targets the email when it's displayed within Hotmail/Outlook.com (see http://templates.mailchimp.com/development/css/client-specific-styles/ )

It would be useful to have a way to mark some CSS rules to be ignored by pynliner. For now I can mark that block myself - remove it pre-pynliner and then add it back in afterwards.

But seems like a general solution to this would be a good feature.

Fail to convert non ASCII html

pynliner fail to convert html when we use non ascii characters with Pseudo classes.

Minimal code to reproduce

# -*- coding: utf-8 -*-
import pynliner

html = u'''
<style>
blockquote>:first-child { margin-top: 0; }
</style>
<ul>
   <li>
   あいうえお
   <ul>
      <li>
      </li>
   </ul>
   </li>
</ul>
'''

print pynliner.fromString(html)

Error outputs

Traceback (most recent call last):
  File "c:/central/md2email/poc.py", line 19, in <module>
    print pynliner.fromString(html)
  File "C:\Users\ITO_YU\Anaconda2\lib\site-packages\pynliner\__init__.py", line 318, in fromString
    return Pynliner(**kwargs).from_string(string).run()
  File "C:\Users\ITO_YU\Anaconda2\lib\site-packages\pynliner\__init__.py", line 139, in run
    self._apply_styles()
  File "C:\Users\ITO_YU\Anaconda2\lib\site-packages\pynliner\__init__.py", line 262, in _apply_styles
    for element in select(self.soup, selector.selectorText):
  File "C:\Users\ITO_YU\Anaconda2\lib\site-packages\pynliner\soupselect.py", line 174, in select
    context_matches = [el for el in context[0].find_all(tag, find_dict) if checker(el)]
  File "C:\Users\ITO_YU\Anaconda2\lib\site-packages\pynliner\soupselect.py", line 98, in checker
    if not func(el):
  File "C:\Users\ITO_YU\Anaconda2\lib\site-packages\pynliner\soupselect.py", line 82, in <lambda>
    'first-child': lambda el: is_first_content_node(getattr(el, 'previousSibling', None)),
  File "C:\Users\ITO_YU\Anaconda2\lib\site-packages\pynliner\soupselect.py", line 70, in is_first_content_node
    if is_white_space(el):
  File "C:\Users\ITO_YU\Anaconda2\lib\site-packages\pynliner\soupselect.py", line 50, in is_white_space
    if isinstance(el, bs4.NavigableString) and str(el).strip() == '':
UnicodeEncodeError: 'ascii' codec can't encode characters in position 4-8: ordinal not in range(128)
exit status 1

>>> pynliner.__version__
'0.8.0'

python --version
Python 2.7.13 :: Anaconda, Inc.

Style not being applied when multiple elements have same class/Id and same content

I am facing some weird issue when using pynliner.fromString(htmlString). When multiple elements have same class/Id, it fails to apply CSS rules to all elements except the first one. Here is the sample code with expected output and actual output.

<style>
    .text-right {
        text-align: right;
    }
    .box {
        width:200px;
        border: 1px solid #000;
    }
</style>
<div class="box">
    <p>Hello World</p>
    <p class="text-right">Hello World on right</p>
    <p class="text-right">Hello World on right</p>
</div>

Expected Output:

Actual Output:

Can't use pynliner - some errors

I have some code:

from pynliner import Pynliner
p=Pynliner().from_url('https://saturn.pl/')
p.run()

In Python 3:

"venv3/lib/python3.5/site-packages/pynliner/__init__.py", line 179, in _get_external_styles
    self.style_string += self._get_url(url)
TypeError: Can't convert 'bytes' object to str implicitly

In Python 2:

"venv2/local/lib/python2.7/site-packages/pynliner/__init__.py", line 179, in _get_external_styles
    self.style_string += self._get_url(url)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)

>>> pynliner.__version__
'0.7.1'

Please help me achieve result.

Website for test_08_fromURL is unavailable

The URL specified in test_08_fromURL, http://media.tannern.com/pynliner/test.html, is unavailable which causes this test for fail.

Doctype errors on Linux (Python 2.7.1+)

Everything works great on our local machines, however on the server we get the following behaviour:

pynliner.fromString("""""")
u'<!>'

note the double <!<! and >>

Simpler case:

pynliner.fromString("<!doctype html>")
u'<!<!doctype html>>'

I could not pinpoint where the error comes from.

I am using:
cssutils==0.9.8a2
pynliner==0.2.1

pynliner doesn't work with unicode (Mac OS X, BeautifulSoup 3.2.0)

It fails with UnicodeDecodeError ('ascii' codec can't decode byte 0xd0 in position 213: ordinal not in range(128)).

This is because of this line: https://github.com/rennat/pynliner/blob/master/pynliner/__init__.py#L211 - there is a string conversion without encoding specified so it fails for non-ascii strings.

Unicode Patches

Just wanted to say thank you for this very useful project. Pynliner have been working great for me, apart for crashes due to unicode support. There have been a couple of patches (#11 #16 #19) which seems to work for me so it would be great to see some of these get merged to the main branch.