Thank you for keeping up the project! I use readability to extract t

Hi buriy, Here is what I do: <div class="snippet-clipboard-conte

Ok, got you now. You can do: <div class="snippet-clipboard-conte

Add charset info to the clean html about python-readability HOT 3 OPEN

rsuhada commented on August 19, 2024

Add charset info to the clean html

from python-readability.

Comments (3)

buriy commented on August 19, 2024

Hi rsuhada,

How do you access that get_clean_html() method?

The Document class usually attempts to guess the document encoding... But it might have failed for you.
Then you need to specify it manually, with subclassing or using the encoding field.

I'm not sure how to implement encoding guessing transparently for those who need it, and omit it for ones who know their document encoding...
Any ideas?

from python-readability.

rsuhada commented on August 19, 2024

Hi buriy,

Here is what I do:

res = requests.get(url)
article = Document(res.text)
article_clean_html = article.get_clean_html()

with codecs.open("test_clean.html", encoding="utf-8", mode="w") as f:
     f.write(article_clean_html)

When I open the test_clean.html, I see buggy unicode characters.
All is good, if I include the charest info into the html:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

The encoding can be taken from res.encoding returned by the requests.get().

Alternatively - is this actually the correct way to use (and saving the output) the readability package?

Thank you!

from python-readability.

buriy commented on August 19, 2024

Ok, got you now.

You can do:

with codecs.open("test_clean.html", encoding="utf-8", mode="w") as f:
     f.write('<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />')
     f.write(article_clean_html)

But I agree that could be an enhancement that a lib could do if it adds a tag to the HTML.

from python-readability.

Recommend Projects

Add charset info to the clean html about python-readability HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent