Comments (5)
It seems the stack trace is caused by html (the result of the call to urllib) being empty.
For some reason, this line does not work with the NY Times server (perhaps they've disabled it to prevent web crawling):
html = urllib.urlopen(url).read()
If, however, I use pycurl to fetch those links instead, it works:
import pycurl
from cStringIO import StringIO
def load_url (url, user_agent=None):
"""Attempt to load the url using pycurl and return the data (which is None if unsuccessful)"""
databuffer = StringIO()
curl = pycurl.Curl()
curl.setopt(pycurl.URL, url)
curl.setopt(pycurl.FOLLOWLOCATION, 1)
curl.setopt(pycurl.WRITEFUNCTION, databuffer.write)
if user_agent:
curl.setopt(pycurl.USERAGENT, user_agent)
try:
curl.perform()
data = databuffer.getvalue()
except:
data = None
curl.close()
return data
Then I can do this successfully:
html = load_url(url)
readable_article = Document(html).summary()
readable_title = Document(html).short_title()
Unfortunately, the command-line version won't work, because it uses urllib, but at least I have a work-around.
from python-readability.
Denis, thanks.
from python-readability.
Also, note that python readability constantly misses the first paragraph of NYT articles. This is due to the HTML tree structure of the NYT articles, where the first paragraph is isolated from the main body of the article.
For instance, the text extracted from http://www.nytimes.com/2013/01/25/sports/25iht-sumo25.html?_r=1& starts with "Not one of those portraits was of a Japanese."
The web version of Readability (readability.com) works fine though, as does Pocket (getpocket.com).
from python-readability.
Hi,
I can see it's fine with web readability.js , e.g. with Readability Redux
browser extension.
Then probably some tag transform is missing in my fork.
I'll take a look.
You can try https://github.com/mitechie/breadability -- it should go fine:
it's another fork, better, but less known.
On Tue, Jul 16, 2013 at 2:44 PM, Lindemann [email protected] wrote:
Also, note that python readability constantly misses the first paragraph
of NYT articles. This is due to the HTML tree structure of the NYT
articles, where the first paragraph is isolated from the main body of the
article.For instance, the text extracted from
http://www.nytimes.com/2013/01/25/sports/25iht-sumo25.html?_r=1& starts
with "Not one of those portraits was of a Japanese."The web version of Readability (readability.com) works fine though, as
does Pocket (getpocket.com).—
Reply to this email directly or view it on GitHubhttps://github.com//issues/31#issuecomment-21026197
.
Best regards, Yuri V. Baburov, Skype: yuri.baburov
from python-readability.
Hi Yuri,
Thank you for your recommending breadability, I didn't know it. However, as far as I can see, it also fails to include the first paragraph. The issue has even been added to their github page.
Also, I can't see any documentation on how to use it from python (e.g. how do I access the title of the article ?). Therefore, I'll keep using your version for a while :)
from python-readability.
Related Issues (20)
- Orphan links in doc.summary()
- Inlining images?
- re._pattern_type has been removed in favor of re.Pattern in Python 3.7
- No chance for GitHub commit page? HOT 1
- RuntimeWarning and Correct invocation on the shell command line (not python script) HOT 1
- Missing <p>-text
- REGEXES["divToPElementsRe"] logical error HOT 3
- Error when using positive_keywords (or negative_keywords) argument with python >= 3.7 HOT 1
- .text may guess the encoding incorrectly HOT 4
- Does not handle github pages
- <p> wrongly inserted before <i> or <b>
- Problems with thecyberwire.com
- Circular import error for pip install readability-lxml HOT 4
- isProbablyReaderable HOT 3
- Issue with utf8 and HTML entities HOT 2
- Last two commands in the "usage" section are incorrect
- `test_many_repeated_spaces` fails on darwin python 3.8, 3.9 & 3.10
- Consider switching from lxml's clean_html for enhanced security (and possibly performance) HOT 5
- Summary is fooled by a modal popup
- Readability of MSN articles
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from python-readability.