I've installed readability-lxml with pip on ubuntu 12.10 (64 bit, amd), and all my pac

Getting lxml stack traces with NY Times urls about python-readability HOT 5 CLOSED

buriy commented on August 19, 2024

Getting lxml stack traces with NY Times urls

from python-readability.

Comments (5)

dpapathanasiou commented on August 19, 2024

It seems the stack trace is caused by html (the result of the call to urllib) being empty.

For some reason, this line does not work with the NY Times server (perhaps they've disabled it to prevent web crawling):

html = urllib.urlopen(url).read()

If, however, I use pycurl to fetch those links instead, it works:

import pycurl
from cStringIO import StringIO

def load_url (url, user_agent=None):
    """Attempt to load the url using pycurl and return the data (which is None if unsuccessful)"""

    databuffer = StringIO()
    curl = pycurl.Curl()
    curl.setopt(pycurl.URL, url)
    curl.setopt(pycurl.FOLLOWLOCATION, 1)
    curl.setopt(pycurl.WRITEFUNCTION, databuffer.write)
    if user_agent:
        curl.setopt(pycurl.USERAGENT, user_agent)
    try:
        curl.perform()
        data = databuffer.getvalue()
    except:
        data = None
    curl.close()

    return data

Then I can do this successfully:

    html = load_url(url)
    readable_article = Document(html).summary()
    readable_title = Document(html).short_title()

Unfortunately, the command-line version won't work, because it uses urllib, but at least I have a work-around.

from python-readability.

buriy commented on August 19, 2024

Denis, thanks.

from python-readability.

commented on August 19, 2024

Also, note that python readability constantly misses the first paragraph of NYT articles. This is due to the HTML tree structure of the NYT articles, where the first paragraph is isolated from the main body of the article.

For instance, the text extracted from http://www.nytimes.com/2013/01/25/sports/25iht-sumo25.html?_r=1& starts with "Not one of those portraits was of a Japanese."

The web version of Readability (readability.com) works fine though, as does Pocket (getpocket.com).

from python-readability.

buriy commented on August 19, 2024

Hi,

I can see it's fine with web readability.js , e.g. with Readability Redux
browser extension.

Then probably some tag transform is missing in my fork.

I'll take a look.

You can try https://github.com/mitechie/breadability -- it should go fine:
it's another fork, better, but less known.

On Tue, Jul 16, 2013 at 2:44 PM, Lindemann [email protected] wrote:

Also, note that python readability constantly misses the first paragraph
of NYT articles. This is due to the HTML tree structure of the NYT
articles, where the first paragraph is isolated from the main body of the
article.

For instance, the text extracted from
http://www.nytimes.com/2013/01/25/sports/25iht-sumo25.html?_r=1& starts
with "Not one of those portraits was of a Japanese."

The web version of Readability (readability.com) works fine though, as
does Pocket (getpocket.com).

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/31#issuecomment-21026197
.

Best regards, Yuri V. Baburov, Skype: yuri.baburov

from python-readability.

commented on August 19, 2024

Hi Yuri,

Thank you for your recommending breadability, I didn't know it. However, as far as I can see, it also fails to include the first paragraph. The issue has even been added to their github page.

Also, I can't see any documentation on how to use it from python (e.g. how do I access the title of the article ?). Therefore, I'll keep using your version for a while :)

from python-readability.

Getting lxml stack traces with NY Times urls about python-readability HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent