Giter VIP home page Giter VIP logo

Comments (5)

dpapathanasiou avatar dpapathanasiou commented on August 19, 2024

It seems the stack trace is caused by html (the result of the call to urllib) being empty.

For some reason, this line does not work with the NY Times server (perhaps they've disabled it to prevent web crawling):

html = urllib.urlopen(url).read()

If, however, I use pycurl to fetch those links instead, it works:

import pycurl
from cStringIO import StringIO

def load_url (url, user_agent=None):
    """Attempt to load the url using pycurl and return the data (which is None if unsuccessful)"""

    databuffer = StringIO()
    curl = pycurl.Curl()
    curl.setopt(pycurl.URL, url)
    curl.setopt(pycurl.FOLLOWLOCATION, 1)
    curl.setopt(pycurl.WRITEFUNCTION, databuffer.write)
    if user_agent:
        curl.setopt(pycurl.USERAGENT, user_agent)
    try:
        curl.perform()
        data = databuffer.getvalue()
    except:
        data = None
    curl.close()

    return data

Then I can do this successfully:

    html = load_url(url)
    readable_article = Document(html).summary()
    readable_title = Document(html).short_title()

Unfortunately, the command-line version won't work, because it uses urllib, but at least I have a work-around.

from python-readability.

buriy avatar buriy commented on August 19, 2024

Denis, thanks.

from python-readability.

 avatar commented on August 19, 2024

Also, note that python readability constantly misses the first paragraph of NYT articles. This is due to the HTML tree structure of the NYT articles, where the first paragraph is isolated from the main body of the article.

For instance, the text extracted from http://www.nytimes.com/2013/01/25/sports/25iht-sumo25.html?_r=1& starts with "Not one of those portraits was of a Japanese."

The web version of Readability (readability.com) works fine though, as does Pocket (getpocket.com).

from python-readability.

buriy avatar buriy commented on August 19, 2024

Hi,

I can see it's fine with web readability.js , e.g. with Readability Redux
browser extension.

Then probably some tag transform is missing in my fork.

I'll take a look.

You can try https://github.com/mitechie/breadability -- it should go fine:
it's another fork, better, but less known.

On Tue, Jul 16, 2013 at 2:44 PM, Lindemann [email protected] wrote:

Also, note that python readability constantly misses the first paragraph
of NYT articles. This is due to the HTML tree structure of the NYT
articles, where the first paragraph is isolated from the main body of the
article.

For instance, the text extracted from
http://www.nytimes.com/2013/01/25/sports/25iht-sumo25.html?_r=1& starts
with "Not one of those portraits was of a Japanese."

The web version of Readability (readability.com) works fine though, as
does Pocket (getpocket.com).


Reply to this email directly or view it on GitHubhttps://github.com//issues/31#issuecomment-21026197
.

Best regards, Yuri V. Baburov, Skype: yuri.baburov

from python-readability.

 avatar commented on August 19, 2024

Hi Yuri,

Thank you for your recommending breadability, I didn't know it. However, as far as I can see, it also fails to include the first paragraph. The issue has even been added to their github page.

Also, I can't see any documentation on how to use it from python (e.g. how do I access the title of the article ?). Therefore, I'll keep using your version for a while :)

from python-readability.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.