Giter VIP home page Giter VIP logo

Comments (9)

mitechie avatar mitechie commented on August 19, 2024

It looks like Goose is a much newer project than this. Care to flip the question the other way?

Honestly, Goose has never come up in a single search for this type of tool and I've been looking/using for years. readability, readitlater, readable parsing, etc are all keywords Goose should attempt to build into the readme to help it gain some visibilty.

from python-readability.

0x0ece avatar 0x0ece commented on August 19, 2024

Looking at the code, and considering it's a porting from Java, I was surprised to see that in goose the "score" of each node is attached to the dom object... so I believe that the origin of everything is readability.

Goose seems to focus on extracting just text, but it doesn't seem hard to patch it to have the content in html as an intermediary step.

This said, I have no affiliation nor gain, I was just scouting :)

from python-readability.

 avatar commented on August 19, 2024

Looking at Goose, I do love the code. I've been playing around with a rewrite of python-readability to have cleaner and more well-documented code. Anyone want to start a project together?

I currently use python-readability with a bunch of customizations at my job, i'd love to contribute back, but I feel they are messy hacks, and it would be better to set up a project that is more easily customized with rules for parsing and some sort of testing library.

from python-readability.

mitechie avatar mitechie commented on August 19, 2024

@mperdomo1 I'd suggest looking around before starting another one. I still feel bad I went that route years ago as you can tell from the list of alternatives https://github.com/bookieio/breadability#alternatives

from python-readability.

 avatar commented on August 19, 2024

I agree that there are a lot of different implementations regarding readability. I try to keep track of them as the algorithms are critical for use in my company's product. This project is one of the more reliable approaches around, but it's very difficult to make changes. The project i've been working on on-and-off for a year is a yet another readability project that allows easy changes to parsing rules, custom rules, and some sort of confidence mechanism on the parsing result.

from python-readability.

buriy avatar buriy commented on August 19, 2024

Hi everyone. A couple of unstructured thoughts from me.

First, thanks for raising this question.

Of course I'd suggest to start a new project doing everything correctly
(including architecture), rather than this messy code.

There are 2 different needs in fact:

  1. supervised content extraction, when someone wants it to make perfect
    job and one wants to edit rules and do quirks.
  2. unsupervised content extraction, when you want it to "just work"

I needed only second one.
I consider current code results in extraction quality to be 95% in average.
With shorter articles, it could be down to 50% or less.

Recently I tried two approaches to improve it:

  1. bayesian learning of the best extraction xpath -- I stored original
    xpaths in the lxml DOM to reuse them later.
  2. an approach with "support text", which is a text snippet made from the
    article content -- a small summary -- which is often put at the articles
    list (where you've found an article URL) or RSS article feed.

After some testing, I've started using readability with support texts,
making it to get to 99%-99.5% correctly identified article texts which
suits my needs.
Support texts work superior to every other approach I tried (including
manual tweaking), and was very easy to set up:
To implement it, one just subclasses a Document and implement your own
re-scoring of the found text nodes. I made the check if 50% of support text
are from the article content, and if it is so, increased node score. And
original Document part then chosen the bigger one.
(If you don't have a support text -- a title could be an indicator already,
helping to concentrate on document parts starting with the title, so
comments will be rarely matched).
But it reduces parsing predictability even more -- you now can't be sure
why it didn't match and how to improve it.
(Bayesian learning for positive and negative keywords could also work very
good, but it needed a lot of already processed articles from each source --
which is better for large project, but worse for a small start-up).

I wanted to publish that update for 2 months already, but I've been busy
with other stuff -- and this project wasn't a real priority for me, it was
just a useful tool but not perfect one and it was ok.

I think that supervised content extraction is a dead end -- you spend days,
and you can't improve it on one article, without making it perform worse on
other articles.
So that's why I don't improve its algorithms towards this.

Rather than that I would consider creating a public database with all major
world article sources and their extraction rules.
Because ideally you'd also need to extract date, category, keywords,
images, videos, better title -- not just text.

And for (semi-)supervised content extraction I would recommend making
another package.

P.S. When I started with this readability project support, there were 3
packages doing the same job.
I reused those, and maybe I wrote only 200 lines of code myself, mostly to
package and document it.
So I'm just a maintainer :)

So I would encourage someone else to improve and update existing code and
make both existing and new users happy.
I'll give the necessary permissions to my branches.

On Tue, Dec 23, 2014 at 9:18 AM, mperdomo1 [email protected] wrote:

I agree that there are a lot of different implementations regarding
readability. I try to keep track of them as the algorithms are critical for
use in my company's product. This project is one of the more reliable
approaches around, but it's very difficult to make changes. The project
i've been working on on-and-off for a year is a yet another readability
project that allows easy changes to parsing rules, custom rules, and some
sort of confidence mechanism on the parsing result.


Reply to this email directly or view it on GitHub
#57 (comment)
.

Best regards, Yuri V. Baburov, Skype: yuri.baburov

from python-readability.

0x0ece avatar 0x0ece commented on August 19, 2024

Thanks @buriy, and thanks all for the comments/hints. Your semi-supervised approach looks definitely interesting to me, I'd love to learn more.

from python-readability.

buriy avatar buriy commented on August 19, 2024

Then I'll try to publish v0.5 with improved debugging, xpath support and some minor improvements early next month, and will provide an example with the support text right in the README and will add related command-line options.

from python-readability.

0x0ece avatar 0x0ece commented on August 19, 2024

Super! I'll keep watching.
As for me, I think I'll go with python newspaper for my project, but the core is similar and I'm very interested in this approach. Moreover, I'm working on a pretty big test set built from RSS feeds (so I have url, title, desc), that could be applied also to this scenario I guess. Let's keep in touch, but for now happy new year!

from python-readability.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.