Comments (9)
It looks like Goose is a much newer project than this. Care to flip the question the other way?
Honestly, Goose has never come up in a single search for this type of tool and I've been looking/using for years. readability, readitlater, readable parsing, etc are all keywords Goose should attempt to build into the readme to help it gain some visibilty.
from python-readability.
Looking at the code, and considering it's a porting from Java, I was surprised to see that in goose the "score" of each node is attached to the dom object... so I believe that the origin of everything is readability.
Goose seems to focus on extracting just text, but it doesn't seem hard to patch it to have the content in html as an intermediary step.
This said, I have no affiliation nor gain, I was just scouting :)
from python-readability.
Looking at Goose, I do love the code. I've been playing around with a rewrite of python-readability to have cleaner and more well-documented code. Anyone want to start a project together?
I currently use python-readability with a bunch of customizations at my job, i'd love to contribute back, but I feel they are messy hacks, and it would be better to set up a project that is more easily customized with rules for parsing and some sort of testing library.
from python-readability.
@mperdomo1 I'd suggest looking around before starting another one. I still feel bad I went that route years ago as you can tell from the list of alternatives https://github.com/bookieio/breadability#alternatives
from python-readability.
I agree that there are a lot of different implementations regarding readability. I try to keep track of them as the algorithms are critical for use in my company's product. This project is one of the more reliable approaches around, but it's very difficult to make changes. The project i've been working on on-and-off for a year is a yet another readability project that allows easy changes to parsing rules, custom rules, and some sort of confidence mechanism on the parsing result.
from python-readability.
Hi everyone. A couple of unstructured thoughts from me.
First, thanks for raising this question.
Of course I'd suggest to start a new project doing everything correctly
(including architecture), rather than this messy code.
There are 2 different needs in fact:
- supervised content extraction, when someone wants it to make perfect
job and one wants to edit rules and do quirks. - unsupervised content extraction, when you want it to "just work"
I needed only second one.
I consider current code results in extraction quality to be 95% in average.
With shorter articles, it could be down to 50% or less.
Recently I tried two approaches to improve it:
- bayesian learning of the best extraction xpath -- I stored original
xpaths in the lxml DOM to reuse them later. - an approach with "support text", which is a text snippet made from the
article content -- a small summary -- which is often put at the articles
list (where you've found an article URL) or RSS article feed.
After some testing, I've started using readability with support texts,
making it to get to 99%-99.5% correctly identified article texts which
suits my needs.
Support texts work superior to every other approach I tried (including
manual tweaking), and was very easy to set up:
To implement it, one just subclasses a Document and implement your own
re-scoring of the found text nodes. I made the check if 50% of support text
are from the article content, and if it is so, increased node score. And
original Document part then chosen the bigger one.
(If you don't have a support text -- a title could be an indicator already,
helping to concentrate on document parts starting with the title, so
comments will be rarely matched).
But it reduces parsing predictability even more -- you now can't be sure
why it didn't match and how to improve it.
(Bayesian learning for positive and negative keywords could also work very
good, but it needed a lot of already processed articles from each source --
which is better for large project, but worse for a small start-up).
I wanted to publish that update for 2 months already, but I've been busy
with other stuff -- and this project wasn't a real priority for me, it was
just a useful tool but not perfect one and it was ok.
I think that supervised content extraction is a dead end -- you spend days,
and you can't improve it on one article, without making it perform worse on
other articles.
So that's why I don't improve its algorithms towards this.
Rather than that I would consider creating a public database with all major
world article sources and their extraction rules.
Because ideally you'd also need to extract date, category, keywords,
images, videos, better title -- not just text.
And for (semi-)supervised content extraction I would recommend making
another package.
P.S. When I started with this readability project support, there were 3
packages doing the same job.
I reused those, and maybe I wrote only 200 lines of code myself, mostly to
package and document it.
So I'm just a maintainer :)
So I would encourage someone else to improve and update existing code and
make both existing and new users happy.
I'll give the necessary permissions to my branches.
On Tue, Dec 23, 2014 at 9:18 AM, mperdomo1 [email protected] wrote:
I agree that there are a lot of different implementations regarding
readability. I try to keep track of them as the algorithms are critical for
use in my company's product. This project is one of the more reliable
approaches around, but it's very difficult to make changes. The project
i've been working on on-and-off for a year is a yet another readability
project that allows easy changes to parsing rules, custom rules, and some
sort of confidence mechanism on the parsing result.—
Reply to this email directly or view it on GitHub
#57 (comment)
.
Best regards, Yuri V. Baburov, Skype: yuri.baburov
from python-readability.
Thanks @buriy, and thanks all for the comments/hints. Your semi-supervised approach looks definitely interesting to me, I'd love to learn more.
from python-readability.
Then I'll try to publish v0.5 with improved debugging, xpath support and some minor improvements early next month, and will provide an example with the support text right in the README and will add related command-line options.
from python-readability.
Super! I'll keep watching.
As for me, I think I'll go with python newspaper for my project, but the core is similar and I'm very interested in this approach. Moreover, I'm working on a pretty big test set built from RSS feeds (so I have url, title, desc), that could be applied also to this scenario I guess. Let's keep in touch, but for now happy new year!
from python-readability.
Related Issues (20)
- Orphan links in doc.summary()
- Inlining images?
- re._pattern_type has been removed in favor of re.Pattern in Python 3.7
- No chance for GitHub commit page? HOT 1
- RuntimeWarning and Correct invocation on the shell command line (not python script) HOT 1
- Missing <p>-text
- REGEXES["divToPElementsRe"] logical error HOT 3
- Error when using positive_keywords (or negative_keywords) argument with python >= 3.7 HOT 1
- .text may guess the encoding incorrectly HOT 4
- Does not handle github pages
- <p> wrongly inserted before <i> or <b>
- Problems with thecyberwire.com
- Circular import error for pip install readability-lxml HOT 4
- isProbablyReaderable HOT 3
- Issue with utf8 and HTML entities HOT 2
- Last two commands in the "usage" section are incorrect
- `test_many_repeated_spaces` fails on darwin python 3.8, 3.9 & 3.10
- Consider switching from lxml's clean_html for enhanced security (and possibly performance) HOT 5
- Summary is fooled by a modal popup
- Readability of MSN articles
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from python-readability.