I have some urls on same site. Can't get corret contain(with table):

I find that due to many td 's <code class="notranslate

Can't get correct article content if content in table, bug? about python-readability HOT 5 CLOSED

buriy commented on July 19, 2024

Can't get correct article content if content in table, bug?

from python-readability.

Comments (5)

eromoe commented on July 19, 2024

I find that due to many td's inner_text_len < MIN_LEN, so table can't get a high score .
And I think the score_paragraphs func need calculate from deep to shallow

from python-readability.

buriy commented on July 19, 2024

Algorithm is probabilistic, it will never process 100% of sources correctly.
Advertising and input forms also use tables with short texts and images, so how one could assume the tables with short tests to have no advertising (and no user forms?
I remind you that this algorithm was created for articles processing, not for arbitrary text extraction.

from python-readability.

eromoe commented on July 19, 2024

Could you explain :

I remind you that this algorithm was created for articles processing, not for arbitrary text extraction.

I think articles processing means : get title and article main content (That 's I want and why I use readability).
But I don't understand arbitrary text extraction, translate(use google translation) to Chinese means: random text extraction .(random??? I doubt google's explanation)

And about tr,
I think there is not article content would wraped by tr, at least should remove tr from condidates.

from python-readability.

buriy commented on July 19, 2024

http://ecp.sgcc.com.cn/html/project/014002007/9990000000010135023.html
By "articles" , usually "news articles" are meant. This is a page, but not a news article (guessing from how it looks like, at least).
You might have a set of different rules for this kind of pages because the same set of rules won't work equally good for both.
But for year 2016, a better approach would be to use machine learning (of course if you have a big corpus) -- boosting trees (like ones made by xgboost) should help.
Otherwise, if you have just several sources, it's easier to set up the rules manually for each source: something like ".article" for source1, "div[1]/div[1]" for source2, etc.
The algorithm from this library is made only for quick news article extraction, to process maybe 90% of sources automatically, but others to be processed manually anyway.
If you will change the rules -- it will process different 90% of articles, but with any single set of rules you will never process all 100% of the articles correctly.
Manually or with machine learning methods, you can improve the success ratio -- but only if you have a very big corpus.

from python-readability.

eromoe commented on July 19, 2024

Ok, I understand, thank you !

from python-readability.

Can't get correct article content if content in table, bug? about python-readability HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent