Giter VIP home page Giter VIP logo

Comments (2)

buriy avatar buriy commented on August 19, 2024

Hi Abdul,

I run it as the following using development version (just published as dev branch):

python -m readability.readability -v -v -v -u http://ukbdnews.com/2014/09/%E0%A6%85%E0%A6%AC%E0%A6%B6%E0%A7%87%E0%A6%B7-%E0%A6%9A%E0%A6%B2%E0%A6%9A%E0%A7%8D%E0%A6%9A%E0%A6%BF%E0%A6%A4%E0%A7%8D%E0%A6%B0%E0%A6%95%E0%A7%87-%E0%A6%AC%E0%A6%BF%E0%A6%A6%E0%A6%BE%E0%A6%AF%E0%A6%BC/ 2>&1 | grep content

and the top lines were:

2014-09-22 15:10:53,035: DEBUG: Branch 59.000 /#content3/#post-7216.post-7216.post.type-post.status-publish.format-standard.has-post-thumbnail.hentry.category-entertainment/.entry.entry-content link density 0.029 -> 57.280 (at readability.py: 306)
2014-09-22 15:10:53,036: DEBUG: Branch 69.000 /#main.clear/#content3/#post-7216.post-7216.post.type-post.status-publish.format-standard.has-post-thumbnail.hentry.category-entertainment link density 0.040 -> 66.264 (at readability.py: 306)

2014-09-22 15:04:00,780: INFO: Cleaned /div{01}/#post-7216.post-7216.post.type-post.status-publish.format-standard.has-post-thumbnail.hentry.category-entertainment/.entry.entry-content (score=57.280, weight=50) cause it has less than 3x <p>s than <input>s:  বিনোদন ডেস্ক :: অবশেষে পূর্ণিমা চলচ্চিত... (at readability.py: 536)

Which means it removed the .entry-content block because it has found a lot of inputs inside the text block, and thought it is a form.
However, the inputs have the type so they should be ignored when calculating this heuristic.

I made a fix and published it as version readability-lxml==0.3.0.5 .
Now it works perfectly like this:

python -m readability.readability -p content -u http://ukbdnews.com/2014/09/%E0%A6%85%E0%A6%AC%E0%A6%B6%E0%A7%87%E0%A6%B7-%E0%A6%9A%E0%A6%B2%E0%A6%9A%E0%A7%8D%E0%A6%9A%E0%A6%BF%E0%A6%A4%E0%A7%8D%E0%A6%B0%E0%A6%95%E0%A7%87-%E0%A6%AC%E0%A6%BF%E0%A6%A6%E0%A6%BE%E0%A6%AF%E0%A6%BC/

from python-readability.

appscluster avatar appscluster commented on August 19, 2024

Excellent. Yes, I can confirm it does work now. Thank you

from python-readability.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.