Giter VIP home page Giter VIP logo

newsworker's Introduction

newsworker -- advanced automatic news extractor using HTML scraping

pypi version Documentation Status

newsworker is a Python 3 lib that extracts feeds from html pages. It's useful when you need to subscribe to a news from website that doesn't publish RSS/ATOM feeds and you don't want to use page change monitoring tools since it's not so accurate.

An idea behind this algorithm is simple. Most pages with news contain date and time information about certain news. These dates could look like "2017-09-27" or "1 jul 2016" or many other ways. First of all we needed to find all dates, second is to understand when date is just a date of this webpage and when date is on webpage area dedicated for news.

This tool helps to find news by locating news blocks on html webpage and parsing them for further usage.

Usage examples

Extract news from html page from EIB website and Bulgarian government website

>>> feed, session = f.get_feed(url="http://government.bg/bg/prestsentar/novini")
>>> feed
...
>>> from newsworker.extractor import FeedExtractor
>>> f = FeedExtractor(filtered_text_length=150)
>>> feed, session = f.get_feed(url="http://www.eib.org/en/index.htm?lang=en")
>>> feed
{'title': 'European Investment Bank (EIB)', 'language': 'en', 'link': 'http://www.eib.org/en/index.htm?lang=en', 'description': 'European Investment Bank (EIB)', 'items': [{'title': 'Blockchain Challenge: coders at the EIB', 'description': 'Blockchain Challenge: coders at the EIB', 'pubdate': datetime.datetime(2018, 6, 18, 0, 0), 'unique_id': 'f9d359f76118076c5331ffec3cdb82eb', 'raw_html': b'<div class="first-column col-xs-12 col-sm-12 col-md-8 col-lg-8 no-padding-left-right"><div class="video-box no-padding-left-right"><a class="video-youtube" href="https://www.youtube.com/watch?v=YlKa2LZgxhE?autoplay=1"><div class="img-item-1" style="background-image:url(\'/img/movies/blockchain-video-hp.png\');"><span class="video-icon"><img height="100" src="/img/site/play.png" width="100"/></span><div class="video-container"><div class="left-box col-lg-8 col-xs-12"><div class="video-date-time"><small>18/06/2018</small><span class="space-separator"> | </span><small>02:12</small></div><div class="video-title col-xs-12 col-lg-12 no-padding-left-right">Blockchain Challenge: coders at the EIB</div></div></div></div></a></div></div>', 'extra': {'links': ['https://www.youtube.com/watch?v=YlKa2LZgxhE?autoplay=1'], 'images': ['http://www.eib.org/img/site/play.png']}, 'link': 'https://www.youtube.com/watch?v=YlKa2LZgxhE?autoplay=1'}, {'title': 'A brighter life for Kenyan women', 'description': 'Jujuy Verde – new horizons for women waste-pickers in Argentina', 'pubdate': datetime.datetime(2018, 6, 5, 0, 0), 'unique_id': '9caef61535352d2734d122c0e092b011', 'raw_html': b'<div class="second-column col-xs-12 col-sm-12 col-md-4 col-lg-4 no-padding-left-right"><div class="video-box no-padding-left-right"><a class="video-youtube  fancybox.iframe" href="https://www.youtube.com/watch?v=T_7OmSDSXtc?autoplay=1"><div class="img-item-2" style="background-image:url(\'/img/kenya-dlight-video-hp.png\');"><span class="video-icon"><img height="100" src="/img/site/play.png" width="100"/></span><div class="video-container"><div class="left-box col-lg-8 col-xs-12"><div class="video-date-time"><small>04/06/2018</small><span class="space-separator"> | </span><small>01:32</small></div><div class="video-title col-xs-12 col-lg-12 no-padding-left-right">A brighter life for Kenyan women</div></div></div></div></a></div><div class="video-box no-padding-left-right"><a class="video-youtube fancybox.iframe" href="https://www.youtube.com/watch?v=d-btxsYT9hI?autoplay=1"><div class="img-item-3" style="background-image:url(\'/img/jujuy-video-hp.png\');"><span class="video-icon"><img height="100" src="/img/site/play.png" width="100"/></span><div class="video-container"><div class="left-box col-lg-8 col-xs-12"><div class="video-date-time"><small>05/06/2018</small><span class="space-separator"> | </span><small>03:12</small></div><div class="video-title col-xs-12 col-lg-12 no-padding-left-right">Jujuy Verde \xc3\xa2\xe2\x82\xac\xe2\x80\x9c new horizons for women waste-pickers in Argentina</div></div></div></div></a></div></div>', 'extra': {'links': ['https://www.youtube.com/watch?v=T_7OmSDSXtc?autoplay=1', 'https://www.youtube.com/watch?v=d-btxsYT9hI?autoplay=1'], 'images': ['http://www.eib.org/img/site/play.png']}, 'link': 'https://www.youtube.com/watch?v=T_7OmSDSXtc?autoplay=1'}], 'cache': {'pats': ['dt:date:date_1']}}

Reuse cached patterns to speed up further news extraction. It could greatly improve page parsing speed since it minimizes number of date comparsion up to 100x times (2-3 date patterns instead of 350 patterns)

>>> pats = feeds['cache']['pats']
>>> feed, session = f.get_feed(url="http://www.eib.org/en/index.htm?lang=en", cached_p=pats)
Change user agent if needed
>>> USER_AGENT = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11'
>>> feed, session = f.get_feed(url="http://www.eib.org/en/index.htm?lang=en", user_agent=USER_AGENT)
Initialize feed finder on webpage
>>> from newsworker.finder import FeedsFinder
>>> f = FeedsFinder()
Try to extract feeds if no one feed exists
>>> feeds = f.find_feeds('http://government.bg/bg/prestsentar/novini')
{'url': 'http://government.bg/bg/prestsentar/novini', 'items': []}
Add "extractrss" param launches FeedExtractor
>>> feeds = f.find_feeds('http://government.bg/bg/prestsentar/novini', extractrss=True)
>>> feeds
{'url': 'http://government.bg/bg/prestsentar/novini', 'items': [{'feedtype': 'html', 'title': 'Министерски съвет :: Новини', 'num_entries': 12, 'url': 'http://government.bg/bg/prestsentar/novini'}]}
Find all feeds and more info from feeds. With "noverify=False" each feed parsed
>>> feeds = f.find_feeds('https://www.dta.gov.au/news/', noverify=False)
>>> feeds
{'url': 'https://www.dta.gov.au/news/', 'items': [{'title': 'Digital Transformation Agency', 'url': 'https://www.dta.gov.au/feed.xml', 'feedtype': 'rss', 'num_entries': 10}]}
Addind "include_entries=True" returns feeds and all parsed feed entries
>>> feeds = f.find_feeds('https://www.dta.gov.au/news/', noverify=False, include_entries=True)
>>> feeds

Documentation

Documentation is built automatically and can be found on Read the Docs.

Features

  • Identifies news blocks on webpages using date patterns. More than 348 date patterns supported. Uses qddate
  • Extremely fast, uses pyparsing to identify dates on webpages
  • Includes function to find feeds on html page and if no feed found, than extract news

Limitations

  • Not all language-specific dates supported
  • Right aligned dates like "Published - 27-01-2018" not supported. It's not hard to add it but it greatly increases false acceptance rate.
  • Some news pages has no dates with urls or texts. These pages are not supported yet

Speed optimization

  • qddate date parsing lib was created for this algorithm. Right now pattern marching is really fast
  • date patterns could be cached to speed up parsing speed for the same website
  • feed finder without verification of feeds works fast, but if verification enabled than it's slowed down

TODO

  • Support more date formats and improve qddate lib
  • Support news pages without dates

Usage

The easiest way is to use the newsworker.FeedExtractor class, and it's get_feed function.

.. automodule:: newsworker.extractor
   :members: FeedExtractor
.. automodule:: newsworker.finder
   :members: FeedsFinder


Dependencies

newsworker relies on following libraries in some ways:

  • qddate_ is a module for data processing
  • pyparsing is a module for advanced text processing.
  • lxml is a module for xml parsing.

Supported languages specific dates

  • Bulgarian
  • Czech
  • English
  • French
  • German
  • Portuguese
  • Russian
  • Spanish

Thanks

I wrote this news extraction code at 2008 year and later only updated it several times, migrating from regular expressions to pyparsing. Initial project was divided between qddate date parsing lib and newsworker intended to news identification on html pages

Feel free to ask question [email protected]

Join the chat at https://gitter.im/newsworker/Lobby

newsworker's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

newsworker's Issues

Add structure detection of news objects

Add structure detection and xpath reconstruction.
Instead of dynamic news detection build pseudo-code to extract news from the page.

It should implement analysis logic that should detect:

  • news list block container
  • the type of news list: sub-blocks or mixed list
  • headline tag
  • text tag/tag-block
  • date tag is exists
  • links is exists
  • images if exists

Is something missing?

Add project like behavior

Consider adding configuration files for data aggregation. it includes:

  • Adding command feedcmd init to initialize extraction with --url <url> option. It should generate .newsworker.yaml file (or TOML)
  • Add support of project YAML or TOML files
  • Add command feedcmd run to launch data extraction

Add server with most commands

Add webserver with most commands provided via REST API like:

  • analyze - analyze url
  • init - initialize project
  • run - run project
  • extract - extract feed from URL HTML
  • scan - scan URL for RSS/ATOM feeds
  • feed - get feed data

Add page analysis command to feedcmd

Add page analysis command.
It should be feedcmd analysis <url> with output of possible feeds on the page and feed types and example feed entities

Rename command `find` to `scan` and improve its behavior

Rename command find to scan and improve its behavior with return of status codes if nothing found and JSON output to the file instead of stdout

  • Rename find to scan
  • Add --output <filename> option
  • Add non zero status codes if nothing found
  • Add URL data validation

Write tests

There is no tests in this projects. Add it!

Template generation after analysis

Instead of dynamic page structure identification generate a template with a number of options that should simplify data parsing afterward.

It should include:

  • location of the container tag
  • datetime pattern(-s)
  • URL pattern
  • title tag path

Review Telegram InstantView templates https://instantview.telegram.org/

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.