kotartemiy / newscatcher Goto Github PK

View Code? Open in Web Editor NEW

2.9K 2.9K 274.0 13.75 MB

Programmatically collect normalized news from (almost) any website.

Home Page: https://newscatcherapi.com/

License: MIT License

Python 100.00%

newscatcher's People

Contributors

Stargazers

Watchers

Forkers

ck-work asteur przchojecki vbukkala ovyan djdev yushu-liu polybius12 burakakrishna vamsi12619 alilosoft firobeid santoshsrinivas79 themucha duyvhh zervcap cavallonechen kobiecrowder ankepand akashpaynesro drewwebster saaib halhenke mdheller priestd09 ajaykannan97 forkkit rajashreepatil simondodson gridl scottairth tchigher junli0411 gorpo smart-fx keshabb herobring j-elliot wumbrath moonepic sanjc christopherat sunilgoda em5813 bigapartmentsin neuty tiresiasel arunabhdas baifengbai garar soncrates kaffeebrauer jogunjobi bhartendu27 sbrichardson reactual mindkhichdi dan7geo fanda tomdos benbarber dkarp0 jaksz rathishg kenny-ngo sonrebmax vdt crimsongold p3po mwilc0x skalawag thanujadax serra-pablo jwoods1 seipy zachgatesak pz325 sekmet vlameiras qezz zerocoder1 himynameistimli gpurkins-ias noufal85 mapineda oafx chuckplantain yazzyyaz backwardn profintegra hanchang bjc613archive michel34343 alexzdjiang anatanick wangdai1 chatterjeeshekhar leoxionglei laddison fudaoji

newscatcher's Issues

Get text content from specific article URL?

Is there a way to get the article text content from a specific news article URL?

I don't want to crawl home page, RSS feeds or categories, but I want text for specific article URLs.

Any ideas @kotartemiy @dwardu89?

Error message

Am I doing something wrong to receive this error?

Article body text

I'm wondering if you plan to provide article body text in the future. Looking at a handful of publishers, I wasn't able to find it in their news collections.

Exception raised on 'news.ycombinator.com'

Cool project, everything seems to work as intended, but for some reason trying to use news.ycombinator.com as the news_source raises Exception: check internet connection / website is not supported.

It's odd, because other sources with subdomains work without issue, such as wired.co.uk.

[QUESTION] Rate Limiting?

When attempting to retrieve large number of items, is there any type of rate limiting or request cool down?

Cannot pull data.

I am unable to get news feed. My internet connection is working fine. Find below my requests.
Python version 3.7.3

from newscatcher import Newscatcher
nc = Newscatcher(website = 'nytimes.com')
results = nc.get_news()

No results found check internet connection or query parameters

An additional sample request to check my internet connection.

import requests
r = requests.get('https://github.com/timeline.json')
r.json()
{'message': 'Hello there, wayfaring stranger. If you’re reading this then you probably didn’t see our blog post a couple of years back announcing that this API would go away: http://git.io/17AROg Fear not, you should be able to get what you need from the shiny new Events API instead.', 'documentation_url': 'https://developer.github.com/v3/activity/events/#list-public-events'}

[REQUEST] Topic specific news without specifying a website?

Would it be possible to get the tech headlines without specifying a website?

Add a .gitignore

Things like .DS_Store and __pycache__ don't belong in a repo. A .gitignore file is needed. I'd recommend this one from http://gitignore.io.

RSS feeds

Would it be possible to store the URLs in a compressed flat file? These SQL things are just convoluted and hard to work with. I would like to just open the list of feeds and see what you have, but it takes like 20 steps to get inside...

News not working

Not compatible with Python 3.9

Looks like the dependencies in this project need to be updated. Feedparser was patched for 3.9 but the patch hasn't made it into here yet so it throws a base64 error.

Breitbart news not working

Hacker News not working

I can't seem to get YC or YC news to work, even though the former is in urls and the latter is on the frontpage readme. Thank you for this amazing open source project!

from newscatcher import Newscatcher
from newscatcher import urls
url = 'news.ycombinator.com'
url2 = 'ycombinator.com'
eng_lnx = urls(language='en')
nc = Newscatcher(website=url)
try:
    print("looking for " + url + "...")
    nc.get_news()
except Exception as e:
    print(repr(e))
describe_url(url)
print(url + ' in urls: ' + str(url in eng_lnx))
print(url2 + ' in urls: ' + str(url2 in eng_lnx))
nc2 = Newscatcher(website='ycombinator.com')
try:
    print("looking for " + url2 + "...")
    nc2.get_news()
except Exception as e:
    print(repr(e))

Support various tld for google news in database

It would be helpful to support some local url of google news, e.g. news.google.com.uk, news.google.com.au

Full list of countries TLD is here though I am not sure if all countries have google news tld.

.ac
.ad
.ae
.af
.ag
.ai
.al
.am
.ao
.aq
.ar
.as
.at
.au
.aw
.ax
.az
.ba
.bb
.bd
.be
.bf
.bg
.bh
.bi
.bj
.bm
.bn
.bo
.br
.bs
.bt
.bw
.by
.bz
.ca
.cc
.cd
.cf
.cg
.ch
.ci
.ck
.cl
.cm
.cn
.co
.cr
.cu
.cv
.cw
.cx
.cy
.cz
.de
.dj
.dk
.dm
.do
.dz
.ec
.ee
.eg
.er
.es
.et
.eu
.fi
.fj
.fk
.fm
.fo
.fr
.ga
.gd
.ge
.gf
.gg
.gh
.gi
.gl
.gm
.gn
.gp
.gq
.gr
.gs
.gt
.gu
.gw
.gy
.hk
.hm
.hn
.hr
.ht
.hu
.id
.ie
.il
.im
.in
.io
.iq
.ir
.is
.it
.je
.jm
.jo
.jp
.ke
.kg
.kh
.ki
.km
.kn
.kp
.kr
.kw
.ky
.kz
.la
.lb
.lc
.li
.lk
.lr
.ls
.lt
.lu
.lv
.ly
.ma
.mc
.md
.me
.mg
.mh
.mk
.ml
.mm
.mn
.mo
.mp
.mq
.mr
.ms
.mt
.mu
.mv
.mw
.mx
.my
.mz
.na
.nc
.ne
.nf
.ng
.ni
.nl
.no
.np
.nr
.nu
.nz
.om
.pa
.pe
.pf
.pg
.ph
.pk
.pl
.pm
.pn
.pr
.ps
.pt
.pw
.py
.qa
.re
.ro
.rs
.ru
.rw
.sa
.sb
.sc
.sd
.se
.sg
.sh
.si
.sk
.sl
.sm
.sn
.so
.sr
.ss
.st
.su
.sv
.sx
.sy
.sz
.tc
.td
.tf
.tg
.th
.tj
.tk
.tl
.tm
.tn
.to
.tr
.tt
.tv
.tw
.tz
.ua
.ug
.uk
.us
.uy
.uz
.va
.vc
.ve
.vg
.vi
.vn
.vu
.wf
.ws
.ye
.yt
.za
.zm
.zw

Requesting more sites

Would it be possible adding country specific sites ?

Offer language option

Since some news sources have different language options (e.g., spiegel.de), there should be an option to choose a language.

There is missing several topics with cnn.com

describe = describe_url('cnn.com')
print(describe['topics'])

['world', 'travel', 'tech', 'politics', 'news', 'entertainment', 'business']

Keeping addresses of RSS feeds up-to-date

Thanks for this great package and for the big collection of RSS feeds for so many news sites.

But how and when did you collect the addresses? The two sites from Germany I tried both had problems: one linked to a broken website (the catching did not work at all), the other one is not the feed you want (some outdated podcast feed). Perhaps it is a good idea to have native speakers review the corresponding feeds, I would volunteer for the german ones

The correct addresses are:

sueddeutsche.de: https://rss.sueddeutsche.de/app/service/rss/alles/index.rss?output=rss
heute.de: https://www.zdf.de/rss/zdf/nachrichten

This leads to a second issue: Using a sqlite database might be convenient, but is not so practical to be tracked in git, as mentioned in another Issue. Therefore, I could not contribute with a Pull Request.

from newscatcher import Newscatcher

ImportError: cannot import name 'Newscatcher' from partially initialized module 'newscatcher'

Thank you.