kotartemiy / newscatcher Goto Github PK
View Code? Open in Web Editor NEWProgrammatically collect normalized news from (almost) any website.
Home Page: https://newscatcherapi.com/
License: MIT License
Programmatically collect normalized news from (almost) any website.
Home Page: https://newscatcherapi.com/
License: MIT License
Is there a way to get the article text content from a specific news article URL?
I don't want to crawl home page, RSS feeds or categories, but I want text for specific article URLs.
Any ideas @kotartemiy @dwardu89?
Hi
I'm wondering if you plan to provide article body text in the future. Looking at a handful of publishers, I wasn't able to find it in their news
collections.
Cool project, everything seems to work as intended, but for some reason trying to use news.ycombinator.com
as the news_source
raises Exception: check internet connection / website is not supported
.
It's odd, because other sources with subdomains work without issue, such as wired.co.uk
.
When attempting to retrieve large number of items, is there any type of rate limiting or request cool down?
I am unable to get news feed. My internet connection is working fine. Find below my requests.
Python version 3.7.3
from newscatcher import Newscatcher
nc = Newscatcher(website = 'nytimes.com')
results = nc.get_news()
No results found check internet connection or query parameters
An additional sample request to check my internet connection.
import requests
r = requests.get('https://github.com/timeline.json')
r.json()
{'message': 'Hello there, wayfaring stranger. If you’re reading this then you probably didn’t see our blog post a couple of years back announcing that this API would go away: http://git.io/17AROg Fear not, you should be able to get what you need from the shiny new Events API instead.', 'documentation_url': 'https://developer.github.com/v3/activity/events/#list-public-events'}
Would it be possible to get the tech headlines without specifying a website?
Things like .DS_Store
and __pycache__
don't belong in a repo. A .gitignore
file is needed. I'd recommend this one from http://gitignore.io.
Would it be possible to store the URLs in a compressed flat file? These SQL things are just convoluted and hard to work with. I would like to just open the list of feeds and see what you have, but it takes like 20 steps to get inside...
Looks like the dependencies in this project need to be updated. Feedparser was patched for 3.9 but the patch hasn't made it into here yet so it throws a base64 error.
I can't seem to get YC or YC news to work, even though the former is in urls
and the latter is on the frontpage readme. Thank you for this amazing open source project!
from newscatcher import Newscatcher
from newscatcher import urls
url = 'news.ycombinator.com'
url2 = 'ycombinator.com'
eng_lnx = urls(language='en')
nc = Newscatcher(website=url)
try:
print("looking for " + url + "...")
nc.get_news()
except Exception as e:
print(repr(e))
describe_url(url)
print(url + ' in urls: ' + str(url in eng_lnx))
print(url2 + ' in urls: ' + str(url2 in eng_lnx))
nc2 = Newscatcher(website='ycombinator.com')
try:
print("looking for " + url2 + "...")
nc2.get_news()
except Exception as e:
print(repr(e))
It would be helpful to support some local url of google news, e.g. news.google.com.uk
, news.google.com.au
Full list of countries TLD is here though I am not sure if all countries have google news tld.
.ac
.ad
.ae
.af
.ag
.ai
.al
.am
.ao
.aq
.ar
.as
.at
.au
.aw
.ax
.az
.ba
.bb
.bd
.be
.bf
.bg
.bh
.bi
.bj
.bm
.bn
.bo
.br
.bs
.bt
.bw
.by
.bz
.ca
.cc
.cd
.cf
.cg
.ch
.ci
.ck
.cl
.cm
.cn
.co
.cr
.cu
.cv
.cw
.cx
.cy
.cz
.de
.dj
.dk
.dm
.do
.dz
.ec
.ee
.eg
.er
.es
.et
.eu
.fi
.fj
.fk
.fm
.fo
.fr
.ga
.gd
.ge
.gf
.gg
.gh
.gi
.gl
.gm
.gn
.gp
.gq
.gr
.gs
.gt
.gu
.gw
.gy
.hk
.hm
.hn
.hr
.ht
.hu
.id
.ie
.il
.im
.in
.io
.iq
.ir
.is
.it
.je
.jm
.jo
.jp
.ke
.kg
.kh
.ki
.km
.kn
.kp
.kr
.kw
.ky
.kz
.la
.lb
.lc
.li
.lk
.lr
.ls
.lt
.lu
.lv
.ly
.ma
.mc
.md
.me
.mg
.mh
.mk
.ml
.mm
.mn
.mo
.mp
.mq
.mr
.ms
.mt
.mu
.mv
.mw
.mx
.my
.mz
.na
.nc
.ne
.nf
.ng
.ni
.nl
.no
.np
.nr
.nu
.nz
.om
.pa
.pe
.pf
.pg
.ph
.pk
.pl
.pm
.pn
.pr
.ps
.pt
.pw
.py
.qa
.re
.ro
.rs
.ru
.rw
.sa
.sb
.sc
.sd
.se
.sg
.sh
.si
.sk
.sl
.sm
.sn
.so
.sr
.ss
.st
.su
.sv
.sx
.sy
.sz
.tc
.td
.tf
.tg
.th
.tj
.tk
.tl
.tm
.tn
.to
.tr
.tt
.tv
.tw
.tz
.ua
.ug
.uk
.us
.uy
.uz
.va
.vc
.ve
.vg
.vi
.vn
.vu
.wf
.ws
.ye
.yt
.za
.zm
.zw
Would it be possible adding country specific sites ?
Since some news sources have different language options (e.g., spiegel.de
), there should be an option to choose a language.
Thanks for this great package and for the big collection of RSS feeds for so many news sites.
But how and when did you collect the addresses? The two sites from Germany I tried both had problems: one linked to a broken website (the catching did not work at all), the other one is not the feed you want (some outdated podcast feed). Perhaps it is a good idea to have native speakers review the corresponding feeds, I would volunteer for the german ones
The correct addresses are:
This leads to a second issue: Using a sqlite database might be convenient, but is not so practical to be tracked in git, as mentioned in another Issue. Therefore, I could not contribute with a Pull Request.
Hello, I would like a feature so that an additional filter could be added in order to specify a range of dates. Would this be possible?
This is awesome, but if it would have scraped the date along with the data then this would make even more sense. Is it there which i missed it ?
Hi team,
I face with this issue after do pip install newscatcher. I use python 3.8.10. Would you please update to fix this.
from newscatcher import Newscatcher
ImportError: cannot import name 'Newscatcher' from partially initialized module 'newscatcher'
Thank you.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.