datawookie / feeder Goto Github PK

View Code? Open in Web Editor NEW

29.0 2.0 6.0 321 KB

Handle RSS and Atom feeds from R

R 100.00%

feeder's Introduction

feedeR

Feed Reader Package for R

A package for reading RSS and Atom feeds.

Installation

Easy to install.

devtools::install_github("datawookie/feedeR")

Usage

For a RSS feed:

feed.extract("https://feeds.feedburner.com/RBloggers")

For an Atom feed:

feed.extract("http://journal.r-project.org/rss.atom")

Similar Projects

The {scifetch} package has getrss().
The {tidyRSS} package.

feeder's People

Contributors

Stargazers

Watchers

Forkers

mherradora sillasgonzaga xinliwang2016 yonghuidong keshovsharma adfi

feeder's Issues

NIH Reporter RSS feed error

Hi DataWookie,

I'm new to R, so I apologize if I'm repeating an issue that has already been addressed. I'm trying to parse info from the NIH Reporter RSS feed, which is in an XML format. Here's the code I'm trying to use:
library(feedeR) library(XML) library(tidyverse) Test <- feed.extract("https://projectreporter.nih.gov")

I've tried without loading in XML and tidyverse (they're for other functions I'm hoping to do later), and I'm still getting the same error messages. It's a rather extensive list, but here's a short subset:

attributes construct error
Couldn't find end of Start Tag a line 2103
Opening and ending tag mismatch: u line 2103 and a
Opening and ending tag mismatch: b line 2103 and u
Opening and ending tag mismatch: li line 2103 and b
Opening and ending tag mismatch: head line 8 and li
AttValue: " or ' expected
attributes construct error
Couldn't find end of Start Tag a line 2103
Opening and ending tag mismatch: u line 2103 and a
Opening and ending tag mismatch: b line 2103 and u
Opening and ending tag mismatch: li line 2103 and b
Opening and ending tag mismatch: html line 7 and li
Extra content at the end of the document
Error: 1: Opening and ending tag mismatch: meta line 19 and head
2: Opening and ending tag mismatch: img line 73 and div
3: Entity 'nbsp' not defined
4: xmlParseEntityRef: no name
5: Entity 'nbsp' not defined
6: Opening and ending tag mismatch: input line 127 and legend
7: Opening and ending tag mismatch: legend line 124 and fieldset
8: EntityRef: expecting ';'
9: Opening and ending tag mismatch: input line 137 and form
10: Opening and ending tag mismatch: fieldset line 123 and li
11: Opening and ending tag mismatch: form line 122 and ul
12: Opening and ending tag mismatch: img line 267 and a
13: Opening and ending tag mismatch: img line 268 and a
14: Opening and ending tag mismatch: img line 269 and a
15: Opening and ending tag mismatch: a line 269 and div
16: Opening and ending tag mismatch: a line 268 and div
17: Opening and ending tag mismatch: a line 267 and li
18: Opening and ending tag mismatch: div line 263 and ul
19: Opening and ending tag mismatch: div line 252 and li
20: En

Any chance you can help to sort this out? I'm not really sure where to begin, all I can tell is it seems like the different columns being pulled may not be matching up properly.

Unable to parse date

> url <- "http://feeds.feedburner.com/analisemacro?format=xml"
> x <- feed.extract(url)
Error: Unable to parse date.
In addition: Warning message:
All formats failed to parse. No formats found.

Thanks a lot for your package. I've been using it to create a Twitter bot. However, I'm getting an error with that specific blog feed.

RSS without date

Hi,
I am playing feedR package, and I encountered the following error:

a <- feed.extract("http://feeds.nature.com/nplants/rss/current")

Error in if (is.na(parsed)) stop("Unable to parse date.", call. = FALSE) : argument is of length zero

I think the reason is due to the fact that some RSS feeds do not provide date. I have found that many RSS feeds for scientific journals do not contain date.

Are there some suggestions to deal with this problem? Thanks a lot.

Feature request

Thanks for the nice package. The current feedeR version exports Title, Dates and Links, it will be great if it can export (partial) Contents as well.

Space required after the Public Identifier error

When I am trying these feeds:

cbc_rss1 <- feed.extract("http://rss.cbc.ca/lineup/politics.xml")

cbc_rss2 <- feed.extract("http://rss.cbc.ca/lineup/technology.xml")

cbc_rss3 <- feed.extract("http://rss.cbc.ca/lineup/world.xml")

I get this error:

Space required after the Public Identifier
SystemLiteral " or ' expected
SYSTEM or PUBLIC, the URI is missing
Error: 1: Space required after the Public Identifier
2: SystemLiteral " or ' expected
3: SYSTEM or PUBLIC, the URI is missing

Unable to parse date

http://www.valor.com.br/financas/mercados/rss

Tag mismatches from Glassdoor feed

Using code:

`devtools::install_github("DataWookie/feedeR")
library(feedeR)

philip_morris <- feed.extract("https://www.glassdoor.com/rss/reviews.rss?id=7745")
`

I get output:

`> philip_morris <- feed.extract("https://www.glassdoor.com/rss/reviews.rss?id=7745")

Opening and ending tag mismatch: img line 11 and div
Opening and ending tag mismatch: img line 21 and p
Opening and ending tag mismatch: p line 17 and div
Opening and ending tag mismatch: img line 34 and p
Opening and ending tag mismatch: p line 30 and div
Opening and ending tag mismatch: img line 48 and p
Opening and ending tag mismatch: p line 43 and div
Opening and ending tag mismatch: img line 61 and p
Opening and ending tag mismatch: p line 57 and div
Specification mandates value for attribute async
attributes construct error
Couldn't find end of Start Tag script line 71
Opening and ending tag mismatch: form line 70 and script
Opening and ending tag mismatch: p line 69 and form
Opening and ending tag mismatch: div line 68 and p
Opening and ending tag mismatch: div line 28 and body
Opening and ending tag mismatch: div line 15 and html
Premature end of data in tag div line 14
Premature end of data in tag div line 9
Premature end of data in tag body line 8
Premature end of data in tag html line 2
Error: 1: Opening and ending tag mismatch: img line 11 and div
2: Opening and ending tag mismatch: img line 21 and p
3: Opening and ending tag mismatch: p line 17 and div
4: Opening and ending tag mismatch: img line 34 and p
5: Opening and ending tag mismatch: p line 30 and div
6: Opening and ending tag mismatch: img line 48 and p
7: Opening and ending tag mismatch: p line 43 and div
8: Opening and ending tag mismatch: img line 61 and p
9: Opening and ending tag mismatch: p line 57 and div
10: Specification mandates value for attribute async
11: attributes construct error
12: Couldn't find end of Start Tag script line 71
13: Opening and ending tag mismatch: form line 70 and script
14: Opening and ending tag mismatch: p line 69 and form
15: Opening and ending tag mismatch: div line 68 and p
16: Opening and ending tag mismatch: div line 28 and body
17: Opening and ending tag mismatch: div line 15 and html
18: Premature end of data in tag div line 14
19: Premature end of data in tag div `

The feed loads appropriately in NetNewsWire. I'm not that savvy, but happy to provide any other information to help debug. Thanks!

Google-News: feed.extract returns error

Dear Data Wookie,

thank you for your work!
I've tried to retrieve rss-feeds from Google News, but feed.extract returns an error:

feed.extract("https://news.google.com/news?cf=all&hl=de&pz=1&ned=de&output=rss")

returns:

Error in UseMethod("xmlAttrs", node) :
nicht anwendbare Methode für 'xmlAttrs' auf Objekt der Klasse "NULL" angewendet.

Best regards,

Volker

Empty description in RSS

Feed example: https://www.wjakethompson.com/publication/index.xml

> feedeR::feed.extract("https://www.wjakethompson.com/publication/index.xml")
Error in item$description[[1]] : subscript out of bounds
Calls: <Anonymous> ... lapply -> FUN -> tibble -> tibble_quos -> eval_tidy
Execution halted
> traceback()
9: eval_tidy(xs[[j]], mask)
8: tibble_quos(xs[!is.null], .rows, .name_repair)
7: tibble(title = item$title[[1]], date = date, link = if (is.null(item$origLink)) item$link[[1]] else item$origLink[[1]], 
       description = item$description[[1]])
6: FUN(X[[i]], ...)
5: lapply(feed[names(feed) == "item"], function(item) {
       if (is.null(item$title)) 
           return(NULL)
       date = if (is.null(item$pubDate)) 
           NA
       else parse.date(item$pubDate)
       if (is.na(suppressWarnings(as.integer(date)))) 
           return(NULL)
       tibble(title = item$title[[1]], date = date, link = if (is.null(item$origLink)) 
           item$link[[1]]
       else item$origLink[[1]], description = item$description[[1]])
   })
4: list2(...)
3: bind_rows(lapply(feed[names(feed) == "item"], function(item) {
       if (is.null(item$title)) 
           return(NULL)
       date = if (is.null(item$pubDate)) 
           NA
       else parse.date(item$pubDate)
       if (is.na(suppressWarnings(as.integer(date)))) 
           return(NULL)
       tibble(title = item$title[[1]], date = date, link = if (is.null(item$origLink)) 
           item$link[[1]]
       else item$origLink[[1]], description = item$description[[1]])
   }))
2: parse.rss(feed)
1: feedeR::feed.extract("https://www.wjakethompson.com/publication/index.xml")

Feed part:

> curl -s https://www.wjakethompson.com/publication/index.xml | tail
      <title>Transcranial direct current stimulation as a possible intervention tool for emotion regulation in depression</title>
      <link>https://wjakethompson.com/publication/2013-frontiers-tdcs/</link>
      <pubDate>Sat, 01 Jun 2013 00:00:00 +0000</pubDate>
      
      <guid>https://wjakethompson.com/publication/2013-frontiers-tdcs/</guid>
      <description></description>
    </item>
    
  </channel>
</rss>

Getting Error while extracting reuters RSS feed.

Hi andrew,

I am a post graduate research intern in India at Indian institute of Management, Calcutta.

I am trying to Use your package "feedeR" to extract RSS feeds. It works most of the times. But many times, it gives me error continuously on the same feeds it worked before.

Here is the Error.

feed.extract("http://feeds.reuters.com/reuters/BusinessNews")

Opening and ending tag mismatch: BR line 7 and FONT
AttValue: " or ' expected
attributes construct error
Couldn't find end of Start Tag TABLE line 10
Opening and ending tag mismatch: BR line 15 and FONT
Opening and ending tag mismatch: BR line 14 and TD
Opening and ending tag mismatch: FONT line 12 and TR
AttValue: " or ' expected
attributes construct error
Couldn't find end of Start Tag FONT line 29
Opening and ending tag mismatch: BR line 31 and FONT
Opening and ending tag mismatch: BR line 31 and TD
Opening and ending tag mismatch: BR line 31 and TR
Opening and ending tag mismatch: BR line 31 and TABLE
Opening and ending tag mismatch: BR line 31 and blockquote
Opening and ending tag mismatch: BR line 31 and FONT
Opening and ending tag mismatch: BR line 31 and BODY
Opening and ending tag mismatch: BR line 30 and HTML
Premature end of data in tag TD line 28
Premature end of data in tag TR line 28
Premature end of data in tag TD line 11
Premature end of data in tag TR line 11
Premature end of data in tag blockquote line 9
Premature end of data in tag FONT line 6
Premature end of data in tag BR line 5
Premature end of data in tag BR line 5
Premature end of data in tag IMG line 5
Premature end of data in tag BODY line 4
Premature end of data in tag HTML line 1
Error: 1: Opening and ending tag mismatch: BR line 7 and FONT
2: AttValue: " or ' expected
3: attributes construct error
4: Couldn't find end of Start Tag TABLE line 10
5: Opening and ending tag mismatch: BR line 15 and FONT
6: Opening and ending tag mismatch: BR line 14 and TD
7: Opening and ending tag mismatch: FONT line 12 and TR
8: AttValue: " or ' expected
9: attributes construct error
10: Couldn't find end of Start Tag FONT line 29
11: Opening and ending tag mismatch: BR line 31 and FONT
12: Opening and ending tag mismatch: BR line 31 and TD
13: Opening and ending tag mismatch: BR line 31 and TR
14: Opening and ending tag mismatch: BR line 31 and TABLE
15: Opening and ending tag mismatch: BR line 31 and blockquote
16: Opening and ending tag mismatch: BR line 31 and FONT
17: Opening and ending tag mismatch: BR line 31 and BODY
18: Opening and ending tag mismatch: BR line 30 and HTML
19: Premature end of data in tag TD line 28
20: Premature end of data in tag TR line 28
21: Premat

This Url opens perfectly in a browser. so looks like its fine as a webpage. It works with same code many times, but starts giving error at times, more than often. Could you please help me get a way to deal with this error. ?

Reproducibility: Intermittent.( Does not happen always but happens a lot.) And when it happens i am able to open/see/acess the link in my browser.

Mail ID: [email protected]

feed.extract() is failing with reuters finance RSS feed. Please see below the error.

Hi,

the feed.extract is giving an issue with the below URL as of now. Please help.

feed.extract("http://feeds.reuters.com/reuters/financialsNews")
Entity 'ldquo' not defined
Entity 'rdquo' not defined
Opening and ending tag mismatch: div line 60 and body
Opening and ending tag mismatch: body line 59 and html
Premature end of data in tag html line 1
Error: 1: Entity 'ldquo' not defined
2: Entity 'rdquo' not defined
3: Opening and ending tag mismatch: div line 60 and body
4: Opening and ending tag mismatch: body line 59 and html
5: Premature end of data in tag html line 1
Entity 'ldquo' not defined
Entity 'rdquo' not defined
Opening and ending tag mismatch: div line 60 and body
Opening and ending tag mismatch: body line 59 and html
Premature end of data in tag html line 1
Error: 1: Entity 'ldquo' not defined
2: Entity 'rdquo' not defined
3: Opening and ending tag mismatch: div line 60 and body
4: Opening and ending tag mismatch: body line 59 and html
5: Premature end of data in tag html line 1

Mail ID: [email protected]

Unable to feed.extract slashdot.com feed

Get an error with this /. (slashdot) feed

> library(feedeR)
> feed.extract("http://rss.slashdot.org/Slashdot/slashdot")
Error in UseMethod("xmlAttrs", node) : 
  no applicable method for 'xmlAttrs' applied to an object of class "NULL"

R version:

R version 3.2.3 (2015-12-10) -- "Wooden Christmas-Tree"
Copyright (C) 2015 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

Trying to debug, I copied and pasted your package code in a R script, and it works properly by issuing the same command.

locale de_DE.UTF-8: feed.extract parse date error

Dear Andrew,

thank you for working so fast! Feed extract works fine on my server now.

However, when I'm working on my Desktop (with Locale de_DE.UTF-8), feed.extract() throws an error:

Fehler: Unable to parse date

If I manually set
Sys.setlocale("LC_TIME", "C")

feed.extract("https://news.google.com/news?cf=all&hl=de&pz=1&ned=de&output=rss&q=buergerbegehren")

works fine.

With the desktop locale it won't work.

Sys.getlocale("LC_ALL")

[1] "LC_CTYPE=de_DE.UTF-8;LC_NUMERIC=C;LC_TIME=de_DE.UTF-8;LC_COLLATE=de_DE.UTF-8;LC_MONETARY=de_DE.UTF-8;LC_MESSAGES=de_DE.UTF-8;LC_PAPER=de_DE.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=de_DE.UTF-8;LC_IDENTIFICATION=C"

I think, your Package can be very useful for many people, if it can handle I18n issues.

All the best

Volker

Encoding issues

I have problems with the string encoding. It resulted to be unkown as the function doesn't allow to indicate encoding. Which is the best way to get this right?

url <- "http://www.catastro.minhap.es/INSPIRE/buildings/ES.SDGC.bu.atom.xml"
prov_links <- feed.extract(url)

cannot parse Google allert RSS

This is an example of RSS feed I cannot parse.

https://www.google.nl/alerts/feeds/13944359149642504817/1580247262290707610

Thanks upfront for your help.

Not able to parse Date from Cnbc RSS feeds.

Hi,

I am facing issue in parsing timestamp of Cnbc RSS feeds. please see below. I am also attaching the result of getUrl in a file as you had advised to my earlier issue.
cnbc_business-news.txt

Below Code is failing

feed.extract("http://www.cnbc.com/id/10001147/device/rss/rss.html")
Error: Unable to parse date.
In addition: Warning message:
All formats failed to parse. No formats found.

I am attaching the output from below code.

library(RCurl)
xml = getURL("http://www.cnbc.com/id/10001147/device/rss/rss.html")
cat(xml, file = "cnbc_business-news.xml")

Email ID: [email protected]