slymarbo / rss Goto Github PK
View Code? Open in Web Editor NEWA Go library for fetching, parsing, and updating RSS feeds.
License: Other
A Go library for fetching, parsing, and updating RSS feeds.
License: Other
In an Item for the feed that I'm pulling from...
<pubDate>Tue, 19 Jul 2016 13:14:13 PDT</pubDate>
On my local laptop with local time set to PDT, I don't have a problem:
2016-07-19 13:14:13 -0700 PDT
On my server with local time set to UTC, I have this odd timezone problem:
2016-07-19 13:14:13 +0000 PDT
I'm trying to get the rss ("http://mtgjson.com/atom.xml") and get the following error:
parsing time "2016-02-28T00:00:00" as "2006-01-02T15:04:05.999999999Z07:00": cannot parse "" as "Z07:00"
currently, ISO88591 is supported, do you know how to extend CharsetReader to other charsets? For example 'GBK'.
I would like the HTTP request to come from a specific User-Agent, is this possible?
package main
import (
"github.com/SlyMarbo/rss"
)
func main() {
feed, err := rss.Fetch("http://www.ruanyifeng.com/blog/atom.xml")
if err != nil {
// handle error.
}
// ... Some time later ...
err = feed.Update()
if err != nil {
// handle error.
}
}
Related issue on other library: mmcdole/gofeed#98
http://cyber.law.harvard.edu/rss/rss.html#ltenclosuregtSubelementOfLtitemgt
According to the spec the enclosure tag has 3 attributes. The Rss2_0Enclosure is setup to parse those attributes as sub elements.
My code no longer works. I assume this was changed recently. How do we reproduce this behaviour in the new versions?
See: https://gowalker.org/github.com/SlyMarbo/rss#CacheParsedItemIDs
I only remember I had to use this function to work around a bug I was encountering. I can't remember why exactly I was using it, it was a while ago now. It was likely to do with either running out of memory in a long running process, or to seeing updates to feeds coming through.
Consider a RSS feed with no feeds:
<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0">
<channel>
<title>Willkommen auf unserem Blog</title>
<link>https://www.bitkom.org//bitkom/org/Presse/Blog/index.jsp</link>
<description>Der RSS Feed mit den aktuellsten Blogbeiträgen</description>
<language>de</language>
</channel>
</rss>
For such a feed parseRSS2
returns an error:
Line 65 in 6288663
But such a feed is valid.
Wouldn't it more consistent to not return an error but a Feed
with no items?
I suggest to add support for HTTP Basic Authentication. This would allow access to password protected feeds. net/http.Request
provides SetBasicAuth
which could be used.
Do you have plans and/or time to implement this? If not, I'd try to add this and then submit a PR.
Hi, I was wondering about supporting media content like --are there any plans to add such support?
I'm trying to parse this feed: http://www.juliabloggers.com/feed/. Which is invalid. It defines some length to be an empty string, suckers. See here: http://www.feedvalidator.org/check.cgi?url=http%3A%2F%2Fwww.juliabloggers.com%2Ffeed%2F#l147
Perhaps there's some graceful way of this lib to deal with such a case?
Hi,
Could you make Enclosure.Length as uint?
I think, it cannot be negative.
Thanks a lot
The feed item id is only locally unique to the feed. The database relies on the id to detect previous feed items. This can lead to false positives if two different feed items from different feeds happen to have the same id
Two feeds makes this package output this error. I'm attaching the offending feeds.
Is there a workaround for this issue?
What do you think about supporting HTTP Conditional GET? As far as I understand the current code, the HTTP GET requests made don't use the If-None-Match
or If-Modified-Since
HTTP headers. This results in the full feed XML always being downloaded.
Since many servers support HTTP Conditional GET, it would be a way to reduce network load to not only pass the URL but also values for these headers. rss.Fetch
would have to return the values of the ETag
and Last-Modified
headers, if present on the response. The code calling rss.Fetch
could then remember the values of these headers and pass them in later requests.
What do you think?
The docs say that 10 minutes is the default refresh period, and for Atom feeds it looks like the only possible refresh period (because Atom doens't have a ttl
element).
This is very aggressive -- if your library is used in a popular application and it polls a popular feed (especially an Atom feed), it would create a lot of load and traffic.
Can I suggest that the default be changed to something like 12 or 24 hours? That fits the use case for most feeds much better.
Also, the feed response can have an explicit freshness lifetime -- if it's there, it'd be good to use it. E.g., if the response says Cache-Control: max-age=3600
, there's no reason to poll the feed for the next hour (taking into account the Age
header, in case it was cached upstream).
I can successfully parse http://www.ft.com/rss/home/europe
feed, except that the Feed.Link
attribute is the same as RSS feed address. So:
RSS address is http://www.ft.com/rss/home/europe
<link>
attribute in the RSS feed is http://www.ft.com/home/europe
Feed.Link
value is http://www.ft.com/rss/home/europe (same as RSS address, not as <link>
attribute in the RSS feed)
I had the same issue with some other feeds as well.
How are you building RSS reader? For education purposes, I want to build one myself, but has no idea how to build one. Are there any resources that you could share to help.
Thanks a lot
https://indieweb.org/payment#Implementations
Would also parse atom:link (s) within items
<author>
<name>Mark Pilgrim</name>
<email>[email protected]</email>
<uri>https://mysite.com</uri>
<atom:link rel="payment" type="application/bitcoin-paymentrequest" href="bitcoin:abc7askjdfg"/>
</author>
<entry>
<id>abc</id>
<title>Iabc</title>
<link href="https://siasky.net/CADcPfMxnOgtgwllK9-kp12sIy9L8De7br9nvNFcslCKRg" rel="alternate"/>
<summary>
The story of abc
</summary>
<atom:link rel="payment" type="application/bitcoin-paymentrequest" href="bitcoin:abc7askjdfg"/>
</entry>
Is ti possible to add timeout when fetching url ?
Like as
http://stackoverflow.com/questions/16895294/how-to-set-timeout-for-http-get-requests-in-golang
I'm trying to read http://www.freenas.org/whats-new/feed. But it seems like the item <description>
ends up as item.Content
and item.Summary
is empty.
I would think expected behavior is <description>
as item.Summary
and <content:encoded>
as item.Content
.
Hey there,
is it a good idea (or is it even possible without breaking the program because of error catching) to return an error when rss.Fetch() or feed.Update() failed because of the feed.Refresh time?
Its kinda confusing to receive a complete empty list after recalling the function without an error.
I think this commit broke some rss feeds
70c0278
e.g. build debug.go and run it:
./debug http://rideapart.com/articles.rss
Right now I'm testing with an RSS feed which returns 100 posts. If the feed updates and the update() function is called then new posts are appended to the Feed struct resulting in a huge struct over time. How can I make sure it never contains more than 100 posts?
I'm parsing an RSS feed for URL's, then doing stuff with the URL later on. Let's say the feed updates every 5 minutes and I'm parsing it every 30 mins. How would I make sure it doesn't parse the previously read items again?
package main
import (
"fmt"
"github.com/SlyMarbo/rss"
)
func main() {
url := "http://URL"
feed, _ := rss.Fetch(url)
fmt.Printf("Sent fetch for %s\n", url)
fmt.Printf("There are %d items in %s\n\n", len(feed.Items), url)
for key, value := range feed.Items {
fmt.Println(key, value.Link)
}
}
Example
http://feeds.feedburner.com/ImgurGallery?format=xml
I also created a custom FeedBurner for this same feed and attempted to convert it into several different formats.
I am running inside of Google App engine and have substituted http.Client with urlfetch.Client in rss.go. There have been a handfull of other feeds that I have experienced trouble with, but so far your library has worked great for 95% of everything I've hit so far. Great Work!
I need to display only the titles, of each post in the rss feed, how do I do that?
I would also like to note: I'm not talking about Feed.Title
So my code failed when using your package because you removed the cache function, and can you specify what is the default behavior now? it caches items by default or not? Thanks.
hi! i wrote https://vore.website, which uses this library internally to fetch rss/atom feeds.
i ran into an issue recently where certain feeds containing escaped HTML causes the following failure: panic: XML syntax error on line 4: invalid character entity –
here's a minimal reproducible example:
package main
import (
"github.com/SlyMarbo/rss"
)
func main() {
_, err := rss.Fetch("https://trash.j3s.sh/bad-feed.xml")
if err != nil {
panic(err)
}
}
note that this is triggered by the following XML:
<title>– feed with html escaped stuff</title>
i'm wondering if it might make sense to unescape the HTML prior to processing to avoid this? unfortunately i don't think that i can do that kind of pre-processing using FetchByFunc
, because i need to modify the returned Body.
As a feature request I'd like to see the ability to edit the user agent. A few sites block requests from the default golang user-agent.
JSON Feed is becoming a more popular format for feed publication. It has similarities to RSS/Atom (see https://jsonfeed.org/mappingrssandatom for more details) and it would be really useful if this library supported it too. The v1 spec is available at https://jsonfeed.org/version/1 and looks relatively straightforward with support for top-level metadata, items, and enclosures (called "attachments").
If you agree, I can try and put together a PR adding basic support for it!
phase `build' succeeded after 1.0 seconds
starting phase `check'
--- FAIL: TestParseItemDateOK (0.00s)
rss_2.0_test.go:84: testdata/rss_2.0: got "2009-09-06 16:45:00 +0000 UTC", want "2009-09-06 16:45:00 +0000 +0000"
rss_2.0_test.go:84: testdata/rss_2.0_content_encoded: got "2009-09-06 16:45:00 +0000 UTC", want "2009-09-06 16:45:00 +0000 +0000"
rss_2.0_test.go:84: testdata/rss_2.0_enclosure: got "2009-09-06 16:45:00 +0000 UTC", want "2009-09-06 16:45:00 +0000 +0000"
FAIL
FAIL github.com/SlyMarbo/rss 0.044s
FAIL
hi! i recently attempted to fetch a certain feed, and slymarbo/rss blocks indefinitely - here's a reproducible example:
package main
import (
"github.com/SlyMarbo/rss"
"log"
)
func main() {
feed, err := rss.Fetch("https://www.idealista.pt/news/rss/v2/latest-news.xml")
if err != nil {
log.Println(err)
}
err = feed.Update()
if err != nil {
log.Println(err)
}
}
i've attached the XML of that feed as it exists today, in case the example i've provided stops reproducing:
bad-xml.txt
While reading:
http://rss.wn.com/English/top-stories
The following error is returned:
parsing time "Mon, 29 Aug 2016 02:52 GMT" as
"2006-01-02T15:04:05.999999999Z07:00": cannot parse
"Mon, 29 Aug 2016 02:52 GMT" as "2006"
PS: I know providers should use the standard date format, but sometimes this is out of our control.
I can't fully parse YouTube RSS feeds (atom) - e.g.
https://www.youtube.com/feeds/videos.xml?channel_id=UCUJeW9pnxhDZ5GA0TNRl4zg
I get title, link, timestamp and guid, but importantly not the summary, or any images. Here's how an Item looks:
2017/11/14 15:45:42 Item "Updated 2018 Yamaha MT-07 First Look"
"https://www.youtube.com/watch?v=J1uq9oBm7B8"
23:44:37 +0000 14/11/2017
"yt:video:J1uq9oBm7B8"
Read: false
""
When an Item has no ID element in the XML it is entered into the database as an empty string. Any other items with no ID will then be treated as know items.
In Rss 2.0.go no ID is handled around line 86 but by this time the entry has already been skipped as a know item.
Example Feed:
http://www.capitalonecup.co.uk/common/rss/news-rss-feed.xml
github.com/gorilla/feeds is no longer active, it'd be great if this library could support feed generating.
func (f *Feed) WriteAtom(w io.Writer) error
func (f *Feed) WriteRSS(w io.Writer) error
func (f *Feed) WriteJSON(w io.Writer) error
Based on the spec, atom:link
elements with a rel
attribute of alternate
or a missing rel
attribute should be considered as links.
Currently, line 54 in atom.go
correctly sets the link for the latter case, but not the former.
if link.Rel == "" {
next.Link = link.Href
}
should probably be
if link.Rel == "alternate" || link.Rel == "" {
next.Link = link.Href
}
This small change fixes a few feeds that were parsing incorrectly for me.
Hi, I've successfully fetched other feeds, but these fox news feeds are not populating any of the items. any ideas?
http://feeds.foxnews.com/podcasts/TalkingPoints?format=xml
http://feeds.feedburner.com/foxnews/podcasts/FoxNewsSundayVideo?format=xml
This is all it populates in my *feed:
%+v
"FOX News Sunday Video"
"http://feeds.feedburner.com/foxnews/podcasts/FoxNewsSundayVideo?format=xml"
Image ""
Refresh at Wed 22 Oct 2014 04:41:00 UTC
Unread: 0
Items:
%#v
&rss.Feed{Nickname:"", Title:"FOX News Sunday", Description:"FOX News Sunday Video", Link:"http://feeds.feedburner.com/foxnews/podcasts/FoxNewsSundayVideo?format=xml", UpdateURL:"http://feeds.feedburner.com/foxnews/podcasts/FoxNewsSundayVideo?format=xml", Image:(*rss.Image)(0xc21012be40), Items:[]*rss.Item{}, ItemMap:map[string]struct {}{}, Refresh:time.Time{sec:63549549818, nsec:0x26182591, loc:(*time.Location)(0xa61fe0)}, Unread:0x0}
edit: it works fine with the CNN feedburner feed: http://rss.cnn.com/services/podcasting/ac360/rss.xml
(but none of the foxnews ones)
Please support parsing the RSS Content Module as item.Content
(and using "description" as item.Summary
).
Example RSS Feed: The Points Guy RSS
in atom.go:85
Content string xml:"summary"
should be
Content string xml:"content"
I would like to get the value of the field <newznab:attr name="group" value="alt.binaries.teevee"/>
so ending up with the value alt.binaries.teevee
.
How do I do so?
<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:newznab="http://www.newznab.com/DTD/2010/feeds/attributes/" encoding="utf-8">
<channel>
<atom:link href="https://REMOVED.com/api" rel="self" type="application/rss+xml"/>
<title>REMOVED</title>
<description>API Details</description>
<link>https://REMOVED.com/</link>
<language>en-gb</language>
<webMaster>[email protected]</webMaster>
<category>Stuff</category>
<generator>Me</generator>
<ttl>10</ttl>
<docs>https://removed.com/apihelp/</docs>
<image url="https://removed.com/themes/shared/img/logo.png" title="REMOVED" link="https://removed.com/" description="Visit REMOVED"/>
<newznab:response offset="0" total="125000"/>
<item>
<title>Fair.Go.2017.09.18.HDTV.x264-FiHTV </title>
<guid isPermaLink="true">https://REMOVED.com/details/427d2b6c5fb3a0f73bd43be4bb8cff955700fd4d</guid>
<link>https://REMOVED.com/getnzb/427d2b6c5fb3a0f73bd43be4bb8cff955700fd4d.nzb&i=1&r=3bc4e94ef14337e4e2b490a3897c48f6</link>
<comments>https://REMOVED.com/details/427d2b6c5fb3a0f73bd43be4bb8cff955700fd4d#comments</comments>
<pubDate>Tue, 19 Sep 2017 10:18:21 +0200</pubDate>
<category>TV > SD</category>
<description>Fair.Go.2017.09.18.HDTV.x264-FiHTV </description>
<enclosure url="https://REMOVED.com/getnzb/427d2b6c5fb3a0f73bd43be4bb8cff955700fd4d.nzb&i=1&r=3bc4e94ef14337e4e2b490a3897c48f6" length="168013625" type="application/x-nzb"/>
<newznab:attr name="category" value="5030"/>
<newznab:attr name="size" value="168013625"/>
<newznab:attr name="files" value="17"/>
<newznab:attr name="poster" value="[email protected] (yeahsure)"/>
<newznab:attr name="prematch" value="1"/>
<newznab:attr name="info" value="https://REMOVED.com/api?t=info&id=427d2b6c5fb3a0f73bd43be4bb8cff955700fd4d&r=3bc4e94ef14337e4e2b490a3897c48f6"/>
<newznab:attr name="grabs" value="0"/>
<newznab:attr name="comments" value="0"/>
<newznab:attr name="password" value="0"/>
<newznab:attr name="usenetdate" value="Tue, 19 Sep 2017 10:07:47 +0200"/>
<newznab:attr name="group" value="alt.binaries.teevee"/>
</item>
</channel>
</rss>
As far as I can see, the package tries to parse time with the default string format provided by the time
package...
Providing a way to set the string format for parsing certain feeds which don't use a orthodox (from time
point of view) display format for time would be helpful.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.