slymarbo / rss Goto Github PK

View Code? Open in Web Editor NEW

399.0 399.0 85.0 143 KB

A Go library for fetching, parsing, and updating RSS feeds.

License: Other

Go 100.00%

rss's People

Contributors

Stargazers

Watchers

rss's Issues

Item.Date issue with timezone (UTC vs PDT)

In an Item for the feed that I'm pulling from...

<pubDate>Tue, 19 Jul 2016 13:14:13 PDT</pubDate>

On my local laptop with local time set to PDT, I don't have a problem:
2016-07-19 13:14:13 -0700 PDT

On my server with local time set to UTC, I have this odd timezone problem:
2016-07-19 13:14:13 +0000 PDT

parsing time error

I'm trying to get the rss ("http://mtgjson.com/atom.xml") and get the following error:
parsing time "2016-02-28T00:00:00" as "2006-01-02T15:04:05.999999999Z07:00": cannot parse "" as "Z07:00"

support more charset

currently, ISO88591 is supported, do you know how to extend CharsetReader to other charsets? For example 'GBK'.

Can we provide a useragent to rss.Fetch()?

I would like the HTTP request to come from a specific User-Agent, is this possible?

CDATA tags inside content not parsed

package main

import (
	"github.com/SlyMarbo/rss"
)

func main() {
	feed, err := rss.Fetch("http://www.ruanyifeng.com/blog/atom.xml")
	if err != nil {
		// handle error.
	}

	// ... Some time later ...

	err = feed.Update()
	if err != nil {
		// handle error.
	}
}

Related issue on other library: mmcdole/gofeed#98

Enclosures attributes not parsed

http://cyber.law.harvard.edu/rss/rss.html#ltenclosuregtSubelementOfLtitemgt

According to the spec the enclosure tag has 3 attributes. The Rss2_0Enclosure is setup to parse those attributes as sub elements.

What happened to rss.CacheParsedItemIDs

My code no longer works. I assume this was changed recently. How do we reproduce this behaviour in the new versions?

See: https://gowalker.org/github.com/SlyMarbo/rss#CacheParsedItemIDs

I only remember I had to use this function to work around a bug I was encountering. I can't remember why exactly I was using it, it was a while ago now. It was likely to do with either running out of memory in a long running process, or to seeing updates to feeds coming through.

Is a RSS channel with no feeds faulty?

Consider a RSS feed with no feeds:

<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0">
  <channel>
    <title>Willkommen auf unserem Blog</title>
    <link>https://www.bitkom.org//bitkom/org/Presse/Blog/index.jsp</link>
    <description>Der RSS Feed mit den aktuellsten Blogbeiträgen</description>
    <language>de</language>
  </channel>
</rss>

For such a feed parseRSS2 returns an error:

rss/rss_2.0.go

Line 65 in 6288663

return nil, fmt.Errorf("no feeds found in %q", string(data))

But such a feed is valid.

Wouldn't it more consistent to not return an error but a Feed with no items?

Support HTTP Basic Authentication

I suggest to add support for HTTP Basic Authentication. This would allow access to password protected feeds. net/http.Request provides SetBasicAuth which could be used.

Do you have plans and/or time to implement this? If not, I'd try to add this and then submit a PR.

Support media

Hi, I was wondering about supporting media content like --are there any plans to add such support?

`length = ""` leads to `strconv.ParseInt: parsing "": invalid syntax`

I'm trying to parse this feed: http://www.juliabloggers.com/feed/. Which is invalid. It defines some length to be an empty string, suckers. See here: http://www.feedvalidator.org/check.cgi?url=http%3A%2F%2Fwww.juliabloggers.com%2Ffeed%2F#l147

Perhaps there's some graceful way of this lib to deal with such a case?

"Enclosure.Length" int to uint

Hi,

Could you make Enclosure.Length as uint?
I think, it cannot be negative.

Thanks a lot

different feeds with same ids might lead to false positives

The feed item id is only locally unique to the feed. The database relies on the id to detect previous feed items. This can lead to false positives if two different feed items from different feeds happen to have the same id

"expected element type <feed> but have <html>"

Two feeds makes this package output this error. I'm attaching the offending feeds.

laprensalibre.txt
crhoy.txt

Is there a workaround for this issue?

Supoort HTTP Conditional GET

What do you think about supporting HTTP Conditional GET? As far as I understand the current code, the HTTP GET requests made don't use the If-None-Match or If-Modified-Since HTTP headers. This results in the full feed XML always being downloaded.

Since many servers support HTTP Conditional GET, it would be a way to reduce network load to not only pass the URL but also values for these headers. rss.Fetch would have to return the values of the ETag and Last-Modified headers, if present on the response. The code calling rss.Fetch could then remember the values of these headers and pass them in later requests.

What do you think?

10 minute refresh

The docs say that 10 minutes is the default refresh period, and for Atom feeds it looks like the only possible refresh period (because Atom doens't have a ttl element).

This is very aggressive -- if your library is used in a popular application and it polls a popular feed (especially an Atom feed), it would create a lot of load and traffic.

Can I suggest that the default be changed to something like 12 or 24 hours? That fits the use case for most feeds much better.

Also, the feed response can have an explicit freshness lifetime -- if it's there, it'd be good to use it. E.g., if the response says Cache-Control: max-age=3600, there's no reason to poll the feed for the next hour (taking into account the Age header, in case it was cached upstream).

Problem with Feed Link attribute

I can successfully parse http://www.ft.com/rss/home/europe feed, except that the Feed.Link attribute is the same as RSS feed address. So:

RSS address is http://www.ft.com/rss/home/europe
<link> attribute in the RSS feed is http://www.ft.com/home/europe
Feed.Link value is http://www.ft.com/rss/home/europe (same as RSS address, not as <link> attribute in the RSS feed)

I had the same issue with some other feeds as well.

Architecture help?

How are you building RSS reader? For education purposes, I want to build one myself, but has no idea how to build one. Are there any resources that you could share to help.

Thanks a lot

Harit

Parse metadata element: author

https://indieweb.org/payment#Implementations

Would also parse atom:link (s) within items

  <author>
  	<name>Mark Pilgrim</name>
  	<email>[email protected]</email>
  	<uri>https://mysite.com</uri>
  	<atom:link rel="payment" type="application/bitcoin-paymentrequest" href="bitcoin:abc7askjdfg"/>
  </author>

<entry>
<id>abc</id>
<title>Iabc</title>
<link href="https://siasky.net/CADcPfMxnOgtgwllK9-kp12sIy9L8De7br9nvNFcslCKRg" rel="alternate"/>
<summary>
The story of abc
</summary>
<atom:link rel="payment" type="application/bitcoin-paymentrequest" href="bitcoin:abc7askjdfg"/>
</entry>

get url timeout

Is ti possible to add timeout when fetching url ?

Like as

http://stackoverflow.com/questions/16895294/how-to-set-timeout-for-http-get-requests-in-golang

Incorrect parsing of RSS 1.0?

I'm trying to read http://www.freenas.org/whats-new/feed. But it seems like the item <description> ends up as item.Content and item.Summary is empty.

I would think expected behavior is <description> as item.Summary and <content:encoded> as item.Content.

Throw Error when fetch wasnt called because of RefreshTime?

Hey there,

is it a good idea (or is it even possible without breaking the program because of error catching) to return an error when rss.Fetch() or feed.Update() failed because of the feed.Refresh time?

Its kinda confusing to receive a complete empty list after recalling the function without an error.

failing to get content for some rss feeds after content:encoding change

I think this commit broke some rss feeds
70c0278

e.g. build debug.go and run it:
./debug http://rideapart.com/articles.rss

Maximum number of posts in feed struct?

Right now I'm testing with an RSS feed which returns 100 posts. If the feed updates and the update() function is called then new posts are appended to the Feed struct resulting in a huge struct over time. How can I make sure it never contains more than 100 posts?

How to mark item as read?

I'm parsing an RSS feed for URL's, then doing stuff with the URL later on. Let's say the feed updates every 5 minutes and I'm parsing it every 30 mins. How would I make sure it doesn't parse the previously read items again?

package main

import (
	"fmt"
	"github.com/SlyMarbo/rss"
)

func main() {
	url := "http://URL"
	feed, _ := rss.Fetch(url)
	fmt.Printf("Sent fetch for %s\n", url)
	fmt.Printf("There are %d items in %s\n\n", len(feed.Items), url)
	for key, value := range feed.Items {
		fmt.Println(key, value.Link)
	}

}

Not returning items for FeedBurner

Example
http://feeds.feedburner.com/ImgurGallery?format=xml

I also created a custom FeedBurner for this same feed and attempted to convert it into several different formats.

I am running inside of Google App engine and have substituted http.Client with urlfetch.Client in rss.go. There have been a handfull of other feeds that I have experienced trouble with, but so far your library has worked great for 95% of everything I've hit so far. Great Work!

How to read just the titles of every post

I need to display only the titles, of each post in the rss feed, how do I do that?

I would also like to note: I'm not talking about Feed.Title

CachedParsedItemIDs breaking change

So my code failed when using your package because you removed the cache function, and can you specify what is the default behavior now? it caches items by default or not? Thanks.

escaped HTML within XML causes feed parse failure

hi! i wrote https://vore.website, which uses this library internally to fetch rss/atom feeds.

i ran into an issue recently where certain feeds containing escaped HTML causes the following failure: panic: XML syntax error on line 4: invalid character entity –

here's a minimal reproducible example:

package main

import (
	"github.com/SlyMarbo/rss"
)

func main() {
	_, err := rss.Fetch("https://trash.j3s.sh/bad-feed.xml")
	if err != nil {
		panic(err)
	}
}

note that this is triggered by the following XML:

    <title>&ndash; feed with html escaped stuff</title>

i'm wondering if it might make sense to unescape the HTML prior to processing to avoid this? unfortunately i don't think that i can do that kind of pre-processing using FetchByFunc, because i need to modify the returned Body.

Edit UserAgent

As a feature request I'd like to see the ability to edit the user agent. A few sites block requests from the default golang user-agent.

Support JSON Feed

JSON Feed is becoming a more popular format for feed publication. It has similarities to RSS/Atom (see https://jsonfeed.org/mappingrssandatom for more details) and it would be really useful if this library supported it too. The v1 spec is available at https://jsonfeed.org/version/1 and looks relatively straightforward with support for top-level metadata, items, and enclosures (called "attachments").

If you agree, I can try and put together a PR adding basic support for it!

Tests fail in 1.0.3

phase `build' succeeded after 1.0 seconds
starting phase `check'
--- FAIL: TestParseItemDateOK (0.00s)
    rss_2.0_test.go:84: testdata/rss_2.0: got "2009-09-06 16:45:00 +0000 UTC", want "2009-09-06 16:45:00 +0000 +0000"
    rss_2.0_test.go:84: testdata/rss_2.0_content_encoded: got "2009-09-06 16:45:00 +0000 UTC", want "2009-09-06 16:45:00 +0000 +0000"
    rss_2.0_test.go:84: testdata/rss_2.0_enclosure: got "2009-09-06 16:45:00 +0000 UTC", want "2009-09-06 16:45:00 +0000 +0000"
FAIL
FAIL	github.com/SlyMarbo/rss	0.044s
FAIL

Attempting to fetch certain feed blocks indefinitely

hi! i recently attempted to fetch a certain feed, and slymarbo/rss blocks indefinitely - here's a reproducible example:

package main

import (
	"github.com/SlyMarbo/rss"
	"log"
)

func main() {
	feed, err := rss.Fetch("https://www.idealista.pt/news/rss/v2/latest-news.xml")
	if err != nil {
		log.Println(err)
	}

	err = feed.Update()
	if err != nil {
		log.Println(err)
	}
}

i've attached the XML of that feed as it exists today, in case the example i've provided stops reproducing:
bad-xml.txt

Support for non standard date format

While reading:

http://rss.wn.com/English/top-stories

The following error is returned:

parsing time "Mon, 29 Aug 2016 02:52 GMT" as
"2006-01-02T15:04:05.999999999Z07:00": cannot parse
"Mon, 29 Aug 2016 02:52 GMT" as "2006"

PS: I know providers should use the standard date format, but sometimes this is out of our control.

support for yahoo media rss extensions (e.g. YouTube feeds)

I can't fully parse YouTube RSS feeds (atom) - e.g.
https://www.youtube.com/feeds/videos.xml?channel_id=UCUJeW9pnxhDZ5GA0TNRl4zg

I get title, link, timestamp and guid, but importantly not the summary, or any images. Here's how an Item looks:

2017/11/14 15:45:42 Item "Updated 2018 Yamaha MT-07 First Look"
"https://www.youtube.com/watch?v=J1uq9oBm7B8"
23:44:37 +0000 14/11/2017
"yt:video:J1uq9oBm7B8"
Read: false
""

Items with no ID are treated as known items

When an Item has no ID element in the XML it is entered into the database as an empty string. Any other items with no ID will then be treated as know items.

In Rss 2.0.go no ID is handled around line 86 but by this time the entry has already been skipped as a know item.

Example Feed:
http://www.capitalonecup.co.uk/common/rss/news-rss-feed.xml

Support feed generating

github.com/gorilla/feeds is no longer active, it'd be great if this library could support feed generating.

func (f *Feed) WriteAtom(w io.Writer) error

func (f *Feed) WriteRSS(w io.Writer) error

func (f *Feed) WriteJSON(w io.Writer) error

Some Atom feeds not populating Link correctly

Based on the spec, atom:link elements with a rel attribute of alternate or a missing rel attribute should be considered as links.

Currently, line 54 in atom.go correctly sets the link for the latter case, but not the former.

if link.Rel == "" {
  next.Link = link.Href
}

should probably be

if link.Rel == "alternate" || link.Rel == "" {
  next.Link = link.Href
}

This small change fixes a few feeds that were parsing incorrectly for me.

fails to populate items for when fetching some feedburner feeds

Hi, I've successfully fetched other feeds, but these fox news feeds are not populating any of the items. any ideas?

http://feeds.foxnews.com/podcasts/TalkingPoints?format=xml
http://feeds.feedburner.com/foxnews/podcasts/FoxNewsSundayVideo?format=xml

This is all it populates in my *feed:

%+v

    "FOX News Sunday Video"
    "http://feeds.feedburner.com/foxnews/podcasts/FoxNewsSundayVideo?format=xml"
    Image ""
    Refresh at Wed 22 Oct 2014 04:41:00 UTC
    Unread: 0
    Items:

%#v

&rss.Feed{Nickname:"", Title:"FOX News Sunday", Description:"FOX News Sunday Video", Link:"http://feeds.feedburner.com/foxnews/podcasts/FoxNewsSundayVideo?format=xml", UpdateURL:"http://feeds.feedburner.com/foxnews/podcasts/FoxNewsSundayVideo?format=xml", Image:(*rss.Image)(0xc21012be40), Items:[]*rss.Item{}, ItemMap:map[string]struct {}{}, Refresh:time.Time{sec:63549549818, nsec:0x26182591, loc:(*time.Location)(0xa61fe0)}, Unread:0x0}

edit: it works fine with the CNN feedburner feed: http://rss.cnn.com/services/podcasting/ac360/rss.xml
(but none of the foxnews ones)

Support RSS Content Module

Please support parsing the RSS Content Module as item.Content (and using "description" as item.Summary).

Example RSS Feed: The Points Guy RSS

Atom struct tag wrong

in atom.go:85
Content string xml:"summary"
should be
Content string xml:"content"

Possible to parse other XML fields?

I would like to get the value of the field <newznab:attr name="group" value="alt.binaries.teevee"/> so ending up with the value alt.binaries.teevee.

How do I do so?

<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:newznab="http://www.newznab.com/DTD/2010/feeds/attributes/" encoding="utf-8">
 <channel>
  <atom:link href="https://REMOVED.com/api" rel="self" type="application/rss+xml"/>
  <title>REMOVED</title>
  <description>API Details</description>
  <link>https://REMOVED.com/</link>
  <language>en-gb</language>
  <webMaster>[email protected]</webMaster>
  <category>Stuff</category>
  <generator>Me</generator>
  <ttl>10</ttl>
  <docs>https://removed.com/apihelp/</docs>
  <image url="https://removed.com/themes/shared/img/logo.png" title="REMOVED" link="https://removed.com/" description="Visit REMOVED"/>
  <newznab:response offset="0" total="125000"/>
  <item>
   <title>Fair.Go.2017.09.18.HDTV.x264-FiHTV </title>
   <guid isPermaLink="true">https://REMOVED.com/details/427d2b6c5fb3a0f73bd43be4bb8cff955700fd4d</guid>
   <link>https://REMOVED.com/getnzb/427d2b6c5fb3a0f73bd43be4bb8cff955700fd4d.nzb&amp;i=1&amp;r=3bc4e94ef14337e4e2b490a3897c48f6</link>
   <comments>https://REMOVED.com/details/427d2b6c5fb3a0f73bd43be4bb8cff955700fd4d#comments</comments>
   <pubDate>Tue, 19 Sep 2017 10:18:21 +0200</pubDate>
   <category>TV &gt; SD</category>
   <description>Fair.Go.2017.09.18.HDTV.x264-FiHTV </description>
   <enclosure url="https://REMOVED.com/getnzb/427d2b6c5fb3a0f73bd43be4bb8cff955700fd4d.nzb&amp;i=1&amp;r=3bc4e94ef14337e4e2b490a3897c48f6" length="168013625" type="application/x-nzb"/>
   <newznab:attr name="category" value="5030"/>
   <newznab:attr name="size" value="168013625"/>
   <newznab:attr name="files" value="17"/>
   <newznab:attr name="poster" value="[email protected] (yeahsure)"/>
   <newznab:attr name="prematch" value="1"/>
   <newznab:attr name="info" value="https://REMOVED.com/api?t=info&amp;id=427d2b6c5fb3a0f73bd43be4bb8cff955700fd4d&amp;r=3bc4e94ef14337e4e2b490a3897c48f6"/>
   <newznab:attr name="grabs" value="0"/>
   <newznab:attr name="comments" value="0"/>
   <newznab:attr name="password" value="0"/>
   <newznab:attr name="usenetdate" value="Tue, 19 Sep 2017 10:07:47 +0200"/>
   <newznab:attr name="group" value="alt.binaries.teevee"/>
  </item>
</channel>
</rss>

cannot parse "Sat, 30 Abr 2016 08:28:59 GMT" as "2006"

As far as I can see, the package tries to parse time with the default string format provided by the time package...

Providing a way to set the string format for parsing certain feeds which don't use a orthodox (from time point of view) display format for time would be helpful.

slymarbo / rss Goto Github PK

rss's People

Contributors

Stargazers

Watchers

Forkers

rss's Issues

Recommend Projects

Recommend Topics

Recommend Org