shouya / rss-funnel Goto Github PK
View Code? Open in Web Editor NEWAn RSS multi-tool
Home Page: https://rss-funnel-demo.fly.dev
License: GNU General Public License v3.0
An RSS multi-tool
Home Page: https://rss-funnel-demo.fly.dev
License: GNU General Public License v3.0
Provide a subcommand like rss-funnel -c <config> test <endpoint> -l <limit>
to test the output of an endpoint.
The output XML should be pretty printed.
-l
option should allow the user to run through only the first <limit>
filters.
Or perhaps an open a browser with a local webpage showing the filtered feed for each step? (+websocket live update?)
Is there a possibility of reducing the number of posts in the original feed? Adding it in the Filters section?
I could implement more dom navigation operators:
And also add some useful operations:
Currently for Atom feeds, only the content
field is displayed on the UI. Whereas we can as well display the summary
if it is present.
First of all, this is a great project and I'm excited to use it more. I use Miniflux and it has some similar features but I like the ability to apply the same rules to multiple sites and store config in source control.
A filter that would be nice to have is one that can filter out posts that match a regex string (or list of regexes).
From the user's perspective, there is a post's title and a post's body.
However, from the perspective of this program, there is no single "body" field for a post.
Item.content
and Item.description
fieldsEntry.summary
and Entry.content
fieldsThe question is, how to enable the user to run filters in either of these fields like redacting out certain words? I can think of a way - add a modify_body
function that accepts a callback Fn(&mut String)
where it's called for each body-like values.
If a hot reload fails due to configuration error, the error is silent on webui. It would be good if the inspector UI can get an status indicating the config is broken and the error message.
This can be done by making the config
endpoint report extra information about the broken config after a failed reload.
Some websites provide hundreds of feeds, so utilizing full_text
causes excessive load. There's a concern about potential rate limiting and IP address bans due to the increased feed generation.
This can be implemented as a filter - merge the posts from another feed to the current feed.
The feature was brought up in #8.
It would be nice to offer a demo where we replace all process img[src]
with an image proxy prefix.
Based on https://github.com/shouya/rss-funnel/wiki/Filter-config#keep_only--discard, I'd expect this yaml to be valid configuration:
- path: /app-sales.xml
source: https://www.app-sales.net/highlights/
note: App Sales Highlights
filters:
- split:
title_selector: ".sale-list-item p.app-name"
link_selector: ".sale-list-item .sale-list-action > a"
description_selector: ".sale-list-item .pricing"
- discard: field: title contains:
- 'icon pack'
- 'watch face'
case_sensitive: false
However, I get the following error on startup:
rssfunnel-1 | Error: Config(Yaml(Error { kind: SCANNER, problem: "mapping values are not allowed in this context", pro
blem_mark: Mark { line: 36, column: 23 } }))
(The line number above references my entire config. line 36 corresponds to the discard
key in the above snippet).
If I change the config to this:
- path: /app-sales.xml
source: https://www.app-sales.net/highlights/
note: App Sales Highlights
filters:
- split:
title_selector: ".sale-list-item p.app-name"
link_selector: ".sale-list-item .sale-list-action > a"
description_selector: ".sale-list-item .pricing"
- discard:
- 'icon pack'
- 'watch face'
rss_funnel
will start up without error and the discard
configuration will show up for this endpoint within the inspector. However, I'm still seeing entries in the rss output that I would have expected to be discarded. (Specifically ones with "Icon Pack" in the title).
error[E0635]: unknown feature `stdsimd`
--> /home/yonas/.cargo/registry/src/index.crates.io-6f17d22bba15001f/ahash-0.8.6/src/lib.rs:99:42
|
99 | #![cfg_attr(feature = "stdsimd", feature(stdsimd))]
$ cargo build --release
Compiling rss-funnel v0.1.0 (/git/rss-funnel)
error: #[derive(RustEmbed)] folder '/git/rss-funnel/inspector/dist/' does not exist. cwd: '/git/rss-funnel'
--> src/server/inspector.rs:13:1
|
13 | / #[folder = "inspector/dist/"]
14 | | struct Asset;
| |_____________^
error[E0599]: no function or associated item named `get` found for struct `Asset` in the current scope
--> src/server/inspector.rs:47:24
|
14 | struct Asset;
| ------------ function or associated item `get` not found for this struct
...
47 | let content = Asset::get(path.as_str()).map(|content| content.data);
| ^^^ function or associated item not found in `Asset`
|
= help: items from traits can only be used if the trait is implemented and in scope
= note: the following traits define an item `get`, perhaps you need to implement one of them:
candidate #1: `SliceIndex`
candidate #2: `rustls::server::server_conn::StoresServerSessions`
candidate #3: `string_cache::static_sets::StaticAtomSet`
candidate #4: `RustEmbed`
For more information about this error, try `rustc --explain E0599`.
error: could not compile `rss-funnel` (bin "rss-funnel") due to 2 previous errors
OS: FreeBSD 14 and Ubuntu 23
Rust: rustc 1.78.0-nightly (6672c16af 2024-02-17)
We can use GREsau/schemars to generate a JSON schema from the type automatically and somewhat render it on inspector UI.
It seems quite frequently one need to use JS to find/modify specific DOM elements in the feed's description. Now this kind of manipulation is only doable with regexp or simple string matches. It would be nice to offer a set of DOM manipulation features.
For the first stage, I want to only implement data extraction APIs. This means the DOM is only parsed and read - never modified. I am proposing the following interfaces:
interface DOM {
constructor(fragment: string): DOM;
querySelector(css_selector: string): Node | null;
querySelectors(css_selector: string): Node[];
}
interface Node {
outerHTML(): string;
innerHTML(): string;
innerText(): string;
attr(field: string): string | null;
children(): Node[];
querySelector(css_selector: string): Node | null;
querySelectors(css_selector: string): Node[];
}
Mixing sources works perfectly, but it would be possible to sort the mixed feed by publication date after it has been created?.
Currently, what I have observed is that it downloads the first feed, then the second, and so on. It adds them together and the client sorts them by publication order. This can become a problem when you trim the feed distance. I have currently created a feed of six podcasts. Since there are many entries, the size of the feed goes to 16 Mb. If I leave the last ten publications, only the first ten publications are shown, so if the first feed has more than ten, only publications from the first feed would be seen. I think it would be best to first create the feed as it is now and then sort it by publication date.
How would you simplify the Feed as much as possible? My intention, for the minimum data usage expense, is to minimize the Feed as much as possible. For example:
<item>
<title>
No actualices tu Fire TV Stick: un problema te impide usar algunas aplicaciones
</title>
<link>
https://www.adslzone.net/noticias/streaming-tv/no-actualices-fire-tv-stick-problema-aplicaciones/
</link>
</item>
<item>
<title>
Libera espacio de tu cuenta de Google en menos de 5 minutos con estos trucos
</title>
<link>
https://www.genbeta.com/paso-a-paso/libera-espacio-tu-cuenta-google-5-minutos-estos-trucos-1
</link>
</item>
As you can see here it only consists of the title and URL of the post.
Could you add arm64 compatibility?
Thanks
! rss-funnel The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested
When creating an RSS feed from multiple sources, it always adopts the name and image of the first feed. Could an option be added to specify the name of the new generated feed and the image to avoid confusion?
DeArrow is a crowdsourced service providing de-clickbait-ed video titles and thumbnail data. API doc: https://wiki.sponsor.ajay.app/w/API_Docs/DeArrow#GET_/api/branding/:sha256HashPrefix
I'm considering a general a way to support it: transform a YouTube RSS feed by replacing the title and thumbnail using the API data.
The best way to do this may be to provide an API in JS runtime that support async request of remote data. I should probably further add some security measures to limit (whitelist) the requested domains from JavaScript.
Add support for multiple sources in the same merge
filter that looks like:
- path: /my-feed.xml
source: <some source feed>
filters:
- merge:
- http://my-domain/generate.xml?source=https://source1/page.html
- http://my-domain/generate.xml?source=https://source2/page.html
- http://my-domain/generate.xml?source=https://source3/page.html
which is (mostly) equivalent to:
- path: /my-feed.xml
source: <some source feed>
filters:
- merge: http://my-domain/generate.xml?source=https://source1/page.html
- merge: http://my-domain/generate.xml?source=https://source2/page.html
- merge: http://my-domain/generate.xml?source=https://source3/page.html
As well as the syntax like this:
- path: /my-feed.xml
source: <some source feed>
filters:
- merge:
client: <client config>
filters: <filters to run on each merged feed>
source:
- http://my-domain/generate.xml?source=https://source1/page.html
- http://my-domain/generate.xml?source=https://source2/page.html
- http://my-domain/generate.xml?source=https://source3/page.html
This would be helpful in speeding up the fetch. If we have multiple merges, we have to execute each one in sequence, but inside one merge
filter we parallelize the fetch.
We should offer a keyword highlighting filter. Technically, it scans for all text nodes in the body (and perhaps also title?) and wrap the keyword around with a span with a conspicuous background color.
Could you please assist me? I'm attempting to remove some text, but it's not functioning as anticipated.
source: https://www.thenewsminute.com/api/v1/collections/tamil-nadu.rss
- path: /newsminute.xml
filters:
- sanitize:
- remove_regex: '<p><strong>Read.*?<\/strong><\/p>'
This would be helpful in cases where you want to merge a lot of feeds together. Then you can just make a new feed as:
- path: /my-lovely-feed.xml
source:
from_scratch:
title: My Lovely Feed Aggregation
link: http://not-available.com
filters:
- merge: http://feed1.xml
- merge: http://feed2.xml
Add a -w
/--watch
option to server
that reloads the server on config file changes.
Currently, feeds with html source are converted into a singleton RSS feed with the full of html markup included in the article's description. However, this practice may break the preview on feed readers because the "<html>
" tag is unexpected in the article's description.
A more proper way to deal with it may be to cleanse the markup a bit - only include the body's content (not including the outmost "<body>
" tag).
This also applies to the full_text
filter.
Hi.
When a lunch docker compose up with the example docker-compose.yaml
version: "3.8"
services:
rss-funnel:
image: ghcr.io/shouya/rss-funnel:latest
ports:
- 4080:4080
volumes:
- ./funnel.yaml:/funnel.yaml
environment:
RSS_FUNNEL_CONFIG: /funnel.yaml
RSS_FUNNEL_BIND: 0.0.0.0:4080
โ Network rss-funnel_default Created 0.1s
โ Container rss-funnel-rss-funnel-1 Created 0.0s
Attaching to rss-funnel-1
rss-funnel-1 | Error: Config(Yaml(Error("Is a directory (os error 21)")))
rss-funnel-1 exited with code 1
I get this error.
Whats can I do?
Add a filter to convert RSS to Atom and vice versa.
Thank you for the software.
I attempted to aggregate content from multiple YouTube channels using the provided code. However, upon inspection in the UI, it appears that there is no rendered content, although RAW and JSON data can be generated successfully.
When using the feed reader Yarr, only the content is empty. Additionally, when trying with the WebExtension rsspreview to preview RSS feeds in the browser, it throws an error XML Parsing Error: prefix not bound to a namespace
.
endpoints:
- path: /youtube.xml
source:
format: atom
title: YouTube
link: https://www.youtube.com
description: Aggregate content from multiple YouTube channels and exclude YouTube shorts
filters:
- merge:
filters:
- discard:
- shorts
source:
- https://www.youtube.com/feeds/videos.xml?channel_id=UCiPmhfdCL06cSVTXKabF0Zg
- https://www.youtube.com/feeds/videos.xml?channel_id=UCueYcgdqos0_PzNOq81zAFg
When deploying a Docker container without specifying environment variables, the container should utilize default values for configuration parameters, ensuring smooth operation without requiring additional user input.
Hi, while wanting to try rss-funnel I saw that the docker image you mention in your docker-compose example (ghcr.io/shouya/rss-funnel:0.0.1
) is not available.
I think Github registries are set to private by default, so this may be the problem? Or should I just build it with my own Dockerfile?
It would be possible with js to extract the original source url of articles from the feeds of Google News, Meneame, ..... Normally the url provided by the feed is to access their servers and they are the ones that provide you with the original url. But would it be possible for rss-funnel to extract the original url and return the rss feed with the original title and source?
Here are some examples of feeds
Google News: https://news.google.com/rss/search?q=openai&hl=es&gl=ES&ceid=ES:es
Meneame is a well-known social network in Spain and acts in a similar way: https://www.meneame.net/rss2.php
Currently, there is an issue with the discard/keep_only filter not matching inside non-standard fields, specifically the media:description tag within the media namespace.
Currently, if I want to remove short videos from a YouTube channel feed and do not want to merge, I have to write specific code for each individual feed. This process seems inefficient. Instead of this, I suggest allowing the use of ?url=
in the query parameter for the source field.
For instance, instead of:
- path: /youtube-kurzgesagt.xml
source: https://www.youtube.com/feeds/videos.xml?channel_id=UCsXVk37bltHxD1rDPwtNM8Q
filters:
- discard:
field: title
matches: shorts
We could have a more generic endpoint like:
- path: /remove-shorts.xml
filters:
- discard:
field: title
matches: shorts
Then, we could utilize the ?url=
parameter and call it to perform an action on the feed URL, like so:
http://127.0.0.1:4080/remove-shorts.xml?url=https://www.youtube.com/feeds/videos.xml?channel_id=UCsXVk37bltHxD1rDPwtNM8Q
This would streamline the process.
I'm trying to build the docker image on arm64 and i get an error with rquickjs
error: couldn't read /home/xshyne/.cargo/registry/src/index.crates.io-6f17d22bba15001f/rquickjs-sys-0.4.0/src/bindings/aarch64-unknown-linux-gnu.rs: No such file or directory (os error 2)
--> /home/xshyne/.cargo/registry/src/index.crates.io-6f17d22bba15001f/rquickjs-sys-0.4.0/src/lib.rs:16:1
|
16 | include!(concat!("bindings/", bindings_env!("TARGET"), ".rs"));
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
= note: this error originates in the macro `include` (in Nightly builds, run with -Z macro-backtrace for more info)
The following warnings were emitted during compilation:
warning: rquickjs probably doesn't ship bindings for platform `aarch64-unknown-linux-gnu`. try the `bindgen` feature instead.
error: could not compile `rquickjs-sys` (lib) due to previous error
It seems that with the version 0.4.0 of rquickjs is not possible to build on arm64. It's fixed in the version 0.5.0.
The file aarch64-unknown-linux-gnu.rs got deleted during 0.4.0 beta and was added back for the 0.4.3.
It would be easier for debugging as one doesn't have to constantly look at the console output.
You could add the icon variable In the configuration file, currently there is the title, description even the RSS type but when you subscribe to a Feed generated with the merge option it does not have an image.
I'm running rss_funnel
via a docker compose file with the RSS_FUNNEL_WATCH: true
inside the environment:
block.
I startup rss_funnel
with a valid funnel.yaml
file and appropriate RSS_FUNNEL_CONFIG
environment setting. The logs report:
rssfunnel-1 | 2024-03-12T14:35:04.534164Z INFO rss_funnel::server: watching config file for changes
...<snipped for brevity>...
rssfunnel-1 | 2024-03-12T14:35:04.674727Z INFO rss_funnel::server: starting server
Then I edit my funnel.yaml
to completely remove one of the endpoints. The logs then report:
rssfunnel-1 | 2024-03-12T14:42:14.539908Z INFO rss_funnel::server: config updated, reloading service
After that I go to my browser, reload the inspector page and also click the "Reload Config" button. I would expect the endpoint I just deleted to be absent from the inspector UI, but it is still present.
I can mostly get the logs to report a successful reload if I change a config from invalid to valid. However even then it doesn't appear to always work. (Sorry I don't have more concrete reproduction steps).
I have currently set up the service with a reverse proxy and I have exposed both the feeds and the administration UI, which is fantastic (It was a great idea). I think that the UI should have the possibility to set a username and password, leaving the feeds open to the network on the same server and being able to access them.
Could you implement this feature?
Currently, in bare minimum, one need to write something like:
function modify_post(feed, post) {
return post;
}
to make the js filter work. Any modification made to post
needs to be assigned back and returned. This is quite cumbersome and prone to mistake. Also in reality, I expect most modification to be small and perhaps many can be done in one or two lines. It would be nice to have a shorter syntax for short js code.
I'm considering syntax like this:
filters:
- modify_post: post.title = `${post.title} - ${feed.title}`
- modify_feed: feed.items = feed.items.slice(0, 5)
Now if we want to refer to an endpoint inside another endpoint (e.g. in merge
filter), we have to write out the absolute url explicitly.
- path: /generate.xml
filters: <some filters>
- path: /my-feed.xml
source: <some source feed>
filters:
- merge: http://my-domain/generate.xml?source=https://source1/page.html
- merge: http://my-domain/generate.xml?source=https://source2/page.html
- merge: http://my-domain/generate.xml?source=https://source3/page.html
It would be nice if we can omit that like this:
- path: /generate.xml
filters: <some filters>
- path: /my-feed.xml
source: <some source feed>
filters:
- merge: /generate.xml?source=https://source1/page.html
- merge: /generate.xml?source=https://source2/page.html
- merge: /generate.xml?source=https://source3/page.html
It would be better if we can detect cycles and stops the infinite chain. But this shouldn't be much of an issue given the timeout setting is there to avoid leaking.
The host name can be either inferred from Host
field or have it converted into an internal call.
The problem was due to an excessive conversion to utf-8. The first conversion to utf-8 happens during feed fetch (due to content-type: text/xml; charset=iso-8859-1
), and then it gets converted again in the feed parser (due to <?xml version="1.0" encoding="ISO-8859-1"?>
).
I'm able to fix the problem by skipping one of the conversion steps.
An error occurs when trying to use the attribute in the configuration of the RSS funnel, as documented in the wiki.
Error 1:
Error: Config(Yaml(Error("endpoints[0]: data did not match any variant of untagged enum SourceConfig", line: 2, column: 5)))
endpoints:
- path: /youtube.xml
source:
from_scratch:
format: atom
title: YouTube
link: https://www.youtube.com
description: Aggregate content from multiple YouTube channels and exclude YouTube shorts
filters:
- merge:
filters:
- discard:
- shorts
source:
- https://www.youtube.com/feeds/videos.xml?channel_id=UCiPmhfdCL06cSVTXKabF0Zg
- https://www.youtube.com/feeds/videos.xml?channel_id=UCueYcgdqos0_PzNOq81zAFg
Error 2:
Error: Config(Yaml(Error("endpoints[0]: data did not match any variant of untagged enum MergeConfig", line: 2, column: 5)))
endpoints:
- path: /youtube.xml
source:
format: atom
title: YouTube
link: https://www.youtube.com
description: Aggregate content from multiple YouTube channels and exclude YouTube shorts
filters:
- merge:
filters:
- discard:
field: title
matches:
- shorts
source:
- https://www.youtube.com/feeds/videos.xml?channel_id=UCiPmhfdCL06cSVTXKabF0Zg
- https://www.youtube.com/feeds/videos.xml?channel_id=UCueYcgdqos0_PzNOq81zAFg
I have the ability to edit the wiki without requiring permission, but I am unable to revert changes. Could you please verify this as well? Thank you.
It would be nice to be able to place comments in the filter list.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.