shouya / rss-funnel Goto Github PK

View Code? Open in Web Editor NEW

90.0 4.0 3.0 478 KB

An RSS multi-tool

Home Page: https://rss-funnel-demo.fly.dev

License: GNU General Public License v3.0

Makefile 1.16% Rust 82.79% Shell 0.20% JavaScript 8.83% HTML 2.96% CSS 4.07%

rss rss-aggregator rss-feed yahoo-pipes

rss-funnel's Issues

Feed filter debugging tool

Provide a subcommand like rss-funnel -c <config> test <endpoint> -l <limit> to test the output of an endpoint.

The output XML should be pretty printed.

-l option should allow the user to run through only the first <limit> filters.

Or perhaps an open a browser with a local webpage showing the filtered feed for each step? (+websocket live update?)

Limit number of posts

Is there a possibility of reducing the number of posts in the original feed? Adding it in the Filters section?

Better DOM navigation support

I could implement more dom navigation operators:

Node.previous_sibling
Node.next_sibling
Node.parent

And also add some useful operations:

Node.type (e.g. element, text)
Node.tag_name
Node.destroy (equivalent to Node.outerHTML = '')

Show Atom summary on Inspector UI

Currently for Atom feeds, only the content field is displayed on the UI. Whereas we can as well display the summary if it is present.

Feature: Filter articles by title

First of all, this is a great project and I'm excited to use it more. I use Miniflux and it has some similar features but I like the ability to apply the same rules to multiple sites and store config in source control.

A filter that would be nice to have is one that can filter out posts that match a regex string (or list of regexes).

Handle post body correctly

From the user's perspective, there is a post's title and a post's body.

However, from the perspective of this program, there is no single "body" field for a post.

For rss, there are Item.content and Item.description fields
For atom, there are Entry.summary and Entry.content fields
In addition, for YouTube feeds, the description is under a media extension.

The question is, how to enable the user to run filters in either of these fields like redacting out certain words? I can think of a way - add a modify_body function that accepts a callback Fn(&mut String) where it's called for each body-like values.

Indicate config error on Inspector UI

If a hot reload fails due to configuration error, the error is silent on webui. It would be good if the inspector UI can get an status indicating the config is broken and the error message.

This can be done by making the config endpoint report extra information about the broken config after a failed reload.

Feature Request: Option to Limit Number of Feeds

Some websites provide hundreds of feeds, so utilizing full_text causes excessive load. There's a concern about potential rate limiting and IP address bans due to the increased feed generation.

Merge extra feeds

This can be implemented as a filter - merge the posts from another feed to the current feed.

The feature was brought up in #8.

Proxy support for individual requests

Image proxy demo

It would be nice to offer a demo where we replace all process img[src] with an image proxy prefix.

Problems with `discard`

Based on https://github.com/shouya/rss-funnel/wiki/Filter-config#keep_only--discard, I'd expect this yaml to be valid configuration:

  - path: /app-sales.xml
    source: https://www.app-sales.net/highlights/
    note: App Sales Highlights
    filters:
      - split:
          title_selector: ".sale-list-item p.app-name"
          link_selector: ".sale-list-item .sale-list-action > a"
          description_selector: ".sale-list-item .pricing"
      - discard: field: title contains:
        - 'icon pack'
        - 'watch face'
        case_sensitive: false

However, I get the following error on startup:

rssfunnel-1  | Error: Config(Yaml(Error { kind: SCANNER, problem: "mapping values are not allowed in this context", pro
blem_mark: Mark { line: 36, column: 23 } }))

(The line number above references my entire config. line 36 corresponds to the discard key in the above snippet).

If I change the config to this:

  - path: /app-sales.xml
    source: https://www.app-sales.net/highlights/
    note: App Sales Highlights
    filters:
      - split:
          title_selector: ".sale-list-item p.app-name"
          link_selector: ".sale-list-item .sale-list-action > a"
          description_selector: ".sale-list-item .pricing"
      - discard:
        - 'icon pack'
        - 'watch face'

rss_funnel will start up without error and the discard configuration will show up for this endpoint within the inspector. However, I'm still seeing entries in the rss output that I would have expected to be discarded. (Specifically ones with "Icon Pack" in the title).

Preview the JSON representation of the feed

Build errors

Before running cargo update

error[E0635]: unknown feature `stdsimd`
  --> /home/yonas/.cargo/registry/src/index.crates.io-6f17d22bba15001f/ahash-0.8.6/src/lib.rs:99:42
   |
99 | #![cfg_attr(feature = "stdsimd", feature(stdsimd))]

After running cargo update

$ cargo build --release
   Compiling rss-funnel v0.1.0 (/git/rss-funnel)
error: #[derive(RustEmbed)] folder '/git/rss-funnel/inspector/dist/' does not exist. cwd: '/git/rss-funnel'
  --> src/server/inspector.rs:13:1
   |
13 | / #[folder = "inspector/dist/"]
14 | | struct Asset;
   | |_____________^

error[E0599]: no function or associated item named `get` found for struct `Asset` in the current scope
  --> src/server/inspector.rs:47:24
   |
14 | struct Asset;
   | ------------ function or associated item `get` not found for this struct
...
47 |   let content = Asset::get(path.as_str()).map(|content| content.data);
   |                        ^^^ function or associated item not found in `Asset`
   |
   = help: items from traits can only be used if the trait is implemented and in scope
   = note: the following traits define an item `get`, perhaps you need to implement one of them:
           candidate #1: `SliceIndex`
           candidate #2: `rustls::server::server_conn::StoresServerSessions`
           candidate #3: `string_cache::static_sets::StaticAtomSet`
           candidate #4: `RustEmbed`

For more information about this error, try `rustc --explain E0599`.
error: could not compile `rss-funnel` (bin "rss-funnel") due to 2 previous errors

OS: FreeBSD 14 and Ubuntu 23
Rust: rustc 1.78.0-nightly (6672c16af 2024-02-17)

Add an option to disable inspector UI

Show documentation on InspectUI

We can use GREsau/schemars to generate a JSON schema from the type automatically and somewhat render it on inspector UI.

DOM manipulation in JS runtime

It seems quite frequently one need to use JS to find/modify specific DOM elements in the feed's description. Now this kind of manipulation is only doable with regexp or simple string matches. It would be nice to offer a set of DOM manipulation features.

For the first stage, I want to only implement data extraction APIs. This means the DOM is only parsed and read - never modified. I am proposing the following interfaces:

interface DOM {
  constructor(fragment: string): DOM;
  querySelector(css_selector: string): Node | null;
  querySelectors(css_selector: string): Node[];
}

interface Node {
  outerHTML(): string;
  innerHTML(): string;
  innerText(): string;
  attr(field: string): string | null;
  children(): Node[];
  querySelector(css_selector: string): Node | null;
  querySelectors(css_selector: string): Node[];
}

Sort by date of publication

Mixing sources works perfectly, but it would be possible to sort the mixed feed by publication date after it has been created?.

Currently, what I have observed is that it downloads the first feed, then the second, and so on. It adds them together and the client sorts them by publication order. This can become a problem when you trim the feed distance. I have currently created a feed of six podcasts. Since there are many entries, the size of the feed goes to 16 Mb. If I leave the last ten publications, only the first ten publications are shown, so if the first feed has more than ten, only publications from the first feed would be seen. I think it would be best to first create the feed as it is now and then sort it by publication date.

Simplify the Feed as much as possible

How would you simplify the Feed as much as possible? My intention, for the minimum data usage expense, is to minimize the Feed as much as possible. For example:

<item>
      <title>
        No actualices tu Fire TV Stick: un problema te impide usar algunas aplicaciones
      </title>
      <link>
        https://www.adslzone.net/noticias/streaming-tv/no-actualices-fire-tv-stick-problema-aplicaciones/
      </link>
    </item>
    <item>
      <title>
        Libera espacio de tu cuenta de Google en menos de 5 minutos con estos trucos
      </title>
      <link>
        https://www.genbeta.com/paso-a-paso/libera-espacio-tu-cuenta-google-5-minutos-estos-trucos-1
      </link>
    </item>

As you can see here it only consists of the title and URL of the post.

ARM compatibility

Could you add arm64 compatibility?

Thanks

 ! rss-funnel The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested

Personalize Name and Image in RSS Feeds from Multiple Sources

When creating an RSS feed from multiple sources, it always adopts the name and image of the first feed. Could an option be added to specify the name of the new generated feed and the image to avoid confusion?

Support DeArrow for YouTube

DeArrow is a crowdsourced service providing de-clickbait-ed video titles and thumbnail data. API doc: https://wiki.sponsor.ajay.app/w/API_Docs/DeArrow#GET_/api/branding/:sha256HashPrefix

I'm considering a general a way to support it: transform a YouTube RSS feed by replacing the title and thumbnail using the API data.

The best way to do this may be to provide an API in JS runtime that support async request of remote data. I should probably further add some security measures to limit (whitelist) the requested domains from JavaScript.

Merge multiple sources in one filter

Add support for multiple sources in the same merge filter that looks like:

- path: /my-feed.xml
  source: <some source feed>
  filters:
    - merge:
      - http://my-domain/generate.xml?source=https://source1/page.html
      - http://my-domain/generate.xml?source=https://source2/page.html
      - http://my-domain/generate.xml?source=https://source3/page.html

which is (mostly) equivalent to:

- path: /my-feed.xml
  source: <some source feed>
  filters:
    - merge: http://my-domain/generate.xml?source=https://source1/page.html
    - merge: http://my-domain/generate.xml?source=https://source2/page.html
    - merge: http://my-domain/generate.xml?source=https://source3/page.html

As well as the syntax like this:

- path: /my-feed.xml
  source: <some source feed>
  filters:
    - merge:
        client: <client config>
        filters: <filters to run on each merged feed>
        source:
          - http://my-domain/generate.xml?source=https://source1/page.html
          - http://my-domain/generate.xml?source=https://source2/page.html
          - http://my-domain/generate.xml?source=https://source3/page.html

This would be helpful in speeding up the fetch. If we have multiple merges, we have to execute each one in sequence, but inside one merge filter we parallelize the fetch.

Highlight keywords

We should offer a keyword highlighting filter. Technically, it scans for all text nodes in the body (and perhaps also title?) and wrap the keyword around with a span with a conspicuous background color.

Help is needed as `remove_regex` is not working as expected

Could you please assist me? I'm attempting to remove some text, but it's not functioning as anticipated.

source: https://www.thenewsminute.com/api/v1/collections/tamil-nadu.rss

  - path: /newsminute.xml
    filters:
      - sanitize:
          - remove_regex: '<p><strong>Read.*?<\/strong><\/p>'

Allow creating new feed from scratch

This would be helpful in cases where you want to merge a lot of feeds together. Then you can just make a new feed as:

- path: /my-lovely-feed.xml
  source:
    from_scratch:
      title: My Lovely Feed Aggregation
      link: http://not-available.com
  filters:
    - merge: http://feed1.xml
    - merge: http://feed2.xml

Reload on config file change

Add a -w/--watch option to server that reloads the server on config file changes.

Check for duplicated endpoints

Only include the body for html source

Currently, feeds with html source are converted into a singleton RSS feed with the full of html markup included in the article's description. However, this practice may break the preview on feed readers because the "<html>" tag is unexpected in the article's description.

A more proper way to deal with it may be to cleanse the markup a bit - only include the body's content (not including the outmost "<body>" tag).

This also applies to the full_text filter.

Error on docker compose

Hi.
When a lunch docker compose up with the example docker-compose.yaml

version: "3.8"
services:
  rss-funnel:
    image: ghcr.io/shouya/rss-funnel:latest
    ports:
      - 4080:4080
    volumes:
      - ./funnel.yaml:/funnel.yaml
    environment:
      RSS_FUNNEL_CONFIG: /funnel.yaml
      RSS_FUNNEL_BIND: 0.0.0.0:4080

✔ Network rss-funnel_default Created 0.1s
✔ Container rss-funnel-rss-funnel-1 Created 0.0s
Attaching to rss-funnel-1
rss-funnel-1 | Error: Config(Yaml(Error("Is a directory (os error 21)")))
rss-funnel-1 exited with code 1

I get this error.
Whats can I do?

Convert feed format

Add a filter to convert RSS to Atom and vice versa.

Empty Content and Parsing Error When Aggregating YouTube Channels

Thank you for the software.

I attempted to aggregate content from multiple YouTube channels using the provided code. However, upon inspection in the UI, it appears that there is no rendered content, although RAW and JSON data can be generated successfully.

When using the feed reader Yarr, only the content is empty. Additionally, when trying with the WebExtension rsspreview to preview RSS feeds in the browser, it throws an error XML Parsing Error: prefix not bound to a namespace.

endpoints:
  - path: /youtube.xml
    source:
      format: atom
      title: YouTube
      link: https://www.youtube.com
      description: Aggregate content from multiple YouTube channels and exclude YouTube shorts
    filters:
      - merge:
          filters:
            - discard:
              - shorts
          source:
            - https://www.youtube.com/feeds/videos.xml?channel_id=UCiPmhfdCL06cSVTXKabF0Zg
            - https://www.youtube.com/feeds/videos.xml?channel_id=UCueYcgdqos0_PzNOq81zAFg

Feature Request: Docker Should Work Using Default Values Without Requiring Environment Variables

When deploying a Docker container without specifying environment variables, the container should utilize default values for configuration parameters, ensuring smooth operation without requiring additional user input.

Docker image not accessible

Hi, while wanting to try rss-funnel I saw that the docker image you mention in your docker-compose example (ghcr.io/shouya/rss-funnel:0.0.1) is not available.

I think Github registries are set to private by default, so this may be the problem? Or should I just build it with my own Dockerfile?

Bug: blank `client` field in `merge` filter

Google News, Meneame,...Original sources

It would be possible with js to extract the original source url of articles from the feeds of Google News, Meneame, ..... Normally the url provided by the feed is to access their servers and they are the ones that provide you with the original url. But would it be possible for rss-funnel to extract the original url and return the rss feed with the original title and source?

Here are some examples of feeds
Google News: https://news.google.com/rss/search?q=openai&hl=es&gl=ES&ceid=ES:es

Meneame is a well-known social network in Spain and acts in a similar way: https://www.meneame.net/rss2.php

Issue with Discard/Keep_Only Filter Not Matching Inside Non-Standard Fields

Currently, there is an issue with the discard/keep_only filter not matching inside non-standard fields, specifically the media:description tag within the media namespace.

Feature Request: Simplified Feed Filtering with Expanded Query Parameter Support

Currently, if I want to remove short videos from a YouTube channel feed and do not want to merge, I have to write specific code for each individual feed. This process seems inefficient. Instead of this, I suggest allowing the use of ?url= in the query parameter for the source field.

For instance, instead of:

  - path: /youtube-kurzgesagt.xml
    source: https://www.youtube.com/feeds/videos.xml?channel_id=UCsXVk37bltHxD1rDPwtNM8Q
    filters:
      - discard:
          field: title
          matches: shorts

We could have a more generic endpoint like:

  - path: /remove-shorts.xml
    filters:
      - discard:
          field: title
          matches: shorts

Then, we could utilize the ?url= parameter and call it to perform an action on the feed URL, like so:

http://127.0.0.1:4080/remove-shorts.xml?url=https://www.youtube.com/feeds/videos.xml?channel_id=UCsXVk37bltHxD1rDPwtNM8Q

This would streamline the process.

Error building on arm64

I'm trying to build the docker image on arm64 and i get an error with rquickjs

error: couldn't read /home/xshyne/.cargo/registry/src/index.crates.io-6f17d22bba15001f/rquickjs-sys-0.4.0/src/bindings/aarch64-unknown-linux-gnu.rs: No such file or directory (os error 2)
  --> /home/xshyne/.cargo/registry/src/index.crates.io-6f17d22bba15001f/rquickjs-sys-0.4.0/src/lib.rs:16:1
   |
16 | include!(concat!("bindings/", bindings_env!("TARGET"), ".rs"));
   | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   |
   = note: this error originates in the macro `include` (in Nightly builds, run with -Z macro-backtrace for more info)

The following warnings were emitted during compilation:

warning: rquickjs probably doesn't ship bindings for platform `aarch64-unknown-linux-gnu`. try the `bindgen` feature instead.

error: could not compile `rquickjs-sys` (lib) due to previous error

It seems that with the version 0.4.0 of rquickjs is not possible to build on arm64. It's fixed in the version 0.5.0.
The file aarch64-unknown-linux-gnu.rs got deleted during 0.4.0 beta and was added back for the 0.4.3.

Capture `console.log` outputs on Inspector UI

It would be easier for debugging as one doesn't have to constantly look at the console output.

Feed icon

You could add the icon variable In the configuration file, currently there is the title, description even the RSS type but when you subscribe to a Feed generated with the merge option it does not have an image.

RSS_FUNNEL_WATCH=true is unreliable

I'm running rss_funnel via a docker compose file with the RSS_FUNNEL_WATCH: true inside the environment: block.

I startup rss_funnel with a valid funnel.yaml file and appropriate RSS_FUNNEL_CONFIG environment setting. The logs report:

rssfunnel-1  | 2024-03-12T14:35:04.534164Z  INFO rss_funnel::server: watching config file for changes
...<snipped for brevity>...
rssfunnel-1  | 2024-03-12T14:35:04.674727Z  INFO rss_funnel::server: starting server

Then I edit my funnel.yaml to completely remove one of the endpoints. The logs then report:

rssfunnel-1  | 2024-03-12T14:42:14.539908Z  INFO rss_funnel::server: config updated, reloading service

After that I go to my browser, reload the inspector page and also click the "Reload Config" button. I would expect the endpoint I just deleted to be absent from the inspector UI, but it is still present.

I can mostly get the logs to report a successful reload if I change a config from invalid to valid. However even then it doesn't appear to always work. (Sorry I don't have more concrete reproduction steps).

Authentication for UI

I have currently set up the service with a reverse proxy and I have exposed both the feeds and the administration UI, which is fantastic (It was a great idea). I think that the UI should have the possibility to set a username and password, leaving the feeds open to the network on the same server and being able to access them.

Could you implement this feature?

A less verbose js filter syntax

Currently, in bare minimum, one need to write something like:

          function modify_post(feed, post) {
            return post;
          }

to make the js filter work. Any modification made to post needs to be assigned back and returned. This is quite cumbersome and prone to mistake. Also in reality, I expect most modification to be small and perhaps many can be done in one or two lines. It would be nice to have a shorter syntax for short js code.

I'm considering syntax like this:

filters:
  - modify_post: post.title = `${post.title} - ${feed.title}`
  - modify_feed: feed.items = feed.items.slice(0, 5)

Allow calling existing endpoints recursively

Now if we want to refer to an endpoint inside another endpoint (e.g. in merge filter), we have to write out the absolute url explicitly.

- path: /generate.xml
  filters: <some filters>
- path: /my-feed.xml
  source: <some source feed>
  filters:
    - merge: http://my-domain/generate.xml?source=https://source1/page.html
    - merge: http://my-domain/generate.xml?source=https://source2/page.html
    - merge: http://my-domain/generate.xml?source=https://source3/page.html

It would be nice if we can omit that like this:

- path: /generate.xml
  filters: <some filters>
- path: /my-feed.xml
  source: <some source feed>
  filters:
    - merge: /generate.xml?source=https://source1/page.html
    - merge: /generate.xml?source=https://source2/page.html
    - merge: /generate.xml?source=https://source3/page.html

It would be better if we can detect cycles and stops the infinite chain. But this shouldn't be much of an issue given the timeout setting is there to avoid leaking.

The host name can be either inferred from Host field or have it converted into an internal call.

Make the Inspector UI more smooth

Blocking on fetch: rss-funnel is essentially a reverse proxy. There is little we can do to make the upstream respond faster. But we can certainly further avoid blocking the ui and perhaps show a nicer loading animation.
Another source of choppiness is the occasional GC pauses. I need to look into the root cause of this.
Add a debouncing logic for fetching to avoid issuing repeated heavy actions.
Make sure only one fetch request is handled. If a new one comes before the old one finishes, cancel it.
~~Cache more aggressively, add the cache for 5xx and 4xx errors as well.~~

Broken encoding for non-UTF-8 feed

Reported by @uGeek in #59.

The problem was due to an excessive conversion to utf-8. The first conversion to utf-8 happens during feed fetch (due to content-type: text/xml; charset=iso-8859-1), and then it gets converted again in the feed parser (due to <?xml version="1.0" encoding="ISO-8859-1"?>).

I'm able to fix the problem by skipping one of the conversion steps.

Publish date selector for `split` filter

config error

An error occurs when trying to use the attribute in the configuration of the RSS funnel, as documented in the wiki.

Error 1:

Error: Config(Yaml(Error("endpoints[0]: data did not match any variant of untagged enum SourceConfig", line: 2, column: 5)))

endpoints:
  - path: /youtube.xml
    source:
      from_scratch:
        format: atom
        title: YouTube
        link: https://www.youtube.com
        description: Aggregate content from multiple YouTube channels and exclude YouTube shorts
    filters:
      - merge:
          filters:
            - discard:
              - shorts
          source:
            - https://www.youtube.com/feeds/videos.xml?channel_id=UCiPmhfdCL06cSVTXKabF0Zg
            - https://www.youtube.com/feeds/videos.xml?channel_id=UCueYcgdqos0_PzNOq81zAFg

Error 2:

Error: Config(Yaml(Error("endpoints[0]: data did not match any variant of untagged enum MergeConfig", line: 2, column: 5)))

endpoints:
  - path: /youtube.xml
    source:
      format: atom
      title: YouTube
      link: https://www.youtube.com
      description: Aggregate content from multiple YouTube channels and exclude YouTube shorts
    filters:
      - merge:
          filters:
            - discard:
              field: title
              matches:
                - shorts
          source:
            - https://www.youtube.com/feeds/videos.xml?channel_id=UCiPmhfdCL06cSVTXKabF0Zg
            - https://www.youtube.com/feeds/videos.xml?channel_id=UCueYcgdqos0_PzNOq81zAFg

I have the ability to edit the wiki without requiring permission, but I am unable to revert changes. Could you please verify this as well? Thank you.

A comment filter

It would be nice to be able to place comments in the filter list.

shouya / rss-funnel Goto Github PK

rss-funnel's Issues

Before running cargo update

After running cargo update

Recommend Projects

Recommend Topics

Recommend Org