Giter VIP home page Giter VIP logo

html2rss / html2rss-web Goto Github PK

View Code? Open in Web Editor NEW
81.0 4.0 11.0 700 KB

🕸 Create custom RSS feeds from any website with ease! Quick setup with Docker. Use built-in configs or tailor your own. Stay updated effortlessly.

Home Page: https://html2rss.github.io/components/html2rss-web

License: MIT License

Ruby 81.20% Dockerfile 4.29% Shell 0.65% XSLT 6.42% CSS 2.66% HTML 4.07% JavaScript 0.58% Procfile 0.13%
html2rss ruby docker scraper rss feed builder website-scraper rss-feed-scraper html2rss-configs

html2rss-web's Introduction

html2rss logo

html2rss-web

This web application scrapes websites to build and deliver RSS 2.0 feeds.

Features:

The functionality of scraping websites and building the RSS feeds is provided by the Ruby gem html2rss.

Get started

This application should be used with Docker. It is designed to require as little maintenance as possible. See Versioning and Releases and consider automatic updates.

With Docker

docker run -p 3000:3000 gilcreator/html2rss-web

Then open http://127.0.0.1:3000/ in your browser and click the example feed link.

This is the quickest way to get started. However, it's also the option with the least flexibility: it doesn't allow you to use custom feed configs and doesn't update automatically.

If you want more flexibility and automatic updates sound good to you, read on to get started with docker compose

With docker compose

Create a docker-compose.yml file and paste the following into it:

services:
  html2rss-web:
    image: gilcreator/html2rss-web
    ports:
      - "3000:3000"
    volumes:
      - type: bind
        source: ./feeds.yml
        target: /app/config/feeds.yml
        read_only: true
    environment:
      - RACK_ENV=production
      - HEALTH_CHECK_USERNAME=health
      - HEALTH_CHECK_PASSWORD=please-set-YOUR-OWN-veeeeeery-l0ng-aNd-h4rd-to-gue55-Passw0rd!
  watchtower:
    image: containrrr/watchtower
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - "~/.docker/config.json:/config.json"
    command: --cleanup --interval 7200

Start it up with: docker compose up.

If you have not created your feeds.yml yet, download this feeds.yml as a blueprint into the directory containing the docker-compose.yml.

Docker: Automatically keep the html2rss-web image up-to-date

The watchtower service automatically pulls running Docker images and checks for updates. If an update is available, it will automatically start the updated image with the same configuration as the running one. Please read its manual.

The docker-compose.yml above contains a service description for watchtower.

How to use the included configs

html2rss-web comes with many feed configs out of the box. See the file list of all configs.

To use a config from there, build the URL like this:

lib/html2rss/configs/ domainname.tld/whatever.yml
Would become this URL:
http://localhost:3000/ domainname.tld/whatever.rss
^^^^^^^^^^^^^^^^^^^^^^^^^^^

How to build your RSS feeds

To build your own RSS feed, you need to create a feed config.
That feed config goes into the file feeds.yml.
Check out the example feed config.

Please refer to html2rss' README for a description of the feed config and its options. html2rss-web is just a small web application that depends on html2rss.

Versioning and releases

This web application is distributed in a rolling release fashion from the master branch.

For the latest commit passing GitHub CI/CD on the master branch, an updated Docker image will be pushed to Docker Hub: gilcreator/html2rss-web.

GitHub's @dependabot is enabled for dependency updates and they are automatically merged to the master branch when the CI gives the green light.

If you use Docker, you should update to the latest image automatically by setting up watchtower as described.

Use in production

This app is published on Docker Hub and therefore easy to use with Docker.
The above docker-compose.yml is a good starting point.

If you're going to host a public instance, please, please, please:

Supported ENV variables

Name Description
PORT default: 3000
RACK_ENV default: 'development'
RACK_TIMEOUT_SERVICE_TIMEOUT default: 15
WEB_CONCURRENCY default: 2
WEB_MAX_THREADS default: 5
HEALTH_CHECK_USERNAME default: auto-generated on start
HEALTH_CHECK_PASSWORD default: auto-generated on start

Runtime monitoring via GET /health_check.txt

It is recommended to set up monitoring of the /health_check.txt endpoint. With that, you can find out when one of your own configs breaks. The endpoint uses HTTP Basic authentication.

First, set the username and password via these environment variables: HEALTH_CHECK_USERNAME and HEALTH_CHECK_PASSWORD. If these are not set, html2rss-web will generate a new random username and password on each start.

An authenticated GET /health_check.txt request will respond with:

  • If the feeds are generatable: success.
  • Otherwise: the names of the broken configs.

To get notified when one of your configs breaks, set up monitoring of this endpoint.

UptimeRobot's free plan is sufficient for basic monitoring (every 5 minutes).
Create a monitor of type Keyword with this information and make it aware of your username and password:

A screenshot showing the Keyword Monitor: a name, the instance's URL to /health_check.txt, and an interval.

Setup for development

Check out the git repository and…

Using Docker

This approach allows you to experiment without installing Ruby on your machine. All you need to do is install and run Docker.

# Build image from Dockerfile and name/tag it as html2rss-web:
docker build -t html2rss-web -f Dockerfile .

# Run the image and name it html2rss-web-dev:
docker run \
  --detach \
  --mount type=bind,source=$(pwd)/config,target=/app/config \
  --name html2rss-web-dev \
  html2rss-web

# Open an interactive TTY with the shell `sh`:
docker exec -ti html2rss-web-dev sh

# Stop and clean up the container
docker stop html2rss-web-dev
docker rm html2rss-web-dev

# Remove the image
docker rmi html2rss-web

Using installed Ruby

If you're comfortable with installing Ruby directly on your machine, follow these instructions:

  1. Install Ruby >= 3.2
  2. gem install bundler foreman
  3. bundle
  4. foreman start

html2rss-web now listens on port 3000 for requests.

Contribute

Contributions are welcome!

Open a pull request with your changes,
open an issue, or
join discussions on html2rss.

html2rss-web's People

Contributors

dependabot-preview[bot] avatar dependabot[bot] avatar gil-robot avatar gildesmarais avatar github-actions[bot] avatar mabeett avatar renovate-bot avatar renovate[bot] avatar snyk-bot avatar vvzvlad avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

html2rss-web's Issues

Make HealthCheck checks concurrently

The HealthChecks run sequentially. That works fine for one or two custom configs, but could become slow with more feeds.

HealthCheck.run should request feeds concurrently (e.g. using Ruby's Fiber or Ractor).

Dependency Dashboard

This issue provides visibility into Renovate updates and their statuses. Learn more

This repository currently has no open or pending branches.


  • Check this box to trigger a request for Renovate to run again on this repository

Wrong result for different dynamic parameter value

I am using latest docker image with dynamic parameters for a feed and I see sometimes the resulf of one parameter given in the other one.

Here I attach a bash shell script with

  • a docker-compose sample project with a flask web application which gives a previsible result.
  • a loop asking html2css web different previsible results.

Briefly, when I run it I find a results like from the second query.
The output results just the first one repeated.

+ curl -s 'http://localhost:3001/failer.rss?id=1otherstring'
      <description>2023-04-16 15:34:23.958242 - foo random description</description>

while the output should be:

+ curl -s 'http://localhost:3001/failer.rss?id=1otherstring'
      <description>2023-04-16 15:34:23.958242 - 1otherstring random description</description>

Letme know if you need further information,
thanks in advance,

[ edit: WEB_MAX_THREADS=1, WEB_CONCURRENCY=1 ]

script launch

.  ./the_script_from_details.sh

Test Script:

#!/bin/bash
set -x

cd `mktemp -d`

mkdir -p templates/
cat <<EOF >app.py
from flask import Flask
from flask import render_template
from datetime import datetime
app = Flask(__name__)

@app.route('/')
@app.route('/<name>')
def hello(name=None):
    return render_template('hello.html', name=name, date=datetime.now())

if __name__ == '__main__':
        app.run(host='0.0.0.0', port=8000)
EOF

cat <<EOF >templates/hello.html
<!DOCTYPE html>
<html>
<head>
{% if name %}
<title>Title for {{ name }}</title>
{% else %}
<title>Title {{ name }}</title>
{% endif %}
</head>
<body>
  <div class="foo">
{% if name %}
  <h1><a href="https://google.com/search?q={{ date}}">Hello {{ name }}! at {{ date }}</a></h1>
  <div class="item">{{ date }} - {{ name }} random description</div>
{% else %}
  <h1>Hello, World!</h1>
  <div class="item">nada - nada random description</div>
{% endif %}
  </div>
</body>
</html>
EOF

cat <<'EOF' >Dockerfile 
FROM python:3.10-alpine AS builder
WORKDIR /app
RUN pip3 install Flask==2.2.3
RUN mkdir -p /app/templates
COPY . /app
ENTRYPOINT ["python3"]
CMD ["app.py"]
FROM builder as dev-envs
RUN \
  apk update && \
  apk add git
EOF

cat <<'EOF' >docker-compose.yml
version: "3.2"
services:
  web: 
    build:
      context: ./
    image: local/someflask
    stop_signal: SIGINT
    ports:
      - '8000:8000'
  html2rss-web:
    image: gilcreator/html2rss-web
    ports:
      - "3001:3000"
    volumes:
      - type: bind
        source: ./feeds.yml
        target: /app/config/feeds.yml
        read_only: true
    environment:
      - RACK_ENV=production
      - HEALTH_CHECK_USERNAME=health
      - HEALTH_CHECK_PASSWORD=please-set-YOUR-OWN-veeeeeery-l0ng-aNd-h4rd-to-gue55-Passw0rd 
      - WEB_MAX_THREADS=1
      - WEB_CONCURRENCY=1
EOF

cat <<'EOF' >feeds.yml
stylesheets:
  - href: "/rss.xsl"
    media: "all"
    type: "text/xsl"
headers:
  "User-Agent": "Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"
feeds:
  failer:
    channel:
      url: http://web:8000/%<id>s
      title: Test web - %<id>s
      ttl: 120
    selectors:
      items:
        selector: "div.foo"
      title:
        selector: "h1"
      link:
        selector: "h1 > a"
        extractor: "href"
      description: 
        selector: "div.item"
EOF

docker-compose up --build -d 

while sleep 1; do curl -s http://localhost:3001/ >/dev/null && break ; done 

sleep 3 
curl -s http://localhost:3001/failer.rss?id=foo    | grep 'description' | tail -n 1
curl -s http://localhost:3001/failer.rss?id=foo    | grep 'description' | tail -n 1
for item in {1..10}
do
  curl -s http://localhost:3001/failer.rss?id=${item}otherstring | grep 'description' | tail -n 1
  sleep 10
done

Add to snapcraft store

  • add .snapcraft.yml which runs this project as daemon
  • allow usage of a custom feeds.yml
  • run snap on ci and request a feed (similar to docker)
  • on master branch, push new snap version
  • add instructions to readme

html2rss-web with traefik: not proxying, container stuck in "starting"

Hallo

I am trying to use html2rss-web with Docker and Traefik.

My docker-compose.yml:

version: '3.3'
services:
  html2rss:
    image: gilcreator/html2rss-web
    container_name: html2rss                                                                                                                                                                                                
    volumes:
      - type: bind
        source: ./feeds.yml
        target: /app/config/feeds.yml
        read_only: true
    environment:
      - PORT=80
      - RACK_ENV=production
      - HEALTH_CHECK_USERNAME=health
      - HEALTH_CHECK_PASSWORD=xxx
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.html2rss.entrypoints=http"
      - "traefik.http.routers.html2rss.rule=Host(`html2rss.xxx.de`)"
      - "traefik.http.middlewares.html2rss-https-redirect.redirectscheme.scheme=https"
      - "traefik.http.routers.html2rss.middlewares=html2rss-https-redirect"
      - "traefik.http.routers.html2rss-secure.entrypoints=https"
      - "traefik.http.routers.html2rss-secure.rule=Host(`html2rss.xxx.de`)"
      - "traefik.http.routers.html2rss-secure.tls=true"
      - "traefik.http.routers.html2rss-secure.tls.certresolver=http"
      - "traefik.http.routers.html2rss-secure.service=html2rss"
      - "traefik.http.services.html2rss.loadbalancer.server.port=80"
      - "traefik.docker.network=proxy"
    networks:
      - proxy

networks:
  proxy:
    external: true

In Traefik's log I find

time="2023-12-22T20:29:48+01:00" level=debug msg="Filtering unhealthy or starting container" container=html2rss-html2rss-web-7e3d7a19bf11ae0876fd5848850fa3ddc638c59f11ccbaf8072734db5cbaa012 providerName=docker

Traefik's view is correct:

root# docker ps 
CONTAINER ID   IMAGE                             COMMAND                  CREATED         STATUS                            PORTS                                                                      NAMES
7e3d7a19bf11   gilcreator/html2rss-web           "bundle exec 'puma -…"   7 minutes ago   Up 7 minutes (health: starting)   3000/tcp                                                                   html2rss

How can I change the setup to have html2rss-web signal it is healthy?
I can access the container's web interface via the container IP on port 80.

Thanks
M

Request proxy support

Some websites have strict anti-crawler policies and require the use of a proxy to access them normally.

Respond to handled errors using the accepted content-type (i.e. in RSS)

Goal: Respond to handled errors using in the accepted content-type (i.e. in RSS).

Example Scenario: when an invalid feed is requested return a 404 rendering a response body using RSS/XML.

Implementation idea:
hook into error handler, or investigate the use of https://roda.jeremyevans.net/rdoc/classes/Roda/RodaPlugins/TypeRouting.html

  • the response should contain helpful information, but not reveal 'internals'
  • The channel language must be changed to English
  • if html2rss-config: provide description containing link to edit the config on Github

500 Internal Error // Faraday::ConnectionFailed

Hello, I recently cam across your project, while testing it out I'm having some issues. I created a folder with two files:

docker-compose.yml

version: "3"
services:
  html2rss-web:
    image: gilcreator/html2rss-web
    ports:
      - "3000:3000"
    volumes:
      - type: bind
        source: ./feeds.yml
        target: /app/config/feeds.yml
        read_only: true
    environment:
      - RACK_ENV=production
      - HEALTH_CHECK_USERNAME=health
      - HEALTH_CHECK_PASSWORD=please-set-YOUR-OWN-veeeeeery-l0ng-aNd-h4rd-to-gue55-Passw0rd!
     

and

feeds.yml

stylesheets:
  - href: "/rss.xsl"
    media: "all"
    type: "text/xsl"
headers:
  "User-Agent": "Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"
feeds:
  # your custom feeds go here:
  example:
    channel:
      url: https://www.reuters.com/technology/
      language: en
      ttl: 360
      time_zone: UTC
    selectors:
      items:
        selector: '[class^="story-collection"] > li'
      title:
        selector: h3
      link:
        selector: a:first
        extractor: href
      description:
        selector: p

No matter which config example, it returns the same error (this is a screenshot when using 120 commits config):

image

These are the logs:

[+] Building 0.0s (0/0)                                                                                                                                                                                              
[+] Running 1/0
 ✔ Container feeds-html2rss-web-1  Created                                                                                                                                                                      0.0s 
Attaching to feeds-html2rss-web-1
feeds-html2rss-web-1  | [1] Puma starting in cluster mode...
feeds-html2rss-web-1  | [1] * Puma version: 6.4.0 (ruby 3.2.2-p53) ("The Eagle of Durango")
feeds-html2rss-web-1  | [1] *  Min threads: 5
feeds-html2rss-web-1  | [1] *  Max threads: 5
feeds-html2rss-web-1  | [1] *  Environment: production
feeds-html2rss-web-1  | [1] *   Master PID: 1
feeds-html2rss-web-1  | [1] *      Workers: 2
feeds-html2rss-web-1  | [1] *     Restarts: (✔) hot (✖) phased
feeds-html2rss-web-1  | [1] * Preloading application
feeds-html2rss-web-1  | [1] * Listening on http://0.0.0.0:3000
feeds-html2rss-web-1  | [1] Use Ctrl-C to stop
feeds-html2rss-web-1  | [1] - Worker 0 (PID: 7) booted in 0.0s, phase: 0
feeds-html2rss-web-1  | [1] - Worker 1 (PID: 9) booted in 0.0s, phase: 0
feeds-html2rss-web-1  | source=rack-timeout id=80dcab76-8dda-4ea9-9e18-82e536d4da82 timeout=15000ms state=ready at=info
feeds-html2rss-web-1  | source=rack-timeout id=80dcab76-8dda-4ea9-9e18-82e536d4da82 timeout=15000ms service=5605ms state=completed at=info

What could be the issue?

Possible issue around caching on apnews.com

When attempting to use the new configuration in the following PR: html2rss/html2rss-configs#176 I seem to have hit an issue around the caching of dynamic config files. I have yet to figure out a good set of steps to reproduce, or find the root cause of where exactly it goes wrong. But what i have done is this:

Try to load section=trending-news then shortly after attempt to load another, for example section=ukraine. It will either mix the stories from both trending-news and ukraine. Or only load the previous trending-news news stories under ukraine.

My feeling is that it might have to do with these pages taking a bit to load, and there are errors seeming to be around timeouts. But this might be the wrong path to go down. You can see the logs below:

[1] Puma starting in cluster mode...
[1] * Puma version: 5.6.2 (ruby 3.1.1-p18) ("Birdie's Version")
[1] *  Min threads: 5
[1] *  Max threads: 5
[1] *  Environment: production
[1] *   Master PID: 1
[1] *      Workers: 2
[1] *     Restarts: (✔) hot (✖) phased
[1] * Preloading application
[1] * Listening on http://0.0.0.0:3000
[1] Use Ctrl-C to stop
[1] - Worker 0 (PID: 3) booted in 0.0s, phase: 0
[1] - Worker 1 (PID: 4) booted in 0.0s, phase: 0
source=rack-timeout id=0894f721-5348-4ba2-9fac-0fffcabb1078 timeout=15000ms state=ready at=info
/usr/local/bundle/bundler/gems/html2rss-a898923f92b2/lib/html2rss/item_extractors/href.rb:26: warning: redefining constant Struct::Options
/usr/local/bundle/bundler/gems/html2rss-a898923f92b2/lib/html2rss/item_extractors/html.rb:25: warning: redefining constant Struct::Options
/usr/local/bundle/bundler/gems/html2rss-a898923f92b2/lib/html2rss/item_extractors/static.rb:16: warning: redefining constant Struct::Options
/usr/local/bundle/bundler/gems/html2rss-a898923f92b2/lib/html2rss/item_extractors/text.rb:23: warning: redefining constant Struct::Options
source=rack-timeout id=0894f721-5348-4ba2-9fac-0fffcabb1078 timeout=15000ms service=1462ms state=completed at=info
source=rack-timeout id=45b4009a-7b73-4652-b7d2-fc2e84178fe1 timeout=15000ms state=ready at=info
source=rack-timeout id=45b4009a-7b73-4652-b7d2-fc2e84178fe1 timeout=15000ms service=1ms state=completed at=info
source=rack-timeout id=a2bd4d83-a7b9-420e-b131-668b5394abc3 timeout=15000ms state=ready at=info
source=rack-timeout id=a2bd4d83-a7b9-420e-b131-668b5394abc3 timeout=15000ms service=1ms state=completed at=info
source=rack-timeout id=1e97365c-5228-4d84-a3b2-c3b401cd06d5 timeout=15000ms state=ready at=info
source=rack-timeout id=1e97365c-5228-4d84-a3b2-c3b401cd06d5 timeout=15000ms service=4ms state=completed at=info
source=rack-timeout id=a2720804-beca-41de-9003-194396ef1a69 timeout=15000ms state=ready at=info
source=rack-timeout id=9f3a171e-3956-47fc-8eed-cd3fe10a51a0 timeout=15000ms state=ready at=info
source=rack-timeout id=a2720804-beca-41de-9003-194396ef1a69 timeout=15000ms service=2ms state=completed at=info
source=rack-timeout id=9f3a171e-3956-47fc-8eed-cd3fe10a51a0 timeout=15000ms service=1ms state=completed at=info
source=rack-timeout id=ad960dbb-d315-4752-8229-fc42159466d8 timeout=15000ms state=ready at=info
source=rack-timeout id=ad960dbb-d315-4752-8229-fc42159466d8 timeout=15000ms service=1ms state=completed at=info
source=rack-timeout id=806abc6b-43c8-4479-ba37-0ad93475f3a3 timeout=15000ms state=ready at=info
/usr/local/bundle/bundler/gems/html2rss-a898923f92b2/lib/html2rss/item_extractors/href.rb:26: warning: redefining constant Struct::Options
/usr/local/bundle/bundler/gems/html2rss-a898923f92b2/lib/html2rss/item_extractors/html.rb:25: warning: redefining constant Struct::Options
/usr/local/bundle/bundler/gems/html2rss-a898923f92b2/lib/html2rss/item_extractors/static.rb:16: warning: redefining constant Struct::Options
/usr/local/bundle/bundler/gems/html2rss-a898923f92b2/lib/html2rss/item_extractors/text.rb:23: warning: redefining constant Struct::Options
source=rack-timeout id=806abc6b-43c8-4479-ba37-0ad93475f3a3 timeout=15000ms service=4094ms state=completed at=info
source=rack-timeout id=2c4a2607-64e9-44ca-9732-bbaace89398b timeout=15000ms state=ready at=info
source=rack-timeout id=2c4a2607-64e9-44ca-9732-bbaace89398b timeout=15000ms service=634ms state=completed at=info
source=rack-timeout id=de17384a-22f5-4aac-8295-f79e2c6c5204 timeout=15000ms state=ready at=info
source=rack-timeout id=de17384a-22f5-4aac-8295-f79e2c6c5204 timeout=15000ms service=364ms state=completed at=info

Fetch full text by following the item URL?

When no description or text is available (only a heading and/or an image), is html2rss able to fetch the full text of an item by following the URL (load a new page), then selecting elements that contains the text, date, author etc – and add it to the generated RSS feed?

I have tried to read the available documentation but could not find such a feature described. It would be quite handy to have such a feature 😏 One example of a site where this would be handy: https://www.op.no/debatt/

ArgumentError when setting RACK_TIMEOUT_SERVICE_TIMEOUT envvar

If the environment variable RACK_TIMEOUT_SERVICE_TIMEOUT is set to any value, html2rss-web will fail with the following error:

Puma caught this error: value "0" should be false, zero, or a positive number. (ArgumentError)
/html2rss/vendor/bundle/ruby/3.1.0/gems/rack-timeout-0.6.3/lib/rack/timeout/core.rb:57:in read_timeout_property' /html2rss/vendor/bundle/ruby/3.1.0/gems/rack-timeout-0.6.3/lib/rack/timeout/core.rb:71:in initialize'
/html2rss/vendor/bundle/ruby/3.1.0/gems/roda-3.57.0/lib/roda.rb:393:in new' /html2rss/vendor/bundle/ruby/3.1.0/gems/roda-3.57.0/lib/roda.rb:393:in block in build_rack_app'
/html2rss/vendor/bundle/ruby/3.1.0/gems/roda-3.57.0/lib/roda.rb:391:in reverse_each' /html2rss/vendor/bundle/ruby/3.1.0/gems/roda-3.57.0/lib/roda.rb:391:in build_rack_app'
/html2rss/vendor/bundle/ruby/3.1.0/gems/roda-3.57.0/lib/roda.rb:34:in app' /html2rss/vendor/bundle/ruby/3.1.0/gems/roda-3.57.0/lib/roda.rb:53:in call'
/html2rss/vendor/bundle/ruby/3.1.0/gems/rack-unreloader-2.0.0/lib/rack/unreloader.rb:87:in call' /html2rss/vendor/bundle/ruby/3.1.0/gems/puma-5.6.4/lib/puma/configuration.rb:252:in call'
/html2rss/vendor/bundle/ruby/3.1.0/gems/puma-5.6.4/lib/puma/request.rb:77:in block in handle_request' /html2rss/vendor/bundle/ruby/3.1.0/gems/puma-5.6.4/lib/puma/thread_pool.rb:340:in with_force_shutdown'
/html2rss/vendor/bundle/ruby/3.1.0/gems/puma-5.6.4/lib/puma/request.rb:76:in handle_request' /html2rss/vendor/bundle/ruby/3.1.0/gems/puma-5.6.4/lib/puma/server.rb:441:in process_client'
/html2rss/vendor/bundle/ruby/3.1.0/gems/puma-5.6.4/lib/puma/thread_pool.rb:147:in `block in spawn_thread'

This issue is caused by line 19 of the app.rb retrieving the environment variable without converting it to integer or boolean, and then passing it over to Rack::Timeout:

use Rack::Timeout, service_timeout: ENV.fetch('RACK_TIMEOUT_SERVICE_TIMEOUT', 15)

A simple fix would be to replace line 19 with
use Rack::Timeout
This would leave the envvar parsing up to rack-timeout's initialization function. Also, the default timeout used by rack-timeout(15s) is the same as the one specified in line 19, so there's no need to re-define it.

feeds.yml not loaded using docker-compose.yml

I'm trying to run html2rss-web from docker using the docker-compose.yml provided, but when I open the web interface I just see the default example.rss, not mine.
image

There's not a lot of logging telling me what's going on, so this is also a feature request: display on startup the config that is loaded and how it is interpreted.

This is my feeds.yml, which is mounted to /app/config/feeds.yml. Could it be that this is not the correct location?

stylesheets:
  - href: "/rss.xsl"
    media: "all"
    type: "text/xsl"
headers:
  "User-Agent": "Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"
feeds:
  # your custom feeds go here:
  samengezondindeventer:
    channel:
      url: https://samengezondindeventer.nl/nieuws
      title: "Samen gezond in Deventer"
      ttl: 120
    selectors:
      items:
        selector: "article"
      title:
        selector: "h2 a"
      link:
        selector: "a"
        extractor: "href"

Support for sub-path reverse proxy

Hi,

Is there a way to host this site under a sub-path reverse proxy? I mean something like https://nosuchdomain.invalid/html2rss, with an Nginx config like this:

upstream html2rss {
  server 127.0.0.1:3000;
}

server {
  # listen block and other sub-path sites would go here
  location /htm2rss/ {
    proxy_pass http://html2rss/;
  }
}

When I use the above Nginx site config, I'm able to download the XML files for the RSS feeds, but the browser preview does not work, since the html2rss-web site assumes that its asset files(like styles.css) are located at the root URL and does not add the /html2rss prefix.

Is there a configuration option for adding a prefix to the URLs of the site's assets? I could not find any myself...

In any case, thanks for the help and especially for this project, it's extremely useful! :)

Selectors

First, my compliments for the interesting project!

What I did:

  1. cloned html2rss-web repo
  2. built and ran container
  3. accessed rss feeds from html2rss-configs
  4. tried to generate my own feeds from .yml files and added a couple into config.yml

Basically, there are 4 main aspects of the problem:

  1. Seems, some selectors (e.g. "description") are causing "Internal Server Problem". Try "https://some.domain/webentwickler-jobs.de/in.rss"

  2. Other selectors are ignored e.g. "update". Basically, only "title" and "link" are displayed in rss feeds.

  3. If you "docker exec -it ..." into the container for generating feeds from .yml, you face the following problem:
    image

gem env:
image

bundler -v:
Bundler version 2.2.18

Gemfile.lock:
... BUNDLED WITH 2.2.17 ...

  1. /health_check.txt endpoint doesn't work returning "success" alongside with "Internal Server Problem"

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.