Giter VIP home page Giter VIP logo

readablewebproxy's Introduction

Readable-Web Proxy

Reading long-form content on the internet is a shitty experience.
This is a web-proxy that tries to make it better.

This is a rewriting proxy. In other words, it proxies arbitrary web content, while allowing the rewriting of the remote content as driven by a set of rule-files. The goal is to effectively allow the complete customization of any existing web-sites as driven by predefined rules.

Functionally, it's used for extracting just the actual content body of a site and reproducing it in a clean layout. It also modifies all links on the page to point to internal addresses, so following a link points to the proxied version of the file, rather then the original.


While the above was the original scope, the project has mutated heavily. At this point, it has a complete web spider and archives entire websites to local storage. Additionally, multiple versions of each page are kept, with a overall rolling refresh of the entire database at configurable intervals (configurable on a per-domain, or global basis).

There are also a lot of facilities responsible for feeding the releases/RSS views as part of wlnupdates.com.


Quick installation overview:

  • Install Redis
  • (optional) install InfluxDB
  • (optional) install Graphite
  • Install Postgresql >= 10.
  • Build the community extensions for Postgresql.
  • Create a database for the project.
  • In the project database, install the pg_trgm and citext extensions from the community extensions modules.
  • Copy settings.example.py to settings.py.
  • Fill in all settings in settings.py
  • Setup virtualhost by running build-venv.sh
  • Activate vhost: source flask/bin/activate
  • Bootstrap DB: alembic uprade head
  • (on another machine/session) Run local fetch RPC server run_local.sh from https://github.com/fake-name/AutoTriever
  • Run server: python3 run.py
  • If you want to run the spider, it has a LOT more complicated components:
    • Main scraper is started by python runScrape.py
    • Raw scraper is started by python runScrape.py raw
    • Scraper periodic scheduler is started by python runScrape.py scheduler
    • The scraper requires substantial RPC infrastructure. You will need:
      • A RabbitMQ instance with a public DNS address
      • A machine running saltstack + salt-master with a public DNS address On the salt machine, run https://github.com/fake-name/AutoTriever/tree/master/marshaller/salt_scheduler.py
      • A variable number of RPC workers to execute fetch tasks. The AutoTriever project can be used to manage these.
      • A machine to run the RPC local demultiplexing agent (run_agent.sh) The RPC agent allows multiple projects to use the RPC system simultaneously. Since the RPC system basically allows executing either predefined jobs, or arbitrary code on the worker swarm. This is fairly useful in general, so I've implemented it as a service that multiple of my projects then use.

Ubuntu dependencies

  • postgresql-common libpq-dev libenchant-dev
  • probably more I've forgotten

readablewebproxy's People

Contributors

dependabot[bot] avatar fake-name avatar pyup-bot avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

readablewebproxy's Issues

Docker image

Can you provide a Docker image for this as it seems hard to set up. I want all the features, includeing AutoTriever and I dont know what amqp is or saltstack.

How exactly do you handle duplicate update?

Hi, would you like to explain in a few senteces how do you handle updates and duplicate checking?
I read on HN that you are using triggers for versioning, where exactly is that in the code?
Thank you very much for your attention!

What happened to your xA-Scraper project?

Hi,

Sorry, this isn't exactly an appropriate place to ask this, but I can't find your email address anywhere so this is the only recourse I have. I was going to start using your xA-Scraper project to archive some Patreon posts from one of my supported content creators. I actually discussed this with you in a ticket I filed over there and you even made some improvements to the program. But, just today, I discovered xA-Scraper is totally gone. I know you deleted your even older Patreon-Archiver project, but now you've deleted xA-Scraper too? Was it incorporated into this project? Please help me out here, as this deletion of repostitories is more than a little jarring.

Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.