Giter VIP home page Giter VIP logo

muna's Introduction

muna

Clean a series of links, resolving redirects and finding Wayback results if page is gone

muna logo

Contents

  1. About
  2. License
  3. Prerequisites
  4. Installation
  5. Usage
  6. TODO

1. About

Originally, muna was uniredirector for my program agaetr, but grew to be quite a bit more multipurpose. (The script in agaetr is now enhanced to be identical to muna but in name.

I ended up writing this because of ArchiveBox. It's a great self-hosted archiving system, but when you throw a random list of URLs (or worse, different types of RSS feeds) at it, you get... mixed results. It does not handle redirects too well, and if something is just 404, you're out of luck. So I wrote feeds-in to preprocess inputs from both persnickety RSS feeds and a plain list of URLs. It's included here as a use example of how to use muna and the bash function unredirect.

muna is an old norse word meaning "call to mind, remember".

2. License

This project is licensed under the Apache License. For the full license, see LICENSE.

3. Prerequisites

  • bash
  • awk
  • curl
  • wget
  • sed

On many linux installations these may already be installed; if not, they're in your package manager. (If you have to build these from source, you don't need me telling you how to do that!)

4. Installation

muna

Clone or download this repository. Put muna.sh somewhere in your $PATH or call/source it explicitly.

feeds-in.sh

While this script is included here as an example, it is a fully functional DEATH ST... script. It's a functional script, appropriate to put in a cronjob to preprocess sources of URLs for ArchiveBox. Or use it as the base of a script to meet your needs.

One important and super useful note for someone who already has a big list of URLs from some other program: All you have to do is put that text file, one URL per line, in RAWDIR (which you'll configure here in a second) and that list will be pulled seamlessly into the workflow.

If you are using feeds-in.sh with ArchiveBox, you will need to edit these lines as appropriate for you:

APPDIR="/home/www-data/apps/ArchiveBox-Docker"
RAWDIR="$APPDIR/rawdata"
DATADIR="$APPDIR/data"
source "$APPDIR/muna.sh"

APPDIR should be to your ArchiveBox installation. RAWDIR is a work directory where you can also put any text file with a plain list of URLs. DATADIR should be the data directory of your ArchiveBox installation.

There are several example feeds (starting around line 50). Each strips that particular RSS feed (or XML sitemap) down to a series of URLs, one per line, written in a text file.

The sed and awk strings are left here as an example for these particular kinds of feeds. Feel free to use them as a starting point, but I won't guarantee they work for your feeds, they just work for these feeds.

The console output here is a progress bar unless there are errors. The text file is time-date stamped to avoid collisions and overwrites.

Then feeds-in.sh calls ArchiveBox to import that list of URLs. Uncomment the appropriate line in this section for your style of installation. Note that the docker-compose and standalone docker commands are quite different; don't confuse them! (I won't tell you how I know... sigh.)

###############################START PARTS TO EDIT########################

# Uncomment the next line for non-docker installations
#./archive /"$OUTFILESHORT"    

# Uncomment the next line for docker-compose installations   
docker-compose exec archivebox /bin/archive /"$OUTFILESHORT"

# Uncomment the next line for docker *NOT DOCKER COMPOSE* installations
#cat "$OUTFILE" | docker run -i -v ~/ArchiveBox:/data nikisweeting/archivebox
    

5. Usage

muna

If there's a redirect, whether from a shortener or, say, redirected to HTTPS, muna will follow that and change the variable "$url" (or return to STDOUT) the appropriate URL. If there is any other error (including if the page is gone or the server has disappeared), it will see if the page is saved at the Internet Archive and return the latest capture instead. If it cannot find a copy anywhere, it changes the variable "$url" to a NULL string and returns nothing, exiting with the exit code 99.

Standalone

muna.sh [-q] URL

  • -q : If run standalone, it will return nothing to STDOUT except for the unredirected URL. Some error messages may print to STDERR.

As a function

Put this line at the top of your script.

source path/to/muna

In your script, the variable $url must be set before calling the function unredirect. Afterward, if a successful match was made, $url will be set appropriately. If no match was made, $url will be set to NULL. Like this example in feeds-in.sh.

url=$(printf "%s" "$line")
unredirector 
if [ ! -z "$url" ];then  #yup, that url exists
    echo "$url" >> "$OUTFILE"
fi     

feeds-in.sh

bash ./feeds-in.sh

Seriously, that's it. If you edited things in the script to meet your system, then you should be done.

6. TODO

Roadmap:

muna's People

Contributors

uriel1998 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.