Giter VIP home page Giter VIP logo

webmirror's Introduction

WebMirror

A simple website archiver written in Java.

Usage

Usage:
        java -jar WebMirror.jar <destination> [OPTIONS]
Flags:
                         (default)
        --strict          (true)      include subdomains during recursion check?
        --recurse         (true)      enable recursion?
Params:
                         [one of]
        --url           [required]    base URL to archive (e.g. index.html)
        --file          [required]    file of URLs to batch archive (one link per line)

Downloading a site

This will take a LONG time, and create a LOT of files. Make sure the destination folder is empty! (it will create a new directory if necessary)

Also, please note that this will only work for relatively simple sites. Sites that make heavy use of JavaScript (e.g. Instagram) won't archive properly!

java -jar WebMirror.jar "Destination folder" --url="Link to archive"

This will spit out a TON of messages. Don't worry! Many sites have invalid links. Also, it will show the current status with each new link (sorry, idk how to use fancy floating lines in the console).

Resuming a stopped archive

When you restart, it will re-scrape the entire site but using files already downloaded, meaning that it will be very fast (minutes) compared to the first run (hours/days).

On my M1 Max, I was able to archive a 38GB site in a few days, however re-scraping all 374,564 links only takes about 20 minutes.

Browsing the mirrored site

This archiver intends to create an exact copy of the site, as a browser sees it.

Quickstart

  • run your webserver of choice, e.g. Python:
python3 -m http.server 8000 # (hint: try `python` if `python3` fails)
  • then, navigate to:

http://localhost:8000

Additional info

If you look in the destination folder, you will notice a bunch of domains outside of your target site. These are external assets that it tries to download. Currently there is no include these while browsing the local site (your browser still download these assets remotely), however I intend to create a Python webserver that servers these kinds of archives properly (soon™️). Meanwhile, you can browse the mirrored site with the default http.server module.

Note that it guesses file types based on extenion in the URL, so for example Wikipeia File:name.jpg links will get resolved as images and downloaded, even though they are a webpage. The only solution I can think of is guessing the file type from a stream, but that would take up bandwidth and has diminishing returns, so I've left the behaviour as is.

Building

git clone https://github.com/vhagedorn/WebMirror
cd WebMirror
./gradlew build
java -jar target/WebMirror.jar "dst" --url="url"

Words

Backstory

The concept here is very simple: recursively download an entire website by scraping the links from each page. To my utter disbeleif, nothing like this exists (AFAIK). Wayback machine backups don't get downloaded properly when files are SHIFT-JIS ... and people recommending wget --mirror are actual psychopaths. Thus, I made this project.

The Process

Initially, I just hacked together some Selenium code in an afternoon, which basically just cached everything Chrome received. However, this approach had many synchronization challenges to overcome and eventually failed with inexplicable errors. Plus, it was really slow. The way I'm doing it now is much faster, and involves downloading each file to the disk then using Jsoup to scrape every document for links using a list of every URL attribute.

The sites I tend to archive sometimes have a lot of broken links, so I tried my best to discard 404s and generally just swallow any errors while printing a cool backtrace to see exactly where the invalid link came from. Note that this spits out a LOT of log messages, and they're all saved to the ./logs folder, so it's possible to comb for errors later.

Maybe in the future I'll revisit Chrome with fake scrolling and JS support and better caching, but I don't think it'll happen anytime soon as I don't foresee myself archiving any sites like that.

webmirror's People

Contributors

vhagedorn avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.