Giter VIP home page Giter VIP logo

pywb's Introduction

PyWb 0.6.6

https://travis-ci.org/ikreymer/pywb.png?branch=master https://coveralls.io/repos/ikreymer/pywb/badge.png?branch=master

pywb is a python implementation of web archival replay tools, sometimes also known as 'Wayback Machine'.

pywb allows high-quality replay (browsing) of archived web data stored in standardized ARC and WARC.

pywb can be used as a traditional web application or an HTTP or HTTPS proxy server.

pywb is also fully compliant with the Memento protocol (RFC-7089).

Public Projects Using Pywb

Several organizations run public services which use pywb that you may explore directly:

Usage Examples

This README contains a basic overview of using pywb. After reading this intro, consider also taking a look at these seperate projects:

  • pywb-webrecorder demonstrates a way to use pywb and warcprox to record web content while browsing.
  • pywb-samples provides additional archive samples with difficult-to-replay content.
  • pywb-proxy-demo showcases the revamped HTTP/S proxy replay system (available from pywb 0.6.0)

pywb Tools Overview

In addition to the standard wayback machine (explained further below), pywb tool suite includes a number of useful command-line and web server tools. The tools should be available to run after running python setup.py install:

  • live-rewrite-server -- a demo live rewriting web server which accepts requests using wayback machine url format at /rewrite/ path, eg, /rewrite/http://example.com/ and applies the same url rewriting rules as are used for archived content. This is useful for checking how live content will appear when archived before actually creating any archive files, or for recording data. The webrecorder.io service is built using this tool.
  • cdx-indexer -- a command-line tool for creating CDX indexs from WARC and ARC files. Supports SURT and non-SURT based cdx files and optional sorting. See cdx-indexer -h for all options. for all options.
  • cdx-server -- a CDX API only server which returns a responses about CDX captures in bulk. Includes most of the features of the original cdx server implementation, updated documentation coming soon.
  • proxy-cert-auth -- a utility to support proxy mode. It can be used in CA root certificate, or per-host certificate with an existing root cert.
  • wayback -- The full Wayback Machine application, further explained below.

Latest Changes

See CHANGES.rst for up-to-date changelist.

Quick Install & Run Samples

  1. git clone https://github.com/ikreymer/pywb.git
  2. python setup.py install
  3. wayback to run samples
  4. Browse to http://localhost:8080/pywb/*/example.com to see capture of http://example.com

(The installation page contains additional installation and testing examples.)

Running in Proxy Mode

pywb can also be used as an HTTP and/or HTTPS proxy server. See pywb Proxy Mode Usage for more details on configuring proxy mode. The pywb-proxy-demo project also contains a working configuration of proxy mode deployment.

Configure with Archived Content

If you have existing WARC or ARC files (.warc, .warc.gz, .arc, .arc.gz), you should be able to view their contents in pywb after creating sorted .cdx index files of their contents. This process can be done by running the cdx-indexer script and only needs to be done once.

(See the note below if you already have .cdx files for your archives)

Given an archive of warcs at myarchive/warcs

  1. Create a dir for indexes, .eg. myarchive/cdx
  2. Run cdx-indexer --sort myarchive/cdx myarchive/warcs to generate .cdx files for each warc/arc file in myarchive/warcs
  3. Edit config.yaml to contain the following. You may replace pywb with a name of your choice -- it will be the path to your collection. (Multiple collections can be added for different sets of .cdx files as well)
collections:
   pywb: ./my_archive/cdx/


archive_paths: ./my_archive/warcs/
  1. Run wayback to start session. If your archives contain http://my-archive-page.example.com, all captures should be accessible by browsing to http://localhost:8080/pywb/*/my-archived-page.example.com

    (You can also use run-uwsgi.sh or run-gunicorn.sh to launch using those WSGI containers)

See INSTALL.rst for additional installation info.

Use existing .cdx index files

If you already have .cdx files for your archive, you can skip the first two steps above.

pywb recommends using SURT (Sort-friendly URI Reordering Transform) sorted urls and the cdx-indexer automatically generates indexs in this format.

However, pywb is compatible with regular url keyed indexes also. If you would like to use non-SURT ordered .cdx files, simply add this field to the config:

surt_ordered: false

UI Customization

pywb makes it easy to customize most aspects of the UI around archived content, including a custom banner insert, query calendar, search and home pages, via HTML Jinja2 templates. See the config file for comment examples or read more about UI Customization.

About Wayback Machine

pywb is compatible with the standard Wayback Machine url format:

http://<host>/<collection>/<timestamp>/<original url>

Some examples of this url from other wayback machines (not implemented via pywb):

http://web.archive.org/web/20140312103519/http://www.example.com http://www.webarchive.org.uk/wayback/archive/20100513010014/http://www.example.com/

A listing of archived content, often in calendar form, is available when a * is used instead of timestamp.

The Wayback Machine often uses an html parser to rewrite relative and absolute links, as well as absolute links found in javascript, css and some xml.

pywb provides these features as a starting point.

Additional Documentation

  • For additional/up-to-date configuration details, consult the current config.yaml
  • The wiki will have additional technical documentation about various aspects of pywb

Contributions

You are encouraged to fork and contribute to this project to improve web archiving replay!

Please take a look at list of current issues and feel free to open new ones.

https://cdn.rawgit.com/gratipay/gratipay-badge/2.0.1/dist/gratipay.png

pywb's People

Contributors

ikreymer avatar kngenie avatar jcushman avatar rajbot avatar nlevitt avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.