Giter VIP home page Giter VIP logo

webis-web-archiver's Introduction

Note: development continues as scriptor.

webis-web-archiver

Source code and scripts for the Webis Web Archiver.

If you use the archiver, please cite the paper that describes it in detail.

Quickstart

You need to have Docker installed.

Then, on a Unix machine:

  • run src-bash/archive.sh for archiving web pages. It will display usage hints.
  • run src-bash/reproduce.sh for reproducing from an archive. It will display usage hints.

The scripts will automatically download and run the image (2GB+ due to all the fonts).

For other OSes, have a look at the shell scripts and adjust the call to docker run accordingly.

Custom user simulation scripts

  • Write a class that extends InteractionScript.

  • You can use the ScrollDownScript as an example, or extend it.

  • The utility class Windows offers static helper methods for frequently used interactions.

  • Compile your script with the binaries in the class path and create a JAR from it.

  • Place the JAR into a directory named "scriptname-1.0.0", where you replace "scriptname" by the name of your script.

  • Create a file "script.conf" with the following content and put it into the same directory

    script = packages.of.your.ScriptClass;
    environment.name = de.webis.java
    environment.version = 1.0.0
    

    where you replace "packages.of.your.ScriptClass" accordingly. For the example ScrollDownScript, that would be

    script = de.webis.webarchive.environment.scripts.ScrollDownScript
    
  • The src-bash/compile-scroll-down-script.sh illustrates the complete compilation process for the ScrollDownScript. Adapt it for your own script.

  • When running archive.sh or reproduce.sh, specify the directory that contains the new directory with "--scriptsdirectory" and give the script name (as in the new directory) with "--script".

webis-web-archiver's People

Contributors

johanneskiesel avatar theelstner avatar mam10eks avatar

Stargazers

Erik Körner avatar  avatar Mandy Neumann avatar

Watchers

 avatar James Cloos avatar Martin Potthast avatar  avatar  avatar  avatar Matthias Hagen avatar  avatar

webis-web-archiver's Issues

SELinux interference with starting docker container

Not sure how to address this generically, but thought it is good to report this.

I get the following errors when running the docker container using src-bash/archive.sh with an output directory /my/path/to/output:

SEVERE: Fail during environment setup
java.nio.file.AccessDeniedException: /output/logs

The problem can be traced back to access rights on the file mapped using the -volume option on the docker run command.

Error resolved after allowing (SELinux) container_file_t access to the output directory:

semanage fcontext -a -t container_file_t '/my/path/to/webis-wa'
sudo restorecon -v '/my/path/to/webis-wa'

Not sure if this could be part of default script settings, but we should document the access rights issue to the directory that will be mapped to the docker container.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.