Ignore pages that have a 404 status code

SuckIT

SuckIT allows you to recursively visit and download a website's content to your disk.

Features

Vacuums the entirety of a website recursively
Uses multithreading
Writes the website's content to your disk
Enables offline navigation
Offers random delays to avoid IP banning
Saves application state on CTRL-C for later pickup

Options

USAGE:
    suckit [FLAGS] [OPTIONS] <url>

FLAGS:
    -c, --continue-on-error                  Flag to enable or disable exit on error
        --disable-certs-checks               Dissable SSL certificates verification
        --dry-run                            Do everything without saving the files to the disk
    -h, --help                               Prints help information
    -V, --version                            Prints version information
    -v, --verbose                            Enable more information regarding the scraping process
        --visit-filter-is-download-filter    Use the dowload filter in/exclude regexes for visiting as well

OPTIONS:
    -a, --auth <auth>...
            HTTP basic authentication credentials space-separated as "username password host". Can be repeated for
            multiple credentials as "u1 p1 h1 u2 p2 h2"
        --cookie <cookie>
            Cookie to send with each request, format: key1=value1;key2=value2 [default: ]

        --delay <delay>
            Add a delay in seconds between downloads to reduce the likelihood of getting banned [default: 0]

    -d, --depth <depth>
            Maximum recursion depth to reach when visiting. Default is -1 (infinity) [default: -1]

    -e, --exclude-download <exclude-download>
            Regex filter to exclude saving pages that match this expression [default: $^]

        --exclude-visit <exclude-visit>
            Regex filter to exclude visiting pages that match this expression [default: $^]

        --ext-depth <ext-depth>
            Maximum recursion depth to reach when visiting external domains. Default is 0. -1 means infinity [default:
            0]
    -i, --include-download <include-download>
            Regex filter to limit to only saving pages that match this expression [default: .*]

        --include-visit <include-visit>
            Regex filter to limit to only visiting pages that match this expression [default: .*]

    -j, --jobs <jobs>                            Maximum number of threads to use concurrently [default: 1]
    -o, --output <output>                        Output directory
        --random-range <random-range>
            Generate an extra random delay between downloads, from 0 to this number. This is added to the base delay
            seconds [default: 0]
    -t, --tries <tries>                          Maximum amount of retries on download failure [default: 20]
    -u, --user-agent <user-agent>                User agent to be used for sending requests [default: suckit]

ARGS:
    <url>    Entry point of the scraping

Example

A common use case could be the following:

suckit http://books.toscrape.com -j 8 -o /path/to/downloaded/pages/

Installation

As of right now, SuckIT does not work on Windows.

To install it, you need to have Rust installed.

Check out this link for instructions on how to install Rust.
If you just want to install the suckit executable, you can simply run cargo install --git https://github.com/skallwar/suckit
Now, run it from anywhere with the suckit command.