Giter VIP home page Giter VIP logo

suckit's Introduction

Build and test codecov Crates.io Docs Deps License License MSRV

SuckIT

SuckIT allows you to recursively visit and download a website's content to your disk.

SuckIT Logo

Features

  • Vacuums the entirety of a website recursively
  • Uses multithreading
  • Writes the website's content to your disk
  • Enables offline navigation
  • Offers random delays to avoid IP banning
  • Saves application state on CTRL-C for later pickup

Options

USAGE:
    suckit [FLAGS] [OPTIONS] <url>

FLAGS:
    -c, --continue-on-error                  Flag to enable or disable exit on error
        --disable-certs-checks               Dissable SSL certificates verification
        --dry-run                            Do everything without saving the files to the disk
    -h, --help                               Prints help information
    -V, --version                            Prints version information
    -v, --verbose                            Enable more information regarding the scraping process
        --visit-filter-is-download-filter    Use the dowload filter in/exclude regexes for visiting as well

OPTIONS:
    -a, --auth <auth>...
            HTTP basic authentication credentials space-separated as "username password host". Can be repeated for
            multiple credentials as "u1 p1 h1 u2 p2 h2"
        --cookie <cookie>
            Cookie to send with each request, format: key1=value1;key2=value2 [default: ]

        --delay <delay>
            Add a delay in seconds between downloads to reduce the likelihood of getting banned [default: 0]

    -d, --depth <depth>
            Maximum recursion depth to reach when visiting. Default is -1 (infinity) [default: -1]

    -e, --exclude-download <exclude-download>
            Regex filter to exclude saving pages that match this expression [default: $^]

        --exclude-visit <exclude-visit>
            Regex filter to exclude visiting pages that match this expression [default: $^]

        --ext-depth <ext-depth>
            Maximum recursion depth to reach when visiting external domains. Default is 0. -1 means infinity [default:
            0]
    -i, --include-download <include-download>
            Regex filter to limit to only saving pages that match this expression [default: .*]

        --include-visit <include-visit>
            Regex filter to limit to only visiting pages that match this expression [default: .*]

    -j, --jobs <jobs>                            Maximum number of threads to use concurrently [default: 1]
    -o, --output <output>                        Output directory
        --random-range <random-range>
            Generate an extra random delay between downloads, from 0 to this number. This is added to the base delay
            seconds [default: 0]
    -t, --tries <tries>                          Maximum amount of retries on download failure [default: 20]
    -u, --user-agent <user-agent>                User agent to be used for sending requests [default: suckit]

ARGS:
    <url>    Entry point of the scraping

Example

A common use case could be the following:

suckit http://books.toscrape.com -j 8 -o /path/to/downloaded/pages/

asciicast

Installation

As of right now, SuckIT does not work on Windows.

To install it, you need to have Rust installed.

  • Check out this link for instructions on how to install Rust.

  • If you just want to install the suckit executable, you can simply run cargo install --git https://github.com/skallwar/suckit

  • Now, run it from anywhere with the suckit command.

Arch Linux

suckit can be installed from available AUR packages using an AUR helper. For example,

yay -S suckit

Want to contribute ? Feel free to open an issue or submit a PR !

License

SuckIT is primarily distributed under the terms of both the MIT license and the Apache License (Version 2.0)

See LICENSE-APACHE and LICENSE-MIT for details.

suckit's People

Contributors

aox0 avatar atul9 avatar bastiengermond avatar cohenarthur avatar creativcoder avatar dependabot[bot] avatar jaslogic avatar lhvy avatar like0x avatar marchellodev avatar mr-bo-jangles avatar orhun avatar pinkforest avatar pjsier avatar raphcode avatar skallwar avatar spyrosroum avatar untitaker avatar waxymocha avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

suckit's Issues

Ignore pages that have a 404 status code

Currently suckit will save pages even if they are indicated as not found by the webserver. I think this is erroneous behaviour.

Eg this page on my site that 404s was saved to disk.

Chrome dev tools:
Screen Shot 2020-05-10 at 8 01 14 pm

File explorer:
Screen Shot 2020-05-10 at 8 01 22 pm

Save files with extension

Awesome project :)

Haven't seen anyone else mention it so I might be using it wrong.. but I am getting all downloaded files without an extension. Here is a screenshot from downloading a small site.

My assumption would be it would save files about.html, favicon.png, global.css etc. It looks like they're mostly there but with an underscore instead of a period.

On mac (10.15.4) if that helps, installed from cargo --git Installed package suckit v0.1.0 (https://github.com/skallwar/suckit#16a5ed9c) (executable suckit)

Screen Shot 2020-05-07 at 9 54 32 am

Support CSS scraping

Usually, a website will use url(...) in the css to point out to another resource such as background image, logo, sprite, fonts, or even another css. Would love this resource getting downloaded too.

memory allocation of 2147483648 bytes failed (that's 2GB!)

I tried running it on a directory posted on reddit yesterday:

$ suckit https://icsarchive.org/icsarchive-org/paperback/cookbooks/

memory allocation of 2147483648 bytes failed
zsh: abort      suckit https://icsarchive.org/icsarchive-org/paperback/cookbooks/

I don't understand how it would ever come to try to allocate 2GB of memory. None of the files is that big (and even if they were, they shouldn't be buffered to memory).

I left it running for sometime so it didn't happen until quite late in the game when I was afk. I didn't run with RUST_BACKTRACE so don't have more details at this time, but I can try fetching again to see if I can replicate

Remove fragment hash from scraped URLs

Scraping a web page with multiple links to specific sections on another page (i.e. "/#section") results in duplicate downloads of the same page because the URL fragment is different. According to the fragment method docs this portion of the URL isn't typically sent to the server, and in my understanding it would only make a difference in client-side updates that wouldn't be tracked here anyway.

Should the URL fragment be removed from the URL to avoid duplicates? If so I can put in a PR for that

What type of regex works?

What are the wildcards for include/exclude regex? I tried %, *, and . and I don't think any of them worked. The script eventually runs into some errors and saves nothing. I want to find a specific .aspx filename and only save those because this website probably has at least 10 billion unique pages.

Dry run

It would be nice to run suckit without saving anything to disk

Panic with EmptyHost

I'm getting the following panics;

thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: EmptyHost', src/scraper.rs:97:37
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Any', src/scraper.rs:150:9

Could be a link without host?

Disk: Stop flat archive creation

hi :)

loving this program so far but just had one feature idea / request, is it possible to have an like a -mirror flag or sitemap output flag so the downloader can output with the same folder structure as the site based on url or folders that its downloading. the reason i ask is when trying this on a few public open directories it makes several folders with the site name and then with the file name and creates folders for each file / page.

here is an example

https://releases.ubuntu.com/19.10/
https://releases.ubuntu.com/20.04/

so we could setup suckit to output to releases.ubuntu.com/19.10/, releases.ubuntu.com/20.04 and etc

Thanks

Handle subdomains

Handle subdomains such as https://static.lwm.net for https://lwm.net

Handle incomplete content

Fix crashes on hyper::Error(IncompleteMessage) by retrying for example.

Here's an example often producing it:

./suckit http://books.toscrape.com 

os error 22 thread '<unnamed>' panicked when i attempt a crawl

seemed to work for a while but then was suddenly hit with this error. -c switch doesn't seem to help
[ERROR] Couldn't create /run/media/path/to/website/"f.txt": Invalid argument (os error 22)
thread '' panicked at 'Couldn't create /run/media/path/to/website/"f.txt": Invalid argument (os error 22)', src/logger.rs:42:9
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace
thread 'main' panicked at 'called Result::unwrap() on an Err value: Any', src/scraper.rs:208:10

Sites with foreigner characters are shown as gibberish.

Some sites are shown with gibberish characters instead of the correct once, however this doesn't seems to happen consistently.

Screenshot_20200505_185303

NRK.no shows the correct characters (æøå) but gamlegjerpen.no does not.
Seems like gamlegjerpen.no does not contain: <meta charset="utf-8">
Not sure if that's why it gets displayed as gibberish on the downloaded copy but it's something.

Depth counter

We need an cli option to stop the scrapping at a given depth level

Fix duplicate downloadings

suckit downloads some pages multiple times, for exemple:

suckit https://escapefromtarkov.gamepedia.com/Category:Link_templates | sort | uniq -c | sort

returns the following:

      ...
      2 https://help.gamepedia.com/How_to_contact_Gamepedia has been downloaded
      2 https://support.gamepedia.com/ has been downloaded
      2 https://twitter.com/CurseGamepedia has been downloaded
      2 https://www.facebook.com/CurseGamepedia has been downloaded
      2 https://www.fandom.com/ has been downloaded
      3 https://creativecommons.org/licenses/by-nc-sa/3.0/ has been downloaded
      3 https://escapefromtarkov.gamepedia.com/index.php?title=Category:Link_templates&action=edit has been downloaded
      3 https://www.gamepedia.com/ has been downloaded
      5 https://escapefromtarkov.gamepedia.com/Category:Link_templates has been downloaded

Improve speed regression testing

  • Add commit hashes to .csv
  • Add computer identifier to rerun all tests on different computers
  • Show improvements as percent % instead of no units
  • Self hosted server to avoid network delays (#108)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.