Crawls web pages and prints any link it can find.
- fast html SAX-parser (powered by
golang.org/x/net/html
) - small (<3000 SLOC), idiomatic, 100% test covered codebase
- grabs most of useful resources urls (pics, videos, audios, forms, etc...)
- found urls are streamed to stdout and guranteed to be unique (with fragments omitted)
- scan depth (limited by starting host and path, by default - 0) can be configured
- can crawl rules and sitemaps from
robots.txt
brute
mode - scan html comments for urls (this can lead to bogus results)- make use of
HTTP_PROXY
/HTTPS_PROXY
environment values - directory-only scan mode (aka
fast-scan
) - user-defined cookies, in curl-compatible format (i.e.
-cookie "ONE=1; TWO=2" -cookie "ITS=ME" -cookie @cookie-file
) - user-defined headers, same as curl:
-header "ONE: 1" -header "TWO: 2" -header @headers-file
- binaries for Linux, FreeBSD, macOS and Windows
Crawley is available in the AUR. Linux distributions with access to it can obtain the package from here.
You can also use your favourite AUR helper to install it, e. g. paru -S crawley-bin
.
crawley [flags] url
possible flags:
-brute
scan html comments
-cookie value
extra cookies for request, can be used multiple times, accept files with '@'-prefix
-delay duration
per-request delay (0 - disable) (default 150ms)
-depth int
scan depth (-1 - unlimited)
-dirs string
policy for non-resource urls: show / hide / only (default "show")
-header value
extra headers for request, can be used multiple times, accept files with '@'-prefix
-headless
disable pre-flight HEAD requests
-help
this flags (and their defaults) description
-robots string
policy for robots.txt: ignore / crawl / respect (default "ignore")
-silent
suppress info and error messages in stderr
-skip-ssl
skip ssl verification
-user-agent string
user-agent string
-version
show version
-workers int
number of workers (default - number of CPU cores)