Giter VIP home page Giter VIP logo

adamdehaven / fetchurls Goto Github PK

View Code? Open in Web Editor NEW
126.0 7.0 45.0 74 KB

A bash script to spider a site, follow links, and fetch urls (with built-in filtering) into a generated text file.

Home Page: https://www.adamdehaven.com/blog/easily-crawl-a-website-and-fetch-all-urls-with-a-shell-script/

License: MIT License

Shell 100.00%
wget crawl spider bash-scripting shell-script website urls

fetchurls's Introduction

fetchurls

A bash script to spider a site, follow links, and fetch urls (with built-in filtering) into a generated text file.

Usage

  1. Download the script and save to the desired location on your machine.

  2. You'll need wget installed on your machine.

    To check if it is already installed, try running the command wget by itself.

    If you are on a Mac or running Linux, chances are you already have wget installed; however, if the wget command is not working, it may not be properly added to your PATH variable.

    If you are running Windows:

    1. Download the lastest wget binary for windows from https://eternallybored.org/misc/wget/

      The download is available as a zip with documentation, or just an exe. I'd recommend just the exe.

    2. If you downloaded the zip, extract all (if windows built in zip utility gives an error, use 7-zip). In addition, if you downloaded the 64-bit version, rename the wget64.exe file to wget.exe

    3. Move wget.exe to C:\Windows\System32\

  3. Ensure the version of grep on your computer supports -E, --extended-regexp. To check for support, run grep --help and look for the flag. To check the installed version, run grep -V.

  4. Open Git Bash, Terminal, etc. and set execute permissions for the fetchurls.sh script:

    chmod +x /path/to/script/fetchurls.sh
  5. Enter the following to run the script:

    ./fetchurls.sh [OPTIONS]...

    Alternatively, you may execute with either of the following:

    sh ./fetchurls.sh [OPTIONS]...
    
    # -- OR -- #
    
    bash ./fetchurls.sh [OPTIONS]...

If you do not pass any options, the script will run in interactive mode.

If the domain URL requires authentication, you must pass the username and password as flags; you are not prompted for these values in interactive mode.

Options

You may pass options (as flags) directly to the script, or pass nothing to run the script in interactive mode.

domain

  • Usage: -d, --domain
  • Example: https://example.com

The fully qualified domain URL (with protocol) you would like to crawl.

Ensure that you enter the correct protocol (e.g. https) and subdomain for the URL or the generated file may be empty or incomplete. The script will automatically attempt to follow the first HTTP redirect, if found. For example, if you enter the incorrect protocol (http://...) for https://www.adamdehaven.com, the script will automatically follow the redirect and fetch all URLs for the correct HTTPS protocol.

The domain's URLs will be successfully spidered as long as the target URL (or the first redirect) returns a status of HTTP 200 OK.

location

  • Usage: -l, --location
  • Default: ~/Desktop
  • Example: /c/Users/username/Desktop

The location (directory) where you would like to save the generated results.

If the directory does not exist at the specified location, as long as the rest of the path is valid, the new directory will automatically be created.

filename

  • Usage: -f, --filename
  • Default: domain-topleveldomain
  • Example: example-com

The desired name of the generated file, without spaces or file extension.

exclude

Pipe-delimited list of file extensions to exclude from results.

To prevent excluding files matching the default list of file extensions, simply pass an empty string ""

sleep

  • Usage: -s, --sleep
  • Default: 0
  • Example: 2

The number of seconds to wait between retrievals.

username

  • Usage: -u, --username
  • Example: marty_mcfly

If the domain URL requires authentication, the username to pass to the wget command.

If the username contains space characters, you must pass inside quotes. This value may only be set with a flag; there is no prompt in interactive mode.

password

  • Usage: -p, --password
  • Example: thats_heavy

If the domain URL requires authentication, the password to pass to the wget command.

If the password contains space characters, you must pass inside quotes. This value may only be set with a flag; there is no prompt in interactive mode.

non-interactive

  • Usage: -n, --non-interactive

Allows the script to run successfully in a non-interactive shell.

The script will utilize the default --location and --filename settings unless the respective flags are explicitely set.

ignore-robots

  • Usage: -i, --ignore-robots

Ignore robots.txt for the domain.

wget

  • Usage: -w, --wget

Show wget install instructions. The installation instructions may vary depending on your computer's configuration.

version

  • Usage: -v, -V, --version

Show version information.

troubleshooting

  • Usage: -t, --troubleshooting

Outputs received option flags with their associated values at runtime for troubleshooting.

help

  • Usage: -h, -?, --help

Show the help content.

Interactive Mode

If you do not pass the --domain flag, the script will run in interactive mode and you will be prompted for the unset options.

First, you will be prompted to enter the full URL (including HTTPS/HTTP protocol) of the site you would like to crawl:

Fetch a list of unique URLs for a domain.

Enter the full domain URL ( http://example.com )
Domain URL:

You will then be prompted to enter the location (directory) of where you would like the generated results to be saved (defaults to Desktop on Windows):

Save file to directory
Directory: /c/Users/username/Desktop

Next, you will be prompted to change/accept the name of the generated file (simply press enter to accept the default filename):

Save file as
Filename (no file extension, and no spaces): example-com

Finally, you will be prompted to change/accept the default list of excluded file extensions (press enter to accept the default list):

Exclude files with matching extensions
Excluded extensions: bmp|css|doc|docx|gif|jpeg|jpg|JPG|js|map|pdf|PDF|png|ppt|pptx|svg|ts|txt|xls|xlsx|xml

The script will crawl the site and compile a list of valid URLs into a new text file. When complete, the script will show a message and the location of the generated file:

Fetching URLs for example.com

Finished with 1 result!

File Location:
/c/Users/username/Desktop/example-com.txt

If a file of the same name already exists at the location (e.g. if you previously ran the script for the same URL), the original file will be overwritten.

Excluded Files and Directories

The script, by default, filters out many file extensions that are commonly not needed.

The list of file extensions can be passed via the --exclude flag, or provided via the interactive mode.

Excluded Files

  • .bmp
  • .css
  • .doc
  • .docx
  • .gif
  • .jpeg
  • .jpg
  • .JPG
  • .js
  • .map
  • .pdf
  • .PDF
  • .png
  • .ppt
  • .pptx
  • .svg
  • .ts
  • .txt
  • .xls
  • .xlsx
  • .xml

Excluded Directories

In addition, specific site (including WordPress) files and directories are also ignored.

  • /wp-content/uploads/
  • /feed/
  • /category/
  • /tag/
  • /page/
  • /widgets.php/
  • /wp-json/
  • xmlrpc

Advanced Usage

The script should filter out most unwanted file types and directories; however, you can edit the regular expressions that filter out certain pages, directories, and file types by editing the fetchUrlsForDomain() function within the fetchurls.sh file.

Warning: If you're not familiar with grep or regular expressions, you can easily break the script.

fetchurls's People

Contributors

adamdehaven avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

fetchurls's Issues

[Feature Request] Add Option for User-Agent

Hey Adam, thank you for your nice little script! :-)

I ran into the problem that the website I'd like to fetch all URLs from blocked the default wget User-Agent (currently "Wget/1.20.3 (linux-gnu)").
To progress with my task I manually changed your (current) script and added a User-Agent string (to the wget command) and it worked very well.

Question: Are you willing to add an option for the User-Agent?

If yes, I would prepare a PR โ€ฆ :-)

KR

Grep problem

I do not have option --max-count in grep (BSD grep) 2.5.1-FreeBSD under MacOS

[Bug Report] does not run in Mac OS because of bash v3.2 - workaround

Hello, thank you very much for writing this script and taking the time to make it available. I have been looking for such a thing for quite a while now and lack of any of the skills to write it.

I have found and somewhat solved a compatibility issue with Mac OS. Sorry it's so long but I don't know how to do this properly and what can be left out. Also I have to admit I am extremely excited because I have never even halfway solved a programming(ish) problem in my whole life.

Summary

  • Script does not run on mac OS because Mac OS has bash v 3.2 only.
  • How to install bash 5 in Mac OS
  • A way to edit the .sh file so the correct version of bash will be called

Describe the bug

I read in #1 that this script requires bash >v4. So I investigated and found out that even modern up to date versions of Mac OS are running bash 3.2 unless it has been manually upgraded. This has something to do with Apple being unable or unwilling to comply with the GPL requirements for v >4. See this SE thread among other discussions online.

Also of note (something I only learned recently despite being a regular if casual terminal user for many years) that zsh has been the default shell in Mac OS for some time now. Probably because they didn't want to have an out of date shell for the rest of time.

To Reproduce

  1. download and run per instructions
$ ./fetchurls.sh

Fetch a list of unique URLs for a domain.

Enter the full domain URL ( https://example.com )
Domain URL: https://quotes.toscrape.com
usage: grep [-abcDEFGHhIiJLlmnOoqRSsUVvwxZ] [-A num] [-B num] [-C[num]]
	[-e pattern] [-f file] [--binary-files=value] [--color=when]
	[--context[=num]] [--directories=action] [--label] [--line-buffered]
	[--null] [pattern] [file ...]
usage: grep [-abcDEFGHhIiJLlmnOoqRSsUVvwxZ] [-A num] [-B num] [-C[num]]
	[-e pattern] [-f file] [--binary-files=value] [--color=when]
	[--context[=num]] [--directories=action] [--label] [--line-buffered]
	[--null] [pattern] [file ...]

Save file to directory
./fetchurls.sh: line 492: read: -i: invalid option
read: usage: read [-ers] [-u fd] [-t timeout] [-p prompt] [-a array] [-n nchars] [-d delim] [name ...]
usage: mkdir [-pv] [-m mode] directory ...

Save file as
./fetchurls.sh: line 505: read: -i: invalid option
read: usage: read [-ers] [-u fd] [-t timeout] [-p prompt] [-a array] [-n nchars] [-d delim] [name ...]

Exclude files with matching extensions
./fetchurls.sh: line 518: read: -i: invalid option
read: usage: read [-ers] [-u fd] [-t timeout] [-p prompt] [-a array] [-n nchars] [-d delim] [name ...]

Fetching URLs for

./fetchurls.sh: line 358: /.txt: Permission denied
^Cease wait... [ | ]

Environment

To verify bash version:

$ bash --version
bash --version
GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin18)
Copyright (C) 2007 Free Software Foundation, Inc.

and also

$ which bash
which bash
/bin/bash

Apparently this is where bash 3 lives. Bash 4 lives in /usr/local/bin/bash

Workaround

  1. Install homebrew if not already.

  2. Install up to date bash (version 5 at time of writing)

    brew install bash
  3. bash 3 will still be active and the Internet says better not to replace it completely in case for some reason you need it some day. The instructions I looked at all assumed you would like to make bash 5 you default shell. However I am happy with zsh so I was able to find this which I believe makes it default only when you call bash

    sudo bash -c 'echo /usr/local/bin/bash >> /etc/shells'

So now if I run bash --version or which bash it still returns v 3 as originally. (The Internet said which should return both, but it didn't, strangely.) However if I actually switch to bash (by inputting command bash) and enter those commands it will report v5.

Even when I ran the script from the bash 5 prompt instead of zsh, I still got same errors. (If you are thinking "this guy doesn't know anything about shell scripting" you are correct.) However I changed the top from

#!/bin/shift

to

#!/usr/local/bin/bash

and the script ran. But it was taking so long to run I thought it was hanging and wouldn't complete and that's when I noticed the original said shift not bash which is the only thing I've ever seen at the top of a script. So I went to find out what that means and after some dead ends I found something that sounds reasonable but I don't understand. By the time I had finished reading that page the script completed. It ran just fine without shift.

I also tried running it with both shebangs (new word I learned today) included but the result was the same. I ran the script in interactive mode with all defaults so perhaps a problem will arise at a later date. I'm sure you already have a good idea of the answer to that question.

Someone who understands what is going on here would probably be able to find a better solution but this seems to work.

not working on kali linux

Hi,
hope you're doing well
fetchurl is not working on kali but same process is done on cPanel its working fine.
Capture

v3.2.3 Non Interactive Mode now allowing a few attributes

Describe the bug
Version v3.2.3
Non-interactive mode always overrides -f -l and -e attributes with defaults

To Reproduce
./fetchurls.sh -t -d http://example.com -l /tmp -f file -e "css"

Just review the troubleshooting output.

Made the following changes to correct issue.

diff fetchurls.sh fetchurls.sh-ORIG
493c493
< elif [ -z "$USER_SAVE_LOCATION" ] && [ "$RUN_NONINTERACTIVE" -eq 1 ]; then

else
508c508
< elif [ -z "$USER_FILENAME" ] && [ "$RUN_NONINTERACTIVE" -eq 1 ]; then


else
521c521
< elif [ -z "$USER_EXCLUDED_EXTENTIONS" ] && [ "$RUN_NONINTERACTIVE" -eq 1 ]; then


else

Thanks for your efforts.

Permission issue to write file

Hey,

I keep getting the following errors and it never asks me for the location where i want to store. It did ask me the prompt to enter the url of the website where I wanted to fetch the urls. Not sure how to resolve the permission issue to let it write the file and ask me about the file name steps.

#    
#    Save file to location
/Users/varunkhanduja/Desktop/fetchurls-master/fetchurls.sh: line 103: read: -i: invalid option
read: usage: read [-ers] [-u fd] [-t timeout] [-p prompt] [-a array] [-n nchars] [-d delim] [name ...]
usage: mkdir [-pv] [-m mode] directory ...
#    
#    Save file as
/Users/varunkhanduja/Desktop/fetchurls-master/fetchurls.sh: line 110: read: -i: invalid option
read: usage: read [-ers] [-u fd] [-t timeout] [-p prompt] [-a array] [-n nchars] [-d delim] [name ...]
#    
#    Fetching URLs for 
#    
/Users/varunkhanduja/Desktop/fetchurls-master/fetchurls.sh: line 64: /.txt: Permission denied

Bash Error

Hey Adam!

Really cool concept. Do you know what this error might mean?

fetchurls-master/fetchurls.sh: line 61: read: -i: invalid option

Thanks,
Charlie

Question - fetchUrlsForDomain

In fetchUrlsForDomain, grep is used to search for user exluded extensions after wget has downloaded the files. I couldn't see a written explanation for why filtering for extensions was done in this way, rather than filtering them using wget --reject or wget --reject-regex. Was this for reliability?

Not sure if this was the right place or way to ask, I'm very new and inexperienced. Thanks!

Fixes

I suggest you set a trap in your script , so if someone does a ctrl-d or the script fails that the color is reset in the terminal for the user. insert this at the top of your script

resetcolor () { # this is just a rough example is can be implemented better 
echo -e "\033[0m"
}

trap resetcolor INT TERM EXIT

you also need to create the directory that you suggest in the script . it fails to find the directory because its not created,

echo "${COLOR_RESET}# "
read -e -p "# Save txt file as: ${COLOR_CYAN}" -i "${filename}" SAVEFILENAME

mkdir -p $savelocation/$SAVEFILENAME # you need to place this here

savefilename=$SAVEFILENAME

I'd suggest you rewrite it , as its a bit messy and wont handle syntax errors properly
What if someone has renamed there desktop file or deleted it or wants to put it somewhere else ? crazy I know but your script will fail

what is someone enters an incomplete url or non existent url example : http://www.goodfg.comdwdd
you could handle this in a while true loop and check the existence of the url

here is something I threw together as an an example

while true
do
# CHECK IF URL EXISTS AND IS ONLINE
if [[ "$#" -eq  "0" ]] ; then
       echo -e "${YELLOW}Example${NC} : checkurl msn.com"
       echo
       break
       exit
fi

if curl --output /dev/null --silent --head --fail "$CONVERTCASE" then

CHECKURL=$(lynx -dump http://downforeveryoneorjustme.com/$CONVERTCASE | grep -o "It's just you")

if [[ $CHECKURL != 0 ]] ; then
      echo -en "Url exists so lets continue" > /dev/null
      break
else
      echo -en  "Url doesn't exist lets try again" 
      sleep 2
fi
done

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.