Giter VIP home page Giter VIP logo

hydra-link-checker's Introduction

Hydra: multithreaded site-crawling link checker in Python

Tests status badge

A Python program that crawls slithers ๐Ÿ a website for links and prints a YAML report of broken links.

Requires

Python 3.6 or higher.

There are no external dependencies, Neo.

Usage

$ python hydra.py -h
usage: hydra.py [-h] [--config CONFIG] URL

Positional arguments:

  • URL: The URL of the website to crawl. Ensure URL is absolute including schema, e.g. https://example.com.

Optional arguments:

  • -h, --help: Show help message and exit
  • --config CONFIG, -c CONFIG: Path to a configuration file

A broken links report will be output to stdout, so you may like to redirect this to a file.

The report will be YAML formatted. To save the output to a file, run:

python hydra.py [URL] > [PATH/TO/FILE.yaml]

You can add the current date to the filename using a command substitution, such as:

python hydra.py [URL] > /path/to/$(date '+%Y_%m_%d')_report.yaml

To see how long Hydra takes to check your site, add time:

time python hydra.py [URL]

GitHub Action

You can easily incorporate Hydra as part of an automated process using the link-snitch action.

Configuration

Hydra can accept an optional JSON configuration file for specific parameters, for example:

{
    "OK": [
        200,
        999,
        403
    ],
    "attrs": [
        "href"
    ],
    "exclude_scheme_prefixes": [
        "tel"
    ],
    "tags": [
        "a",
        "img"
    ],
    "threads": 25,
    "timeout": 30,
    "graceful_exit": "True"
}

To use a configuration file, supply the filename:

python hydra.py https://example.com --config ./hydra-config.json

Possible settings:

  • OK - HTTP response codes to consider as a successful link check. Defaults to [200, 999].
  • attrs - Attributes of the HTML tags to check for links. Defaults to ["href", "src"].
  • exclude_scheme_prefixes - HTTP scheme prefixes to exclude from checking. Defaults to ["tel:", "javascript:"].
  • tags - HTML tags to check for links. Defaults to ["a", "link", "img", "script"].
  • threads - Maximum workers to run. Defaults to 50.
  • timeout - Maximum seconds to wait for HTTP response. Defaults to 60.
  • graceful_exit - If set to True, and there are broken links present return exit code 0 else return exit code 1.

Test

Run:

python -m unittest tests/test.py

hydra-link-checker's People

Contributors

krinkle avatar luisfmelo avatar mz0 avatar stevezieglerva avatar tushar5526 avatar victoriadrake avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

hydra-link-checker's Issues

SSL problem

I tried your code but it seems that it is failing due to SSL issues.
I tried a few of my sites as well as your personal blog: https://victoria.dev

Checking links in logged in pages

Is your feature request related to a problem? Please describe.
Use case: checking for broken links in a logged in pages.

Describe the solution you'd like
hydra provides username/password to login and then checks for broken links inside the logged in pages.

Describe alternatives you've considered
Other than provides username/password we may able to provide session/cookie for login.

Additional context
Nothing.

Add the ability to use a configuration file

Is your feature request related to a problem? Please describe.
Hydra has a number of configurable variables. It would be nice to be able to use an external file to set these.

Describe the solution you'd like
An external file can set configuration variables for hydra's constants, like TAGS, EXCLUDE_SCHEME_PREFIXES, THREADS, TIMEOUT, and OK.

Exclude tel: links

Describe the bug
tel: links are reported as bad links and should just be excluded from checks.

To Reproduce
Steps to reproduce the behavior:

  1. Run the script on a site that includes telephone links

Expected behavior
tel: links are reported as bad but should just be skipped

Screenshots
NA

Desktop (please complete the following information):

  • OS: MacOS

Smartphone (please complete the following information):
NA

I've got a fix and will fork and submit a PR

Possibly add end-to-end testing on a real site as part of build process

Is your feature request related to a problem? Please describe.
This is not a problem with the script as is, but more of an issue with building confidence for developers that nothing major was broken. This can be important for a complex process like link checking that can easily have unseen issues given the multi-threaded, high volume of transactions.

Describe the solution you'd like
Execute a script as part of the test process (somehow in GitHub Actions?) that runs Hydra against a known test website(s) and then compares the Hydra output to a text file of expected results (at least compare the front matter). This also could be used to time performance to enure changes to make it slower. It could also be used to check configuration settings, outside of unit tests, like testing if performance is slower with only using 1 thread.

I created sites like this when trying to build my own link checker: https://github.com/stevezieglerva/lnkchk_test_sites

It could be done in a shell script to avoid trying to shoehorn into a unit test framework.

Describe alternatives you've considered
I don't have alternatives. I realize that this introduces some external dependencies into the build process which may not be desirable.

Additional context

Inconsistent results after multiple runs on the same URL

@victoriadrake thanks for sharing this, it's really useful.

I've been testing it on PyCharm with an URL and almost every time I run it, I get different number of total links and broken links. The URL I have been using: https://www.chiark.greenend.org.uk/~sgtatham/putty/mirrors.html

Would you mind running that URL a few times to see if you can replicate the issue? I wonder if it's my environment somehow that's causing that.

Thanks!

EDIT: just as a sanity check, I also ran hydra.pl from a Debian10 venv (3.73) and noticed same behavior.

Allow ignoring some codes

Is your feature request related to a problem? Please describe.
Some links will always return codes that are not 200 (such as websites that try to block scripts). The user may wish to ignore these.

Describe the solution you'd like
Add the ability to list status codes to ignore (don't count as broken links).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.