Giter VIP home page Giter VIP logo

link-checker-service's Introduction

Link Checker Service

The Link Checker web service runs cached and otherwise optimized broken link checks.

Build Status Go Report Card CodeQL

Endpoints:

  • /checkUrls checks a batch at once
  • /checkUrls/stream returns results as they arrive using JSON streaming
  • /version returns the server version
  • /livez, /readyz health checks

Quickstart Options

  • get the binary into $GOPATH/bin
go get -u github.com/siemens/link-checker-service

โ†“

link-checker-service serve
  • download it from the releases

  • run the service dockerized, without installing Go, and navigate to the sample UI:

docker-compose up --build
  • run from source:
go run . serve

Motivation

For a website author willing provide a link checking functionality there are few options available. Browser requests to other domains are most likely to be blocked by CORS. Building the link-checking functionality into the back-end might compromise the stability of the service through exhaustion of various resources.

Thus, to minimize risk, a link checker should be isolated into a separate service. While there are several websites providing the functionality, these may not have access to hosts on a private network, and are otherwise not under your control.

Checking whether a link is broken seems like a trivial task, but consider checking a thousand links a thousand times. Several optimizations and server, gateways, CDN or proxy implementation peculiarity work-arounds will need to be applied. This repository contains an implementation of such service.

Usage

Example Request

Start the server, e.g. link-checker-service serve, and send the following request body to http://localhost:8080/checkUrls:

{
    "urls": [
        {
            "url":"https://google.com",
            "context": "0"
        },
        {
            "url":"https://ashdfkjhdf.com/kajhsd",
            "context": "1"
        }
    ]
}

e.g. via HTTPie on Windows cmd

http POST localhost:8080/checkUrls urls:="[{"""url""":"""https://google.com""","""context""":"""0"""},{"""url""":"""https://baskldjha.com/loaksd""","""context""":"""1"""}]"

or in *sh:

http POST localhost:8080/checkUrls urls:='[{"url":"https://google.com","context":"0"},{"url":"https://baskldjha.com/loaksd","context":"1"}]'

The context field allows correlating the requests on the client side.

Sample response:

{
    "result": "complete",
    "urls": [
        {
          "context": "1",
            "error": "cannot access 'https://baskldjha.com/loaksd'... no such host",
            "http_status": 528,
            "status": "broken",
            "timestamp": 1599132784,
            "body_patterns_found": [],
            "url": "https://baskldjha.com/loaksd"
        },
        {
            "context": "0",
            "error": "",
            "http_status": 200,
            "status": "ok",
            "timestamp": 1599132784,
            "body_patterns_found": [],
            "url": "https://google.com"
        }
    ]
}

Large Requests Using JSON Streaming

JSON Streaming can be used to optimize the client user experience, so that the client does not have to wait for the whole check result to complete to render.

In the sample HTTPie request, post the streaming request to the /checkUrls/stream endpoint:

http --stream  POST  localhost:8080/checkUrls/stream ...

URL check result objects will be streamed continuously, delimited by a newline character \n, as they become available. These can then be rendered immediately. E.g. see the sample UI.

Sample Front-Ends

  • For a programmatic large URL list check, see test/large_list_check, which crawls a markdown page for URLs and checks them via the running link checker service
  • For an example of a simple page to check links and display the results using jQuery using the service, see test/jquery_example

Configuration

For up-to-date help, check link-checker-service help or link-checker-service help <command>.

To override the service port, define the PORT environment variable.

To bind to another address, configure the bindAddress option, i.e.: ... serve -a 127.0.0.1:8080

Config File

A sample configuration file configuration file is available, with most possible configuration options listed.

Start the app with the path to the configuration file: --config <path-to-config-toml>.

Environment Variables

Most configuration values can also be overridden via environment variables in the 12-factor fashion.

The variables found in the configuration file can be upper-cased and prefixed with LCS_ to override.

Arrays of strings can be defined delimited by a space, e.g.:

LCS_CORSORIGINS="http://localhost:8080 http://localhost:8092"

For complex keys, such as HTTPClient.userAgent, take the uppercase key and replace the dot with an underscore:

LCS_HTTPCLIENT_USERAGENT="lcs/2.0"

Authentication

The server implements a simple optional authentication via JWT token validation using a public certificate (middleware: github.com/appleboy/gin-jwt).

Currently, the JWT middleware requires a dummy private certificate to be configured, even though it is not used for validation.

See the configuration file and the serve command help for detailed settings.

URL Checker Plugins

URLs may be checked using different methods, e.g. with an HTTP client with or without using a proxy. Depending on the connectivity available to the link checker service, the sequence of checks can be influenced via a configuration of the URL Checker Plugins.

E.g.:

urlCheckerPlugins = [
    "urlcheck-noproxy",
    "urlcheck",
    "urlcheck-pac",
]

By default, the urlcheck plugin is used, which uses an HTTP client with a proxy, if one is configured, and without one, if not. urlcheck-noproxy uses a client explicitly without a proxy set. urlcheck-pac generates a client for each URL depending on the proxy configuration returned via the PAC script, configured via the pacScriptURL option. Only the first proxy returned by the PAC script will be used.

Advanced Configuration

Link checker can optionally detect patterns within successful HTTP response bodies, e.g. in pages with authentication. This configuration is only possible via the configuration file:

# enable searching for patterns here
searchForBodyPatterns = true

# define Go Regex patterns and their names in this manner
[[bodyPatterns]]
name = "authentication redirect"
regex = "Authentication Redirect"

[[bodyPatterns]]
name = "google"
regex = "google"

The names of the found patterns will be available in the URL check results.

Using a Custom Configuration

e.g. when a proxy is needed for the HTTP client, see the sample .link-checker-service.toml, and start the server with the argument: --config .link-checker-service.toml

alternatively, set the client proxy via an environment variable: LCS_PROXY=http://myproxy:8080

Development

see development.md

Request Optimization Architecture

optimization chain

Rate limiting based on IPs can be turned on in the configuration via a rate specification. See ulule/limiter.

Blocked IPs will run into HTTP 429, and will be unblocked after the sliding window duration passes:

hey -m POST -n 1000 -c 200 -T "application/json" -t 30 -D sample_request_body.json http://localhost:8080/checkUrls with a limit of 10-S:

Status code distribution:
  [200] 10 responses
  [429] 990 responses

Dependencies

Alternatives

the alternatives that are not URL list check web services:

some URL check services exist, albeit not open source (as of 02.09.2020)

License

    Copyright 2020-2021 Siemens AG and contributors as noted in the AUTHORS file.

    This Source Code Form is subject to the terms of the Mozilla Public
    License, v. 2.0. If a copy of the MPL was not distributed with this
    file, You can obtain one at http://mozilla.org/MPL/2.0/

The following sample code folders are licensed under Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

The testing work-around for streaming responses has been adapted from gin (Copyright Manu Martinez-Almeida, MIT License)

Disclaimer

The external hyperlinks found in this repository, and the information contained therein, do not constitute endorsement by the authors, and are used either for documentation purposes, or as examples.

link-checker-service's People

Contributors

d-led avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.