raviqqe / muffet Goto Github PK

View Code? Open in Web Editor NEW

2.4K 19.0 92.0 713 KB

Fast website link checker in Go

License: MIT License

Go 99.28% Dockerfile 0.22% Shell 0.50%

linter website golang

muffet's Introduction

Muffet

Muffet is a website link checker which scrapes and inspects all pages in a website recursively.

Features

Massive speed
High compatibility with web browsers
Different tag support (a, img, link, script, etc)
Multiple output formats (text, JSON, and JUnit XML)

Installation

go install github.com/raviqqe/muffet/v2@latest

Homebrew

brew install muffet

Usage

muffet https://shady.bakery.hotland

For more information, see muffet --help.

Docker

docker run raviqqe/muffet https://shady.bakery.hotland

GitHub Action

My Broken Link Checker

Currently, we do not provide any official one. Feel free to create an issue if you want!

License

MIT

muffet's People

Contributors

Stargazers

Watchers

Forkers

deanlj pschlump chosen1 alee halcyon heybeethov jithinraj dushmis josejibin oguzozgen golang-my lichnak zoosky andersonsantos mewbak rodolphopivetta popey younisshah shelltips neuroradiology linecode maxiaoyun priestd09 furbolg devopsmi dreamsql hemanthsavasere bit-twit still4 kkkmmu nwidger johnlion lorock naylin15 fredcy imaduddinamajid hswuhao jecqiang yehuangcn pfhawkins lornajane amorist jirkadanek muhme colmoconnor apple0407 stephenbrown2 maidomax sskoklev earwickerh kemokemo sltc-li mmaarrttinka qcasey jcadavez privacy-parnoid-vault judev-forks johnfreeborg kokizzu rizalgowandy h2-go r3vit ronaldaoh kelly778956 matlockx student-777 aarongill99 eternalerrors cailiang super-rain suryatmodulus lgtm-migrator jsoref azeemba shivjm iq-scm ha7da8 emdeh daniel-007 forget-c badexception moeacgx networkshb lahabana thppler panjunwei slimsag omegabenefits brianfromlife vsupalov vincnt054

muffet's Issues

HTML `<base>`-element not respected

From MDN:

The HTML element specifies the base URL to use for all relative URLs contained within a document. There can be only one element in a document.

Relative URLs are always resolved relative to the current document even if a <base />-element is present.

To reproduce I created three example files hosted at:

http://files.basex.org/.temp-muffet/

Each file contains the same three links and the same <base />-element that point the each of the 3 documents:

<html>
  <head>
    <base href="/.temp-muffet/">
   </head>
  <body>
        <a href=".">Home</a>
        <a href="sub1">Sub 1</a>
        <a href="sub1/sub2">Sub 2</a>
   </body>
</html>

All links are valid, yet muffet reports ERRORs:

⇒  muffet -v http://files.basex.org/.temp-muffet/
http://files.basex.org/.temp-muffet/
        OK      http://files.basex.org/.temp-muffet/
        OK      http://files.basex.org/.temp-muffet/sub1
        OK      http://files.basex.org/.temp-muffet/sub1/sub2
http://files.basex.org/.temp-muffet/sub1
        OK      http://files.basex.org/.temp-muffet/
        OK      http://files.basex.org/.temp-muffet/sub1
        OK      http://files.basex.org/.temp-muffet/sub1/sub2
http://files.basex.org/.temp-muffet/sub1/sub2
        OK      http://files.basex.org/.temp-muffet/sub1/
        ERROR   http://files.basex.org/.temp-muffet/sub1/sub1 (invalid status code 404)
        ERROR   http://files.basex.org/.temp-muffet/sub1/sub1/sub2 (invalid status code 404)
http://files.basex.org/.temp-muffet/sub1/
        OK      http://files.basex.org/.temp-muffet/sub1/
        ERROR   http://files.basex.org/.temp-muffet/sub1/sub1 (invalid status code 404)
        ERROR   http://files.basex.org/.temp-muffet/sub1/sub1/sub2 (invalid status code 404)

Meaningful description?

I just starred on Github, and I know when I'll need this later, I won't be able to find it. Can you put "link checker" or something meaningful in the description / tags of the repo? Awesome project btw!

Spaces are deleted from internal links (bookmarks)

Hello and thanks fot this tool, it's been really useful!

I encountered a small issue with href tags:

With the following HTML:

<span id"this bookmark">foo</span>
<a href="#this bookmark">Click</a>

The checker returns id thisbookmark not found. The space inside the href is deleted.

Any chance of avoiding this deletion (or is it done by an HTML parsing lib you use?)

Thanks!

support http?

Thank you @raviqqe . It looks simple and useful.

I checked a http site with muffet.

It works on index page. But it changes the site's links to https.

So it does not found the linked pages.

Do you have plan to add support of http sites?

Memory leak on Windows 10

Not sure if it's platform specific or not, but when were running a scan for https://msdn.microsoft.com for fun as mentioned in #27 just noticed muffet.exe using ~7Gb of memory and kept going...

Support Basic Authentication

Is there any way to use basic authentication ie. muffet https://user:[email protected] ?

Consider more linting capabilities

Optional type attribute
Unminified HTML
Render-blocking CSS/JS

References

Links on redirected subdomain pages aren't handled properly

Say I have a 301 redirect on my site example.org:

example.org/foo/bar -> zoo.example.org/haz

Now on this site I have a href="/somewhere", the tool reports 404 because it can't find:

example.org/somewhere, which of course it can't because the /somewhere page is on the zoo subdomain.

Dependabot can't parse your go.mod

Dependabot couldn't parse the go.mod found at /go.mod.

The error Dependabot encountered was:

go: github.com/golangci/[email protected] requires
	gopkg.in/[email protected] requires
	gopkg.in/[email protected]: invalid version: git -c protocol.version=0 fetch --unshallow -f origin refs/heads/*:refs/heads/* refs/tags/*:refs/tags/* in /opt/go/gopath/pkg/mod/cache/vcs/9241c28341fcedca6a799ab7a465dd6924dc5d94044cbfabb75778817250adfc: exit status 128:
	fatal: The remote end hung up unexpectedly

Create a Dockerfile to execute

A valid way to execute the application is containerised, and share the docker image on Docker Hub will be great

invalid status code 999

When I use Muffet on my LinkedIn profile, which certainly works, I get "invalid status code 999".

muffet https://www.linkedin.com/in/chrisbenson

Thanks,
Chris

Dependabot can't parse your go.mod

Dependabot couldn't parse the go.mod found at /go.mod.

The error Dependabot encountered was:

go: github.com/grpc-ecosystem/[email protected] requires
	gopkg.in/[email protected]: invalid version: git fetch -f origin refs/heads/*:refs/heads/* refs/tags/*:refs/tags/* in /opt/go/gopath/pkg/mod/cache/vcs/748bced43cf7672b862fbc52430e98581510f4f2c34fb30c0064b7102a68ae2c: exit status 128:
	fatal: The remote end hung up unexpectedly

View the update logs.

Error: "no free connections available to host"

Many requests fail with "no free connections available to host", even if I specify -c1 on the command line. The error seems to come directly from fasthttp, but it looks like muffet is trying to create new connections before the previous ones have been closed completely.

Doesn't support links with hashes

Testing this tool with my website gives a lot of great information (thanks!) but runs into a few issues, of them being a link like:

https://en.wikipedia.org/wiki/Bell_Labs#1970s

It reports this link to be a 400, I would guess because it's sending the full URL to the server instead of requesting the portion

https://en.wikipedia.org/wiki/Bell_Labs

That a web browser or other client would request, because the #1970s portion is not part of the server request path.

--follow-robots-txt and --follow-sitemap-xml don't respect --skip-tls-verification

The following commands fail to bypass TLS verification even when the --skip-tls-verification / -x flag is added:

$ muffet -s -x https://untrusted-root.badssl.com/
Get https://untrusted-root.badssl.com/sitemap.xml: x509: certificate signed by unknown authority

$ muffet -r -x https://untrusted-root.badssl.com/
Get https://untrusted-root.badssl.com/robots.txt: x509: certificate signed by unknown authority

Repeated link checks and zillions of timeouts

I just tried:

muffet --timeout 60 https://istio.io

The result is 8000 lines of output indicating that a great many pages experienced timeout errors. Any given page is shown a great many times in the output, which seems to indicate it is being processed multiple times (I would expect any given link to be visited only once during a given scan).

When I use the -v option, The output grows to at least 50000 lines, and this time both page that time out and pages that don't time out appear multiple times in the output.

So the first problem is the fact the same pages seem to be visited multiple times.

The second problem are these timeouts. I gave it 60 second timeout intervals, and I don't believe that our CDN is unable to deliver the desired content in 60 seconds for so many pages. When accessing the links from a browser, they appear instantaneously.

I change my command-line to:

muffet -c 32 --timeout 10 https://istio.io

This eliminated all the timeout errors. So the problem is on my side, and not on the CDN. I'm running this on a Mac, perhaps it has some limits to the number of opened sockets (or something similar) which is being interpreted as timeouts by the link checker. So it might be good to dynamically discover the problem and tune down the concurrency accordingly.

Url cache issue

Muffet goes to same url a lot of times. Shouldn't it go just once?

In this case it makes Linkedin send us 999 status code:

muffet -c 10 http://www.ximenavengoechea.com/
http://www.ximenavengoechea.com/
	999	http://www.linkedin.com/in/ximenavengoechea
http://www.ximenavengoechea.com/illustrations/
	999	http://www.linkedin.com/in/ximenavengoechea
http://www.ximenavengoechea.com/contact/
	999	http://www.linkedin.com/in/ximenavengoechea
http://www.ximenavengoechea.com/products/
	999	http://www.linkedin.com/in/ximenavengoechea
http://www.ximenavengoechea.com/design/
	999	http://www.linkedin.com/in/ximenavengoechea
http://www.ximenavengoechea.com/faqs/
	999	http://www.linkedin.com/in/ximenavengoechea

Multiple attributes support

As @pschlump mentioned in #12, the current implementation doesn't support multiple attributes per tag like <source src="url1" /> vs <source srcset="url2" />.

Feature Request: Export/results as JSON

As the title states, would it be possible to get the resultset as JSON?

Use http status code instead of OK/FAIL

I had an idea that may make this more useful to the average user.

Instead of showing OK/FAIL, show the http status code. 2xx would be green, 3xx would be blue, 4xx would be yellow, 5xx would be red.

I could submit a PR if you think this is a good idea.

Support img and link tags

Recognize <a href="#name"> as creating a valid link target

Muffet complains id #mozTocId468501 not found when it sees my site link to:

http://www.clearskyinstitute.com/xephem/help/xephem.html#mozTocId468501

Yet the link works, jumping to the correct section of the page. It appears that muffet might not yet know about link targets created with <a href="#name">? Adding that knowledge would help reduce its noise when run against links whose sites use them. Thanks!

Split concurrency option into two

Currently it seems that the concurrency option defines how many HTTP requests are done at the same time by muffet overall.

This however is not super useful on it's own. I'm happy with muffet doing 512 concurrent HTTP requests but not if those requests all hit the same server.

Browsers typically have two settings (eg network.http.max-connections and network.http.max-connections-per-server in Firefox) to separate the two cases and I believe muffet should separate these cases as well.

feature request: scan sitemap.xml

Hello,

I'm quite impressed by this project you have here! I was wondering if it would be possible to use muffet to check links in a particular sitemap?

Cheers,
Matt

Question: limit check to img tags?

Hi Yota,

just a question. Is it possible to limit the check to a given Tag like ?

In my special case I just want to check some special links to images on a sigle URL

Something like --include "page-0-cover-big.jpg"

Joerg

Publish binary releases for easier usage

Please consider uploading binary releases to GitHub Releases, so that people can quickly use this tool without installing a Go toolchain (which can be helpful in continuous integration). This is made easy thanks to tools like GoReleaser 🙂

Add support to read input HTML from a directory

When producing a web site from a static site generator like Hugo, it'd be highly desirable to be able to verify the links in a statically generated site on disk. So basically:

Instead of specifying a URL to fetch from, let me specify a file system directory
Iterate through all the HTML files and analyze them.
Provide a mappings layer to map the public location of the web site to the directory. So basically make it so https://foo/bar gets translated to mysite/bar.

I could then check the site's link before publishing.

I currently use htmlproofer for this task and it is awfully slow and it requires Ruby. Muffet seems like a much superior solution.

Thanks.

Problems with anchors

Using:

muffet -c 32 --timeout 10 --exclude http://localhost --exclude https://groups.google.com/forum --exclude https://github.com/istio/istio.io/issues/new.* https://istio.io

Produces a bunch of incorrect error messages of the form:

https://istio.io/docs/reference/config/policy-and-telemetry/adapters/memquota/
	id #duration not found	https://developers.google.com/protocol-buffers/docs/reference/google.protobuf#duration
https://istio.io/docs/reference/config/networking/v1alpha3/envoy-filter/
	id #struct not found	https://developers.google.com/protocol-buffers/docs/reference/google.protobuf#struct
https://istio.io/docs/reference/config/policy-and-telemetry/adapters/kubernetesenv/
	id #duration not found	https://developers.google.com/protocol-buffers/docs/reference/google.protobuf#duration

These ID values are all present in the target HTML page, so these errors shouldn't be produced.

Add shiny GIF in README.md

x509: certificate signed by unknown authority error

I wanted check my site (https://lazyd.org) with muffet. The site has "COMODO CA Limited" certificate.

But it exits with x509: certificate signed by unknown authority error.

If you know what causes it, please let me know.

Also maybe add -insecure flag would great. Like git -c http.sslVerify=false

Feature Request: Colors

Hello.

Would it be possible to implement parameter which will disable/enable colors in muffet ?

Something like:

-C, --color <auto | always | never>    Use color [default: auto]

It will help on some "dumb" terminals to "force" color or just disable it completely.

Fails to check src on script tag.

Fails to check <script src="missing.js"></script>

undefined: sync.Map

eparra@eparra-zscaler:~$ go get -u github.com/raviqqe/muffet
# github.com/raviqqe/muffet
go/src/github.com/raviqqe/muffet/fetcher.go:17: undefined: sync.Map

Cache already checked URLs

Hey, this looks mighty cool already. I noticed that when you point it to a URL, it'll request the URLs it finds in links several times. I think it's a good idea to add a cache, so that each URL is only checked once. Thanks!

Improve test coverage

Links are scanned multiple times

Pretty cool project, saw it on hackers news!

Just for fun try to run it against https://msdn.microsoft.com ( they are notorious to have broken links all over the place ).

example from the result showing duplicates ( there are many)
OK https://www.visualstudio.com
OK https://www.visualstudio.com/
OK https://www.visualstudio.com/
OK https://www.visualstudio.com/

Spaces removed from links (all URLs)

Related to #44, however, now I've come across a site (being published on Read the Docs) which has many spaces in the filenames. Though the IDs are fine, the spaces get removed from the path, which breaks the links and results in many false 404's.

Perhaps instead of removing spaces, using the net/url Parse function only? Or removing spaces only from the URL.Fragment?

muffet/scraper.go

Lines 48 to 54 in 4998c9b

 s := normalizeURL(scrape.Attr(n, a)) 

 if s == "" || sc.isURLExcluded(s) { 

 continue 

 } 

 u, err := url.Parse(s)

muffet/scraper.go

Lines 82 to 90 in 4998c9b

 func normalizeURL(s string) string { 

 return strings.Map(func(r rune) rune { 

 if unicode.IsSpace(r) { 

 return -1 

 } 

 return r 

 }, s) 

 }

https://play.golang.org/p/42kUw1Rg23m

Feature request: report error on links to redirects

I'm enjoying the performance of this tool! But, to improve user experience of the web and lower server costs, I like linking directly to the current URL for a page instead of linking to an old URL that is simply a redirect.

Could an option be added that would report all URLs leading to redirects as broken instead of valid? I tried -l 0 to turn redirection-following off, but muffet still seems to follow at least a single initial redirection.

error when reading response headers: small read buffer

If I run muffet http://www.flickr.com/issf2018 I get this output:

error when reading response headers: small read buffer. Increase ReadBufferSize. Buffer size=4096, contents: "HTTP/1.1 302 Found\r\nDate: Fri, 24 Aug 2018 15:29:52 GMT\r\nContent-Type: text/html; charset=UTF-8\r\nContent-Length: 0\r\nP3p: policyref=\"https://policies.yahoo.com/w3c/p3p.xml\", CP=\"CAO DSP COR CUR ADM DEV"...".com https://*.maps.api.here.com https://*.maps.cit.api.here.com https://*.ads.yahoo.com https://cdn.siftscience.com; connect-src https://*.flickr.com https://*.flickr.net http://*.flickr.net https://"

Identical URLs requested multiple times

When I run muffet against a local site I see in the logs that some pages are being requested many times in a single run. This seem unnecessary and puts extra load on the server being tested.

Here is a simple example. Create "test.html" with this content:

<html><body>
<a href="/foo.html">foo</a>
<a href="/test2.html">test2</a>
</body></html>

and "test2.html" with this:

<html><body>
<a href="/foo.html">foo</a>
</body></html>

Then serve this content with python3 -m http.server.

And run muffet http://localhost:8000/test.html.

The python http.server output I get is this:

Serving HTTP on 0.0.0.0 port 8000 (http://0.0.0.0:8000/) ...
127.0.0.1 - - [24/Aug/2018 16:14:01] "GET /test.html HTTP/1.1" 200 -
127.0.0.1 - - [24/Aug/2018 16:14:01] "GET /test2.html HTTP/1.1" 200 -
127.0.0.1 - - [24/Aug/2018 16:14:01] code 404, message File not found
127.0.0.1 - - [24/Aug/2018 16:14:01] "GET /foo.html HTTP/1.1" 404 -
127.0.0.1 - - [24/Aug/2018 16:14:01] code 404, message File not found
127.0.0.1 - - [24/Aug/2018 16:14:01] "GET /foo.html HTTP/1.1" 404 -

This shows that "/foo.html" was requested multiple times.

Strangely, small changes to those html files cause different results. If I add a link to test.html, muffet requests foo.html only once in the run.

Limit concurrent connections per host

Add verbose option

It's better not to print success outputs when --verbose option is off as users probably want to focus only on errors to be fixed.

Install error?

When trying to install I receive the following error:

$ go get -u github.com/raviqqe/muffet

github.com/raviqqe/muffet

go/src/github.com/raviqqe/muffet/fetcher.go:17: undefined: sync.Map

Any suggestions, sorry this is my first go around with go...

Use http status code instead of OK/FAIL

I had an idea that may make this more useful to the average user.

Instead of showing OK/FAIL, show the http status code. 2xx would be green, 3xx would be blue, 4xx would be yellow, 5xx would be red.

I could submit a PR if you think this is a good idea.

Don't scrape recursively

Named anchors

First of all, thanks for the amazing tool.

I noticed that muffet only supports anchors to HTML elements with an id attribute.

Consider that anchors also works on HTML elements with name attribute.

<a name="Dependency_Scope" data-devgib="tagged">Dependency Scope</a>

Some examples:

Add HTTP authentication to options

Currently Muffet has no way to supply the username and password for a HTPASSWD protected website.

This feature request is to add username and password options that would allow Muffet to login to a site to check it.

Handle redirects

An HTTP redirect isn't necessarily a failing HTTP status code.

ERROR   XXX (invalid status code 307)

Needs proxy support

Current version doesn't work behind proxy

arguments.go:101:15: undefined: docopt.ParseArgs

args, err := docopt.ParseArgs(u, ss, "0.6.0")

	s := normalizeURL(scrape.Attr(n, a))

	if s == "" \|\| sc.isURLExcluded(s) {
	continue
	}

	u, err := url.Parse(s)

	func normalizeURL(s string) string {
	return strings.Map(func(r rune) rune {
	if unicode.IsSpace(r) {
	return -1
	}

	return r
	}, s)
	}