Giter VIP home page Giter VIP logo

muffet's Introduction

Muffet

GitHub Action Codecov Go Report Card Docker License

demo

Muffet is a website link checker which scrapes and inspects all pages in a website recursively.

Features

  • Massive speed
  • High compatibility with web browsers
  • Different tag support (a, img, link, script, etc)
  • Multiple output formats (text, JSON, and JUnit XML)

Installation

go install github.com/raviqqe/muffet/v2@latest

Homebrew

brew install muffet

Usage

muffet https://shady.bakery.hotland

For more information, see muffet --help.

Docker

docker run raviqqe/muffet https://shady.bakery.hotland

GitHub Action

Currently, we do not provide any official one. Feel free to create an issue if you want!

License

MIT

muffet's People

Contributors

azeemba avatar dependabot-preview[bot] avatar dependabot[bot] avatar drichardson avatar johnfreeborg avatar kemokemo avatar lgtm-com[bot] avatar lornajane avatar lpar avatar matlockx avatar muhme avatar nwidger avatar pmwmedia avatar popey avatar pschlump avatar raviqqe avatar renovate[bot] avatar shivjm avatar xyproto avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

muffet's Issues

HTML `<base>`-element not respected

From MDN:

The HTML element specifies the base URL to use for all relative URLs contained within a document. There can be only one element in a document.

Relative URLs are always resolved relative to the current document even if a <base />-element is present.


To reproduce I created three example files hosted at:

http://files.basex.org/.temp-muffet/

Each file contains the same three links and the same <base />-element that point the each of the 3 documents:

<html>
  <head>
    <base href="/.temp-muffet/">
   </head>
  <body>
        <a href=".">Home</a>
        <a href="sub1">Sub 1</a>
        <a href="sub1/sub2">Sub 2</a>
   </body>
</html>

All links are valid, yet muffet reports ERRORs:

โ‡’  muffet -v http://files.basex.org/.temp-muffet/
http://files.basex.org/.temp-muffet/
        OK      http://files.basex.org/.temp-muffet/
        OK      http://files.basex.org/.temp-muffet/sub1
        OK      http://files.basex.org/.temp-muffet/sub1/sub2
http://files.basex.org/.temp-muffet/sub1
        OK      http://files.basex.org/.temp-muffet/
        OK      http://files.basex.org/.temp-muffet/sub1
        OK      http://files.basex.org/.temp-muffet/sub1/sub2
http://files.basex.org/.temp-muffet/sub1/sub2
        OK      http://files.basex.org/.temp-muffet/sub1/
        ERROR   http://files.basex.org/.temp-muffet/sub1/sub1 (invalid status code 404)
        ERROR   http://files.basex.org/.temp-muffet/sub1/sub1/sub2 (invalid status code 404)
http://files.basex.org/.temp-muffet/sub1/
        OK      http://files.basex.org/.temp-muffet/sub1/
        ERROR   http://files.basex.org/.temp-muffet/sub1/sub1 (invalid status code 404)
        ERROR   http://files.basex.org/.temp-muffet/sub1/sub1/sub2 (invalid status code 404)

Meaningful description?

I just starred on Github, and I know when I'll need this later, I won't be able to find it. Can you put "link checker" or something meaningful in the description / tags of the repo? Awesome project btw!

Spaces are deleted from internal links (bookmarks)

Hello and thanks fot this tool, it's been really useful!

I encountered a small issue with href tags:

With the following HTML:

<span id"this bookmark">foo</span>
<a href="#this bookmark">Click</a>

The checker returns id thisbookmark not found. The space inside the href is deleted.

Any chance of avoiding this deletion (or is it done by an HTML parsing lib you use?)

Thanks!

support http?

Thank you @raviqqe . It looks simple and useful.

I checked a http site with muffet.

It works on index page. But it changes the site's links to https.

So it does not found the linked pages.

Do you have plan to add support of http sites?

Links on redirected subdomain pages aren't handled properly

Say I have a 301 redirect on my site example.org:

example.org/foo/bar -> zoo.example.org/haz

Now on this site I have a href="/somewhere", the tool reports 404 because it can't find:

example.org/somewhere, which of course it can't because the /somewhere page is on the zoo subdomain.

Dependabot can't parse your go.mod

Dependabot couldn't parse the go.mod found at /go.mod.

The error Dependabot encountered was:

go: github.com/golangci/[email protected] requires
	gopkg.in/[email protected] requires
	gopkg.in/[email protected]: invalid version: git -c protocol.version=0 fetch --unshallow -f origin refs/heads/*:refs/heads/* refs/tags/*:refs/tags/* in /opt/go/gopath/pkg/mod/cache/vcs/9241c28341fcedca6a799ab7a465dd6924dc5d94044cbfabb75778817250adfc: exit status 128:
	fatal: The remote end hung up unexpectedly

Dependabot can't parse your go.mod

Dependabot couldn't parse the go.mod found at /go.mod.

The error Dependabot encountered was:

go: github.com/grpc-ecosystem/[email protected] requires
	gopkg.in/[email protected]: invalid version: git fetch -f origin refs/heads/*:refs/heads/* refs/tags/*:refs/tags/* in /opt/go/gopath/pkg/mod/cache/vcs/748bced43cf7672b862fbc52430e98581510f4f2c34fb30c0064b7102a68ae2c: exit status 128:
	fatal: The remote end hung up unexpectedly

View the update logs.

Error: "no free connections available to host"

Many requests fail with "no free connections available to host", even if I specify -c1 on the command line. The error seems to come directly from fasthttp, but it looks like muffet is trying to create new connections before the previous ones have been closed completely.

Doesn't support links with hashes

Testing this tool with my website gives a lot of great information (thanks!) but runs into a few issues, of them being a link like:

https://en.wikipedia.org/wiki/Bell_Labs#1970s

It reports this link to be a 400, I would guess because it's sending the full URL to the server instead of requesting the portion

https://en.wikipedia.org/wiki/Bell_Labs

That a web browser or other client would request, because the #1970s portion is not part of the server request path.

--follow-robots-txt and --follow-sitemap-xml don't respect --skip-tls-verification

The following commands fail to bypass TLS verification even when the --skip-tls-verification / -x flag is added:

$ muffet -s -x https://untrusted-root.badssl.com/
Get https://untrusted-root.badssl.com/sitemap.xml: x509: certificate signed by unknown authority
$ muffet -r -x https://untrusted-root.badssl.com/
Get https://untrusted-root.badssl.com/robots.txt: x509: certificate signed by unknown authority

Repeated link checks and zillions of timeouts

I just tried:

muffet --timeout 60 https://istio.io

The result is 8000 lines of output indicating that a great many pages experienced timeout errors. Any given page is shown a great many times in the output, which seems to indicate it is being processed multiple times (I would expect any given link to be visited only once during a given scan).

When I use the -v option, The output grows to at least 50000 lines, and this time both page that time out and pages that don't time out appear multiple times in the output.

So the first problem is the fact the same pages seem to be visited multiple times.

The second problem are these timeouts. I gave it 60 second timeout intervals, and I don't believe that our CDN is unable to deliver the desired content in 60 seconds for so many pages. When accessing the links from a browser, they appear instantaneously.

I change my command-line to:

muffet -c 32 --timeout 10 https://istio.io

This eliminated all the timeout errors. So the problem is on my side, and not on the CDN. I'm running this on a Mac, perhaps it has some limits to the number of opened sockets (or something similar) which is being interpreted as timeouts by the link checker. So it might be good to dynamically discover the problem and tune down the concurrency accordingly.

Url cache issue

Muffet goes to same url a lot of times. Shouldn't it go just once?

In this case it makes Linkedin send us 999 status code:

muffet -c 10 http://www.ximenavengoechea.com/
http://www.ximenavengoechea.com/
	999	http://www.linkedin.com/in/ximenavengoechea
http://www.ximenavengoechea.com/illustrations/
	999	http://www.linkedin.com/in/ximenavengoechea
http://www.ximenavengoechea.com/contact/
	999	http://www.linkedin.com/in/ximenavengoechea
http://www.ximenavengoechea.com/products/
	999	http://www.linkedin.com/in/ximenavengoechea
http://www.ximenavengoechea.com/design/
	999	http://www.linkedin.com/in/ximenavengoechea
http://www.ximenavengoechea.com/faqs/
	999	http://www.linkedin.com/in/ximenavengoechea

Use http status code instead of OK/FAIL

I had an idea that may make this more useful to the average user.

Instead of showing OK/FAIL, show the http status code. 2xx would be green, 3xx would be blue, 4xx would be yellow, 5xx would be red.

I could submit a PR if you think this is a good idea.

Split concurrency option into two

Currently it seems that the concurrency option defines how many HTTP requests are done at the same time by muffet overall.

This however is not super useful on it's own. I'm happy with muffet doing 512 concurrent HTTP requests but not if those requests all hit the same server.

Browsers typically have two settings (eg network.http.max-connections and network.http.max-connections-per-server in Firefox) to separate the two cases and I believe muffet should separate these cases as well.

feature request: scan sitemap.xml

Hello,

I'm quite impressed by this project you have here! I was wondering if it would be possible to use muffet to check links in a particular sitemap?

Cheers,
Matt

Question: limit check to img tags?

Hi Yota,

just a question. Is it possible to limit the check to a given Tag like ?

In my special case I just want to check some special links to images on a sigle URL

Something like --include "page-0-cover-big.jpg"

Joerg

Publish binary releases for easier usage

Please consider uploading binary releases to GitHub Releases, so that people can quickly use this tool without installing a Go toolchain (which can be helpful in continuous integration). This is made easy thanks to tools like GoReleaser ๐Ÿ™‚

Add support to read input HTML from a directory

When producing a web site from a static site generator like Hugo, it'd be highly desirable to be able to verify the links in a statically generated site on disk. So basically:

  • Instead of specifying a URL to fetch from, let me specify a file system directory

  • Iterate through all the HTML files and analyze them.

  • Provide a mappings layer to map the public location of the web site to the directory. So basically make it so https://foo/bar gets translated to mysite/bar.

I could then check the site's link before publishing.

I currently use htmlproofer for this task and it is awfully slow and it requires Ruby. Muffet seems like a much superior solution.

Thanks.

Problems with anchors

Using:

muffet -c 32 --timeout 10 --exclude http://localhost --exclude https://groups.google.com/forum --exclude https://github.com/istio/istio.io/issues/new.* https://istio.io 

Produces a bunch of incorrect error messages of the form:

https://istio.io/docs/reference/config/policy-and-telemetry/adapters/memquota/
	id #duration not found	https://developers.google.com/protocol-buffers/docs/reference/google.protobuf#duration
https://istio.io/docs/reference/config/networking/v1alpha3/envoy-filter/
	id #struct not found	https://developers.google.com/protocol-buffers/docs/reference/google.protobuf#struct
https://istio.io/docs/reference/config/policy-and-telemetry/adapters/kubernetesenv/
	id #duration not found	https://developers.google.com/protocol-buffers/docs/reference/google.protobuf#duration

These ID values are all present in the target HTML page, so these errors shouldn't be produced.

x509: certificate signed by unknown authority error

I wanted check my site (https://lazyd.org) with muffet. The site has "COMODO CA Limited" certificate.

But it exits with x509: certificate signed by unknown authority error.

If you know what causes it, please let me know.

Also maybe add -insecure flag would great. Like git -c http.sslVerify=false

Feature Request: Colors

Hello.

Would it be possible to implement parameter which will disable/enable colors in muffet ?

Something like:

-C, --color <auto | always | never>    Use color [default: auto]

It will help on some "dumb" terminals to "force" color or just disable it completely.

undefined: sync.Map

eparra@eparra-zscaler:~$ go get -u github.com/raviqqe/muffet
# github.com/raviqqe/muffet
go/src/github.com/raviqqe/muffet/fetcher.go:17: undefined: sync.Map

Cache already checked URLs

Hey, this looks mighty cool already. I noticed that when you point it to a URL, it'll request the URLs it finds in links several times. I think it's a good idea to add a cache, so that each URL is only checked once. Thanks!

Spaces removed from links (all URLs)

Related to #44, however, now I've come across a site (being published on Read the Docs) which has many spaces in the filenames. Though the IDs are fine, the spaces get removed from the path, which breaks the links and results in many false 404's.

Perhaps instead of removing spaces, using the net/url Parse function only? Or removing spaces only from the URL.Fragment?

muffet/scraper.go

Lines 48 to 54 in 4998c9b

s := normalizeURL(scrape.Attr(n, a))
if s == "" || sc.isURLExcluded(s) {
continue
}
u, err := url.Parse(s)

muffet/scraper.go

Lines 82 to 90 in 4998c9b

func normalizeURL(s string) string {
return strings.Map(func(r rune) rune {
if unicode.IsSpace(r) {
return -1
}
return r
}, s)
}

https://play.golang.org/p/42kUw1Rg23m

Feature request: report error on links to redirects

I'm enjoying the performance of this tool! But, to improve user experience of the web and lower server costs, I like linking directly to the current URL for a page instead of linking to an old URL that is simply a redirect.

Could an option be added that would report all URLs leading to redirects as broken instead of valid? I tried -l 0 to turn redirection-following off, but muffet still seems to follow at least a single initial redirection.

error when reading response headers: small read buffer

If I run muffet http://www.flickr.com/issf2018 I get this output:

error when reading response headers: small read buffer. Increase ReadBufferSize. Buffer size=4096, contents: "HTTP/1.1 302 Found\r\nDate: Fri, 24 Aug 2018 15:29:52 GMT\r\nContent-Type: text/html; charset=UTF-8\r\nContent-Length: 0\r\nP3p: policyref=\"https://policies.yahoo.com/w3c/p3p.xml\", CP=\"CAO DSP COR CUR ADM DEV"...".com https://*.maps.api.here.com https://*.maps.cit.api.here.com https://*.ads.yahoo.com https://cdn.siftscience.com; connect-src https://*.flickr.com https://*.flickr.net http://*.flickr.net https://"

Identical URLs requested multiple times

When I run muffet against a local site I see in the logs that some pages are being requested many times in a single run. This seem unnecessary and puts extra load on the server being tested.

Here is a simple example. Create "test.html" with this content:

<html><body>
<a href="/foo.html">foo</a>
<a href="/test2.html">test2</a>
</body></html>

and "test2.html" with this:

<html><body>
<a href="/foo.html">foo</a>
</body></html>

Then serve this content with python3 -m http.server.

And run muffet http://localhost:8000/test.html.

The python http.server output I get is this:

Serving HTTP on 0.0.0.0 port 8000 (http://0.0.0.0:8000/) ...
127.0.0.1 - - [24/Aug/2018 16:14:01] "GET /test.html HTTP/1.1" 200 -
127.0.0.1 - - [24/Aug/2018 16:14:01] "GET /test2.html HTTP/1.1" 200 -
127.0.0.1 - - [24/Aug/2018 16:14:01] code 404, message File not found
127.0.0.1 - - [24/Aug/2018 16:14:01] "GET /foo.html HTTP/1.1" 404 -
127.0.0.1 - - [24/Aug/2018 16:14:01] code 404, message File not found
127.0.0.1 - - [24/Aug/2018 16:14:01] "GET /foo.html HTTP/1.1" 404 -

This shows that "/foo.html" was requested multiple times.

Strangely, small changes to those html files cause different results. If I add a link to test.html, muffet requests foo.html only once in the run.

Add verbose option

It's better not to print success outputs when --verbose option is off as users probably want to focus only on errors to be fixed.

Install error?

When trying to install I receive the following error:

$ go get -u github.com/raviqqe/muffet

github.com/raviqqe/muffet

go/src/github.com/raviqqe/muffet/fetcher.go:17: undefined: sync.Map

Any suggestions, sorry this is my first go around with go...

Use http status code instead of OK/FAIL

I had an idea that may make this more useful to the average user.

Instead of showing OK/FAIL, show the http status code. 2xx would be green, 3xx would be blue, 4xx would be yellow, 5xx would be red.

I could submit a PR if you think this is a good idea.

Named anchors

First of all, thanks for the amazing tool.

I noticed that muffet only supports anchors to HTML elements with an id attribute.

Consider that anchors also works on HTML elements with name attribute.

<a name="Dependency_Scope" data-devgib="tagged">Dependency Scope</a>

Some examples:

Add HTTP authentication to options

Currently Muffet has no way to supply the username and password for a HTPASSWD protected website.

This feature request is to add username and password options that would allow Muffet to login to a site to check it.

Handle redirects

An HTTP redirect isn't necessarily a failing HTTP status code.

ERROR   XXX (invalid status code 307)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.