twiny / spidy Goto Github PK

View Code? Open in Web Editor NEW

147.0 6.0 27.0 58 KB

Domain names collector - Crawl websites and collect domain names along with their availability status.

Home Page: https://github.com/twiny/spidy/wiki

License: MIT License

Go 100.00%

seotools scraper expired-domain crawler backlinks domain golang spider

spidy's Introduction

Spidy

A tool that crawl websites to find domain names and checks thier availiabity.

Install

git clone https://github.com/twiny/spidy.git
cd ./spidy

# build
go build -o bin/spidy -v cmd/spidy/main.go

# run
./bin/spidy -c config/config.yaml -u https://github.com

Usage

NAME:
   Spidy - Domain name scraper

USAGE:
   spidy [global options] command [command options] [arguments...]

VERSION:
   2.0.0

COMMANDS:
   help, h  Shows a list of commands or help for one command

GLOBAL OPTIONS:
   --config path, -c path  path to config file
   --help, -h              show help (default: false)
   --urls urls, -u urls    urls of page to scrape  (accepts multiple inputs)
   --version, -v           print the version (default: false)

Configuration

# main crawler config
crawler:
    max_depth: 10 # max depth of pages to visit per website.
    # filter: [] # regexp filter
    rate_limit: "1/5s" # 1 request per 5 sec
    max_body_size: "20MB" # max page body size
    user_agents: # array of user-agents
      - "Spidy/2.1; +https://github.com/ twiny/spidy"
    # proxies: [] # array of proxy. http(s), SOCKS5
# Logs
log:
    rotate: 7 # log rotation
    path: "./log" # log directory
# Store
store:
    ttl: "24h" # keep cache for 24h 
    path: "./store" # store directory
# Results
result:
    path: ./result # result directory
parralle: 3 # number of concurrent workers 
timeout: "5m" # request timeout
tlds: ["biz", "cc", "com", "edu", "info", "net", "org", "tv"] # array of domain extension to check.

TODO

Add support to more writers.
Add terminal logging.
Add test cases.

Issues

NOTE: This package is provided "as is" with no guarantee. Use it at your own risk and always test it yourself before using it in a production environment. If you find any issues, please create a new issue.

spidy's People

Contributors

Stargazers

Watchers

spidy's Issues

panic: send on closed channel

Encountered this error with go version go1.15.6 linux/amd64 on Ubuntu 20.04.1 LTS and the following command line:

Welcome, Spidy is running.
false schema.org
false googletagmanager.com
false giphy.com
false facebook.net
false google-analytics.com
false vine.co
false twitter.com
false t.co
false doubleclick.net
false vineapp.com
false nytimes.com
false twimg.com
false pscp.tv
false google.com
false apple.com
false bit.ly
false happs.tv
false n.tw
false jquery.com
false wordpress.com
false pocoo.org
false apache.org
false jquery.org
false gruntjs.com
false emberjs.com
false github.com
false mattt.me
false bower.io
false stylus-lang.com
false ogp.me
false ethicspointvp.com
false jamsadr.com
false dataprotection.ie
false privacyshield.gov
panic: send on closed channel

goroutine 172 [running]:
github.com/superiss/spidy/crawler.(*Spider).extract(0xc000276600, 0xc000586000, 0x3b7d, 0x3e00)
        /root/go/pkg/mod/github.com/superiss/[email protected]/crawler/crawler.go:268 +0x105
github.com/superiss/spidy/crawler.(*Spider).Run.func4(0xc000276600, 0xc00006cea0)
        /root/go/pkg/mod/github.com/superiss/[email protected]/crawler/crawler.go:352 +0x71
created by github.com/superiss/spidy/crawler.(*Spider).Run
        /root/go/pkg/mod/github.com/superiss/[email protected]/crawler/crawler.go:350 +0x21b

setting.yaml:

Engine:
  worker: 100
  parallel: 10
  depth: 10
  urls: ./spidy/input.txt
  proxies: []
  #tlds: [com, net, io, co, ly, me, us, at, st, so]
  tlds: []
  random_delay: 5s
  timeout: 30s

httpx can be found here: https://github.com/projectdiscovery/httpx

install
run on any website
wait to see results

Expected behavior
.tv domains are correctly categorized.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):
tested on windows 10, linux & alpinejs