Giter VIP home page Giter VIP logo

spidy's Introduction

Spidy

A tool that crawl websites to find domain names and checks thier availiabity.

Install

git clone https://github.com/twiny/spidy.git
cd ./spidy

# build
go build -o bin/spidy -v cmd/spidy/main.go

# run
./bin/spidy -c config/config.yaml -u https://github.com

Usage

NAME:
   Spidy - Domain name scraper

USAGE:
   spidy [global options] command [command options] [arguments...]

VERSION:
   2.0.0

COMMANDS:
   help, h  Shows a list of commands or help for one command

GLOBAL OPTIONS:
   --config path, -c path  path to config file
   --help, -h              show help (default: false)
   --urls urls, -u urls    urls of page to scrape  (accepts multiple inputs)
   --version, -v           print the version (default: false)

Configuration

# main crawler config
crawler:
    max_depth: 10 # max depth of pages to visit per website.
    # filter: [] # regexp filter
    rate_limit: "1/5s" # 1 request per 5 sec
    max_body_size: "20MB" # max page body size
    user_agents: # array of user-agents
      - "Spidy/2.1; +https://github.com/ twiny/spidy"
    # proxies: [] # array of proxy. http(s), SOCKS5
# Logs
log:
    rotate: 7 # log rotation
    path: "./log" # log directory
# Store
store:
    ttl: "24h" # keep cache for 24h 
    path: "./store" # store directory
# Results
result:
    path: ./result # result directory
parralle: 3 # number of concurrent workers 
timeout: "5m" # request timeout
tlds: ["biz", "cc", "com", "edu", "info", "net", "org", "tv"] # array of domain extension to check.

TODO

  • Add support to more writers.
  • Add terminal logging.
  • Add test cases.

Issues

NOTE: This package is provided "as is" with no guarantee. Use it at your own risk and always test it yourself before using it in a production environment. If you find any issues, please create a new issue.

spidy's People

Contributors

twiny avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

spidy's Issues

panic: send on closed channel

Encountered this error with go version go1.15.6 linux/amd64 on Ubuntu 20.04.1 LTS and the following command line:

curl -v -silent -l https://twitter.com --stderr - | awk '/^content-security-policy:/' | grep -Eo "[a-zA-Z0-9./?=_-]*" | sed -e '/\./!d' -e '/[^A-Za-z0-9._-]/d' -e 's/^\.//' | sort -u | httpx -o spidy/input.txt && spidy -config spidy/setting.yaml

Welcome, Spidy is running.
false schema.org
false googletagmanager.com
false giphy.com
false facebook.net
false google-analytics.com
false vine.co
false twitter.com
false t.co
false doubleclick.net
false vineapp.com
false nytimes.com
false twimg.com
false pscp.tv
false google.com
false apple.com
false bit.ly
false happs.tv
false n.tw
false jquery.com
false wordpress.com
false pocoo.org
false apache.org
false jquery.org
false gruntjs.com
false emberjs.com
false github.com
false mattt.me
false bower.io
false stylus-lang.com
false ogp.me
false ethicspointvp.com
false jamsadr.com
false dataprotection.ie
false privacyshield.gov
panic: send on closed channel

goroutine 172 [running]:
github.com/superiss/spidy/crawler.(*Spider).extract(0xc000276600, 0xc000586000, 0x3b7d, 0x3e00)
        /root/go/pkg/mod/github.com/superiss/[email protected]/crawler/crawler.go:268 +0x105
github.com/superiss/spidy/crawler.(*Spider).Run.func4(0xc000276600, 0xc00006cea0)
        /root/go/pkg/mod/github.com/superiss/[email protected]/crawler/crawler.go:352 +0x71
created by github.com/superiss/spidy/crawler.(*Spider).Run
        /root/go/pkg/mod/github.com/superiss/[email protected]/crawler/crawler.go:350 +0x21b

setting.yaml:

Engine:
  worker: 100
  parallel: 10
  depth: 10
  urls: ./spidy/input.txt
  proxies: []
  #tlds: [com, net, io, co, ly, me, us, at, st, so]
  tlds: []
  random_delay: 5s
  timeout: 30s

httpx can be found here: https://github.com/projectdiscovery/httpx

Document how -u can accept multiple urls

Is your feature request related to a problem? Please describe.
I am not sure if its -u url1.com url2.com or -u url1.com,url2.com or -u url1.com -u url2.com

I am sorry im not familiar enough with go yet to figure it out. But if you explain ill submit a pull request to update the readme

Returns .tv domains as available even though they are not

Describe the bug
The crawler marks domains like justin.tv (old twitch.tv) as available when they are not. Inside the results .csv

To Reproduce

  1. install
  2. run on any website
  3. wait to see results

Expected behavior
.tv domains are correctly categorized.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):
tested on windows 10, linux & alpinejs

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.