Giter VIP home page Giter VIP logo

blue-pugs-engine-python's Introduction

Blue Pugs (beta) v0.3

$$$$$$$$  $$$$   $$$$  $$$$   $$$$$$     $$$$$$$  $$$$  $$$$    $$$$$$    $$$$$
$$$$$$$$$ $$$$   $$$$  $$$$  $$$$$$$$    $$$$$$$$ $$$$  $$$$  $$$$$$$$$  $$$$$$$
$$$$ $$$$ $$$$   $$$$  $$$$ $$$$  $$$$   $$$$ $$$ $$$$  $$$$ $$$$       $$$$
$$$$$$$$  $$$$   $$$$  $$$$ $$$$$$$$$$   $$$$$$$$ $$$$  $$$$ $$$$ $$$$$$ $$$$$$$
$$$$ $$$$ $$$$   $$$$  $$$$ $$$$         $$$$$$$  $$$$  $$$$ $$$$  $$$$$     $$$$  $$
$$$$$$$$$ $$$$$$$ $$$$$$$$   $$$$        $$$       $$$$$$$$  $$$$$$$$$$  $$$$$$$  $$$$
$$$$$$$$  $$$$$$$  $$$$$$     $$$$$$     $$$        $$$$$$     $$$$$$$    $$$$$    $$

Description

This small utility script was made to crawl websites for email addresses. It uses multiprocessing threads to get multiple workers to scrape the web pages, extract emails and links, and dumps them in a *.csv file.

It is quite easy to use but is in early development stage. I don't know if it will be maintained. This will depend on whether it is a useful script or not. The code may be considered as ugly, but it works (at least for me). So if you can make it better, go ahead.

This script uses celery's billiard instead of the normal multiprocessing from python. It is there compatible with celery and can easily be integrated in a task by calling main() directly.

Known bugs

Currently, breaking the script ctrl+c might end up with orphan processes. This is currently the major issue with this script.

Installation

Prerequisite

Everything you need to install is in the requirements.txt file. However, some noteworthy libraries are listed below

beautifulsoup4
lxml
pandas
pyzmq
celery

You just need to download the repository, and install the requirements.

git clone https://github.com/guanicoe/Blue-pugs-engine
python3 -m pip install -r requirements.txt
chmod +x bluePugs.py

Usage

There are only two required flag -u which sets the target url, and -d which is a list of domains to limit the search to.

#example
$ python3 bluePugs.py -u https://domain.com -d domainA domainB

Defaults

  • WORKERS: By default, the number of worker is set to 10, modify this -w depending on you CPU power (more is not always better).
  • LIMIT: By default, there is a set of 1000 page limit to scan. This is a lot! But if the website has fewer accessible page, it will scan all. You can nevertheless specify no limits -ul.
  • OUTPUT DIRECTORY: Two files are output. One with unique emails and a second with two columns: email, URL (with duplicates). The latter enables you to see where the email was found.
usage: bluePugs.py [-h] -u URL -d DOMAIN [DOMAIN ...] [-w WORKERS] [-l LIMIT] [-o OUTPUT_DIR] [--version]

This small utility script was made to crawl websites for email addresses. It uses multiprocessing threads to get multiple workers to scrape the web pages,
extract emails and links, and dumps them in a *.csv file.

optional arguments:
  -h, --help            show this help message and exit
  -u URL, --url URL     Url to crawl
  -d DOMAIN [DOMAIN ...], --domain DOMAIN [DOMAIN ...]
                        Domain name to keep in scope (ex: -d domain1 domain2). The first domain will be used as name for output.
  -w WORKERS, --workers WORKERS
                        Number of workers (default: 10)
  -l LIMIT, --limit LIMIT
                        Limite the number of pages to crawl (default: 1000)
  -o OUTPUT_DIR, --output-dir OUTPUT_DIR
                        Specify which directory to save the date. (default is URL)
  --version             Returns the version number

TO-DO

  • finish README :)

License

This work CC BY 4.0

G

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.