Giter VIP home page Giter VIP logo

mightycrawler's Introduction

Mightycrawler - A multithreaded web crawler written in Java

Getting it

Warning: Mightycrawler is experimental. Some features are not fully implemented.

git clone git://github.com/kristoffer-dyrkorn/mightycrawler.git
cd mightycrawler
mvn clean package

Description

Mightycrawler is a multithreaded web crawler with reporting capabilities. Resources run in parallel threads (managed by a thread pool of configurable size), while content parsing and storage to disk is done by the main thread. During crawling, various statistics about the web site performance is gathered and put in a database. After crawling is done, user-specified queries are run against the database to produce various reports in plain text format.

Warning! Mightycrawler is indeed mighty and can generate a lot of requests in short time. Please use this tool properly and with care - ensure there is a reasonable delay between each download request.

Quick start

After downloading and building, edit crawler.properties to suit your needs.

Then run mightycrawler by typing

java -jar target/mightycrawler-[version].jar [myconfig.properties]

Configuration

All options for running mightycrawler are given in the crawler.properties file:

  • startURL: Where to start crawling. Host names must end with a "/".

  • includePattern: Restrict crawling to URLs matching this regex. Defaults to everything under startURL.

  • excludePattern: Among the included URLs, exclude those matching this regex. Defaults to none (don't exclude any).

  • urlFile: Visit all the URLs in this file (one URL per line). if provided, will override all of the above settings and turn off crawling.

  • extractPattern: Extract links from content types matching this regex. Defaults to none (if nothing is specified).

  • linkPattern: Consinder content captured by this regex as links to resources. Defaults to none (if nothing is specified).

  • storePattern: Download and save to disk all content types matching this regex. Defaults to none (if nothing is specified).

  • userAgent: The user agent the crawler reports to be.

  • useCookies: Set to false to disable cookie support, thus disabling session stickyness. Defaults to true (ie supporting sticky sessions).

  • followRedirects: Specify whether the crawler should follow redirects. If true, only the final URL (after redirect) will be logged. Defaults to true.

  • downloadThreads: Number of threads for downloading content.

  • maxVisits: Stop after visiting this number of pages.

  • maxDownloads: Stop after downloading this number of pages.

  • maxRecursion: Stop when reaching this recursion level.

  • maxTime: Stop when running for this number of minutes.

  • downloadDelay: For each resource, pause of this number of seconds before downloading.

  • responseTimeout: Move on if a request gets no response after this number of seconds.

  • crawlerTimeout: Stop crawling if no new URLs are discovered within this number of seconds.

  • outputDirectory: Where to put the downloaded web pages. MANDATORY.

  • reportDirectory: Where to put download statistics. MANDATORY.

  • databaseDirectory: Where to put the crawler database. MANDATORY.

  • reportSQL: SQL statements that are run against the crawler database after completion. Output is sent to the filename specified. Multiple SQLs can be provided, one per line, each line ending with \ and the next line starting with |.

Example:

Tables in the crawler database

DOWNLOADS
---------
url VARCHAR(4095)
http_code INTEGER default 0
content_type VARCHAR(255)
response_time INTEGER default 0
downloaded_at DATETIME default NOW
downladed BOOLEAN

LINKS
-----
url_from VARCHAR(4095)
url_to VARCHAR(4095)

TODO

  • Support stopping and resuming the crawler
  • Improve the display of various run-time statistics

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.