Mightycrawler - A multithreaded web crawler written in Java

Getting it

Warning: Mightycrawler is experimental. Some features are not fully implemented.

git clone git://github.com/kristoffer-dyrkorn/mightycrawler.git
cd mightycrawler
mvn clean package

Description

Mightycrawler is a multithreaded web crawler with reporting capabilities. Resources run in parallel threads (managed by a thread pool of configurable size), while content parsing and storage to disk is done by the main thread. During crawling, various statistics about the web site performance is gathered and put in a database. After crawling is done, user-specified queries are run against the database to produce various reports in plain text format.

Warning! Mightycrawler is indeed mighty and can generate a lot of requests in short time. Please use this tool properly and with care - ensure there is a reasonable delay between each download request.

Quick start

After downloading and building, edit crawler.properties to suit your needs.

Then run mightycrawler by typing

java -jar target/mightycrawler-[version].jar [myconfig.properties]

Configuration

All options for running mightycrawler are given in the crawler.properties file:

startURL: Where to start crawling. Host names must end with a "/".
includePattern: Restrict crawling to URLs matching this regex. Defaults to everything under startURL.
excludePattern: Among the included URLs, exclude those matching this regex. Defaults to none (don't exclude any).
urlFile: Visit all the URLs in this file (one URL per line). if provided, will override all of the above settings and turn off crawling.
extractPattern: Extract links from content types matching this regex. Defaults to none (if nothing is specified).
linkPattern: Consinder content captured by this regex as links to resources. Defaults to none (if nothing is specified).
storePattern: Download and save to disk all content types matching this regex. Defaults to none (if nothing is specified).
userAgent: The user agent the crawler reports to be.
useCookies: Set to false to disable cookie support, thus disabling session stickyness. Defaults to true (ie supporting sticky sessions).
followRedirects: Specify whether the crawler should follow redirects. If true, only the final URL (after redirect) will be logged. Defaults to true.
downloadThreads: Number of threads for downloading content.
maxVisits: Stop after visiting this number of pages.
maxDownloads: Stop after downloading this number of pages.
maxRecursion: Stop when reaching this recursion level.
maxTime: Stop when running for this number of minutes.
downloadDelay: For each resource, pause of this number of seconds before downloading.
responseTimeout: Move on if a request gets no response after this number of seconds.
crawlerTimeout: Stop crawling if no new URLs are discovered within this number of seconds.
outputDirectory: Where to put the downloaded web pages. MANDATORY.
reportDirectory: Where to put download statistics. MANDATORY.
databaseDirectory: Where to put the crawler database. MANDATORY.
reportSQL: SQL statements that are run against the crawler database after completion. Output is sent to the filename specified. Multiple SQLs can be provided, one per line, each line ending with \ and the next line starting with |.

Example:

[email protected]\
|[email protected]\
|[email protected]

Tables in the crawler database

DOWNLOADS
---------
url VARCHAR(4095)
http_code INTEGER default 0
content_type VARCHAR(255)
response_time INTEGER default 0
downloaded_at DATETIME default NOW
downladed BOOLEAN

LINKS
-----
url_from VARCHAR(4095)
url_to VARCHAR(4095)

TODO

Support stopping and resuming the crawler
Improve the display of various run-time statistics

kristoffer-dyrkorn / mightycrawler Goto Github PK

mightycrawler's Introduction

Mightycrawler - A multithreaded web crawler written in Java

Getting it

Description

Quick start

Configuration

Tables in the crawler database

TODO

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent