WWW-Crawler-Mojo

WWW::Crawler::Mojo is a web crawling framework written in Perl on top of mojo toolkit, allowing you to write your own crawler rapidly.

This software is considered to be alpha quality and isn't recommended for regular usage.

Features

Easy to rule your crawler.
Allows to use Mojo::URL for URL manipulations, Mojo::Message::Response for response manipulation and Mojo::DOM for DOM inspection.
Internally uses Mojo::UserAgent which is a full featured non-blocking I/O HTTP and WebSocket user agent, with IPv6, TLS, SNI, IDNA, HTTP/SOCKS5 proxy, Comet (long polling), keep-alive, connection pooling, timeout, cookie, multipart, gzip compression and multiple event loop.
Throttle the connection with max connection and max connection per host options.
Depth detection.
Tracks 301 HTTP redirects.
Detects network errors and retry with your own rules.
Shuffles queue periodically if indicated.
Crawls beyond basic authentication.
Crawls even error documents.
Form submitting emulation.

Requirements

Perl 5.14
Mojolicious 6.0

Installation

$ curl -L cpanmin.us | perl - -n WWW::Crawler::Mojo

Synopsis

use WWW::Crawler::Mojo;

my $bot = WWW::Crawler::Mojo->new;

$bot->on(res => sub {
    my ($bot, $scrape, $job, $res) = @_;
    
    for my $job ($scrape->($css_selector)) {
    	if (...) {
        	$bot->enqueue($job);
        }
    }
});

$bot->enqueue('http://example.com/');
$bot->crawl;

Documentation

Examples

Restricting scraping URLs by status code.

$bot->on(res => sub {
    my ($bot, $scrape, $job, $res) = @_;
    return unless ($res->code == 200);
    $bot->enqnene($_) for $scrape->();
});

Restricting scraping URLs by host.

$bot->on(res => sub {
    my ($bot, $scrape, $job, $res) = @_;
    return unless if ($job->url->host eq 'example.com');
    $bot->enqnene($_) for $scrape->();
});

Restrict following URLs by depth.

$bot->on(res => sub {
    my ($bot, $scrape, $job, $res) = @_;
    
    for my $job ($scrape->()) {
        next unless ($job->depth < 5)
        $bot->enqnene($job);
    }
});

Restrict following URLs by host.

$bot->on(res => sub {
    my ($bot, $scrape, $job, $res) = @_;
    
    for my $job ($scrape->()) {
        $bot->enqnene($job) if $job->url->host eq 'example.com';
    }
});

Restrict following URLs by referrer's host.

$bot->on(res => sub {
    my ($bot, $scrape, $job, $res) = @_;
    
    for my $job ($scrape->()) {
        $bot->enqnene($job) if $job->referrer->url->host eq 'example.com';
    }
});

Excepting following URLs by path.

$bot->on(res => sub {
    my ($bot, $scrape, $job, $res) = @_;
    
    for my $job ($scrape->()) {
        $bot->enqnene($job) unless ($job->url->path =~ qr{^/foo/});
    }
});

Crawl only preset URLs.

$bot->on(res => sub {
    my ($bot, $scrape, $job, $res) = @_;
    # DO SOMETHING
});

$bot->enqueue(
	'http://example.com/1',
	'http://example.com/3',
	'http://example.com/5',
);

$bot->crawl;

Speed up.

$bot->max_conn(5);
$bot->max_conn_per_host(5);

Authentication. The user agent automatically reuses the credential for the host.

$bot->enqueue('http://jamadam:[email protected]');

You can fulfill any prerequisites such as login form submittion so that a login session will be established with cookie or something which you don't have to worry about.

my $bot = WWW::Crawler::Mojo->new;
$bot->ua->post('http://example.com/admin/login', form => {
    username => 'jamadam',
    password => 'password',
});
$bot->enqueue('http://example.com/admin/');
$bot->crawl

Other examples

WWW-Flatten
See the scripts under the example directory.

Broad crawling

Althogh the module is only well tested for "focused crawl" at this point, you can also use it for endless crawling by taking special care of memory usage including;

Restrict queue size by yourself.
Replace redundant detecter code.

$bot->queue->redundancy(sub {...});

Copyright

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

gfdev / www-crawler-mojo Goto Github PK

www-crawler-mojo's Introduction

WWW-Crawler-Mojo

Features

Requirements

Installation

Synopsis

Documentation

Examples

Other examples

Broad crawling

Copyright

www-crawler-mojo's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent