Giter VIP home page Giter VIP logo

www-crawler-mojo's Introduction

WWW-Crawler-Mojo

WWW::Crawler::Mojo is a web crawling framework written in Perl on top of mojo toolkit, allowing you to write your own crawler rapidly.

This software is considered to be alpha quality and isn't recommended for regular usage.

Features

  • Easy to rule your crawler.
  • Allows to use Mojo::URL for URL manipulations, Mojo::Message::Response for response manipulation and Mojo::DOM for DOM inspection.
  • Internally uses Mojo::UserAgent which is a full featured non-blocking I/O HTTP and WebSocket user agent, with IPv6, TLS, SNI, IDNA, HTTP/SOCKS5 proxy, Comet (long polling), keep-alive, connection pooling, timeout, cookie, multipart, gzip compression and multiple event loop.
  • Throttle the connection with max connection and max connection per host options.
  • Depth detection.
  • Tracks 301 HTTP redirects.
  • Detects network errors and retry with your own rules.
  • Shuffles queue periodically if indicated.
  • Crawls beyond basic authentication.
  • Crawls even error documents.
  • Form submitting emulation.

Requirements

  • Perl 5.16 or higher
  • Mojolicious 8.12 or higher

Installation

$ curl -L cpanmin.us | perl - -n WWW::Crawler::Mojo

Synopsis

use WWW::Crawler::Mojo;

my $bot = WWW::Crawler::Mojo->new;

$bot->on(res => sub {
    my ($bot, $scrape, $job, $res) = @_;
    
    for my $child_job ($scrape->($css_selector)) {
        if (...) {
            $bot->enqueue($child_job);
        }
    }
});

$bot->enqueue('http://example.com/');
$bot->crawl;

Documentation

Examples

Restricting scraping URLs by status code.

$bot->on(res => sub {
    my ($bot, $scrape, $job, $res) = @_;
    return unless ($res->code == 200);
    $bot->enqueue($_) for $scrape->();
});

Restricting scraping URLs by host.

$bot->on(res => sub {
    my ($bot, $scrape, $job, $res) = @_;
    return unless if ($job->url->host eq 'example.com');
    $bot->enqueue($_) for $scrape->();
});

Restrict following URLs by depth.

$bot->on(res => sub {
    my ($bot, $scrape, $job, $res) = @_;
    
    for my $child_job ($scrape->()) {
        next unless ($child_job->depth < 5)
        $bot->enqueue($child_job);
    }
});

Restrict following URLs by host.

$bot->on(res => sub {
    my ($bot, $scrape, $job, $res) = @_;
    
    for my $child_job ($scrape->()) {
        $bot->enqueue($child_job) if $child_job->url->host eq 'example.com';
    }
});

Excepting following URLs by path.

$bot->on(res => sub {
    my ($bot, $scrape, $job, $res) = @_;
    
    for my $child_job ($scrape->()) {
        $bot->enqueue($child_job)
                            unless ($child_job->url->path =~ qr{^/foo/});
    }
});

Crawl only preset URLs.

$bot->on(res => sub {
    my ($bot, $scrape, $job, $res) = @_;
    # DO SOMETHING
});

$bot->enqueue(
    'http://example.com/1',
    'http://example.com/3',
    'http://example.com/5',
);

$bot->crawl;

Speed up.

$bot->max_conn(5);
$bot->max_conn_per_host(5);

Authentication. The user agent automatically reuses the credential for the host.

$bot->enqueue('http://jamadam:[email protected]');

You can fulfill any prerequisites such as login form submittion so that a login session will be established with cookie or something which you don't have to worry about.

my $bot = WWW::Crawler::Mojo->new;
$bot->ua->post('http://example.com/admin/login', form => {
    username => 'jamadam',
    password => 'password',
});
$bot->enqueue('http://example.com/admin/');
$bot->crawl

Other examples

  • WWW-Flatten
  • See the scripts under the example directory.

Broad crawling

Althogh the module is only well tested for "focused crawling" at this point, you can also use it for endless crawling by taking special care of memory usage including;

  • Restrict queue size by yourself.

  • Replace redundant detecter code.

    $bot->queue->redundancy(sub {...});

Copyright

Copyright (C) jamadam

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

www-crawler-mojo's People

Contributors

jamadam avatar harshals avatar manwar avatar zoffixznet avatar gfdev avatar

Stargazers

Thibault Duponchelle avatar Valentin avatar Stefan Adams avatar Martin-Louis Bright avatar Ilya avatar Tomo avatar Orange avatar joez avatar

Watchers

 avatar James Cloos avatar Denis avatar  avatar

www-crawler-mojo's Issues

Can't locate object method "delay" via package "Mojo::IOLoop"

The test suite started to fail:

Can't locate object method "delay" via package "Mojo::IOLoop" at t/user_agent_userinfo.t line 194.
# Looks like your test exited with 29 just after 37.
t/user_agent_userinfo.t .. 
Dubious, test returned 29 (wstat 7424, 0x1d00)
Failed 9/46 subtests 

Statistical analysis suggests that this is caused by Mojolicious 9.x:

****************************************************************
Regression 'mod:Mojolicious'
****************************************************************
Name           	       Theta	      StdErr	 T-stat
[0='const']    	      1.0000	      0.0000	17688362411884998.00
[1='eq_8.15']  	     -0.0000	      0.0000	  -0.87
[2='eq_8.17']  	     -0.0000	      0.0000	  -1.56
[3='eq_8.24']  	     -0.0000	      0.0000	  -2.47
[4='eq_8.25']  	      0.0000	      0.0000	   1.20
[5='eq_8.26']  	     -0.0000	      0.0000	  -3.36
[6='eq_8.36']  	      0.0000	      0.0000	   2.55
[7='eq_8.50']  	      0.0000	      0.0000	   3.12
[8='eq_9.0']   	     -1.0000	      0.0000	-13701346608582370.00
[9='eq_9.01']  	     -1.0000	      0.0000	-16466969246716122.00
[10='eq_9.02'] 	     -1.0000	      0.0000	-14949394751584572.00

R^2= 1.000, N= 79, K= 11
****************************************************************

Endless loops while running test suite

My smoker systems started to report fails which look like endless loops:

Use of uninitialized value $chunk in concatenation (.) or string at /home/cpansand/.cpan/build/2016052215/Mojolicious-6.62-WFL1bX/blib/lib/Mojo/Asset/Memory.pm line 15.
Use of uninitialized value $chunk in concatenation (.) or string at /home/cpansand/.cpan/build/2016052215/Mojolicious-6.62-WFL1bX/blib/lib/Mojo/Asset/Memory.pm line 15, <DATA> line 54.
Use of uninitialized value $chunk in concatenation (.) or string at /home/cpansand/.cpan/build/2016052215/Mojolicious-6.62-WFL1bX/blib/lib/Mojo/Asset/Memory.pm line 15, <DATA> line 54.
Use of uninitialized value $chunk in concatenation (.) or string at /home/cpansand/.cpan/build/2016052215/Mojolicious-6.62-WFL1bX/blib/lib/Mojo/Asset/Memory.pm line 15, <DATA> line 54.
Use of uninitialized value $chunk in concatenation (.) or string at /home/cpansand/.cpan/build/2016052215/Mojolicious-6.62-WFL1bX/blib/lib/Mojo/Asset/Memory.pm line 15.
...
t/empty.t ................ 
Failed 1/1 subtests 
(segfault)

and

Use of uninitialized value $chunk in concatenation (.) or string at /home/cpansand/.cpan/build/2016052215/Mojolicious-6.62-WFL1bX/blib/lib/Mojo/Asset/Memory.pm line 15.

#   Failed test 'right length'
#   at t/practical.t line 55.
#          got: '1'
#     expected: '10'
Can't call method "depth" on an undefined value at t/practical.t line 63, <DATA> line 117.
# Looks like your test exited with 255 just after 4.
t/practical.t ............ 
Dubious, test returned 255 (wstat 65280, 0xff00)
Failed 27/30 subtests 

Statistical analysis of fail reports suggests that the problem is caused by newer Test::More or newer Mojolicious, though I think it is Mojolicious:

****************************************************************
(3)
****************************************************************
Regression 'mod:Test::More'
****************************************************************
Name                   Theta          StdErr     T-stat
[0='const']           1.0000          0.0000    158260669785923584.00
[1='eq_1.302015']            -1.0000          0.0000    -31340282667675008.00
[2='eq_1.302019']            -1.0000          0.0000    -43477549705383432.00

R^2= 1.000, N= 55, K= 3
****************************************************************
(4)
****************************************************************
Regression 'mod:Mojolicious'
****************************************************************
Name                   Theta          StdErr     T-stat
[0='const']           1.0000          0.0000    63325106270015104.00
[1='eq_6.39']        -0.0000          0.0000      -0.79
[2='eq_6.40']         0.0000          0.0000       0.00
[3='eq_6.41']        -0.0000          0.0000      -1.57
[4='eq_6.47']        -0.0000          0.0000      -1.17
[5='eq_6.56']        -0.0000          0.0000      -2.34
[6='eq_6.58']         0.0000          0.0000       0.00
[7='eq_6.60']        -0.0000          0.0000      -1.57
[8='eq_6.61']         0.0000          0.0000       2.34
[9='eq_6.62']        -1.0000          0.0000    -41456013267638128.00

R^2= 1.000, N= 55, K= 10
****************************************************************

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.