Giter VIP home page Giter VIP logo

www-crawler-lite's Introduction

NAME

WWW::Crawler::Lite - A single-threaded crawler/spider for the web.

SYNOPSIS

my %pages = ( );
my $pattern = 'https?://example\.com\/';
my %links = ( );
my $downloaded = 0;

my $crawler;
$crawler = WWW::Crawler::Lite->new(
  agent       => 'MySuperBot/1.0',
  url_pattern => $pattern,
  http_accept => [qw( text/plain text/html application/xhtml+xml )],
  link_parser => 'default',
  on_response => sub {
    my ($url, $res) = @_;

  warn "$url contains " . $res->content;
  $downloaded++;
  $crawler->stop() if $downloaded++ > 5;
},
follow_ok   => sub {
  my ($url) = @_;
  # If you like this url and want to use it, then return a true value:
  return 1;
},
on_link     => sub {
  my ($from, $to, $text) = @_;

  return if exists($pages{$to}) && $pages{$to} eq 'BAD';
  $pages{$to}++;
  $links{$to} ||= [ ];
  push @{$links{$to}}, { from => $from, text => $text };
},
on_bad_url => sub {
  my ($url) = @_;

    # Mark this url as 'bad':
    $pages{$url} = 'BAD';
  }
);
$crawler->crawl( url => "http://example.com/" );

warn "DONE!!!!!";

use Data::Dumper;
map {
  warn "$_ ($pages{$_} incoming links) -> " . Dumper($links{$_})
} sort keys %links;

DESCRIPTION

WWW::Crawler::Lite is a single-threaded spider/crawler for the web. It can be used within a mod_perl, CGI or Catalyst-style environment because it does not fork or use threads.

The callback-based interface is fast and simple, allowing you to focus on simply processing the data that WWW::Crawler::Lite extracts from the target website.

PUBLIC METHODS

new( %args )

Creates and returns a new WWW::Crawler::Lite object.

The %args hash is not required, but may contain the following elements:

  • agent - String

Used as the user-agent string for HTTP requests.

Default Value: - WWW-Crawler-Lite/$VERSION $^O

  • url_pattern - RegExp or String

New links that do not match this pattern will not be added to the processing queue.

Default Value: https?://.+

  • http_accept - ArrayRef

This can be used to filter out unwanted responses.

  • link_parser - String

Valid values: 'default' and 'HTML::LinkExtor'

The default value is 'default' which uses a naive regexp to do the link parsing.

The upshot of using 'default' is that the regexp will also find the hyperlinked text or alt-text (of a hyperlinked img tag) and give that to your 'on_link' handler.

Default Value: [qw( text/html text/plain application/xhtml+xml )]

  • on_response($url, $response) - CodeRef

Called whenever a successful response is returned.

  • on_link($from, $to, $text) - CodeRef

Called whenever a new link is found. Arguments are:

- $from

The URL that is linked *from*

- $to

The URL that is linked *to*

- $text

The anchor text (eg: The HTML within the link - <a href="...">__This Text Here__</a>)
  • on_bad_url($url) - CodeRef

Called whenever an unsuccessful response is received.

  • delay_seconds - Number

Indicates the length of time (in seconds) that the crawler should pause before making each request. This can be useful when you want to spider a website, not launch a denial of service attack on it.

stop( )

Causes the crawler to stop processing its queue of URLs.

AUTHOR

John Drago [email protected]

COPYRIGHT

This software is Free software and may be used and redistributed under the same terms as perl itself.

www-crawler-lite's People

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.