robots.js

robots.js — is parser for robots.txt files for node.js.

Installation

It's recommended to install via npm:

$ npm install -g robots

Usage

Here's an example of using robots.js:

var robots = require('robots')
  , parser = new robots.RobotsParser();

parser.setUrl('http://nodeguide.ru/robots.txt', function(parser, success) {
  if(success) {
    parser.canFetch('*', '/doc/dailyjs-nodepad/', function (access) {
      if (access) {
        // parse url
      }
    });
  }
});

Default crawler user-agent is:

Mozilla/5.0 (X11; Linux i686; rv:5.0) Gecko/20100101 Firefox/5.0

Here's an example of using another user-agent and more detailed callback:

var robots = require('robots')
  , parser = new robots.RobotsParser(
                'http://nodeguide.ru/robots.txt',
                'Mozilla/5.0 (compatible; RobotTxtBot/1.0)',
                after_parse
            );
            
function after_parse(parser, success) {
  if(success) {
    parser.canFetch('*', '/doc/dailyjs-nodepad/', function (access, url, reason) {
      if (access) {
        console.log(' url: '+url+', access: '+access);
        // parse url ...
      }
    });
  }
};

Here's an example of getting list of sitemaps:

var robots = require('robots')
  , parser = new robots.RobotsParser();

parser.setUrl('http://nodeguide.ru/robots.txt', function(parser, success) {
  if(success) {
    parser.getSitemaps(function(sitemaps) {
      // sitemaps — array
    });
  }
});

Here's an example of getCrawlDelay usage:

    var robots = require('robots')
      , parser = new robots.RobotsParser();

    // for example:
    //
    // $ curl -s http://nodeguide.ru/robots.txt
    //
    // User-agent: Google-bot
    // Disallow: / 
    // Crawl-delay: 2
    //
    // User-agent: *
    // Disallow: /
    // Crawl-delay: 2

    parser.setUrl('http://nodeguide.ru/robots.txt', function(parser, success) {
      if(success) {
        var GoogleBotDelay = parser.getCrawlDelay("Google-bot");
        // ...
      }
    });

An example of passing options to the HTTP request:

var options = {
  headers:{
    Authorization:"Basic " + new Buffer("username:password").toString("base64")}
}

var robots = require('robots')
  , parser = new robots.RobotsParser(null, options);

parser.setUrl('http://nodeguide.ru/robots.txt', function(parser, success) {
  ...
});

API

RobotsParser — main class. This class provides a set of methods to read, parse and answer questions about a single robots.txt file.

setUrl(url, read) — sets the URL referring to a robots.txt file. by default, invokes read() method. If read is a function, it is called once the remote file is downloaded and parsed, and it takes in two arguments: the first is the parser itself, and the second is a boolean which is True if the the remote file was successfully parsed.
read(after_parse) — reads the robots.txt URL and feeds it to the parser
parse(lines) — parse the input lines from a robots.txt file
canFetch(userAgent, url, callback) — using the parsed robots.txt decide if userAgent can fetch url. Callback function: function callback(access, url, reason) { ... } where:
- access — can this url be fetched. true/false.
- url — target url
- reason — reason for access. Object:
  - type — valid values: 'statusCode', 'entry', 'defaultEntry', 'noRule'
  - entry — an instance of lib/Entry.js:. Only for types: 'entry', 'defaultEntry'
  - statusCode — http response status code for url. Only for type 'statusCode'
canFetchSync(userAgent, url) — using the parsed robots.txt decide if userAgent can fetch url. Return true/false.
getCrawlDelay(userAgent) — returns Crawl-delay for the certain userAgent
getSitemaps(sitemaps) — gets Sitemaps from parsed robots.txt
getDisallowedPaths(userAgent) — gets paths explictly disallowed for the user agent specified AND *

License

See LICENSE file.

zhaotinghai / robots.js Goto Github PK

robots.js's Introduction

robots.js

Installation

Usage

API

License

Resources

robots.js's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent