Giter VIP home page Giter VIP logo

robots-parser's Introduction

Robots Parser NPM downloads DeepScan grade GitHub license Coverage Status

A robots.txt parser which aims to be complaint with the draft specification.

The parser currently supports:

  • User-agent:
  • Allow:
  • Disallow:
  • Sitemap:
  • Crawl-delay:
  • Host:
  • Paths with wildcards (*) and EOL matching ($)

Installation

Via NPM:

npm install robots-parser

or via Yarn:

yarn add robots-parser

Usage

var robotsParser = require('robots-parser');

var robots = robotsParser('http://www.example.com/robots.txt', [
	'User-agent: *',
	'Disallow: /dir/',
	'Disallow: /test.html',
	'Allow: /dir/test.html',
	'Allow: /test.html',
	'Crawl-delay: 1',
	'Sitemap: http://example.com/sitemap.xml',
	'Host: example.com'
].join('\n'));

robots.isAllowed('http://www.example.com/test.html', 'Sams-Bot/1.0'); // true
robots.isAllowed('http://www.example.com/dir/test.html', 'Sams-Bot/1.0'); // true
robots.isDisallowed('http://www.example.com/dir/test2.html', 'Sams-Bot/1.0'); // true
robots.getCrawlDelay('Sams-Bot/1.0'); // 1
robots.getSitemaps(); // ['http://example.com/sitemap.xml']
robots.getPreferredHost(); // example.com

isAllowed(url, [ua])

boolean or undefined

Returns true if crawling the specified URL is allowed for the specified user-agent.

This will return undefined if the URL isn't valid for this robots.txt.

isDisallowed(url, [ua])

boolean or undefined

Returns true if crawling the specified URL is not allowed for the specified user-agent.

This will return undefined if the URL isn't valid for this robots.txt.

getMatchingLineNumber(url, [ua])

number or undefined

Returns the line number of the matching directive for the specified URL and user-agent if any.

Line numbers start at 1 and go up (1-based indexing).

Returns -1 if there is no matching directive. If a rule is manually added without a lineNumber then this will return undefined for that rule.

getCrawlDelay([ua])

number or undefined

Returns the number of seconds the specified user-agent should wait between requests.

Returns undefined if no crawl delay has been specified for this user-agent.

getSitemaps()

array

Returns an array of sitemap URLs specified by the sitemap: directive.

getPreferredHost()

string or null

Returns the preferred host name specified by the host: directive or null if there isn't one.

Changes

Version 3.0.1

  • Fixed bug with https: URLs defaulting to port 80 instead of 443 if no port is specified. Thanks to @dskvr for reporting

    This affects comparing URLs with the default HTTPs port to URLs without it. For example, comparing https://example.com/ to https://example.com:443/ or vice versa.

    They should be treated as equivalent but weren't due to the incorrect port being used for https:.

Version 3.0.0

  • Changed to using global URL object instead of importing. – Thanks to @brendankenny

Version 2.4.0:

  • Added Typescript definitions
    – Thanks to @danhab99 for creating
  • Added SECURITY.md policy and CodeQL scanning

Version 2.3.0:

  • Fixed bug where if the user-agent passed to isAllowed() / isDisallowed() is called "constructor" it would throw an error.

  • Added support for relative URLs. This does not affect the default behavior so can safely be upgraded.

    Relative matching is only allowed if both the robots.txt URL and the URLs being checked are relative.

    For example:

    var robots = robotsParser('/robots.txt', [
        'User-agent: *',
        'Disallow: /dir/',
        'Disallow: /test.html',
        'Allow: /dir/test.html',
        'Allow: /test.html'
    ].join('\n'));
    
    robots.isAllowed('/test.html', 'Sams-Bot/1.0'); // false
    robots.isAllowed('/dir/test.html', 'Sams-Bot/1.0'); // true
    robots.isDisallowed('/dir/test2.html', 'Sams-Bot/1.0'); // true

Version 2.2.0:

  • Fixed bug that with matching wildcard patterns with some URLs – Thanks to @ckylape for reporting and fixing
  • Changed matching algorithm to match Google's implementation in google/robotstxt
  • Changed order of precedence to match current spec

Version 2.1.1:

  • Fix bug that could be used to causing rule checking to take a long time – Thanks to @andeanfog

Version 2.1.0:

  • Removed use of punycode module API's as new URL API handles it
  • Improved test coverage
  • Added tests for percent encoded paths and improved support
  • Added getMatchingLineNumber() method
  • Fixed bug with comments on same line as directive

Version 2.0.0:

This release is not 100% backwards compatible as it now uses the new URL APIs which are not supported in Node < 7.

  • Update code to not use deprecated URL module API's. – Thanks to @kdzwinel

Version 1.0.2:

  • Fixed error caused by invalid URLs missing the protocol.

Version 1.0.1:

  • Fixed bug with the "user-agent" rule being treated as case sensitive. – Thanks to @brendonboshell
  • Improved test coverage. – Thanks to @schornio

Version 1.0.0:

  • Initial release.

License

The MIT License (MIT)

Copyright (c) 2014 Sam Clarke

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.

robots-parser's People

Contributors

brendankenny avatar brendonboshell avatar danhab99 avatar dependabot[bot] avatar kdzwinel avatar samclarke avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

robots-parser's Issues

Returns true for disallowed route

Hey nice project, recently I started to use it on my personal ones but I get this:

var robotsParser = require('robots-parser');

var robots = robotsParser('http://www.example.com/robots.txt', [
	'User-agent: *',
	'Disallow: /dir/',
	'Disallow: /test.html',
	'Allow: /dir/test.html',
	'Allow: /test.html',
	'Crawl-delay: 1',
	'Sitemap: http://example.com/sitemap.xml',
	'Host: example.com'
].join('\n'));

console.log(robots.isAllowed('http://www.example.com/test.html', 'Sams-Bot/1.0')); // false

When I run node myParserFile.js console.log returns true

Empty 'Disallow:' statement incorrectly gobbles the next statement

For a robots.txt file like the following:

`User-agent: *
Disallow:

Sitemap: https://www.site.com/sitemap_index.xml
`

The robots-parser library seems to take the next non-empty line after an empty Disallow, leading to the following parsed output:

Robots { _url: URL { href: 'user-agent: *Disallow:Sitemap: https://www.site.com/sitemap_index.xml', origin: 'null', protocol: 'user-agent:', username: '', password: '', host: '', hostname: '', port: '', pathname: ' *Disallow:Sitemap: https://www.site.com/sitemap_index.xml', search: '', searchParams: URLSearchParams {}, hash: '' }, _rules: [Object: null prototype] {}, _sitemaps: [], _preferredHost: null }

I believe an empty Disallow: line is supported in the spec by this ABNF:
rule = *WS ("allow" / "disallow") *WS ":" *WS (path-pattern / empty-pattern) EOL

Incorrectly rejecting deep directories with wildcards

const robotsParser = require('robots-parser')

const rules = [
	'User-agent: *',
	'Disallow: /dir*',
]

const robots = robotsParser('http://www.example.com/robots.txt', rules.join('\n'))

const tests = [
	robots.isAllowed('http://www.example.com/test.html'),
	robots.isAllowed('http://www.example.com/directory.html'),
	robots.isAllowed('http://www.example.com/dirty/test.html'),
	robots.isAllowed('http://www.example.com/folder/dir.html'),
	robots.isAllowed('http://www.example.com/hello/world/dir/test.html'),
]

console.log(tests) // [ true, false, false, false, false ]

// should return: [ true, false, false, true, true ]

Google and other engines work as expected with the last two test URLs.

The library should validate the document before processing it

Hi @samclarke ,

I have a script to watch multiple robots.txt from websites but in some case they have none but still display a fallback content. The issue is your library will tell isAllowed() -> true even if HTML code is passed.

  it('should not confirm it can be indexed', async () => {
    const body = `<html></html>`;

    const robots = robotsParser(robotsUrl, body);
    const canBeIndexed = robots.isAllowed(rootUrl);

    expect(canBeIndexed).toBeFalsy();
  });

(this test will fail, whereas it should pass, or better, it should throw since there are both isDisallowed and isAllowed)

Did I miss something to check the robots.txt format?

Does it make sense to throw an error instead of allowing/disallowing something based on nothing?

Thank you,

EDIT: a workaround could be to check if any HTML inside the file... hoping the website does not return another format (JSON, raw...). But it's a bit hacky, no?

EDIT2: a point of view https://stackoverflow.com/a/31598530/3608410

Crash when passing an invalid URL

It would be nice to handle invalid URLs better instead of crashing. I had to search quite a while why it was not working as it was not clear that robots-parser did not accept an URL without the protocol.

It seems it could be done by validating the result of this call at line 173 in Robots.js

this._url = libUrl.parse(url);

The same validation would also be needed in Robots.prototype.isAllowed

Let me know if you'd like me to submit a pull request for this.

Library adds port when port is never defined.

robots-parser appears to add a port number to the provided URL, which more often than not, breaks the ability to parse a robots.txt file. Or in my case, makes it completely unusable after going through a sample of a 500 URLs. There does not seem to be a way to override this.

Robots-parser not working

I just can't seem to make robots-parser to work properly. Even typing http://mywebsite.com/robots.txt gives error 404 page.

I have inserted this line of code on my app.js:

var robotsParser = require('robots-parser');

var robots = robotsParser('http://kahon-menthos984.c9users.io/robots.txt', [
	'User-agent: *',
	'Allow: /',
	'Sitemap: http://kahon-menthos984.c9users.io/sitemap.xml',
	'Host: http://kahon-menthos984.c9users.io/'
].join('\n'));

I apologize for the inconvenience but I hope you can help me.

Preferred host breaks isAllowed

The parsedUrl.hostname !== this._url.hostname logic in _getRule doesn't work when the robot.txt's host is different from the request host and the sitemap uses robot.txt's preferred host.

When robots.txt sets a preferred host differently from the crawled request, i.e. 'example.com' instead of 'www.example.com', and the sitemap uses the preferred host like example.com/about, then isAllowed always fails.

A workaround is to replace the host in each sitemap link tested in isAllowed for the preferred host, and a fix would be to check the preferred host too, but to be honest, the logic is trying to be too clever.

Files without User-agents can't add rules

If a robots.txt doesn't specify any User-agents like this one, it should default to *. In your code a User-agent is required to add a rule but you can get around this by adding
if(userAgents.length <= 0){userAgents.push('*');}
to the top of
Robots.prototype.addRule

Need help maintain this project?

Hi @samclarke, I noticed that there are some issues and PRs open, perhaps you didn't get around to reviewing them.

If you like, I'd love to help you maintain this project by reviewing the PRs, adding TypeScript support, more comprehensive tests for different Node versions, and setting up community health files to encourage future contributions.

Although it's just a small file you wrote several years ago, robots-parser has 300k+ downloads per week, so it would be nice to have some systems in place to update (dev) dependencies and ensure project maintenance. What say? You can add me as a repository collaborator and I'll get started! 😄

Support for URL object

Hi,

With version we could use url objects that way:

let myUrl = new URL('http://foo.bar');
robotsTxt.isAllowed(myUrl);

With version 2 it's no longer possible and for this reason existing code breaks when upgrading.

The reason is that URL.parse(url) that was used in version 1 accepts URL object as parameter when new URL.URL(url) does not.

Can we consider adding back the support for url objects?

redos vulnerability

Repro steps

const robotsParser = require('robots-parser');
const r = robotsParser('https://localhost/robots.txt', [
  'User-agent: * ',
  'Allow: **************************.js*',
].join('\n'));

r.isAllowed('https://localhost/index.html', 'bot name');

This takes a long time to parse.
Adding more expressions or making the URL longer makes it take more time as well.

Throws error with a robots.txt with a colon at the end or beginning of a line

TypeError: Cannot read property 'indexOf' of null
    at parseRobots (/home/cat/Projects/firm-email-crawler/node_modules/robots-parser/Robots.js:102:23)
    at new Robots (/home/cat/Projects/firm-email-crawler/node_modules/robots-parser/Robots.js:181:2)
    at module.exports (/home/cat/Projects/firm-email-crawler/node_modules/robots-parser/index.js:4:9)
    at /home/cat/Projects/firm-email-crawler/node_modules/simplecrawler/lib/crawler.js:1479:50
    at decodeAndReturnResponse (/home/cat/Projects/firm-email-crawler/node_modules/simplecrawler/lib/crawler.js:496:17)
    at IncomingMessage.<anonymous> (/home/cat/Projects/firm-email-crawler/node_modules/simplecrawler/lib/crawler.js:505:21)
    at emitNone (events.js:91:20)
    at IncomingMessage.emit (events.js:185:7)
    at endReadableNT (_stream_readable.js:926:12)
    at _combinedTickCallback (internal/process/next_tick.js:74:11)
    at process._tickCallback (internal/process/next_tick.js:98:9)

Fixed it by removing

if (!line)
    return null;

from trimLine function

Support for ignoring protocols and ports

Hi,

The below code outputs false then true, because the first parser is created with a URL that uses the HTTP protocol, and the second one with HTTPS.

var robotsParser = require("robots-parser");

var robots1 = robotsParser('http://www.example.com/robots.txt', ["User-agent: *", "Disallow:"].join("\n"));
var robots2 = robotsParser('https://www.example.com/robots.txt', ["User-agent: *", "Disallow:"].join("\n"));

console.log(robots1.isDisallowed("http://www.example.com/test", "useragent"));
console.log(robots2.isDisallowed("http://www.example.com/test", "useragent"));

I understand that based on specifications a robots.txt file is only valid for URLs with the same protocol, host, and port. However, this is hugely inconvenient for programs that are trying to use this module for crawling websites with links that occasionally swap between HTTP and HTTPS.

Should there not be an option to ignore lines 325-330 in Robots.js, or at least limit it to only checking the URL's hostname and (or) port? It otherwise makes this module useless for people who are dealing with websites that swap protocols and don't want to take on the inefficiency of needlessly parsing the same robots.txt file twice, for HTTP and HTTPS.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.