samclarke / robots-parser Goto Github PK

View Code? Open in Web Editor NEW

145.0 6.0 18.0 666 KB

NodeJS robots.txt parser with support for wildcard (*) matching.

License: MIT License

JavaScript 100.00%

user-agent javascript nodejs robots-txt robots-exclusion-standard robots-parser

robots-parser's Introduction

Robots Parser

A robots.txt parser which aims to be complaint with the draft specification.

The parser currently supports:

User-agent:
Allow:
Disallow:
Sitemap:
Crawl-delay:
Host:
Paths with wildcards (*) and EOL matching ($)

Installation

Via NPM:

npm install robots-parser

or via Yarn:

yarn add robots-parser

Usage

var robotsParser = require('robots-parser');

var robots = robotsParser('http://www.example.com/robots.txt', [
	'User-agent: *',
	'Disallow: /dir/',
	'Disallow: /test.html',
	'Allow: /dir/test.html',
	'Allow: /test.html',
	'Crawl-delay: 1',
	'Sitemap: http://example.com/sitemap.xml',
	'Host: example.com'
].join('\n'));

robots.isAllowed('http://www.example.com/test.html', 'Sams-Bot/1.0'); // true
robots.isAllowed('http://www.example.com/dir/test.html', 'Sams-Bot/1.0'); // true
robots.isDisallowed('http://www.example.com/dir/test2.html', 'Sams-Bot/1.0'); // true
robots.getCrawlDelay('Sams-Bot/1.0'); // 1
robots.getSitemaps(); // ['http://example.com/sitemap.xml']
robots.getPreferredHost(); // example.com

isAllowed(url, [ua])

boolean or undefined

Returns true if crawling the specified URL is allowed for the specified user-agent.

This will return undefined if the URL isn't valid for this robots.txt.

isDisallowed(url, [ua])

boolean or undefined

Returns true if crawling the specified URL is not allowed for the specified user-agent.

This will return undefined if the URL isn't valid for this robots.txt.

getMatchingLineNumber(url, [ua])

number or undefined

Returns the line number of the matching directive for the specified URL and user-agent if any.

Line numbers start at 1 and go up (1-based indexing).

Returns -1 if there is no matching directive. If a rule is manually added without a lineNumber then this will return undefined for that rule.

getCrawlDelay([ua])

number or undefined

Returns the number of seconds the specified user-agent should wait between requests.

Returns undefined if no crawl delay has been specified for this user-agent.

getSitemaps()

array

Returns an array of sitemap URLs specified by the sitemap: directive.

getPreferredHost()

string or null

Returns the preferred host name specified by the host: directive or null if there isn't one.

Changes

Version 3.0.1

Fixed bug with https: URLs defaulting to port 80 instead of 443 if no port is specified. Thanks to @dskvr for reporting

This affects comparing URLs with the default HTTPs port to URLs without it. For example, comparing https://example.com/ to https://example.com:443/ or vice versa.

They should be treated as equivalent but weren't due to the incorrect port being used for https:.

Version 3.0.0

Changed to using global URL object instead of importing. – Thanks to @brendankenny

Version 2.4.0:

Added Typescript definitions
– Thanks to @danhab99 for creating
Added SECURITY.md policy and CodeQL scanning

Version 2.3.0:

Fixed bug where if the user-agent passed to isAllowed() / isDisallowed() is called "constructor" it would throw an error.

Added support for relative URLs. This does not affect the default behavior so can safely be upgraded.

Relative matching is only allowed if both the robots.txt URL and the URLs being checked are relative.

For example:

var robots = robotsParser('/robots.txt', [
    'User-agent: *',
    'Disallow: /dir/',
    'Disallow: /test.html',
    'Allow: /dir/test.html',
    'Allow: /test.html'
].join('\n'));

robots.isAllowed('/test.html', 'Sams-Bot/1.0'); // false
robots.isAllowed('/dir/test.html', 'Sams-Bot/1.0'); // true
robots.isDisallowed('/dir/test2.html', 'Sams-Bot/1.0'); // true

Version 2.2.0:

Fixed bug that with matching wildcard patterns with some URLs – Thanks to @ckylape for reporting and fixing
Changed matching algorithm to match Google's implementation in google/robotstxt
Changed order of precedence to match current spec

Version 2.1.1:

Fix bug that could be used to causing rule checking to take a long time – Thanks to @andeanfog

Version 2.1.0:

Removed use of punycode module API's as new URL API handles it
Improved test coverage
Added tests for percent encoded paths and improved support
Added getMatchingLineNumber() method
Fixed bug with comments on same line as directive

Version 2.0.0:

This release is not 100% backwards compatible as it now uses the new URL APIs which are not supported in Node < 7.

Update code to not use deprecated URL module API's. – Thanks to @kdzwinel

Version 1.0.2:

Fixed error caused by invalid URLs missing the protocol.

Version 1.0.1:

Fixed bug with the "user-agent" rule being treated as case sensitive. – Thanks to @brendonboshell
Improved test coverage. – Thanks to @schornio

Version 1.0.0:

Initial release.

License

The MIT License (MIT)

Copyright (c) 2014 Sam Clarke

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.

robots-parser's People

Contributors

Stargazers

Watchers

Forkers

schornio brendonboshell kdzwinel pinkbear ishan-marikar kourylape drorgl voiddeveloper huynhkhanhduc menno120 protyped danhab99 soumyachoudhary slava-v23 brendankenny eternalerrors plasmacreative searchsg

robots-parser's Issues

Returns true for disallowed route

Hey nice project, recently I started to use it on my personal ones but I get this:

var robotsParser = require('robots-parser');

var robots = robotsParser('http://www.example.com/robots.txt', [
	'User-agent: *',
	'Disallow: /dir/',
	'Disallow: /test.html',
	'Allow: /dir/test.html',
	'Allow: /test.html',
	'Crawl-delay: 1',
	'Sitemap: http://example.com/sitemap.xml',
	'Host: example.com'
].join('\n'));

console.log(robots.isAllowed('http://www.example.com/test.html', 'Sams-Bot/1.0')); // false

When I run node myParserFile.js console.log returns true

Recommend robots-agent

If I may, I suggest recommending use of robots-parser through https://github.com/gajus/robots-agent as it abstracts cache handling and ensures safe usage of robots-parser methods.

Empty 'Disallow:' statement incorrectly gobbles the next statement

For a robots.txt file like the following:

`User-agent: *
Disallow:

Sitemap: https://www.site.com/sitemap_index.xml
`

The robots-parser library seems to take the next non-empty line after an empty Disallow, leading to the following parsed output:

Robots { _url: URL { href: 'user-agent: *Disallow:Sitemap: https://www.site.com/sitemap_index.xml', origin: 'null', protocol: 'user-agent:', username: '', password: '', host: '', hostname: '', port: '', pathname: ' *Disallow:Sitemap: https://www.site.com/sitemap_index.xml', search: '', searchParams: URLSearchParams {}, hash: '' }, _rules: [Object: null prototype] {}, _sitemaps: [], _preferredHost: null }

I believe an empty Disallow: line is supported in the spec by this ABNF:
rule = *WS ("allow" / "disallow") *WS ":" *WS (path-pattern / empty-pattern) EOL

Incorrectly rejecting deep directories with wildcards

const robotsParser = require('robots-parser')

const rules = [
	'User-agent: *',
	'Disallow: /dir*',
]

const robots = robotsParser('http://www.example.com/robots.txt', rules.join('\n'))

const tests = [
	robots.isAllowed('http://www.example.com/test.html'),
	robots.isAllowed('http://www.example.com/directory.html'),
	robots.isAllowed('http://www.example.com/dirty/test.html'),
	robots.isAllowed('http://www.example.com/folder/dir.html'),
	robots.isAllowed('http://www.example.com/hello/world/dir/test.html'),
]

console.log(tests) // [ true, false, false, false, false ]

// should return: [ true, false, false, true, true ]

Google and other engines work as expected with the last two test URLs.

The library should validate the document before processing it

Hi @samclarke ,

I have a script to watch multiple robots.txt from websites but in some case they have none but still display a fallback content. The issue is your library will tell isAllowed() -> true even if HTML code is passed.

  it('should not confirm it can be indexed', async () => {
    const body = `<html></html>`;

    const robots = robotsParser(robotsUrl, body);
    const canBeIndexed = robots.isAllowed(rootUrl);

    expect(canBeIndexed).toBeFalsy();
  });

(this test will fail, whereas it should pass, or better, it should throw since there are both isDisallowed and isAllowed)

Did I miss something to check the robots.txt format?

Does it make sense to throw an error instead of allowing/disallowing something based on nothing?

Thank you,

EDIT: a workaround could be to check if any HTML inside the file... hoping the website does not return another format (JSON, raw...). But it's a bit hacky, no?

EDIT2: a point of view https://stackoverflow.com/a/31598530/3608410

Crash when passing an invalid URL

It would be nice to handle invalid URLs better instead of crashing. I had to search quite a while why it was not working as it was not clear that robots-parser did not accept an URL without the protocol.

It seems it could be done by validating the result of this call at line 173 in Robots.js

this._url = libUrl.parse(url);

The same validation would also be needed in Robots.prototype.isAllowed

Let me know if you'd like me to submit a pull request for this.

Library adds port when port is never defined.

robots-parser appears to add a port number to the provided URL, which more often than not, breaks the ability to parse a robots.txt file. Or in my case, makes it completely unusable after going through a sample of a 500 URLs. There does not seem to be a way to override this.

Robots-parser not working

I just can't seem to make robots-parser to work properly. Even typing http://mywebsite.com/robots.txt gives error 404 page.

I have inserted this line of code on my app.js:

var robotsParser = require('robots-parser');

var robots = robotsParser('http://kahon-menthos984.c9users.io/robots.txt', [
	'User-agent: *',
	'Allow: /',
	'Sitemap: http://kahon-menthos984.c9users.io/sitemap.xml',
	'Host: http://kahon-menthos984.c9users.io/'
].join('\n'));

I apologize for the inconvenience but I hope you can help me.

Empty disallow statements are interpreted as disallow all

Empty disallow statements Disallow: are interpreted as disallow all while they should be interpreted as allow all.

https://en.wikipedia.org/wiki/Robots_exclusion_standard#Examples

getPreferredHost() is spelled wrong in code example and is throwing an error

tried to clone and create a pull request but no luck.

Preferred host breaks isAllowed

The parsedUrl.hostname !== this._url.hostname logic in _getRule doesn't work when the robot.txt's host is different from the request host and the sitemap uses robot.txt's preferred host.

When robots.txt sets a preferred host differently from the crawled request, i.e. 'example.com' instead of 'www.example.com', and the sitemap uses the preferred host like example.com/about, then isAllowed always fails.

A workaround is to replace the host in each sitemap link tested in isAllowed for the preferred host, and a fix would be to check the preferred host too, but to be honest, the logic is trying to be too clever.

Files without User-agents can't add rules

If a robots.txt doesn't specify any User-agents like this one, it should default to *. In your code a User-agent is required to add a rule but you can get around this by adding
if(userAgents.length <= 0){userAgents.push('*');}
to the top of
Robots.prototype.addRule

Need help maintain this project?

Hi @samclarke, I noticed that there are some issues and PRs open, perhaps you didn't get around to reviewing them.

If you like, I'd love to help you maintain this project by reviewing the PRs, adding TypeScript support, more comprehensive tests for different Node versions, and setting up community health files to encourage future contributions.

Although it's just a small file you wrote several years ago, robots-parser has 300k+ downloads per week, so it would be nice to have some systems in place to update (dev) dependencies and ensure project maintenance. What say? You can add me as a repository collaborator and I'll get started! 😄

Support for URL object

Hi,

With version we could use url objects that way:

let myUrl = new URL('http://foo.bar');
robotsTxt.isAllowed(myUrl);

With version 2 it's no longer possible and for this reason existing code breaks when upgrading.

The reason is that URL.parse(url) that was used in version 1 accepts URL object as parameter when new URL.URL(url) does not.

Can we consider adding back the support for url objects?

redos vulnerability

Repro steps

const robotsParser = require('robots-parser');
const r = robotsParser('https://localhost/robots.txt', [
  'User-agent: * ',
  'Allow: **************************.js*',
].join('\n'));

r.isAllowed('https://localhost/index.html', 'bot name');

This takes a long time to parse.
Adding more expressions or making the URL longer makes it take more time as well.

Throws error with a robots.txt with a colon at the end or beginning of a line

TypeError: Cannot read property 'indexOf' of null
    at parseRobots (/home/cat/Projects/firm-email-crawler/node_modules/robots-parser/Robots.js:102:23)
    at new Robots (/home/cat/Projects/firm-email-crawler/node_modules/robots-parser/Robots.js:181:2)
    at module.exports (/home/cat/Projects/firm-email-crawler/node_modules/robots-parser/index.js:4:9)
    at /home/cat/Projects/firm-email-crawler/node_modules/simplecrawler/lib/crawler.js:1479:50
    at decodeAndReturnResponse (/home/cat/Projects/firm-email-crawler/node_modules/simplecrawler/lib/crawler.js:496:17)
    at IncomingMessage.<anonymous> (/home/cat/Projects/firm-email-crawler/node_modules/simplecrawler/lib/crawler.js:505:21)
    at emitNone (events.js:91:20)
    at IncomingMessage.emit (events.js:185:7)
    at endReadableNT (_stream_readable.js:926:12)
    at _combinedTickCallback (internal/process/next_tick.js:74:11)
    at process._tickCallback (internal/process/next_tick.js:98:9)

Fixed it by removing

if (!line)
    return null;

from trimLine function

Support for ignoring protocols and ports

Hi,

The below code outputs false then true, because the first parser is created with a URL that uses the HTTP protocol, and the second one with HTTPS.

var robotsParser = require("robots-parser");

var robots1 = robotsParser('http://www.example.com/robots.txt', ["User-agent: *", "Disallow:"].join("\n"));
var robots2 = robotsParser('https://www.example.com/robots.txt', ["User-agent: *", "Disallow:"].join("\n"));

console.log(robots1.isDisallowed("http://www.example.com/test", "useragent"));
console.log(robots2.isDisallowed("http://www.example.com/test", "useragent"));

I understand that based on specifications a robots.txt file is only valid for URLs with the same protocol, host, and port. However, this is hugely inconvenient for programs that are trying to use this module for crawling websites with links that occasionally swap between HTTP and HTTPS.

Should there not be an option to ignore lines 325-330 in Robots.js, or at least limit it to only checking the URL's hostname and (or) port? It otherwise makes this module useless for people who are dealing with websites that swap protocols and don't want to take on the inefficiency of needlessly parsing the same robots.txt file twice, for HTTP and HTTPS.