Giter VIP home page Giter VIP logo

crawl-streams's Introduction

Crawl Streams

Tools for operating on the event streams relating to our crawler activity.

Commands

When installed, the following commands are installed.

crawlstreams.launcher

The launcher command can be used to launch crawls according to a crawl specification. Each specification contains the crawl schedule as well as the seeds and other configuration. Currently we only support a proprietary JSON spec., but will support crawlspec in the future.

The current spec. looks like:

{
	"id": 57875,
	"title": "Bae Colwyn Mentre Treftadaeth Treflun | Colwyn Bay Townscape Heritage Initiative",
	"seeds": [
		"https://www.colwynbaythi.co.uk/"
	],
	"depth": "CAPPED",
	"scope": "subdomains",
	"ignoreRobotsTxt": true,
	"schedules": [
		{
			"startDate": "2017-10-17 09:00:00",
			\"endDate": "",
			"frequency": "QUARTERLY"
		}
	],
	"watched": false,
	"documentUrlScheme": null,
	"loginPageUrl": "",
	"logoutUrl": "",
	"secretId": ""
}
  • TBA - the newer fields around parallel queues, etc. need to be added in. See #3

The launcher is designed to be run hourly, and will enqueue all URLs that are due that hour. Finer-grained launching of requests is not yet supported. The crawler itself uses the embedded launch timestamp to determine if the request has already been satisfied, making the requests idempotent. This means it's okay if the launcher accidentally runs multiple times per hour.

As an example, launching crawls for the current hour for NPLD looks like this:

$ launcher -k crawler06.n45.bl.uk:9094 fc.tocrawl.npld /shared/crawl_feed_npld.jsonl

and the command will log what happens, and report the outcome in terms of numbers of crawls launched.

Development Setup

Outside Docker

To develop directly on your host machine, you'll need Snappy and build tools. e.g. for RHEL/CentOS:

sudo yum install snappy-devel
sudo yum install gcc gcc-c++ libtool

With these in place, this should work:

git clone https://github.com/ukwa/crawl-streams.git
cd crawl-streams/
virtualenv -p python3.7 venv
source venv/bin/activate
python setup.py install

Supporting services

The provided docker-compose.yml file is intended to be used to spin-up local versions of Kafka suitable for developing against. Run

$ docker-compose up -d kafka

And a Kafka service should be running on host port 9094. Kafka has it's own protocol, not HTTP, so you can't talk to it via curl etc. However, there is also a generic Kafka UI you can run like this:

$  docker-compose up -d ui

At which point you should be able to visit port 9990 (e.g. http://dev1.n45.wa.bl.uk:9990/) and have a look around.

Running the development version

You should now be able to edit the Python source files and run them to test against the given Kafka service. For example, to submit a URL to the NPLD frequent crawl's to-crawl topic, you can run:

$ python -m crawlstreams.submit -k dev1:9094 fc.tocrawl.npld http://a-test-string.com

To run the reporting script to analyse the contents of the fc.crawled topic, you use:

$ python -m crawlstreams.report -k dev1:9094 -t -1 -q fc.crawled

But this topic will need populating with test data. You can use Kafka's own tools to pull some data from the live service, like this:

$ docker run -i --net host wurstmeister/kafka:2.12-2.1.0 /opt/kafka/bin/kafka-console-consumer.sh --bootstrap-server crawler05.n45.bl.uk:9094 --topic fc.crawled --max-messages 100 > messages.json

You now have 100 messages from the live system in a JSON file. You can submit this to the local Kakfa like this:

$ cat messages.json | docker run -i --net host wurstmeister/kafka:2.12-2.1.0 /opt/kafka/bin/kafka-console-producer.sh --broker-list dev1:9094 --topic fc.crawled

Running on live data

When performing read operations, it's fine to run against the live system. e.g. to see the raw fc.crawled feed from the live Crawler05 instance, you can use:

$  python -m crawlstreams.report -k crawler05.n45.bl.uk:9094 -t -1 -q fc.crawled -r | head

Running on a test crawler

$ python -m crawlstreams.launcher -k crawler05.n45.bl.uk:9094 fc.tocrawl.npld ~/crawl_feed_npld.jsonl

Running inside Docker

TBA

$ docker-compose build
$ docker-compose run crawlstreams -k kafka:9092 ...

docker run --net host -ti wurstmeister/kafka:2.12-2.1.0 /opt/kafka/bin/kafka-console-consumer.sh --bootstrap-server crawler06.n45.bl.uk:9094 --from-beginning --topic fc.tocrawl.npld --max-messages 100

crawl-streams's People

Contributors

anjackson avatar dependabot[bot] avatar ldbiz avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

crawl-streams's Issues

Add some send-to-crawl-db support

Building on the shared kevals library, add hooks to pull data from Kafka or local files and push to a crawl log database.

For crawl streams from Kafka, some field mapping is required to map Heritrix's fields to the warcprox ones:

  • size > wire_bytes
  • mimetype > content_type
  • seed > source

Also:

Switch to a keyed, compacted crawl configurations topic

To avoid issues like ukwa/ukwa-heritrix#76, we could:

  • When storing in Kafka, key the launch requests by distinct seed/host configuration.
  • Switch to compacted Kafka so only the latest crawl spec. is kept per unique site key.
  • Ensure Heritrix always reads the whole Kafka topic, thus reconstructing the up-to-date configuration.#

Cleanest option would be to gather seeds together and assemble an overall launch spec per host, then change how Heritrix processes that to set up the queue config and then enqueue all the seeds with their launchTimestamp used to reflect the most recent launches.

Add a REST API interface to engage with the crawl engine

Should we add a FastAPI module so the crawler can have a REST API for launching a crawl. Possibly also for managing scope and seeds files? This could also expose crawl stats etc. The overall approach would be to avoid forcing services that interact with the crawler having to go through Kafka, at least for crawl launches.

This could be made consistent with the Browsertrix-Crawler API/model, so we essentially have two separate crawl engines that we can interact with in consistent ways.

Option to support forcing a crawl of 'domaincrawl' by-permission URLs

Currently, non-NPLD (by-permission) but domain-crawl frequency URLs (like this one) don't get picked up. It's not clear they get bundled in with the DC, and TBH they probably shouldn't as they won't be marked as BYPM.

So, they should really be treated as annual, and the launcher should be able to launch them without there being a matching current time. The options should also be set up such that they will not force a recrawl (i.e. do not set a launchTimestamp, just use the default recrawl settings, which is a recrawl period of a year.

Pipe Errors when connecting to Kafka

In some situations (I haven't pursued an always reproducible case) a BrokenPipeError is raised, e.g. when piping the report output through head.

This can be "fixed" as per the documentation (h/t Stackoverflow) by catching the error as so:

(eg. in show_raw_stream)

   try:
        for message in consumer:
        :
    except BrokenPipeError:
        # Python flushes standard streams on exit; redirect remaining output
        # to devnull to avoid another BrokenPipeError at shutdown
        devnull = os.open(os.devnull, os.O_WRONLY)
        os.dup2(devnull, sys.stdout.fileno())
        sys.exit(1)  # Python exits with error code 1 on EPIPE

I haven't committed the fix because I don't know how much of an issue it is outside of my environment (it occurs for me using the last example in the readme); I'm not sure that the fix is appropriate; and it makes the code less readable.

Noted here for review in case it's been encountered before and is considered worth implementing.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.