Giter VIP home page Giter VIP logo

crawl-streams's Introduction

UKWA (prototype)

This is the initial prototype for the updated UKWA website.

crawl-streams's People

Contributors

anjackson avatar dependabot[bot] avatar ldbiz avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

crawl-streams's Issues

Add some send-to-crawl-db support

Building on the shared kevals library, add hooks to pull data from Kafka or local files and push to a crawl log database.

For crawl streams from Kafka, some field mapping is required to map Heritrix's fields to the warcprox ones:

  • size > wire_bytes
  • mimetype > content_type
  • seed > source

Also:

Add a REST API interface to engage with the crawl engine

Should we add a FastAPI module so the crawler can have a REST API for launching a crawl. Possibly also for managing scope and seeds files? This could also expose crawl stats etc. The overall approach would be to avoid forcing services that interact with the crawler having to go through Kafka, at least for crawl launches.

This could be made consistent with the Browsertrix-Crawler API/model, so we essentially have two separate crawl engines that we can interact with in consistent ways.

Pipe Errors when connecting to Kafka

In some situations (I haven't pursued an always reproducible case) a BrokenPipeError is raised, e.g. when piping the report output through head.

This can be "fixed" as per the documentation (h/t Stackoverflow) by catching the error as so:

(eg. in show_raw_stream)

   try:
        for message in consumer:
        :
    except BrokenPipeError:
        # Python flushes standard streams on exit; redirect remaining output
        # to devnull to avoid another BrokenPipeError at shutdown
        devnull = os.open(os.devnull, os.O_WRONLY)
        os.dup2(devnull, sys.stdout.fileno())
        sys.exit(1)  # Python exits with error code 1 on EPIPE

I haven't committed the fix because I don't know how much of an issue it is outside of my environment (it occurs for me using the last example in the readme); I'm not sure that the fix is appropriate; and it makes the code less readable.

Noted here for review in case it's been encountered before and is considered worth implementing.

Option to support forcing a crawl of 'domaincrawl' by-permission URLs

Currently, non-NPLD (by-permission) but domain-crawl frequency URLs (like this one) don't get picked up. It's not clear they get bundled in with the DC, and TBH they probably shouldn't as they won't be marked as BYPM.

So, they should really be treated as annual, and the launcher should be able to launch them without there being a matching current time. The options should also be set up such that they will not force a recrawl (i.e. do not set a launchTimestamp, just use the default recrawl settings, which is a recrawl period of a year.

Switch to a keyed, compacted crawl configurations topic

To avoid issues like ukwa/ukwa-heritrix#76, we could:

  • When storing in Kafka, key the launch requests by distinct seed/host configuration.
  • Switch to compacted Kafka so only the latest crawl spec. is kept per unique site key.
  • Ensure Heritrix always reads the whole Kafka topic, thus reconstructing the up-to-date configuration.#

Cleanest option would be to gather seeds together and assemble an overall launch spec per host, then change how Heritrix processes that to set up the queue config and then enqueue all the seeds with their launchTimestamp used to reflect the most recent launches.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.