This is the initial prototype for the updated UKWA website.
ukwa / crawl-streams Goto Github PK
View Code? Open in Web Editor NEWTools for working with UKWA crawler event streams
License: Apache License 2.0
Tools for working with UKWA crawler event streams
License: Apache License 2.0
Building on the shared kevals library, add hooks to pull data from Kafka or local files and push to a crawl log database.
For crawl streams from Kafka, some field mapping is required to map Heritrix's fields to the warcprox ones:
size > wire_bytes
mimetype > content_type
seed > source
Also:
annotations
and extra_info
need mapping to string arrayswarc_writer
field needs to be added to mark files as from heritrix3
or warcprox
.The primary task is to move the ukwa-manage
Luigi task launcher code into here as a standalone command-line tools that can be run manually or invoked from AirFlow.
Then, that tool needs to be updated to use the newer crawl configurations from W3ACT, once they have been added to the crawl feed export, as per ukwa/python-w3act#15
Should we add a FastAPI module so the crawler can have a REST API for launching a crawl. Possibly also for managing scope and seeds files? This could also expose crawl stats etc. The overall approach would be to avoid forcing services that interact with the crawler having to go through Kafka, at least for crawl launches.
This could be made consistent with the Browsertrix-Crawler API/model, so we essentially have two separate crawl engines that we can interact with in consistent ways.
In some situations (I haven't pursued an always reproducible case) a BrokenPipeError is raised, e.g. when piping the report output through head.
This can be "fixed" as per the documentation (h/t Stackoverflow) by catching the error as so:
(eg. in show_raw_stream)
try:
for message in consumer:
:
except BrokenPipeError:
# Python flushes standard streams on exit; redirect remaining output
# to devnull to avoid another BrokenPipeError at shutdown
devnull = os.open(os.devnull, os.O_WRONLY)
os.dup2(devnull, sys.stdout.fileno())
sys.exit(1) # Python exits with error code 1 on EPIPE
I haven't committed the fix because I don't know how much of an issue it is outside of my environment (it occurs for me using the last example in the readme); I'm not sure that the fix is appropriate; and it makes the code less readable.
Noted here for review in case it's been encountered before and is considered worth implementing.
This line does not report the right queue/topic:
crawl-streams/crawlstreams/launcher.py
Line 165 in 75741ad
It should use this.queue
rather than a hard-coded value.
Currently, non-NPLD (by-permission) but domain-crawl frequency URLs (like this one) don't get picked up. It's not clear they get bundled in with the DC, and TBH they probably shouldn't as they won't be marked as BYPM
.
So, they should really be treated as annual, and the launcher should be able to launch them without there being a matching current time. The options should also be set up such that they will not force a recrawl (i.e. do not set a launchTimestamp
, just use the default recrawl settings, which is a recrawl period of a year.
We could use https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams and make a client to the page-level URL changes stream, and look for UK URLs and domains, and feed them into the crawl input stream.
To avoid issues like ukwa/ukwa-heritrix#76, we could:
Cleanest option would be to gather seeds together and assemble an overall launch spec per host, then change how Heritrix processes that to set up the queue config and then enqueue all the seeds with their launchTimestamp
used to reflect the most recent launches.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.