Giter VIP home page Giter VIP logo

ukwa-manage's Introduction

UKWA Manage

Tools for managing the UK Web Archive

Dependencies

This codebase contains many of the command-line tools used to run automation tasks at the UK Web Archive, via the Docker container version, orchestrated by Apache Airflow. This runs local command and MrJob Hadoop jobs, with the latter coded in either Java or Python, able to run on the older or newer Hadoop clusters. As such, the dependencies are quite complex.

Requirements in builds are handled via requirements.txt files, which pin specific versions of dependencies. Therefore, for the builds to remain consistent, the versions of modules that appear in multiple places (e.g. requests) have to be synchronised across all placed. For example, upgrading requests means updating all five Python codebases, otherwise the build will fail.

Similarly, if upgrading the version of Python, to ensure full compatibility, this needs to be done across all dependencies including the Docker base image (as well as all nodes in the Hadoop cluster, with the mrjob_*.conf files updated to refer to it).

Getting started

n.b. we currently run Python 3.7 on the Hadoop cluster, so streaming Hadoop tasks need to stick to that version.

Set up a Python 3.7 environment

  sudo yum install snappy-devel
  sudo pip install virtualenv
  virtualenv -p python3.7 venv
  source venv/bin/activate

Install UKWA modules and other required dependencies:

  pip install --no-cache --upgrade https://github.com/ukwa/hapy/archive/master.zip
  pip install --no-cache --upgrade https://github.com/ukwa/python-w3act/archive/master.zip
  pip install --no-cache --upgrade https://github.com/ukwa/crawl-streams/archive/master.zip
  pip install -r requirements.txt

Running the tools

To run the tools during development:

  export PYTHONPATH=.
  python lib/store/cmd.py -h

To install:

  python setup.py install

then e.g.

  store -h

Or they can be built and run via Docker, which is useful for runs that need to run Hadoop jobs, and for rolling out to production. e.g.

docker-compose build tasks
docker-compose run tasks store -h

Management Commands:

The main management commands are trackdb, store and windex:

trackdb

This tool is for directly working with the TrackDB, which we use to keep track of what's going on. See <lib/trackdb/README.md> for details.

store

This tool is for working with the HDFS store via the WebHDFS API, e.g uploading and downloading files. See <lib/store/README.md> for details.

windex

This tool is for managing our CDX and Solr indexes - e.g. running indexing jobs. It talks to the TrackDB, and can also talk to the HDFS store if needed. See <lib/windex/README.md> for details.

Code and configuration

The older versions of this codebase are in the prototype folder, so we can copy in and update tasks as we need. The tools are defined in sub-folders of the lib folder, and some Luigi tasks are defined in the tasks folder.

A Luigi configuration file is not currently included, as we have to use two different files to provides two different levels of integration. In short, ingest services are given write access to HDFS via the Hadoop command line, while access services have limited read-only access via our proxied WebHDFS gateway.

Example: Manually Processing a WARC collection

This probably needs to be simplified and moved to a separate page

We collected some WARCs for EThOS as an experiment.

A script like this was used to upload them:

#!/bin/bash
for WARC in warcs/*
do
  docker run -i -v /mnt/lr10/warcprox/warcs:/warcs ukwa/ukwa-manage store put ${WARC} /1_data/ethos/${WARC}
done

Note that we're using the Docker image to run the tasks, to avoid having to install the software on the host machine.

The files can now be listed using:

docker run -i ukwa/ukwa-manage store list -I /1_data/ethos/warcs > ethos-warcs.ids
docker run -i ukwa/ukwa-manage store list -j /1_data/ethos/warcs > ethos-warcs.jsonl

The JSONL format can be imported into TrackDB (defaults to used the DEV TrackDB).

cat ethos-warcs.jsonl | docker run -i ukwa/ukwa-manage trackdb files import -

These can then be manipulated to set them up as a kind of content stream:

cat ethos-warcs.ids | trackdb files update --set stream_s ethos -
cat ethos-warcs.ids | trackdb files update --set kind_s warcs -

......

Heritrix Jargon

Notes on queue precedence

A queue's precedence is determined by the precedence provider, usually based on the last crawled URI. Note that a lower precedence value means 'higher priority'.

Precedence is used to determine which queues are brought from inactive to active first. Once the precedence of a queue exceeds the 'floor' (255 by default), it is considered ineligible and won't be crawled any further.

The vernicular here is confusing. Floor is in reference to the least priority but is actually the highest allowed integer value.

In practice, unless you use a special precedence policy or tinker with the precedence floor, you will never hit an ineligible condition.

A use for this would be a precedence policy that gradually lowers the precedence (cumulatively) as it encounters more and more 'junky' URLs. But I'm not aware of anyone using it in that manner.

ukwa-manage's People

Contributors

anjackson avatar dchud avatar dependabot[bot] avatar gilhoggarth avatar ldbiz avatar psypherpunk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ukwa-manage's Issues

Improve monitoring and reporting

  • Tasks: Push all successful chain events into ElasticsearchTargets as well as on-disk, for monitoring?
  • Tasks: Add warnings when crucial tasks remain incomplete for long periods, and patch them into the dashboard display (instead of the current queues).
    • e.g. Note that the screen-shotter was failing and no notification was emitted.
    • This is to be done by pushing success metrics to Prometheus, which can be configured to raise the alert.
  • HDFS: summary stats to look for issues. See #29
  • Crawls: Stats on each launch and stage, broken down by host, somewhere useful.
  • Crawls: Dead seeds report needed, along with crawl logs and reports including crawl delays (as per #1).
  • Crawl: Consider cross-checking against H3 crawl reports (see here)
  • Crawl: Fix up report stats in H3 processors

Improve title-level metadata export to make access clear

The title-level metadata export task and associated template should be extended to make the terms clear.

The licence status can be determined as per this fragment of a similar task

if target.get('hasOpenAccessLicense', False):

Base on this, we can compose a URL to Ericom/LD UKWA or OA UKWA as appropriate. We can also populate a field to express the access terms. I propose we add a dc:rights field that contains the relevant text.

Move To HDFS need to be more robust and scaleable

Currently single-threaded, and does not cope well with e.g.

: Max retries exceeded with url: /webhdfs/v1/1_data/dc/2016/heritrix/output/warcs/dc0-20160810/BL-20161012062734039-07354-13342~crawler04~8443.warc.gz?user.name=hdfs&op=GETFILESTATUS (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7f2a3ab3ed90>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution',))

Luigi issues

This ticket is for collecting the issues we've hit with Luigi itself, especially while working on Hadoop Streaming tasks.

  • Depends on mechanize which is unsupported on Python3.
  • Module packaging logic using wrong syntax to find paths. Needs p = list(package.__path__)[0] not p = package.__path__[0]
  • Module packaging skips dotfiles which breaks packaging of tldextract
  • Module packaging does not support egg files and it's not clear why (from the git history it seems simply to have never been implemented)
  • Improve documentation on how to use extra_files (i.e. needs a list)

Set up beta and production deployment via Docker Compose

We will need slightly different Docker Compose files for the beta and production deployments of the new crawl engine. The dev setup needs instances of all necessary services, whereas the beta and production deployments may rely on systems defined in other Compose files or as long-lived VM or bare-metal deployments.

Finish remainder of the crawl workflow (SIPs and submission)

  • Mint ARKs for each payload file, incrementally.
  • Store full WARC-to-ARK identifier mapping somewhere and update it over time.
  • Build current METS SIP for final package (avoiding hard-coded fields - see below)
  • Submit final SIP to DLS.
  • Verify SIP appears in DLS.

There are a number of hard-coded fields in the SIP generation process. The ones in creator.py refer only to the temporary BagIt that is used to send the SIP to DLS so that's probably acceptable. However, the others (in mets.py) become part of the METS in the archival package.

Note that we could get the ClamD version string by posting the string VERSION to it's API.

Other fields should probably get picked up from the general configuration file, but note that this does not seem to be picked up correctly outside of the Celery run-time (TBC).

Shine DB Backup

The new containerised Shine DB needs an automated backup. As we already have a luigi workflow handling w3act, that should be used.

Move core and document harvester post-crawl management workflows to Luigi

The original version of this system used a chain of RabbitMQ queues, which did provide robust scalability but provided poor support for status monitoring (simply because you can't peek into queues easily). It also tended to lock up everything when there were problematic items at the head of a queue.

Hence, I am currently porting the system to use Luigi to control the workflow. This is a very robust and battle-tested framework developed and used by Spotify, which chains tasks together in a manner similar to Makefiles. It relies on simple files (either on local disk or HDFS) to record the status of the workflows, and builds checkpointing/recovery on this foundation. This simple approach suites us well, and reduces the complexity of the state-management part of the workflows. It also helps up make Heritrix's file conventions explicit.

The following tasks need to be done ahead of the internal test release:

  • Switch the monitoring dashboard to use a task output, hence avoiding repeatedly hitting the services to get the status.
  • Test MoveToHdfs works on the ACT-DDB test system.
  • Allow log lines with no WARC information to be accepted rather than throw errors.
  • Pick up any additional WARC files from the relevant folders in the assembler code.
  • Cache the crawl feeds via a separate task, to avoid hitting W3ACT too much.
  • Post-process WARC files and POST them to OutbackCDX/tinycdxserver (WHEN? Could be initiated by assembly, and be in complete chunks, or just be part of the WARC file processing chain?)
  • Move WREN WARC files to separate folders based on job/launch IDs
  • Add watched target example to W3ACT test DB.
  • Post-process WARC files and POST documents and metadata to W3ACT.
  • Set up automated run via Docker.
  • Review checkpoint settings, use e.g. 6 hours, only keep the last one.
  • Review seed prioritisation logic - still based on hop-path length? Apparently not - was using CostUriPrecedencePolicy with unit cost, not HopsUriPrecedencePolicy. See here
  • Add ignoreForTransclusionsRobotsPolicy? Not for now, commented out.
  • Check externalGeoLookup-include is not included.
  • Check scope logging is off.
  • Add font format file extension to the list of acceptable URI file extensions.
  • Add our ID to the end of the user agent for PhantomJS calls (webrender need redeployment)
  • Update and validate ACT-DDB deployment.
  • Fix document spotting logic - should not be simple prefix, but work as well as the original https://www.gov.uk/government/publications?departments%5B%5D=cabinet-office to http://(uk,gov,www,)/government/ normalisation that H3 did.
  • Make Wayback check more thorough, only checks the index, not if it's available, so we get fails when move-to-hdfs lags (another reason to make sure that's done first!)
  • Check the document scanning process is waiting for jobs to finish before looking for the documents.
  • Add a kill-toethreads task to h3cc if possible.
  • Make document metadata extractor use the W3ACT task to download and cache the feed/export.

Scan for new Targets and enqueue the URLs

The current crawl system looks for recent additions to the Targets for a given crawl and pushes them into that crawl. This should be re-factored to run under Luigi. This could inject the URLs via the Kafka feed rather than use the action directory.

Make SIP data fields more easier to update

There are a number of hard-coded fields in the SIP generation process. The ones in creator.py refer only to the temporary BagIt that is used to send the SIP to DLS so that's probably acceptable. However, the others (in mets.py) become part of the METS in the archival package.

Note that we could get the ClamD version string by posting the string VERSION to it's API.

Other fields should probably get picked up from the general configuration file, but note that this does not seem to be picked up correctly outside of the Celery run-time (TBC).

HDFS Filesystem Dump in CSV Output

Currently the filesystem dump is in JSONL. We need CSV for some uses; so either a rejig of the original, or an alternative output.

Processes outside the target analysis procedure that use the current format will have to be updated if the format changes, e.g. the Turing load.

Support setting of sheet associations based on W3ACT feed data

Leading on from ukwa/w3act#503

The basic processes are in https://github.com/ukwa/python-shepherd/blob/master/python-w3act/w3act/job.py but I'd rather move to using the h3cc script approach currently under development here: https://github.com/ukwa/python-shepherd/blob/master/agents/h3cc.py

We need to

  • Finish switch to hapy rather than our own heritrix3 python code
  • Add a command to h3cc that takes a sheet name and a list of hosts/SURTs and ensures the sheet associations are set up (i.e. moving the job.py logic into a templated Heritrix3 script)
  • Add a new agent (or extend the launcher) in order to get all the sheet associations and then execute the relevant series of h3cc commands that sets all the sheet up.
    • if it's a new script, add it to the cron tab so it gets run at least daily.
  • Robots.txt hardcoded - should allow W3ACT override to work.

Expose screenshots in the dashboard.

The screenshots are now getting generated and indexed. Dashboard would ideally show the last X screenshots for the current time. Could be done as a statistic extracted during CDX indexing, or based on crawl feed + lookup (+ warning if no screenshot can be found?).

  • Need to update webrender deployment.

Clear persist-log databases occasionally

After running for 6 months, the persist-log databases for daily and weekly crawls are becoming unmanageable (>1TB). We should either:

  • Do de-duplication differently, e.g. from CDX service.
  • Or occasionally clear-out the old stuff.

Add in more link farms

We need to add these link farms to the blockAll sheet and the excludes.txt.

http://(uk,co,cdssl,
http://(uk,co,yeomanryhouse,www,
http://(uk,org,grettonvillage,www,
http://(uk,co,car-specs-features,

Ensure IDs are kept for published data via OAI-PMH

We need to ensure our system of publishing metadata records using OAI-PMH formats properly records the IDs we use. If we lose track of these, we can't easily submit update or delete records that refer to those IDs. We should probably store IDs in a DB and also record all submissions to downstream systems.

For reference, here's a deletion template (from here):

<?xml version="1.0"  encoding="UTF-8"?>
<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/
        http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
	<ListRecords xmlns="">
		<record>
			<header status="deleted">
				<identifier>SOURCERECORDID</identifier>
				<datestamp>YYYY-MM-DD</datestamp>
			</header>
			<metadata/>
		</record>
<!-- Copy everything below this line to delete additional records
		<record>
			<header status="deleted">
				<identifier>SOURCERECORID_2</identifier>
				<datestamp>YYYY-MM-DD</datestamp>
			</header>
			<metadata/>
		</record>
	Copy everything above this line to delete additional records -->
	</ListRecords>
</OAI-PMH>

Resolve some minor outstanding issues

  • Make all requires() repeatable by performing the list operation in a parent task. The issue will resolve itself as we switch to a simpler workflow.
  • Extract document metadata from archived site rather than the live one.
  • Only perform extraction if landing page and document url match.
  • Allow move_to_hdfs to overwrite the temp file.
  • Make rendered versions available via the dashboard UI.
  • Fix imagemap rendering so they work correctly.
  • Ensure system is requiring the landing-page and document URLs to be on the watched target? NOTE is is currently requiring to document URL to be watched only. This is probably better than discarding documents just because we found them via an unexpected route.
  • Fix this edge case:
2016-11-04T16:40:52.815Z SEVERE Web rendering http://webrender.bl.uk:8010/render failed with unexpected exception: org.json.JSONException: JSONObject["href"] not a string. (in thread 'ToeThread #39: http://www.thisismoney.co.uk/'; in processor 'wrenderHttp')

which arises because the href can be a dict!

            "href": {
              "animVal": "https://img4.sentifi.com/enginev1.11041653/images/widget_livemonitor/chart_bg.png", 
              "baseVal": "https://img4.sentifi.com/enginev1.11041653/images/widget_livemonitor/chart_bg.png"
            }, 

Extend dashboard to cover more services

The dashboard is helpful, but missing...

  • FC queues: sips, sip-submitted, sip-error, qa
  • Split BETA and UKWA, add Interject
  • webrender x2
  • WebHDFS x2
  • pdf2htmlex
  • HAR daemon, webrender queues
  • Celery/movetohdfs/supervisor
  • ELK/Monitrix?
  • DLS services?
  • clamd x2 (see pyClamd
  • Disk-space % full on Crawler displays.
  • DC phantomjs-domain queue.
  • Also add w3actqueue?
  • Solrs? crawl_state_solr if any?

Cope with viral payloads

When viruses are detected, the process gets broken for a couple of reasons. Firstly, the formatting of the line is odd and confuses the crawl.job.output:parse_crawl_logs function:

Got assemble_job_output for: daily/20160809153831
2016-08-10T02:55:32.381Z   200      60671 http://www.bbc.co.uk/webwise/guides/java-and-javascript LL http://www.bbc.co.uk/radio/player/bbc_radio_five_live text/html #196 20160810025532090+177 sha1:MALYVT3J2MQWW3K3AEYE452E62WGP774 - ip:212.58.246.95,duplicate:digest,ip:212.58.244.66,1: stream: Html.Exploit.CVE_2016_3326-3 FOUND {"contentSize":61206}

 stream:
 No JSON object could be decoded

But also note that the viral WARC filename is missing, so that will break things downstream. Also this case probably breaks how the system currently builds up the whole WARC path.

Ideas for next release

  • Use the ssh hooks to check on or act on remote servers. Consider centrally orchestrating H3 jobs, i.e. mapping daily to crawler06.bl.uk etc.
  • Consider restarting from the last checkpoint rather than a clean restart?

Review seed/host rendering loop

The H3 configuration that posts seeds/hosts to a queue for rendering should not bother if it's anything other than a 200. I need to review the implementation and avoid it posting pointless requests to the web renderer.

Shift to Hadoop-based workflow and complete

  • Re-factor so that outputs get shovelled up to HDFS first, and everything else goes from there. See e.g. terasort.py
  • Add Hadoop-based CDX-indexing job rather than running locally.
  • Add Hadoop-based document-extraction job, rather than running locally.
  • Refactor and clean-up the code. The crawl and tasks folders should probably be unified into purpose-based folders, e.g. h3/tasks.py and h3/hapy/,hdfs/tasks.py etc. There's also a lot of copies of older iterations that should be merged down once the flow is working from end-to-end.
  • Moving WREN files into the job launch folder.
  • FTP of Nominet data to HDFS
  • Also copy all assembled/package files up to HDFS.
  • Delete files once uploaded to HDFS (MORE CHECKS PLEASE! Once indexed?)
  • Generate summary statistics from crawl logs
  • Closing open WARCs and removing .lck files for old/dead jobs.
  • The task that gets log files and pushes them up to HDFS (hdfssync).
  • Extract basic WARC metadata and generate graphs etc.
  • Scan Crawler03 and make sure we have the old SIPs.
  • Make move-to-hdfs work for DC
  • Generate incremental packages for all crawls.
  • Update GeoIP databases?

Other ideas

  • More sanity-checks in validate job, e.g. check logs are not empty, check there are no additional WARC files, i.e. not mentioned in logs, etc.
  • Store ARKs and hashes in launch folder and in the ZIP. See CrawlJobOutput.
  • Create validate sip to inspect the store for content and verify it.
  • Add a test W3ACT to the docker system, populated appropriately, and set up to crawl a Dockerized test site (acid-crawl idea).

Move to use Strategic Ingest

  • Add separate process to generate new-style SIPs for incremental packages.
  • Change process to create packages that conform to the strategic ingest profile.
  • Test against IRC.
  • Check various license terms are allowed.

Add tools for crawl stream ops

We have

  • submit: enqueue a URI for crawling
  • crawlstreams: inspect or summarise the crawl queues

Additional features or tools would find useful...

  • cdx: meaning emit the crawled stream as CDX lines
  • uris: just emit the URIs, used to perform sort | uniq etc
  • graph: emit graph form, e.g. fine details or summarised on hop depth.
  • hops: summarise activity by hops from seeds.

Add additional third-party whitelists

We need an additional source of whitelisted URLs/SURTs for the Open Access service, in the form of a simple text file, alongside the WCT one.

We also need to investigate whether we can use W3ACT fields to store additional required URLs. e.g. the currently un-used URL 'White list' could be used to provide added URLs for whitelisting (and indeed for crawling!)

Fix issue with temp files showing up in warcs-by-day lists

We're hitting indexing failures because the warcs-by-day list for a particular day picked up WARCs with a .warc.gz.temp suffix. They have since been renamed, but as the system only keys on the number of WARC files found the file is not being updated. This leads to repeated failures.

The warcs-by-day generator should ignore files that do not end .arc.gz or .warc.gz,

Add task to pull URLs and metadata in from GDELT

The GDELT Project (by Kalev Leetaru) offers a rich data feed under very open terms.

There is a feed file that is update every 15 minutes that links to a TSV in a ZIP file that be parsed and interpreted to generate a list of URLs with associated metadata (including lat/lon of event).

A task could grab this file, parse and push it into an appropriate crawl stream, if it appears to be in scope. If done with care, this could include some of the additional metadata and pass it along to the indexer.

Generate a holdings Bloom-filter dataset

To enable optimisation of things like Memento aggregator lookups, it would be good to build a Bloom filter that covers all the URLs we hold and published that in some serialised, re-usable form, so others can reliably estimate our holdings without having to query us directly.

We would need:

  • Tooling to generate such a filter from CDX files and from OutbackCDX (assuming we can scan the index effectively).
  • A task to generate the filter and publish it as a dataset on a regular basis.
  • Tooling capable of using the filter so the results can be integrated into third-party services.

Aggregatable Crawl Log Analysis

We've done some work on log analysis (see log_analysis.py but it suffers in a couple of areas. It's difficult to merge the results from multiple set of log files, and the document harvesting logic is mixed in too, which was thought to be necessary (to avoid re-parsing the same log files multiple times) but this now seems like a modest yield in terms of optimisation for a significant jump in complexity.

Now we've worked with Pandas a bit, I'm proposing we modify how we analyse crawl logs, to make it easier to manipulate them with Pandas. Instead of outputting a complex stats object, we output tabular data of this rough form:

host date urls bytes 2xx 3xx duration viruses revisits html geoGB
bl.uk 2018-02-01 1 50 1 0 5 0 1 1 0
bbc.co.uk 2018-02-01 6 1020 4 2 508 0 1 1 0

i.e. the analysis here aggregates at the host + day meaning every host we visit gets a separate line in the output for each day. We visit large number of hosts, but it should still be possible to load large sets of these intermediate files into Pandas to generate analyses that start to get more useful. Even if the same 'key' appears in multiple files, we can handle this easily enough in Pandas.

This should help us provide results for things like, for any given timeframe:

  • How many hosts did we visit?
  • How much data did we download (!= bytes retained or stored compressed)
  • How many viruses did we detect?
  • How many URLs did we download?
  • What was the distribution of the number of URLs downloaded across hosts?

Restart pywb after updating configuration files

If we update the OA allow list or any other pywb configuration file, it isn't picked up unless we restart (at least until ukwa/ukwa-pywb#36 anyway).

Alternatively, we could use the Docker Python and add code to restart the service. e.g. connect to Docker on the access service, use DockerClient service.get() and use access_pywb as the service_id. Then call force_update to restart it. If we run two instead of one (scale=2) this should work without a gap in service.

(I guess the script could temporarily set scale=2 during the update process, but maybe that's overkill. Running two all the time is no big deal.)

Targets can only belong to one Collection.

Due to the was the Targets schema is defined, each Target can only have one parent collection. This means that if we only have one Solr document per Target, each target will only show up in one collection. I also note that by using the Target ID as the id of the Solr document, Targets could collide with Collections in that Solr index.

We can work around this by making theid for targets include the CollectionID as well.

HDFS Filesystem Dump - Parse Path into Components

NB: #30 HDFS Filesystem Dump in CSV Output is a prerequisite - do that first.

Currently the filesystem dump outputs the path as is.

Example:
/0_original/dc/crawler04/heritrix/output/slashpage/warcs/slashpage-20150731/BL-20150808162719568-00037-2112-crawler04-8443.warc.gz

Deriving information from the path during analysis is not practical. See HDFS Analysis #29 for motivation.

We need to look at a sample of the current output and try and identify main path components and use them to form new columns, which we can then analyse on.

Parsing the path during the initial dump process seems a good place for this. However as that process provides output for other processes, it will be better to have it as a secondary task.

i.e.

  1. dump filesystem to csv > used by process a
  2. parse path to components, add columns to csv > used by process b

2 above is the functionality to be provided in this issue. 1 is provided in #30

Example Columns:
crawl type/application type (dc/none/etc)
server (crawler04/etc)
application (heretrix/etc)
function (logging/output/etc.)
filetype (warc/gz/etc)
timestamp

So example CSV dump using the path above:

current (pipe delimited only because I've copied it straight from Excel):

3 | 2016-05-18T11:18:00 | hdfs | /0_original/dc/crawler04/heritrix/output/slashpage/warcs/slashpage-20150731/BL-20150808162719568-00037-2112-crawler04-8443.warc.gz | 1.01E+09 | supergroup | -rw-r--r--

becomes something like:

3 | 2016-05-18T11:18:00 | hdfs | /0_original/dc/crawler04/heritrix/output/slashpage/warcs/slashpage-20150731/BL-20150808162719568-00037-2112-crawler04-8443.warc.gz | 1.01E+09 | supergroup | -rw-r--r-- | dc | crawler04 | log | 2015-08-04-010557 | warc

The important thing is to analyse a sample of the current output paths and work out what the most likely components of interest are - they may change in position in different paths; that would need to be catered for. These components would indicate what the new columns should be called and what they would contain. Any changes to these columns would need to be reflected in both the dump and analysis code.

HDFS Stats - more intelligent handling of path parsing required

#31 HDFS Path Parsing attempts to break down the path of files into components that can be reported on, e.g. "images", "heritrix", "server03" etc.

While it works well as a first effort, there are at least two shortfalls that will potentially cause missing, out of date or misleading statistics.

Both of these issues are down to the paths having no formal structure, but an ad-hoc manual naming that - at best - fits into structures that have evolved over time.

Issue 1: component positioning within path,

Examples:

  1. /0_original/dc/server04/heritrix/output/logs/intranet-20150601101519/20150601121316/alerts.log.gz

  2. /heritrix/output/wayback/cdx-index/20160609200001/_logs/history/file...

In the first example, "heritrix" is representing application level information; in the second server or user level. As it currently stands the logic makes no attempt to work out where in the path the value has been found, so in the second example, "heritrix" will be recorded as both the server and application component as the path if it comes prior to (say) "wayback" in the application check list.

We work around this by placing items like "heritrix" that appear in earlier checklists at the end of subsequent checklists. So in the application check, "wayback" is checked first and is recorded as the application. A better solution would be to take into account whereabouts in the path the value appears, but without a formal path structure this can only be am intelligent guess.

Issue 2: static definitions of components of interest.

By manually examining a large sample file we've derived and hardcoded the sort of information we want to report on, e.g. server: server01, server02, yoda, etc.

Naturally some of this information will go out of date or become less relevant over time. Unless there is a regularly scheduled manual review, or an automatic process to work out what those values should be (and then a corresponding change to analysis code) the statistics will start to become unrepresentative. So some sort of process to handle that needs to be put in place.

Both of these issues listed here as enhancements with low priority.

HDFS analysis and reporting

We need to perform some regular checks and reporting on the contents of HDFS. The overall idea is that we set up a daily task that runs in the night and makes a full list of all the contents in HDFS. This fine-grained data can then be analysed to generate metrics and reports. Useful metrics should be sent to Prometheus via it's push gateway. More detailed summaries can either be sent to Elastic Search or just emitted as data files (CSV, JSON, JSONL, etc.) which can be visualised using Apache Zeppelin or Jupyter notebooks.

The initial part, listing the files on HDFS, is here, and that file includes some early experiments in generating derivatives/summaries from there.

The file listings are copied onto HDFS, placed in dated folders and filenames, like this:

/9_processing/access-task-state/2017-11/hdfs/2017-11-22-all-files-list.jsonl.gz

It's compressed line-separated JSON a.k.a. JSONLines, and should be easy to parse either in a map-reduce job or directly.

The kinds of things we want to know are:

  • Are there any WARCs or ARCs in unexpected places? (they should be under /ia or /data or /heritrix)
  • What are the file and byte totals for each top-level directory, broken down by file extension?

Possible metrics to keen an eye on:

  • Total number of files and bytes on HDFS.
  • Total number of files and bytes of WARCs and logs, tagged by crawl stream (daily, dc0, etc.)

Use W3ACT DB table dumps instead of the API

In some cases, e.g. when extracting all Targets, using the W3ACT API is extremely slow. Instead, we could regularly dump tables as CSV and use those to generate the various outputs and datasets we want.

Improve reporting of crawl delays

In checkforuris, I suggest we also report crawl delays. i.e. when it scans for a URL and finds nothing has been crawled, it should log the delay if the delay is over some configurable acceptable tolerance, e.g. 2 hours. e.g.

Launched at [DATE] not crawled at [DATE] delay > [CURRENT GAP]

If the system logs 'delayed' crawl behaviour in this way, we can follow that more closely in Kibana and debug the crawls.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.