ukwa / ukwa-services Goto Github PK

View Code? Open in Web Editor NEW

4.0 8.0 5.0 11.87 MB

Deployment configuration for all UKWA services stacks.

License: Apache License 2.0

Shell 33.04% Python 62.37% JavaScript 0.62% ASL 3.98%

ukwa-services's Introduction

ukwa-services

Deployment configuration for almost all UKWA services.

Introduction
Service Stacks
High-Level Technical Architecture
- Overview
- Areas
  - Manage
  - Ingest
  - Storage
  - Process
  - Access
  - Monitoring
- Interfaces
- Infrastructure
Software
- Deployment Process

Introduction

These Docker Stack configurations and related scripts are used to launch and manage our main services. No internal or sensitive data is kept here -- that is stored in internal ukwa-services-env repository as environment variable scripts required for deployment, or as part of the CI/CD system.

Note that some services are not deployed via containers, e.g. the Hadoop clusters, and the Solr and OutbackCDX indexes. This includes a dedicated API server that acts as an intermediary for calls to various internal systems, allowing the implementation details of the current deployment to be kept separate from their external identity.

For example, our OutbackCDX service is accessed internally as cdx.api.wa.bl.uk. Over recent years, this service has been migrated to new hardware on a number of occasions, but using the cdx.api.wa.bl.uk proxy alias has allowed us to minimised downtime when migrating or upgrading the service.

These other services are documented elsewhere, but the interaction with those other services will be made clear.

Service Stacks

Service stacks are grouped by broad service area, e.g. access contains the stacks that provides the access services, and the access README provides detailed documentation on how the access services are deployed. The service areas are:

ingest covers all services relating to the curation and ingest of web archives
access covers all services relating to how we make the web archives accessible to our audiences
manage covers all internal services relating to the management of the web archive, including automation and workflows that orchestrate activities from ingest to storage and then to access

For a high-level overview of how these service stacks interact, see the section on technical architecture below.

Within each sub-folder, e.g. access/website, we have a docker-compose.yml file which should be used for all deployment contexts (e.g. dev,beta and prod). Any necessary variations should be defined via environment variables.

These variables, any other context-specific configuration, should be held in subdirectories. For example, if access/website/docker-compose.yml is the main stack definition file, any addtional services needed only on dev might be declared in access/website/dev/docker-compose.yml and would be deployed separately.

The process for updating and deploying components is described in the deployment section below.

High-Level Technical Architecture

This is a high-level introduction to the technical components that make up our web archiving services. The primary goal is to provide an overview of the whole system, with a particular focusing on knowing where to look if something goes wrong.

Some wider contextual information can be found at:

Note that the images on this page can be found in this Google Slides presentation.

Overview

The life-cycle of our web archives can be broken down into five main stages, along with management and monitoring processes covering the whole thing, and the underlying infrastructure that supports it all. Each stage is defined by it's interfaces, with the data standards and protocols that define what goes in to and out of that stage (see below for more details). This allows each stage to evolve independently, as long as it's 'contract' with the other stages is maintained.

There are multiple ingest streams, integrating different capture processes into a single overall workflow, starting with the curation tools that we use to drive the web crawlers. Those harvesting processes pull resources off the web and store them in archival form, to be transferred on HDFS. From there, we can ingest the content into other long-term stores, and can then be used to provide access to individual resources both internally and externally, for all the Legal Deposit libraries. As the system complexities and service levels vary significantly across the different access channels, we identify them as distinct services, while only have one (unified) harvesting service.

In order to be able to find items of interest among the billions of resources we hold, we run a range of data-mining processes on our collections that generate appropriate metadata, which is then combine with manually-generated annotations (supplied by our curators) and used to build our catalogue records and indexes. These records drive the discovery process, allowing users to find content which can then be displayed over the open web or via the reading room access service (as appropriate).

Areas

Manage

The critical management component is Apache Airflow, which orchestrates almost all web archive activity. For staff, it is accessible at http://airflow.api.wa.bl.uk. Each workflow (or DAG in Airflow terminology) is accessible via the management interface, and the description supplied with each one provides documentation on what the task does. Where possible, each individual task in a workflow involves running a single command-line application wrapped in versioned Docker container. Developing our tools as individual command-line applications is intended to make them easier to develop, test and maintain. The Airflow deployment and workflows are defined in the ./manage folder, in ./manage/airflow

Another important component is TrackDB, which contains a list of all the files on our storage systems, and is used by Airflow tasks to keep track of what's been indexed, etc.

See manage for more details.

Ingest

Covers curation services and crawl services, everything leading to WARCs and logs to store, and metadata for access.

See ingest for more details.

Storage

Storage systems are not deployed as containers, so there are no details here. We currently have multiple Hadoop clusters, and many of the tasks and components here rely on interacting with those clusters through their normal APIs.

Process

There are various Airflow tasks that process the data from W3ACT or from the Hadoop storage. We use the Python MrJob library to run tasks, which are defined in the ukwa/ukwa-manage repository. That is quite a complex system, as it supports Hadoop 0.20.x and Hadoop 3.x, and supports tasks written in Java and Python. See ukwa/ukwa-manage for more information.

Access

Our two main access services are:

The UK Web Archive open access service, online at https://www.webarchive.org.uk/
The Legal Deposit Access Service, only available in Legal Depost Library reading rooms.

See access for more details.

Monitoring

Monitoring runs independently of all other systems, on separate dedicated hardware. Based on Prometheus, with alerts defined for major critical processes. See https://github.com/ukwa/ukwa-monitor for detail.

Interfaces

There are data standards/protocols that isolate parts of the system so they can evolve independently (see How do you cut a monolith in half? for more on this idea).

Interface	Protocol	Further Details
Curate > Crawl	Crawl feeds (seeds, frequencies, etc.), NEVER-CRAWL list.	Generated from W3ACT, see the w3act_export workflow.
Crawl > Storage	WARC/WACZ files and logs.	These are stored locally then moved to HDFS using Cron jobs (FC) and Airflow (DC, see copy_to_hdfs_crawler08).
Storage > Process	WARC/WACZ files and logs, Metadata from W3ACT exports.	This covers indexing tasks like CDX generation, full-text indexing etc.
Process > Access	WARCs/WACZ on HDFS via HTTP API + TrackDB. OutbackCDX API. Solr Full-text and Collections APIs. Data exported by `w3act\_export` (allows.aclj, blocks.aclj)	As the collection is large, access is powered by APIs rather than file-level standards.

Infrastructure

Access & Updates

A central server known as wash is used to log into all system, and runs updates and logging etc. at the system level via Cron jobs.

A pair of servers use IP-failover to host the *.api.wa.bl.uk domains, running NGINX to proxy internal services to the appropriate back-end system.

Container Platforms

At the time of writing, we use Docker Swarm for production container deployment, and have a set of servers hosting PROD, BETA and DEV swarms.

Networks

The systems configured or maintained by the web archiving technical team are located on the following networks.

Network Name	IP Range	Description
WA Public Network	194.66.232.82 to .94	All public services and crawlers. Note that the crawlers require unrestricted access to the open web, and so outgoing connections on any port are allowed from this network without going through the web proxy. However, very few incoming connections are allowed, each corresponding to a curatorial or access service component. These restrictions are implemented by the corporate firewall.
WA Internal Network	-	Internal service component network. Service components are held here to keep them off the public network, but provide various back-end services for our public network and for systems held on other internal networks. This means the components that act as integration points with other service teams are held here.
WA Private Network	-	The private network's primary role is to isolate the Hadoop cluster and HDFS from the rest of the networks, providing dedicated network capacity for cluster processes without affecting the wider network.
DLS Access Network	-	The BSP, LDN, NLW and NLW Access VLANs. Although we are not responsible for these network areas, we also deploy service components onto specific machines within the DLS access infrastructure, as part of the Legal Deposit Access Service.

Software

Almost our entire stack is open source, and the most critical components are co-maintained with other IIPC members. Specifically, the Heritrix crawler and the PyWB playback components (along with the standards and practices that they depend upon, like WARC) are crucial to the work of all the IIPC members, and to maintaining access to this content over the long term.

Current upgrade work in progress:

Reading Room access currently depends on OpenWayback but should be replaced with a modernized PyWB service through the Legal Deposit Access Solution project.
Adoption of Browsertrix Cloud for one-off crawls, with the intent to move all Frequent Crawls into it eventually.
A new approach is needed to manage monitoring and replication of content across H020, H3 BSP and H3 NLS.
Full-scale fulltext indexing remains a challenge and new workflows are needed.
All servers and containers need forward migration to e.g. to the latest version of RedHat, dependent libraries etc. As we have a fairly large estate, this is an ongoing task. Generally, this can be done without major downtime, e.g. using Hadoop means it's relatively straightforward to take a storage node out and upgrade its operating system without interrupting the service.

Deployment Process

First, individual components should be developed and tested on developers' own machines/VMs, using the Docker Compose files within each tool's repository. e.g. w3act.

These are are intended to be self-contained. i.e. if possible should not depend on external services, but use dummy ones populated with test data.

Once a new version of a component is close to completion, we will want to run then against internal APIs for integration testing and/or user acceptance testing, and that's what the files in this repository are for. A copy of this repository is available on the shared storage of the DEV Swarm, and that's where new builds of versions of containers should be tested.

Once we're happy with the set of Swarm stacks, we can tag the whole configuration set for release through BETA and then to PROD.

Whoever is performing the roll-out will then review the tagged ukwa-services configuration:

check they understand what has been changed, which should be indicated in the relevant pull request(s) or commit(s)
review setup, especially the prod/beta/dev-specific configurations, and check they are up to date and sensible
check no sensitive data or secrets have crept into this repository (rather than ukwa-services-env)
check all containers specify a tagged version to deploy
check the right API endpoints are in us
run any tests supplied for the component
run the service-level regression testing suite, https://github.com/ukwa/docker-robot-framework, to check if the public-facing services are behaving as expected.

ukwa-services's People

Contributors

Stargazers

Watchers

Forkers

anjackson min2ha uk-gov-mirror ldbiz gilhoggarth

ukwa-services's Issues

Revisit ukwa-backstage integration as part of W3ACT

Revisit ukwa-backstage integration as part of W3ACT with files and archived resources and collections and targets

And also integrate W3ACT users via https://github.com/duke-libraries/devise-remote-user

Switch Kafka UIs from Trifecta to provectus/kafka-ui

We currently use Trifecta to check on Kafka queues, but this is not that widely used. We can switch to akhq which seems to be more widely used and well supported.

version: '3.7'
services:
  akhq:
    image: tchiotludo/akhq
    ports:
      - "58080:8080"
    environment:
      AKHQ_CONFIGURATION: |
        akhq:
          connections:
            fc:
              properties:
                bootstrap.servers: "192.168.45.34:9094"
          security:
            default-group: reader

Implement and roll out DDHAPT Airflow tasks, running on H3.

Review current implementation, test it's working.
Migrate older records to DB.
ukwa/ukwa-manage#84

Remove all hard-coded domain names

The stacks hardcode the www/beta/dev domains in a few places, whereas it should be possible to pick these up from Host headers/context.

Crawl Log Viewing

Current plan is to siphon crawl events into a large database, for recently crawled FC material. We will use Solr at first because we know how to run it at scale, reconsidering CockroachDB later if we need e.g. proper SQL or ACID transactions etc.

Start with a simple Solr indexed version of the standard crawl log, so we can:

Find/filter the crawl log (see crawl-log-viewer) but much quicker than using Kafka.
Find crawl launch outcomes.

See ukwa/crawl-db#1

Need to find a way to tidy up the H3 log parsing code and related code that is spread around:

crawl-log-viewer e.g. parser
crawl-analysis should be notebooks I think.
crawl-streams
crawl-db
Deployment bits currently in ukwa-services.
Whether Hadoop jobs should be in ukwa-hadoop-tasks or in the relevant repository e.g. crawl-db (given MrJob is a pretty lightweight dependency compared to Luigi, that might be fine?)

Update webrenderer to 2.3.2

Update the access-time webrenderer to 2.2.x, and set RUN_BEHAVIOURS=false, to speed up rendering of cards etc.

Deduplicate tests?

The repeated tests in browse.robot might lend themselves to being simplified/deduplicated using a custom keyword.

Update CDX Indexing to handle PyWB-style indexes, OPTIONS/HEAD/POST with parameters.

To resolve some complex playback issues (Twitter, HuffPo) we need to be able to play back POST requests.

This requires some coordination with Ilya as he's been changing how he does it.

Once the indexing scheme is stable, we need to use a version of OutbackCDX that supports it, and re-index the CDX data (at least the last couple of years).

Updating the Java stack is quite involved: ukwa/webarchive-discovery#244

Might be time to switch to Python for this MR Job. Use PyWB indexer and POST them to OutbackCDX.

Also need OutbackCDX 0.8.0 to handle the lookups properly.

Some other examples of similar code:

MrJob

Using mapper_raw means MrJob arranges for a copy of each WARC to be placed where we can get to it:
(This breaks data locality, but streaming through large files is not performant because they get read into memory)
(A FileInputFormat that could reliably split block GZip files would be the only workable fix)
(But TBH this is pretty fast as it is)

Play with a WARC processor with https://pypi.org/project/boilerpy3/ and e.g. Spacy

Verify with @ikreymer when the approach has been finalised, and what version of OutbackCDX it works with.
CDXJ Indexer indexes metadata records, which is what we want for video metadata etc. Are those application/warc-fields fields from metadata records from Heritrix3 okay in the CDX?
Index recent material into a fresh Outback (>= 0.8.0) index and check playback.
Convert metadata URIs to embed URNs.
Drop 451/429?

See https://github.com/ukwa/ukwa-hadoop-tasks/tree/master/warc_indexing

Shift website deployment to PROD Swarm

The website is currently deployed on the access server. It should be running from the PROD Swarm.

Include PyWB 2.6.3 in deployment.
Deploy to BETA Swarm.
Run tests against beta.webarchive.org.uk to verify all is working as expected
Deploy to PROD Swarm.
Run tests against prod1.n45.wa.bl.uk to verify all is working as expected.
Set up API proxywebsite.api.wa.bl.uk:80, pointing to prod1:80
Run tests against website.n45.wa.bl.uk to verify all is working as expected.

The final deployment steps will need to be done together, to make sure everything is consistent. This should also be done early or late in the day to minimise disruption.

Update public-facing NGINX proxy configuration to match BETA, using website.api.wa.bl.uk:80 as the back-end.
Run tests against www.webarchive.org.uk to make sure all is well.

Ensure block list gets updated from W3ACT to the FC

The archivist role can add the problematic URL to W3ACT already, under a Black List field.

Then, we need to pick up white_list,black_list URLs from targets.csv and include them in the crawl feeds. Should be combined with the in-scope and nevercrawl lists (respectively).

After that, we need to check the crawler will pick up changes to the scope and block files, and add a w3act_export service to the FC stack that pulls and updates them. This does mean the block list might lag behind the launches a little, so we probably want to update them more often than daily.

(clearly, we should consider wildcard/regex support, but that's more difficult to use. Maybe use plain URLs for URL blocks, but allow #-delimited lines for RegEx?

https://www.bl.uk/?mobile=on
#twitter\.com/.*?lang=#

Hmm.

Also, take ukwa/ukwa-heritrix#85 into account

Issues with cookies and sessions on W3ACT

This is to note any outcomes from ukwa/w3act#662

Carlos is hitting examples where the inner iframe of playback says login, and this ends up logging into Wayback within the frame. The login redirect should target the full page frame. Possible a new frame and then recomment a reload?
The /wayback/*/ to /wayback/archive/*/ redirect is not working, ended up at http://prod1/act/wayback/archive//https://www.scottishpower.co.uk/
Determine if the JSESSIONID Set-Cookie headers are what's tripping up everything else - see e.g. this page.
Copy the W3ACT cookie into every response as a Set-Cookie header, when viewing Wayback.
~~Deal with the SameSite warning (see below, and ukwa/w3act#663).~~

Cookie “PLAY_SESSION” will be soon rejected because it has the “SameSite” attribute set to “None” or an invalid value, without the “secure” attribute. To know more about the “SameSite“ attribute, read https://developer.mozilla.org/docs/Web/HTTP/Headers/Set-Cookie/SameSite

WONTFIX: website deployment to note and setup the block/allow lists

e.g add a script to pull this from GitLab, run that from install and from Airflow?

Make CDX index backfill workflow for DC2018 DC2019

It seems the 2018 and 2019 domain crawls may not have been CDX indexed. We need to design a suitable Airflow DAG that will be able to perform these backfill tasks.

The idomatic Airflow version would be a proper backfilling task, with a start date in e.g. 2010, using the last-modified date of the files on HDFS, and where each chunk loops through the total available to be indexed. e.g. an @monthly task, that lists all WARCs corresponding to that previous month, and then indexes them in chunks of e.g. 2000 WARCs.

This would mean changing the windex utility to (a) be able to filter on a date block instead of X years back, and (b) able to loop over all matching WARCs rather than just running one batch.

Improve crawl configuration and launch management

Building on #83, improve crawl management:

Make crawl launches 'back-fillable` so we can re-run launches if the don't happen:
- Needs date-stamped crawl feed files.
- Needs a separate task that is dependent on the data export, or the current w3act_export needs to be made back-fillable.
Blocks, seeds and scope files in use by the crawlers need to be updated:
- Blocks and scope managed via Watched Files, less clear if/how seeds should be blanket-updated.
- Not clear how best to do that. Probably push rather than pull, as this means Airflow is always in charge of things. But then, a shared volume updated directly by Airflow? Files made available and remote task or service prompted to pull them down?
Launch metrics need to be posted to Prometheus.

Evaluate integration of SolrWayback

Rather than continuing to roll our own search and visualization tool, we should consider adopting SolrWayback and collaborating with the NAS team on it. We could start by making it available as an internal tool, within the W3ACT stack. For this, we need to:

It's possible we can't resolve some of these issues without re-indexing. In which case, this will have to wait.

Notes on public use moved to #73

Integrate reporting systems into W3ACT stack

See the old Ingest Stack ideas: https://github.com/anjackson/ukwa-services/tree/intranet-2020/manage/intranet

These should be part of the core W3ACT stack, using the new auth mechanism, etc.

~~Use Metabase as main report system, rather than a faceted browser. Fall back on Voila notebooks if necessary.~~
Use Grafana as the main report system, rather than a faceted browser. Fall back on Voila notebooks if necessary.
#70
#38
API and recent screenshots
~~[ ] TrackDB browser ???~~
~~[ ] Airflow Dashboard ?~~

Move documentation

Updating https://wiki.bl.uk:8443/display/WAG/UKWA+Technical+Architecture and deleting old stuff.
Moving useful stuff into https://github.com/ukwa/ukwa-documentation
Roadmap in the style of https://www.gov.uk/roadmap

host vs host_no_auth env vars

The new host_no_auth env var was introduced to get around errors raised in robot runs by the inline authentication for the dev service, but it looks a bit extraneous/messy. Might be useful to revisit.

Launching robot clarification

The procedure for launching robot tests might need some clarification or refactoring, especially wrt to environment vars.

eg. The .env files need "export"'s removing before normal docker-compose usage, but have been retained as they are used in that format elsewhere.

The website scripts (as example) are here:
https://github.com/ukwa/ukwa-services/blob/master/access/website/deploy-access-website.sh

Stop crawler visiting BL mobile site.

Avoid https://www.bl.uk/?mobile=on to prevent accidentally crawling the mobile version of www.bl.uk.

Just patch the regex block list bean. Document how it's done to the live crawler, ideally wrap as a h3cc script.

But see also #36

Access Website Stack improvement ideas

To be broken down into issues and milestones...

Push metrics from w3act_export to Prometheus.
Resolve webrecorder/pywb#591 (this is currently worked-around by dropping revisits)
Resolve ukwa/ukwa-pywb#61 (Fixing Twitter requires indexing POST requests etc.)
~~Find any and all places where webrender-puppeteer is used with a proxy and ensure there is no trailing slash because this make puppeteer go crazy.~~ ukwa/webrender-puppeteer#13
Consider hooking Flask app into Sentry
Make recently-crawled screenshots (and content!) available. To do so, we have to finish off the warc-server idea:
- Update warc-server to maintain filename-HDFS mapping?
- Then proxy across crawler warc-server and HDFS warc-server?
- Or have a separate redirect service that bounces HDFS filenames to WebHDFS, and to the crawler warc-server otherwise?
- Also, update warc-server to update shared file list in background thread. Could be separate services, using NGINX proxy_next_upstream to manage them.

See also ukwa/ukwa-access-api#1 and the other issues https://github.com/ukwa/ukwa-access-api/issues

Create an up-to-date Cantaloupe image

The image server container we use is not being kept up to date: https://hub.docker.com/r/lyrasis/cantaloupe

There are some others on Docker Hub to investigate, but it's probably easier to make a new ukwa/cantaloupe one and make sure we're up to date.

Support for monitoring/alerts/RSS feeds of crawl problems for a curator

Can we leverage the CDX or ELK indexes to support alerts corresponding to the Targets owned by a curator and/or organization? Also relates to the idea of using CDX to provide RSS feeds of changes to URLs as an API service.

Document how to deploy the PyWB-backed NPLD Access System

This covers the deployment files and documentation for running the PyWB reading room services as a Docker Swarm service stack. The documentation starts at: https://github.com/ukwa/npld-access-stack#readme

Note that, at this time, PDF and ePubs are not handled properly. PDFs will be rendered in the browser directly, for example. This will remain the case until ukwa/ukwa-pywb#74 is complete.

Switch to http_auth_request_module for QA Wayback

There are problems with the QA Wayback proxy in W3ACT - it struggles already, but will definately struggle with POST requests that PyWB can now support. We can check user cookies at the NGINX level via a cached sub-request.

See:

Also check block-lists are implemented when running service tests

The website and W3ACT tests (see e.g. #42) should should be extended to read the blocked-URLs list, and check those records are not available via CDX or Solr.

Upgrade QA Wayback to PyWB 2.6.2

Following ukwa/ukwa-pywb#71, upgrade W3ACT's Wayback and verify with @crarugal that ukwa/ukwa-pywb#70 is resolved.

Upgrade OA Wayback to PyWB 2.6.2

Upgrade PyWBs to 2.6.4.1

Update to latest version, inc. small patch to rotate workers and hopefully stop the occasional outages we're seeing.

Accessible & Secure NPLD Access Project

This ticket summarises the overall status of this project, also known as the 'Ericom Replacement Project'. In short, we need to be able to access NPLD content from the UK LDL in a way that meets accessibility needs while also maintaining sufficient security. The current approach uses a remote desktop that is accessed via a HTML canvas, and as such this does not provide an route that meets accessibility legislation. The proposed solution makes the content more accessible, while carefully managing the security issues and NPLD contraints (e.g. the single-concurrent-use lock).

The solution works by extending our PyWB service to access PDFs and ePubs, and ensuring that the resulting web service is only access via authorised web browsers that prevent copies of items being taken away. For some reading rooms, this requires a dedicated NPLD Player.

There are two main work streams:

UKWA Team helping App Support deploy the PyWB services and integrate them into LDL Reading Rooms.
- #69
- #86
- Supporting all expected URL forms, including IDs with no prefix, e.g. vdc_100031420983.0x000001 rather than ark:/81055/vdc_100031420983.0x000001, and including /welcome.html?ID....
Webrecorder creating or extending the tools to make this possible:
- ukwa/ukwa-pywb#74
- ukwa/npld-player#3

Add tasks to run under Airflow

n.b. Some tasks need to be run on other services, but AirFlow can make the SSH connection dependencies nice and explicit (see example), and the remote task can still just run a Docker run command so we can manage the software distribution (note we might need a docker pull ukwa/ukwa-manage:latest as part of the remote script if were running latest).

Should also rationalise what's in:

https://github.com/ukwa/ukwa-manage making sure to cover any fixes from the 2019-prod branch
https://github.com/ukwa/ukwa-tasks
https://github.com/ukwa/ukwa-hadoop-tasks

[x] Add https://github.com/epoch8/airflow-exporter so we can integrate with Prometheus.

Ingest

Crawl

crawl_launch (hourly):
- Launch crawls, based on crawl job specifications (hourly) (from here to crawlstreams).
crawl_warc_tidy (hourly): (to be part of ukwa-manage? OR h3cc?)
- Close old open WARCs. (optional)
- Move WEBRENDER WARCs see here.
Move WARCs and crawl logs to HDFS (hourly) ??? (optional) (new store command)
Refillers, e.g. (optional)
- post-process Kafka log or CDX and re-queue URLs that match certain criteria.
- Run with CrawlCache and do screenshots/device emulation.

Management

Update TrackDB from HDFS:
- Update all HDFS daily, update WARCs locations hourly.
- Metrics (generate metrics from TrackDB and push to the Prometheus push gateway) ??? OR stats_pusher.
Back up TrackDB to HDFS. (optional)
HDFS file hash job to TrackDB. (optional)
#102

Access

Update CDX Index with latest WARCs on HDFS, based on TrackDB (hourly) website/scripts/run-cdx-indexer.sh
#63
Back-up the Shine PostgreSQL database (daily)
Run the test suite (daily after the above updates?) and raise an alert if the website is misbehaving (optional)

Enable warcprox deduplication

I think part of the reason warcprox is pulling in so much data is that it does not do any deduplication. We should enable deduplication.

Need to check it's the 'right kind' of deduplication, something we can cope with at playback time.

Switch UKWA Docker image builds to standard workflow

We need to make sure all important Docker images are scanned for security issues as part of the GitHub Actions process, before the images are pushed to Docker Hub.

To do this, we can reuse GitHub Actions workflows across repositories, to ensure we build, scan and upload Docker Images consistently.

This is an example of a container that uses the shared workflow: https://github.com/ukwa/ukwa-warc-server/blob/master/.github/workflows/push-to-docker-hub.yml

The task here is to go through the stacks in this repository and update every referenced container build to re-use this shared workflow. Every change should be proposed as a PR on each repository, and linked here for @anjackson to review.

Move website/w3act tests to the Robot Framework repo and run from there.

Consider adapting SolrWayback for public use

Beyond internal access, to use SolrWayback as a public service (replacing the faceted search part of ukwa-ui), we need to consider more complex issues:

Add localization support: netarchivesuite/solrwayback#23
Implement as an accessible, responsive design, e.g. follow Warclight Bootstrap for basic facet design and UI. Consider TailwindCSS for layout with Shoelaces for components, to be consistent with Webrecorder's work.
Pages like the Toolbox etc. should also be routed and bookmarkable.
Display an error message in the UI when calls the the back-end fail, rather than appearing to still be busy.
Security review - make sure all APIs can safely be made public.

As well as some minor changes:

Make sure the indexer config default (and in SolrWayback Docker Compose) uses url_norm.
Add support for common deployment context headers (X-Forwarded-Proto, X-Forwarded-Host etc.) to SolrWayback, so hardcoding the base URL is no longer necessary.
Possible optimisation: Change ArcHTTPResolver to check response for Content-Bytes rather than make two requests per request (where the first probes for Accept-Ranges: bytes (Or at least only check the first time?)

This needs to be weighed up against the difficulties in adapting the off-the-shelf options.

Simplify and separate W3ACT AirFlow tasks

Currently, one file contains three workflows, because they share code for dumping the W3ACT DB, and each runs their own dump in case of conflicts due to workflows running simultaniously. To me a bit more canonical-Airflow in style and a bit easier to manage, the workflows could be changed as follows:

For w3act_export
- This runs hourly, to keep access services up to date. As it's the most frequent, this is the one that should export W3ACT data.
- Rather than maintaining a single folder and replacing, we should use a shared per-run folder, like /var/tmp/w3act_export_2021-12-10T09:00:00Z/ so that each run get's it's own output folder.
- These will get cleaned up automatically every 30 days by the OS.
For w3act_backup and w3act_report
- These would use Airflows ExternalTaskSensor to await the completion of the w3act_export workflow for the hour at which they run, e.g. 2021-12-10T00:00:00Z. They would then refer to the corresponding W3ACT DB dump and use that instead of a separate dump.

This would make it easier to keep them in separate files, which is also more canonical for Airflow, and makes things a bit easier to understand.

Open Access filters for the main website

It would be very useful to be able to browse and filter Targets in Collections based on whether they are OA or not. e.g.

In this collection, show only OA items.
Show all OA items relating to this subject.

Rich screenshot support via IIIF server layer

If we wrap IIIF around the page screenshotter, we get a lot of the features we'll need, like easy specification of sizes etc, for different purposes.

To make this work, given the format of IIIF URIs, we could use PWID's and Base64 encode them. e.g.

urn:pwid:webarchive.org.uk:2008-11-29T00:41:42Z:page:http://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/2008sigprops.aspx

Becomes...

dXJuOnB3aWQ6d2ViYXJjaGl2ZS5vcmcudWs6MjAwOC0xMS0yOVQwMDo0MTo0Mlo6cGFnZTpodHRwOi8vd3d3Lmppc2MuYWMudWsvd2hhdHdlZG8vcHJvZ3JhbW1lcy9wcm9ncmFtbWVfcHJlc2VydmF0aW9uLzIwMDhzaWdwcm9wcy5hc3B4

Which we use as the identifier in the IIIF {scheme}://{server}{/prefix}/{identifier}/{region}/{size}/{rotation}/{quality}.{format} URLs, like this:

/iiif/2/dXJuOnB3aWQ6d2ViYXJjaGl2ZS5vcmcudWs6MjAwOC0xMS0yOVQwMDo0MTo0Mlo6cGFnZTpodHRwOi8vd3d3Lmppc2MuYWMudWsvd2hhdHdlZG8vcHJvZ3JhbW1lcy9wcm9ncmFtbWVfcHJlc2VydmF0aW9uLzIwMDhzaWdwcm9wcy5hc3B4/0,0,200,200/full/0/default.jpg

This uses the page level precision-spec, as this is what makes sense in this context. The prefix of the URL would have to be used to distinguish between the archived and crawl-time images.

/render/archive/iiif/2/dXJuOnB3aWQ6d2ViYXJjaGl2ZS5vcmcudWs6MjAwOC0xMS0yOVQwMDo0MTo0Mlo6cGFnZTpodHRwOi8vd3d3Lmppc2MuYWMudWsvd2hhdHdlZG8vcHJvZ3JhbW1lcy9wcm9ncmFtbWVfcHJlc2VydmF0aW9uLzIwMDhzaWdwcm9wcy5hc3B4/0,0,200,200/full/0/default.jpg
/render/capture/iiif/2/dXJuOnB3aWQ6d2ViYXJjaGl2ZS5vcmcudWs6MjAwOC0xMS0yOVQwMDo0MTo0Mlo6cGFnZTpodHRwOi8vd3d3Lmppc2MuYWMudWsvd2hhdHdlZG8vcHJvZ3JhbW1lcy9wcm9ncmFtbWVfcHJlc2VydmF0aW9uLzIwMDhzaWdwcm9wcy5hc3B4/0,0,200,200/full/0/default.jpg

This could be done by running a Cantaloupe IIIF image server, which wraps plain image servers nicely, is used by our partners, and has lots of nice features like handling caching. This would pass the Base64 PWID on to a modified webrender-puppeteer which would decode the pwid64 and render the page at full size and ideally at high resolution. Cantaloupe would then cache this output and handle generating all necessary derivatives.

Cantaloupe can also overlay e.g. the UKWA logo which might work quite nicely.

(We could also add http://labs.mementoweb.org/aggregator_config/archivelist.xml and use that to determine the right web archive endpoint for other archives.)

Add metrics to the w3act exporter task service

The script that runs w3act exports should also post metrics to Prometheus. Proposal is to shift to being powered by ukwa-manage rather than just python-w3act and define the task script there, and add code to post-process the python-w3act output and post metrics to Prometheus.

Update FC scope file from W3ACT data.

Update webrender-puppeteer to 1.0.14 and reduce load

Some screenshot problems need fixing by updating the webrender-api service to use the latest webrender-puppeteer release. However, even then, there will be problems getting the timings right because the machine is so heavily loaded when running screenshots in the morning. We need to spread things out a bit more, so we much reduce the number of workers for webrender-api.

Modify setup and docs to improve failover procedures

When one crawler froze up, and we switched to another, this caused problems because both were running the same networked Gluster filesystem (used for Kafka, Prometheus) whereas the crawl state (frontier and caches) were locally held. This caused problems with Kafka and Prometheus on startup.

This ticket is to consider how to handle this:

Only have crawl output on Gluster?
Move Kafka/Prometheus/etc. onto local disk?
Make Kafka an distinct, fully distributed service? (Similar to how the crawl-time CDX is a separate service).
And improve documentation to cover crawler failover.

Use rclone for Hadoop 3 copy-to-hdfs tasks

Add an Airflow DAG, based on the rclone/rclone Docker image, running e.g.

rclone copy --hdfs-namenode h3nn.wa.bl.uk:54310 --hdfs-username ingest  --max-age 24h --no-traverse /mnt/gluster/fc/heritrix/output :hdfs:/heritrix/output --include "*.warc.gz" --include "crawl.log.cp*"


rclone copy --max-age 48h --no-traverse /mnt/gluster/fc/heritrix/output/frequent-npld hadoop3:/heritrix/output/frequent-npld --include "*.warc.gz" --include "crawl.log.cp*"

Add more tests to cover w3act access limits

The newer W3ACT stack improves the authenication method to access e.g. QA Wayback, but it also makes it a bit easier to accidentally remove the access limit. To look our for this, we need some additional automated tests to check logins are required etc.

To check that the following areas are only accessible if logged into W3ACT:

/act/wayback/
/act/nbapps/
/act/logs/

The test system is at https://github.com/ukwa/ukwa-services/tree/master/ingest/ingest_tests and is a set of Robot Framework tests that perform some simple live-system tests. i.e. all tests added to here should be safe to run on live/production services.

Support running across two Hadoop clusters

So we can upgrade/migrate as needed, we need to be able to run across two Hadoop clusters.

Some earlier instances missing in Wayback

NLW report that these websites should all have instances continually from 2004 onwards:
http://www.bloc.org.uk/
http://www.enlli.org/
http://www.morfablog.com/
http://www.cymruarywe.org/
http://www.waleswatch.welshnet.co.uk/
http://academi.org/
http://www.eglwysfair.org/
http://www.eisteddfod.org.uk/
https://mennaelfyn.co.uk/
http://www.fortunecity.com/business/pencil/1572/
http://www.grahamedavies.com/
https://www.iwa.wales/
http://www.ewrop.com/
http://gwleidydd.blogspot.com/

There are instances from 2008 onwards in QA Wayback but the earlier instances are missing.

Ensure Collection Areas are handled correctly

While investigating showing the Collection Areas, @min2ha found the data in the Collections Solr didn't match up with what was in W3ACT. Following team discussion it seems that a significant change of data model has happened.

Specifically, in the schema, the collectionAreaId is not multiValued. This means each Collection can only belong to one Collection Area (which it seems I had assumed was part of the intention, to make the list manageable, but it seems there are many Collections in multiple Collection Areas).

So, to fix this, we need to:

Change the ukwa-ui-collections-solr schema so it allows multiple collection areas, and deploy that to DEV.
Change the python-w3act scripts to make use of the multiple values for the collection areas.
Keep working on ukwa-ui to take advantage of this data.

Switch 'launch now' crawl launch process to Airflow crawl tasks

Rather than depending on processes running on Ingest, the crawls should be launched from Airflow. First, we'll just port the current mechanism over. (see #??? for planned improvements).

First use the bypm crawl launch as a test case, check it all works fine.
Check those crawls launched properly and down-stream tasks are working as expected.
Then add the npld crawl launch task, working in the same way.

Improve Docker Swarm configuration, secrets and security

There are a few things we could look at to simplify and secure how we deploy under Docker Swarm:

Docker Swarm Configs
Docker Swarm Secrets
Stop running anything as root, and isolate containers with a user namespace

Switch out all data sharing features

We're not able to shift off Google Analytics just yet, but can at least switch data sharing features off.

This likely means UKWA-UI and ukwa-pywb, perhaps other stuff too.

After changing this, I'm hoping this report will be clear of issues: https://themarkup.org/blacklight?url=www.webarchive.org.uk

ukwa / ukwa-services Goto Github PK

ukwa-services's Introduction

ukwa-services

Contents

Introduction

Service Stacks

High-Level Technical Architecture

Overview

Areas

Manage

Ingest

Storage

Process

Access

Monitoring

Interfaces

Infrastructure

Access & Updates

Container Platforms

Networks

Software

Deployment Process

ukwa-services's People

Contributors

Stargazers

Watchers

Forkers

ukwa-services's Issues

Ingest

Crawl

Management

Access

Recommend Projects

Recommend Topics

Recommend Org