Giter VIP home page Giter VIP logo

ukwa-services's Introduction

ukwa-services

Deployment configuration for almost all UKWA services.

Contents

Introduction

These Docker Stack configurations and related scripts are used to launch and manage our main services. No internal or sensitive data is kept here -- that is stored in internal ukwa-services-env repository as environment variable scripts required for deployment, or as part of the CI/CD system.

Note that some services are not deployed via containers, e.g. the Hadoop clusters, and the Solr and OutbackCDX indexes. This includes a dedicated API server that acts as an intermediary for calls to various internal systems, allowing the implementation details of the current deployment to be kept separate from their external identity.

For example, our OutbackCDX service is accessed internally as cdx.api.wa.bl.uk. Over recent years, this service has been migrated to new hardware on a number of occasions, but using the cdx.api.wa.bl.uk proxy alias has allowed us to minimised downtime when migrating or upgrading the service.

These other services are documented elsewhere, but the interaction with those other services will be made clear.

Service Stacks

Service stacks are grouped by broad service area, e.g. access contains the stacks that provides the access services, and the access README provides detailed documentation on how the access services are deployed. The service areas are:

  • ingest covers all services relating to the curation and ingest of web archives
  • access covers all services relating to how we make the web archives accessible to our audiences
  • manage covers all internal services relating to the management of the web archive, including automation and workflows that orchestrate activities from ingest to storage and then to access

For a high-level overview of how these service stacks interact, see the section on technical architecture below.

Within each sub-folder, e.g. access/website, we have a docker-compose.yml file which should be used for all deployment contexts (e.g. dev,beta and prod). Any necessary variations should be defined via environment variables.

These variables, any other context-specific configuration, should be held in subdirectories. For example, if access/website/docker-compose.yml is the main stack definition file, any addtional services needed only on dev might be declared in access/website/dev/docker-compose.yml and would be deployed separately.

The process for updating and deploying components is described in the deployment section below.

High-Level Technical Architecture

This is a high-level introduction to the technical components that make up our web archiving services. The primary goal is to provide an overview of the whole system, with a particular focusing on knowing where to look if something goes wrong.

Some wider contextual information can be found at:

Note that the images on this page can be found in this Google Slides presentation.

Overview

High-level technical overview of the UKWA systems

The life-cycle of our web archives can be broken down into five main stages, along with management and monitoring processes covering the whole thing, and the underlying infrastructure that supports it all. Each stage is defined by it's interfaces, with the data standards and protocols that define what goes in to and out of that stage (see below for more details). This allows each stage to evolve independently, as long as it's 'contract' with the other stages is maintained.

There are multiple ingest streams, integrating different capture processes into a single overall workflow, starting with the curation tools that we use to drive the web crawlers. Those harvesting processes pull resources off the web and store them in archival form, to be transferred on HDFS. From there, we can ingest the content into other long-term stores, and can then be used to provide access to individual resources both internally and externally, for all the Legal Deposit libraries. As the system complexities and service levels vary significantly across the different access channels, we identify them as distinct services, while only have one (unified) harvesting service.

In order to be able to find items of interest among the billions of resources we hold, we run a range of data-mining processes on our collections that generate appropriate metadata, which is then combine with manually-generated annotations (supplied by our curators) and used to build our catalogue records and indexes. These records drive the discovery process, allowing users to find content which can then be displayed over the open web or via the reading room access service (as appropriate).

Areas

Manage

The critical management component is Apache Airflow, which orchestrates almost all web archive activity. For staff, it is accessible at http://airflow.api.wa.bl.uk. Each workflow (or DAG in Airflow terminology) is accessible via the management interface, and the description supplied with each one provides documentation on what the task does. Where possible, each individual task in a workflow involves running a single command-line application wrapped in versioned Docker container. Developing our tools as individual command-line applications is intended to make them easier to develop, test and maintain. The Airflow deployment and workflows are defined in the ./manage folder, in ./manage/airflow

Another important component is TrackDB, which contains a list of all the files on our storage systems, and is used by Airflow tasks to keep track of what's been indexed, etc.

See manage for more details.

Ingest

Covers curation services and crawl services, everything leading to WARCs and logs to store, and metadata for access.

See ingest for more details.

Storage

Storage systems are not deployed as containers, so there are no details here. We currently have multiple Hadoop clusters, and many of the tasks and components here rely on interacting with those clusters through their normal APIs.

Process

There are various Airflow tasks that process the data from W3ACT or from the Hadoop storage. We use the Python MrJob library to run tasks, which are defined in the ukwa/ukwa-manage repository. That is quite a complex system, as it supports Hadoop 0.20.x and Hadoop 3.x, and supports tasks written in Java and Python. See ukwa/ukwa-manage for more information.

Access

Our two main access services are:

  • The UK Web Archive open access service, online at https://www.webarchive.org.uk/
  • The Legal Deposit Access Service, only available in Legal Depost Library reading rooms.

See access for more details.

Monitoring

Monitoring runs independently of all other systems, on separate dedicated hardware. Based on Prometheus, with alerts defined for major critical processes. See https://github.com/ukwa/ukwa-monitor for detail.

Interfaces

There are data standards/protocols that isolate parts of the system so they can evolve independently (see How do you cut a monolith in half? for more on this idea).

Interface Protocol Further Details
Curate > Crawl Crawl feeds (seeds, frequencies, etc.), NEVER-CRAWL list. Generated from W3ACT, see the w3act_export workflow.
Crawl > Storage WARC/WACZ files and logs. These are stored locally then moved to HDFS using Cron jobs (FC) and Airflow (DC, see copy_to_hdfs_crawler08).
Storage > Process WARC/WACZ files and logs, Metadata from W3ACT exports. This covers indexing tasks like CDX generation, full-text indexing etc.
Process > Access WARCs/WACZ on HDFS via HTTP API + TrackDB. OutbackCDX API. Solr Full-text and Collections APIs. Data exported by w3act\_export (allows.aclj, blocks.aclj) As the collection is large, access is powered by APIs rather than file-level standards.

Infrastructure

Access & Updates

A central server known as wash is used to log into all system, and runs updates and logging etc. at the system level via Cron jobs.

A pair of servers use IP-failover to host the *.api.wa.bl.uk domains, running NGINX to proxy internal services to the appropriate back-end system.

Container Platforms

At the time of writing, we use Docker Swarm for production container deployment, and have a set of servers hosting PROD, BETA and DEV swarms.

Networks

The systems configured or maintained by the web archiving technical team are located on the following networks.

Network Name IP Range Description
WA Public Network 194.66.232.82 to .94 All public services and crawlers. Note that the crawlers require unrestricted access to the open web, and so outgoing connections on any port are allowed from this network without going through the web proxy. However, very few incoming connections are allowed, each corresponding to a curatorial or access service component. These restrictions are implemented by the corporate firewall.
WA Internal Network - Internal service component network. Service components are held here to keep them off the public network, but provide various back-end services for our public network and for systems held on other internal networks. This means the components that act as integration points with other service teams are held here.
WA Private Network - The private network's primary role is to isolate the Hadoop cluster and HDFS from the rest of the networks, providing dedicated network capacity for cluster processes without affecting the wider network.
DLS Access Network - The BSP, LDN, NLW and NLW Access VLANs. Although we are not responsible for these network areas, we also deploy service components onto specific machines within the DLS access infrastructure, as part of the Legal Deposit Access Service.

Software

Almost our entire stack is open source, and the most critical components are co-maintained with other IIPC members. Specifically, the Heritrix crawler and the PyWB playback components (along with the standards and practices that they depend upon, like WARC) are crucial to the work of all the IIPC members, and to maintaining access to this content over the long term.

Current upgrade work in progress:

  • Reading Room access currently depends on OpenWayback but should be replaced with a modernized PyWB service through the Legal Deposit Access Solution project.
  • Adoption of Browsertrix Cloud for one-off crawls, with the intent to move all Frequent Crawls into it eventually.
  • A new approach is needed to manage monitoring and replication of content across H020, H3 BSP and H3 NLS.
  • Full-scale fulltext indexing remains a challenge and new workflows are needed.
  • All servers and containers need forward migration to e.g. to the latest version of RedHat, dependent libraries etc. As we have a fairly large estate, this is an ongoing task. Generally, this can be done without major downtime, e.g. using Hadoop means it's relatively straightforward to take a storage node out and upgrade its operating system without interrupting the service.

Deployment Process

First, individual components should be developed and tested on developers' own machines/VMs, using the Docker Compose files within each tool's repository. e.g. w3act.

These are are intended to be self-contained. i.e. if possible should not depend on external services, but use dummy ones populated with test data.

Once a new version of a component is close to completion, we will want to run then against internal APIs for integration testing and/or user acceptance testing, and that's what the files in this repository are for. A copy of this repository is available on the shared storage of the DEV Swarm, and that's where new builds of versions of containers should be tested.

Once we're happy with the set of Swarm stacks, we can tag the whole configuration set for release through BETA and then to PROD.

Whoever is performing the roll-out will then review the tagged ukwa-services configuration:

  • check they understand what has been changed, which should be indicated in the relevant pull request(s) or commit(s)
  • review setup, especially the prod/beta/dev-specific configurations, and check they are up to date and sensible
  • check no sensitive data or secrets have crept into this repository (rather than ukwa-services-env)
  • check all containers specify a tagged version to deploy
  • check the right API endpoints are in us
  • run any tests supplied for the component
  • run the service-level regression testing suite, https://github.com/ukwa/docker-robot-framework, to check if the public-facing services are behaving as expected.

ukwa-services's People

Contributors

anjackson avatar gilhoggarth avatar ldbiz avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ukwa-services's Issues

Switch Kafka UIs from Trifecta to provectus/kafka-ui

We currently use Trifecta to check on Kafka queues, but this is not that widely used. We can switch to akhq which seems to be more widely used and well supported.

version: '3.7'
services:
  akhq:
    image: tchiotludo/akhq
    ports:
      - "58080:8080"
    environment:
      AKHQ_CONFIGURATION: |
        akhq:
          connections:
            fc:
              properties:
                bootstrap.servers: "192.168.45.34:9094"
          security:
            default-group: reader

Remove all hard-coded domain names

The stacks hardcode the www/beta/dev domains in a few places, whereas it should be possible to pick these up from Host headers/context.

Crawl Log Viewing

Current plan is to siphon crawl events into a large database, for recently crawled FC material. We will use Solr at first because we know how to run it at scale, reconsidering CockroachDB later if we need e.g. proper SQL or ACID transactions etc.

Start with a simple Solr indexed version of the standard crawl log, so we can:

  • Find/filter the crawl log (see crawl-log-viewer) but much quicker than using Kafka.
  • Find crawl launch outcomes.

See ukwa/crawl-db#1

Need to find a way to tidy up the H3 log parsing code and related code that is spread around:

Update webrenderer to 2.3.2

Update the access-time webrenderer to 2.2.x, and set RUN_BEHAVIOURS=false, to speed up rendering of cards etc.

Deduplicate tests?

The repeated tests in browse.robot might lend themselves to being simplified/deduplicated using a custom keyword.

Update CDX Indexing to handle PyWB-style indexes, OPTIONS/HEAD/POST with parameters.

To resolve some complex playback issues (Twitter, HuffPo) we need to be able to play back POST requests.

This requires some coordination with Ilya as he's been changing how he does it.

Once the indexing scheme is stable, we need to use a version of OutbackCDX that supports it, and re-index the CDX data (at least the last couple of years).


Updating the Java stack is quite involved: ukwa/webarchive-discovery#244

Might be time to switch to Python for this MR Job. Use PyWB indexer and POST them to OutbackCDX.

Also need OutbackCDX 0.8.0 to handle the lookups properly.

Some other examples of similar code:

MrJob

Using mapper_raw means MrJob arranges for a copy of each WARC to be placed where we can get to it:
(This breaks data locality, but streaming through large files is not performant because they get read into memory)
(A FileInputFormat that could reliably split block GZip files would be the only workable fix)
(But TBH this is pretty fast as it is)

Play with a WARC processor with https://pypi.org/project/boilerpy3/ and e.g. Spacy

  • Verify with @ikreymer when the approach has been finalised, and what version of OutbackCDX it works with.
  • CDXJ Indexer indexes metadata records, which is what we want for video metadata etc. Are those application/warc-fields fields from metadata records from Heritrix3 okay in the CDX?
  • Index recent material into a fresh Outback (>= 0.8.0) index and check playback.
  • Convert metadata URIs to embed URNs.
  • Drop 451/429?

See https://github.com/ukwa/ukwa-hadoop-tasks/tree/master/warc_indexing

Shift website deployment to PROD Swarm

The website is currently deployed on the access server. It should be running from the PROD Swarm.

  • Include PyWB 2.6.3 in deployment.
  • Deploy to BETA Swarm.
  • Run tests against beta.webarchive.org.uk to verify all is working as expected
  • Deploy to PROD Swarm.
  • Run tests against prod1.n45.wa.bl.uk to verify all is working as expected.
  • Set up API proxywebsite.api.wa.bl.uk:80, pointing to prod1:80
  • Run tests against website.n45.wa.bl.uk to verify all is working as expected.

The final deployment steps will need to be done together, to make sure everything is consistent. This should also be done early or late in the day to minimise disruption.

  • Update public-facing NGINX proxy configuration to match BETA, using website.api.wa.bl.uk:80 as the back-end.
  • Run tests against www.webarchive.org.uk to make sure all is well.

Ensure block list gets updated from W3ACT to the FC

The archivist role can add the problematic URL to W3ACT already, under a Black List field.

Then, we need to pick up white_list,black_list URLs from targets.csv and include them in the crawl feeds. Should be combined with the in-scope and nevercrawl lists (respectively).

After that, we need to check the crawler will pick up changes to the scope and block files, and add a w3act_export service to the FC stack that pulls and updates them. This does mean the block list might lag behind the launches a little, so we probably want to update them more often than daily.

(clearly, we should consider wildcard/regex support, but that's more difficult to use. Maybe use plain URLs for URL blocks, but allow #-delimited lines for RegEx?

https://www.bl.uk/?mobile=on
#twitter\.com/.*?lang=#

Hmm.

Also, take ukwa/ukwa-heritrix#85 into account

Issues with cookies and sessions on W3ACT

This is to note any outcomes from ukwa/w3act#662

  • Carlos is hitting examples where the inner iframe of playback says login, and this ends up logging into Wayback within the frame. The login redirect should target the full page frame. Possible a new frame and then recomment a reload?
  • The /wayback/*/ to /wayback/archive/*/ redirect is not working, ended up at http://prod1/act/wayback/archive//https://www.scottishpower.co.uk/
  • Determine if the JSESSIONID Set-Cookie headers are what's tripping up everything else - see e.g. this page.
  • Copy the W3ACT cookie into every response as a Set-Cookie header, when viewing Wayback.
  • Deal with the SameSite warning (see below, and ukwa/w3act#663).
Cookie “PLAY_SESSION” will be soon rejected because it has the “SameSite” attribute set to “None” or an invalid value, without the “secure” attribute. To know more about the “SameSite“ attribute, read https://developer.mozilla.org/docs/Web/HTTP/Headers/Set-Cookie/SameSite

Make CDX index backfill workflow for DC2018 DC2019

It seems the 2018 and 2019 domain crawls may not have been CDX indexed. We need to design a suitable Airflow DAG that will be able to perform these backfill tasks.

The idomatic Airflow version would be a proper backfilling task, with a start date in e.g. 2010, using the last-modified date of the files on HDFS, and where each chunk loops through the total available to be indexed. e.g. an @monthly task, that lists all WARCs corresponding to that previous month, and then indexes them in chunks of e.g. 2000 WARCs.

This would mean changing the windex utility to (a) be able to filter on a date block instead of X years back, and (b) able to loop over all matching WARCs rather than just running one batch.

Improve crawl configuration and launch management

Building on #83, improve crawl management:

  • Make crawl launches 'back-fillable` so we can re-run launches if the don't happen:
    • Needs date-stamped crawl feed files.
    • Needs a separate task that is dependent on the data export, or the current w3act_export needs to be made back-fillable.
  • Blocks, seeds and scope files in use by the crawlers need to be updated:
    • Blocks and scope managed via Watched Files, less clear if/how seeds should be blanket-updated.
    • Not clear how best to do that. Probably push rather than pull, as this means Airflow is always in charge of things. But then, a shared volume updated directly by Airflow? Files made available and remote task or service prompted to pull them down?
  • Launch metrics need to be posted to Prometheus.

Evaluate integration of SolrWayback

Rather than continuing to roll our own search and visualization tool, we should consider adopting SolrWayback and collaborating with the NAS team on it. We could start by making it available as an internal tool, within the W3ACT stack. For this, we need to:

  • Dockerise it in a way that allows it to work to our Solr indexes.
  • Cope with content_type being either single or multiValued.
  • Ensure required configuration options to be overridden from environment variables.
  • Tomcat should log the response from the server when calls to Solr fail: add alternative logback config for Docker so logs go to the console not a file.
  • Implement HTTP and/or WebHDFS WARC record retrieval back-ends so it can work properly with our WARC store.
  • SolrWayback relies on url_norm for many queries, but our older indexes do not contain it. Can we work around this somehow?
  • To facilitate automated CI testing, switch to a test WARC as well as test Solr documents, i.e. use consistent records so we can search and replay from the test system.
  • Allow WAR name to be overridden on launch, so I can use act#solrwayback and hence get it to deploy in the right place.
  • Allow alternative playback engine to be overridden. Added ALT_PLAYBACK_PREFIX env var.
  • Also override regex for path as our Solr has name only not full path.
  • ukwa/ukwa-warc-server#12
  • Adjust SolrWayback so it can be deployed at an alternative path (e.g. /act/solrwayback).

It's possible we can't resolve some of these issues without re-indexing. In which case, this will have to wait.

Notes on public use moved to #73

Integrate reporting systems into W3ACT stack

See the old Ingest Stack ideas: https://github.com/anjackson/ukwa-services/tree/intranet-2020/manage/intranet

These should be part of the core W3ACT stack, using the new auth mechanism, etc.

  • Use Metabase as main report system, rather than a faceted browser. Fall back on Voila notebooks if necessary.
  • Use Grafana as the main report system, rather than a faceted browser. Fall back on Voila notebooks if necessary.
  • #70
  • #38
  • API and recent screenshots
  • [ ] TrackDB browser ???
  • [ ] Airflow Dashboard ?

host vs host_no_auth env vars

The new host_no_auth env var was introduced to get around errors raised in robot runs by the inline authentication for the dev service, but it looks a bit extraneous/messy. Might be useful to revisit.

Access Website Stack improvement ideas

To be broken down into issues and milestones...

  • Push metrics from w3act_export to Prometheus.
  • Resolve webrecorder/pywb#591 (this is currently worked-around by dropping revisits)
  • Resolve ukwa/ukwa-pywb#61 (Fixing Twitter requires indexing POST requests etc.)
  • Find any and all places where webrender-puppeteer is used with a proxy and ensure there is no trailing slash because this make puppeteer go crazy. ukwa/webrender-puppeteer#13
  • Consider hooking Flask app into Sentry
  • Make recently-crawled screenshots (and content!) available. To do so, we have to finish off the warc-server idea:
    • Update warc-server to maintain filename-HDFS mapping?
    • Then proxy across crawler warc-server and HDFS warc-server?
    • Or have a separate redirect service that bounces HDFS filenames to WebHDFS, and to the crawler warc-server otherwise?
    • Also, update warc-server to update shared file list in background thread. Could be separate services, using NGINX proxy_next_upstream to manage them.

See also ukwa/ukwa-access-api#1 and the other issues https://github.com/ukwa/ukwa-access-api/issues

Document how to deploy the PyWB-backed NPLD Access System

This covers the deployment files and documentation for running the PyWB reading room services as a Docker Swarm service stack. The documentation starts at: https://github.com/ukwa/npld-access-stack#readme

Note that, at this time, PDF and ePubs are not handled properly. PDFs will be rendered in the browser directly, for example. This will remain the case until ukwa/ukwa-pywb#74 is complete.

  • Confirm required network location. Do we need to be on the Access VLAN?
  • Ensure staff access can be separated out. May require separate IP address.
  • Understand various redundancies/backup services needed.
  • Verify assumption that all failover redirection, SSL encryption, authentication, token validation or user identification are handled upstream.
  • Understand reporting needs and whether this is all handled upstream.
  • Configure logging as appropriate.
  • Consider training options, e.g. this
  • If offline installation needed, @anjackson to document how that can be done (following this example or this one that does multiple images).
  • @anjackson Add NGINX rules to map expected URLs to PyWB URLs.
  • @anjackson Add LOCKS_AUTH=admin:password
  • @anjackson Add in known test cases for manual testing below.
  • @anjackson Allow access to the multi-cluster WARC Server as warc-server.api.wa.bl.uk
  • @anjackson ukwa/ukwa-pywb#77
  • @anjackson ukwa/ukwa-pywb#78
  • ukwa/docker-robot-framework#6

Accessible & Secure NPLD Access Project

This ticket summarises the overall status of this project, also known as the 'Ericom Replacement Project'. In short, we need to be able to access NPLD content from the UK LDL in a way that meets accessibility needs while also maintaining sufficient security. The current approach uses a remote desktop that is accessed via a HTML canvas, and as such this does not provide an route that meets accessibility legislation. The proposed solution makes the content more accessible, while carefully managing the security issues and NPLD contraints (e.g. the single-concurrent-use lock).

The solution works by extending our PyWB service to access PDFs and ePubs, and ensuring that the resulting web service is only access via authorised web browsers that prevent copies of items being taken away. For some reading rooms, this requires a dedicated NPLD Player.

There are two main work streams:

  • UKWA Team helping App Support deploy the PyWB services and integrate them into LDL Reading Rooms.
    • #69
    • #86
    • Supporting all expected URL forms, including IDs with no prefix, e.g. vdc_100031420983.0x000001 rather than ark:/81055/vdc_100031420983.0x000001, and including /welcome.html?ID....
  • Webrecorder creating or extending the tools to make this possible:

Add tasks to run under Airflow

n.b. Some tasks need to be run on other services, but AirFlow can make the SSH connection dependencies nice and explicit (see example), and the remote task can still just run a Docker run command so we can manage the software distribution (note we might need a docker pull ukwa/ukwa-manage:latest as part of the remote script if were running latest).

Should also rationalise what's in:

[x] Add https://github.com/epoch8/airflow-exporter so we can integrate with Prometheus.

Ingest

  • w3act_backup daily):
    • Back-up W3ACT PostgreSQL database to HDFS as CSV and SQL.
    • Clean out older backups??? (optional)
  • w3act_export (hourly):
    • Update blocks/allows.txt/aclj and annotations.json.
    • Update the Collection Solr.
    • Update crawl feed/job specifications.
    • python-w3act, w3act run_w3act-qa-check prod (currently, a weekly cron of 0 8 * * Mon ssh access@prod2 'sh /home/access/github/python-w3act/run_w3act-qa-check prod' >> /var/log/w3act-qa-check.log 2>&1)
    • GenerateW3ACTTitleExport (optional)
  • Analyse Crawl Logs/Run Document Harvester (hourly) see here
  • Update Intranet Reports (GenerateHDFSReports, dead seeds, etc.) (optional) currently some in ukwa-manage here but also GenerateHDFSReports and ukwa-reports.

Crawl

  • crawl_launch (hourly):
    • Launch crawls, based on crawl job specifications (hourly) (from here to crawlstreams).
  • crawl_warc_tidy (hourly): (to be part of ukwa-manage? OR h3cc?)
    • Close old open WARCs. (optional)
    • Move WEBRENDER WARCs see here.
  • Move WARCs and crawl logs to HDFS (hourly) ??? (optional) (new store command)
  • Refillers, e.g. (optional)
    • post-process Kafka log or CDX and re-queue URLs that match certain criteria.
    • Run with CrawlCache and do screenshots/device emulation.

Management

  • Update TrackDB from HDFS:
    • Update all HDFS daily, update WARCs locations hourly.
    • Metrics (generate metrics from TrackDB and push to the Prometheus push gateway) ??? OR stats_pusher.
  • Back up TrackDB to HDFS. (optional)
  • HDFS file hash job to TrackDB. (optional)
  • #102

Access

  • Update CDX Index with latest WARCs on HDFS, based on TrackDB (hourly) website/scripts/run-cdx-indexer.sh
  • #63
  • Back-up the Shine PostgreSQL database (daily)
  • Run the test suite (daily after the above updates?) and raise an alert if the website is misbehaving (optional)

Enable warcprox deduplication

I think part of the reason warcprox is pulling in so much data is that it does not do any deduplication. We should enable deduplication.

Need to check it's the 'right kind' of deduplication, something we can cope with at playback time.

Switch UKWA Docker image builds to standard workflow

We need to make sure all important Docker images are scanned for security issues as part of the GitHub Actions process, before the images are pushed to Docker Hub.

To do this, we can reuse GitHub Actions workflows across repositories, to ensure we build, scan and upload Docker Images consistently.

This is an example of a container that uses the shared workflow: https://github.com/ukwa/ukwa-warc-server/blob/master/.github/workflows/push-to-docker-hub.yml

The task here is to go through the stacks in this repository and update every referenced container build to re-use this shared workflow. Every change should be proposed as a PR on each repository, and linked here for @anjackson to review.

Consider adapting SolrWayback for public use

Beyond internal access, to use SolrWayback as a public service (replacing the faceted search part of ukwa-ui), we need to consider more complex issues:

  • Add localization support: netarchivesuite/solrwayback#23
  • Implement as an accessible, responsive design, e.g. follow Warclight Bootstrap for basic facet design and UI. Consider TailwindCSS for layout with Shoelaces for components, to be consistent with Webrecorder's work.
  • Pages like the Toolbox etc. should also be routed and bookmarkable.
  • Display an error message in the UI when calls the the back-end fail, rather than appearing to still be busy.
  • Security review - make sure all APIs can safely be made public.

As well as some minor changes:

  • Make sure the indexer config default (and in SolrWayback Docker Compose) uses url_norm.
  • Add support for common deployment context headers (X-Forwarded-Proto, X-Forwarded-Host etc.) to SolrWayback, so hardcoding the base URL is no longer necessary.
  • Possible optimisation: Change ArcHTTPResolver to check response for Content-Bytes rather than make two requests per request (where the first probes for Accept-Ranges: bytes (Or at least only check the first time?)

This needs to be weighed up against the difficulties in adapting the off-the-shelf options.

Simplify and separate W3ACT AirFlow tasks

Currently, one file contains three workflows, because they share code for dumping the W3ACT DB, and each runs their own dump in case of conflicts due to workflows running simultaniously. To me a bit more canonical-Airflow in style and a bit easier to manage, the workflows could be changed as follows:

  • For w3act_export
    • This runs hourly, to keep access services up to date. As it's the most frequent, this is the one that should export W3ACT data.
    • Rather than maintaining a single folder and replacing, we should use a shared per-run folder, like /var/tmp/w3act_export_2021-12-10T09:00:00Z/ so that each run get's it's own output folder.
    • These will get cleaned up automatically every 30 days by the OS.
  • For w3act_backup and w3act_report
    • These would use Airflows ExternalTaskSensor to await the completion of the w3act_export workflow for the hour at which they run, e.g. 2021-12-10T00:00:00Z. They would then refer to the corresponding W3ACT DB dump and use that instead of a separate dump.

This would make it easier to keep them in separate files, which is also more canonical for Airflow, and makes things a bit easier to understand.

Open Access filters for the main website

It would be very useful to be able to browse and filter Targets in Collections based on whether they are OA or not. e.g.

  • In this collection, show only OA items.
  • Show all OA items relating to this subject.

Rich screenshot support via IIIF server layer

If we wrap IIIF around the page screenshotter, we get a lot of the features we'll need, like easy specification of sizes etc, for different purposes.

To make this work, given the format of IIIF URIs, we could use PWID's and Base64 encode them. e.g.

urn:pwid:webarchive.org.uk:2008-11-29T00:41:42Z:page:http://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/2008sigprops.aspx

Becomes...

dXJuOnB3aWQ6d2ViYXJjaGl2ZS5vcmcudWs6MjAwOC0xMS0yOVQwMDo0MTo0Mlo6cGFnZTpodHRwOi8vd3d3Lmppc2MuYWMudWsvd2hhdHdlZG8vcHJvZ3JhbW1lcy9wcm9ncmFtbWVfcHJlc2VydmF0aW9uLzIwMDhzaWdwcm9wcy5hc3B4

Which we use as the identifier in the IIIF {scheme}://{server}{/prefix}/{identifier}/{region}/{size}/{rotation}/{quality}.{format} URLs, like this:

/iiif/2/dXJuOnB3aWQ6d2ViYXJjaGl2ZS5vcmcudWs6MjAwOC0xMS0yOVQwMDo0MTo0Mlo6cGFnZTpodHRwOi8vd3d3Lmppc2MuYWMudWsvd2hhdHdlZG8vcHJvZ3JhbW1lcy9wcm9ncmFtbWVfcHJlc2VydmF0aW9uLzIwMDhzaWdwcm9wcy5hc3B4/0,0,200,200/full/0/default.jpg

This uses the page level precision-spec, as this is what makes sense in this context. The prefix of the URL would have to be used to distinguish between the archived and crawl-time images.

/render/archive/iiif/2/dXJuOnB3aWQ6d2ViYXJjaGl2ZS5vcmcudWs6MjAwOC0xMS0yOVQwMDo0MTo0Mlo6cGFnZTpodHRwOi8vd3d3Lmppc2MuYWMudWsvd2hhdHdlZG8vcHJvZ3JhbW1lcy9wcm9ncmFtbWVfcHJlc2VydmF0aW9uLzIwMDhzaWdwcm9wcy5hc3B4/0,0,200,200/full/0/default.jpg
/render/capture/iiif/2/dXJuOnB3aWQ6d2ViYXJjaGl2ZS5vcmcudWs6MjAwOC0xMS0yOVQwMDo0MTo0Mlo6cGFnZTpodHRwOi8vd3d3Lmppc2MuYWMudWsvd2hhdHdlZG8vcHJvZ3JhbW1lcy9wcm9ncmFtbWVfcHJlc2VydmF0aW9uLzIwMDhzaWdwcm9wcy5hc3B4/0,0,200,200/full/0/default.jpg

This could be done by running a Cantaloupe IIIF image server, which wraps plain image servers nicely, is used by our partners, and has lots of nice features like handling caching. This would pass the Base64 PWID on to a modified webrender-puppeteer which would decode the pwid64 and render the page at full size and ideally at high resolution. Cantaloupe would then cache this output and handle generating all necessary derivatives.

Cantaloupe can also overlay e.g. the UKWA logo which might work quite nicely.

(We could also add http://labs.mementoweb.org/aggregator_config/archivelist.xml and use that to determine the right web archive endpoint for other archives.)

Add metrics to the w3act exporter task service

The script that runs w3act exports should also post metrics to Prometheus. Proposal is to shift to being powered by ukwa-manage rather than just python-w3act and define the task script there, and add code to post-process the python-w3act output and post metrics to Prometheus.

Update webrender-puppeteer to 1.0.14 and reduce load

Some screenshot problems need fixing by updating the webrender-api service to use the latest webrender-puppeteer release. However, even then, there will be problems getting the timings right because the machine is so heavily loaded when running screenshots in the morning. We need to spread things out a bit more, so we much reduce the number of workers for webrender-api.

Modify setup and docs to improve failover procedures

When one crawler froze up, and we switched to another, this caused problems because both were running the same networked Gluster filesystem (used for Kafka, Prometheus) whereas the crawl state (frontier and caches) were locally held. This caused problems with Kafka and Prometheus on startup.

This ticket is to consider how to handle this:

  • Only have crawl output on Gluster?
  • Move Kafka/Prometheus/etc. onto local disk?
  • Make Kafka an distinct, fully distributed service? (Similar to how the crawl-time CDX is a separate service).
  • And improve documentation to cover crawler failover.

Use rclone for Hadoop 3 copy-to-hdfs tasks

Add an Airflow DAG, based on the rclone/rclone Docker image, running e.g.

rclone copy --hdfs-namenode h3nn.wa.bl.uk:54310 --hdfs-username ingest  --max-age 24h --no-traverse /mnt/gluster/fc/heritrix/output :hdfs:/heritrix/output --include "*.warc.gz" --include "crawl.log.cp*"


rclone copy --max-age 48h --no-traverse /mnt/gluster/fc/heritrix/output/frequent-npld hadoop3:/heritrix/output/frequent-npld --include "*.warc.gz" --include "crawl.log.cp*"

Add more tests to cover w3act access limits

The newer W3ACT stack improves the authenication method to access e.g. QA Wayback, but it also makes it a bit easier to accidentally remove the access limit. To look our for this, we need some additional automated tests to check logins are required etc.

To check that the following areas are only accessible if logged into W3ACT:

  • /act/wayback/
  • /act/nbapps/
  • /act/logs/

The test system is at https://github.com/ukwa/ukwa-services/tree/master/ingest/ingest_tests and is a set of Robot Framework tests that perform some simple live-system tests. i.e. all tests added to here should be safe to run on live/production services.

Some earlier instances missing in Wayback

Ensure Collection Areas are handled correctly

While investigating showing the Collection Areas, @min2ha found the data in the Collections Solr didn't match up with what was in W3ACT. Following team discussion it seems that a significant change of data model has happened.

Specifically, in the schema, the collectionAreaId is not multiValued. This means each Collection can only belong to one Collection Area (which it seems I had assumed was part of the intention, to make the list manageable, but it seems there are many Collections in multiple Collection Areas).

So, to fix this, we need to:

  • Change the ukwa-ui-collections-solr schema so it allows multiple collection areas, and deploy that to DEV.
  • Change the python-w3act scripts to make use of the multiple values for the collection areas.
  • Keep working on ukwa-ui to take advantage of this data.

Switch 'launch now' crawl launch process to Airflow crawl tasks

Rather than depending on processes running on Ingest, the crawls should be launched from Airflow. First, we'll just port the current mechanism over. (see #??? for planned improvements).

  • First use the bypm crawl launch as a test case, check it all works fine.
  • Check those crawls launched properly and down-stream tasks are working as expected.
  • Then add the npld crawl launch task, working in the same way.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.