ukwa / ukwa-access-api Goto Github PK

View Code? Open in Web Editor NEW

1.0 8.0 2.0 3.72 MB

An application to wrap up APIs for accessing UKWA content.

License: Apache License 2.0

Python 79.10% Dockerfile 1.34% CSS 11.28% JavaScript 1.43% HTML 6.61% Shell 0.25%

ukwa-access-api's Introduction

UKWA Access API

This FastAPI application acts as a front-end for our access-time API services.

APIs

All APIs are documented using Swagger, and the system includes Swagger UI. e.g. when running in dev mode, you can go to:

http://localhost:8000/docs

and you'll get a UI that describes the APIs.

Wayback Resolver

This takes the timestamp and URL of interest, and redirects to the appropriate Wayback instance.

IIIF Image API for rendering archived web pages

urn:pwid:webarchive.org.uk:1995-04-18T15:56:00Z:page:http://acid.matkelly.com

/iiif/2/dXJuOnB3aWQ6d2ViYXJjaGl2ZS5vcmcudWs6MTk5NS0wNC0xOFQxNTo1NjowMFo6cGFnZTpodHRwOi8vYWNpZC5tYXRrZWxseS5jb20==/0,0,1366,1366/300,/0/default.png

Development & Deployment

For development, you can run it (in a suitable virtualenv) using:

$ pip install -f requirements.txt
$ uvicorn ukwa_api.main:app --reload

For staging/beta/production, it's designed to run under Docker, using uvicorn as the runtime engine.

ukwa-access-api's People

Contributors

Stargazers

Watchers

Forkers

uk-gov-mirror ldbiz

ukwa-access-api's Issues

Handle HTTP 451 properly

Currently, HTTP 451 get reported as an internal server error because Flask-RESTx does not recognise 415 as a valid HTTP code (because the validation is based on Python 3.7's and HTTP 451 was added in Python 3.8).

I've proposed an update to Flask-RESTx, here: python-restx/flask-restx#262

Make actions, especially URL nominations/submissions success/failure, visible in metrics

Add some Prometheus support to expose service metrics, especially things like URL submissions because they can fail easily if the back-end isn't there. They should really be cached locally I guess, in SQLite perhaps?

Mementos API should include and honour the exclude and block lists

There's a disjoint between this project and the PyWB API, as the PyWB API implements the access limitations.

In this API, it would be good to know what the block list is, so users can easily find out if it's accessible. Similarly, it is problematic that URLs that should be excluded can show up via this API.

Collection Extract needs an example in the Collection ID field

Similar to the Mementos Resolve example:

Current Collection ID has no example, it should default to the test json example (4388 at time of writing):

Expose search as GraphQL API

The SearchKit toolkit has been refactored use a GraphQL API which gives me an idea.

Rather than coding directly against the Solr API, we could use the same conventions as SearchKit and expose a compatible GraphQL API.

There's good support for that kind of thing, eg https://graphene-python.org which can be integrated with Flask.

We could even using the solr-sqlalchemy adapter and do something like https://medium.com/swlh/python-flask-with-graphql-server-with-sqlalchemy-and-graphene-and-sqlite-ac9fcc9d3d83 but two APIs might be overkill!

API getting // stripped out e.g. /api/save/http://sounds.bl.uk/GT/EC...

Bring Save and Nominate together

Change or extend the API to be the Nominations API as well as Save An URL.

i.e. record all submitted URLs, optionally as nominations, where nominations have fields:

*Title
*URL
*Name
Email
Additional Information

Back-end should store all these in a proper DB table for analysis, e.g. tracking out-of-scope submissions. The existing page could also be ported over to it, where the * fields are required.

Ideas:

Add API use cases to the README

eg. GLAM Workbench

Review and fill out embedded API documentation.

The FastAPI annotations are used to add basic API documentation. This should be review to see if it makes sense, and in particular, any existing parameters that do not have much or any documentation should be updates to include brief descriptions.

As an example. Is the CDX API documentation clear? Does the current setup make it sufficiently clear that only limited values are allowed for particular parameters?

ukwa-access-api/ukwa_api/mementos/router.py

Lines 65 to 90 in aba8ead

 async def lookup_url( 

 url: AnyHttpUrl = Query( 

 ..., 

 title="URL to find.", 

 description="URL to look for (will canonicalize the URL before running the query).", 

 example='http://portico.bl.uk/' 

 ), 

 matchType: Optional[schemas.LookupMatchType] = Query( 

 schemas.LookupMatchType.exact, 

 title='Type of match to look for.' 

 ), 

 sort: Optional[schemas.LookupSort] = Query( 

 schemas.LookupSort.default, 

 title='Order to return results.' 

 ), 

 limit: Union[int, None] = Query( 

 None, 

 title='Number of matching records to return.' 

 ), 

 ): 

 # Only put through allowed parameters: 

 params = { 

 'url': url, 

 'matchType': matchType.value, 

 'sort': sort.value, 

 'limit': limit,

Create caching screenshot service for OA material

Add an endpoint that uses https://github.com/ukwa/webrender-api to render screenshots of OA items via pywb in proxy mode. It should cache them essentially permanently, and keep the URL lists to retain the transclusions for whitelisting.

This has been made much easier by using an off-the-shelf caching IIIF server (Cantaloupe) as an intermediary. This now works reasonably well, see: ukwa/ukwa-services#24 (comment)

Some work remains:

Expose UKWA entities, using IIIF concepts as appropriate

This is a sketch of a proposal for how to improve our API to make it more useful and standard. The current API is visible at: https://beta.webarchive.org.uk/api/

We wish to make our content easy to reuse elsewhere, leveraging the IIIF standard:

As we're using the IIIF standard, we can build on standard libraries, and use iiif-prezi to support Collection API v.2 and add prezi-2-to-3 to expose the same data as v3 manifests.

Building on #1, we could make the API more RESTful by focusing on exposing these nouns:

Mementos: a resource that encapsulates a prior state of the Original Resource. (as per the Memento spec)
- Use the IIIF Image API to provide access to rendered versions of individual pages.
- /api/mementos?... used for CDX queries.
- /api/mementos/resolve/{timestamp}/{url} used to provide the resolution service require by e.g. Document Harvester catalogue entries.
- /api/mementos/warc/{timestamp}/{url} used to access the WARC record for a specific memento.
- /api/mementos/screenshot/{timestamp}/{url} a small extension to the IIIF interface to help users construct the right PWID from URL+timestamp. Redirects the user to the appropriate IIIF endpoint.
IIIF Image API: IIIF Image Service
- /ap/iiiif/2/{PWID}/... exposes the IIIF Image service that gives access to images, using PWIDs to identify what to show.
Websites: the descriptive side of our current crawl Targets, encapsulating a set of Mementos and metadata describing an archives web site.
- Use the IIIF Collection API to describe the set of seed snapshot images, effectively treating each time-stamped Memento as a different Canvas (which is how IIIF talks about the images).
- Hierarchy: Website (Collection) > URLs (Sub-collection) > Mementos (Canvas)
- /api/websites?... list/query websites, returning metadata and links to IIIF Collection manifests.
- /api/websites/_{site_id}_/manifest.json - IIIF description of each website, including a top collection that lists all
- /api/websites/_{site_id}_/_{url_id}_/manifest.json/etc The manifest per URL associated with each site, pointing to all Mementos for that URL.
Collections: a.k.a. Topics & Themes, A heirarchy of groups of Websites and Mementos.
- Hierarchy: Collection Areas (Collection) > Topics & Themes (Collection) > [Sub-Sections (Collection), Websites (Collections), Mementos (Canvas)]
- /api/collections?... list/query all collections (flat list with parent IDs, links to IIIF Collections.
- /api/collections/iiif/... Expose Collections as IIIF Collections.
- Lowest level is usually linking to manifests for URLs at specific times based on collection metadata, in contrast to Websites, that expose the whole timeline for each URL.
Crawls: Just the top-level noun for crawler statistics. Not IIIF
- /api/crawls/fc/livestats expose the live statistics coming from the Frequent Crawl stream.
Nominations: API to accept nominations from the website or other services (webhook-compatible). Not IIIF.
- /api/nominations list nominations (?)
- /api/nominations POST API to nominate a URL.
- /api/nominations/save the current idea of a quick save-this-now-if-in-scope function.

Screenshot service no longer passing date through correctly.

From PyWB, we have image URLs like:

https://www.webarchive.org.uk/api/iiif/2/urn:pwid:webarchive.org.uk:1995:04:18T15:56:00Z:page:http%3A%2F%2Fportico.bl.uk%2F/0,0,1024,512/600,/0/default.png

But this is not showing the 1995 version of the site, but rather the most recent version. So the timestamp isn't making it through.

Add single and batch lookup API

We should expose a subset of OutbackCDX's query API through our official API.

For ukwa/ukwa-gsheets-utils#4 and related use cases, we could use a batch lookup that takes a list of X URLs and checks holdings for one or more web archives.

Needs some boundaries, in terms of number of URLs and possible endpoints, to avoid potentially misuse.

Handle HTTPS URLs properly when screenshotting.

This works:

/api/screenshots/?url=http%3A%2F%2Fportico.bl.uk%2F&type=thumbnail&source=archive

This doesn't (connect timeout):

/api/screenshots/?url=https%3A%2F%2Fportico.bl.uk%2F&type=thumbnail&source=archive

Scope queries via API

For external parties to know which URLs we can crawl, and hence what is worth posting to the save endpoint or what requires a new W3ACT record, we should allow the current permissible crawl scope to be queried.

Essentially, GET /in-scope?url=http://test.url returns true/false.

n.b. this is similar to: ukwa/ukwa-heritrix#37

Make CDX lookup service block query response for blocked URLs

The CDX API should skip lines that are excluded.

The documentation should make the 'empty 200 ~= 404' behaviour clear.

Enhancements to the screenshot API

Building on #1 ...

Extending the API to expose crawl events as an RSS/Atom feed

At the level of individual URLs, expose CDX information as an RSS feed of crawl events, allowing users to be notified if a particularly interesting page is changed, e.g.

e.g. /api/mementos/rss?url=http://example.com/

where (by default) only changes, crawls with different hashes, are reported.

Add all safe and useful parameters to the CDX query API

The current CDX API only exposes four parameters:

ukwa-access-api/ukwa_api/mementos/router.py

Lines 65 to 90 in aba8ead

 async def lookup_url( 

 url: AnyHttpUrl = Query( 

 ..., 

 title="URL to find.", 

 description="URL to look for (will canonicalize the URL before running the query).", 

 example='http://portico.bl.uk/' 

 ), 

 matchType: Optional[schemas.LookupMatchType] = Query( 

 schemas.LookupMatchType.exact, 

 title='Type of match to look for.' 

 ), 

 sort: Optional[schemas.LookupSort] = Query( 

 schemas.LookupSort.default, 

 title='Order to return results.' 

 ), 

 limit: Union[int, None] = Query( 

 None, 

 title='Number of matching records to return.' 

 ), 

 ): 

 # Only put through allowed parameters: 

 params = { 

 'url': url, 

 'matchType': matchType.value, 

 'sort': sort.value, 

 'limit': limit,

The full back-end API offers more (and more detailed documentation! See also #42)

https://nla.github.io/outbackcdx/api.html#operation/query

Not all parameters are safe/simple to add, so I propose we add:

closest
output
from
to
collapse / collapseToFirst
collapseToLast

In call cases, the FastAPI/Pydantic code/annotations can perform basic validation, after which the parameters can be copied and passed to the back-end API query.

The filter parameter is useful but it potentially unsafe, as arbitrary RegEx's can be used in OutbackCDX. I'm not planning to exposing that parameter directly, but in the future we might consider adding a non-RegEx version so that e.g. status code filtering can be done, e.g. filter=!statuscode:429.

The fl parameter is not all that useful, and needs a custom parser, so won't be implemented unless there is demand.

Save API failing to work because it strips the double-slash

Requests like

https://beta.webarchive.org.uk/save/https%3A%2F%2F%2Fwww.bl.uk%2Fbritishlibrary%2F~%2Fmedia%2Fbl%2Fglobal%2Fabout%2520us%2Ffreedom%2520of%2520information%2Fpublication%2520scheme%2F4%2520decisions%2Fboard%2520meetings%2Fboard%2520meetings%25202018%2F180227-368%2Fblb%25201803.pdf

get response like

{
  "result": {
    "ia": {
      "event": "save-page-now",
      "reason": "OK",
      "status": 200
    },
    "ukwa": {
      "event": "save-page-now",
      "reason": "Crawl Requested",
      "status": 201
    }
  },
  "url": "https:/www.bl.uk/britishlibrary/~/media/bl/global/about%20us/freedom%20of%20information/publication%20scheme/4%20decisions/board%20meetings/board%20meetings%202018/180227-368/blb%201803.pdf"
}

i.e. https:/www. so something is stripping the slashes.

Collection Extract env dir handling not catering for trailing /

minor issue to catch mistyped or misunderstood env dir definition. ie regardless of whether it is present or not, the dir path should be valid if it exists

Save API needs to know the crawl scope

I just attempted to use the API to quickly crawl a new .co.uk site, but of course the frequent crawl only has the scope of the seeds it's seen (otherwise it'd be a domain crawl).

So, when popping URLs in through the API, the current permissible scope should be known, and if a URL is in scope, it should be marked as a seed when enqueuing it so that the scope is widened out.

Add Reconciliation API support for Archived Resources

As an additional mode of integration, and covering much of the utility of the now broken Google Sheets Add On, we could support OpenRefine Reconciliation (see API spec).

Starting with Archived Resources a.k.a. Mementos, users could pass in URLs and an optional target time, and this would be matched against our CDX. Later, this could be extended to cover other web archives, or cover other entities like Host or Target records. e.g. which Target does this URL belong too?

n.b. Can be manually tested using https://reconciliation-api.github.io/testbench/

Consider merging Collections search experimental API

There is experimental code from @min2ha here: https://github.com/min2ha/search-categories-api/blob/main/api/api.py

This could possibly be a module within this codebase, for now at least.

Switch to FastAPI and move documentation

FastAPI is a modern and widely-used framework for Python APIs with autogenerated OpenAPI documentation. This components could easily be switched to that, and used to generate the basic API docs. The detailed documentation, including examples etc., should be moved to ukwa-documentation where it can be edited and localised without messing with the live service.

Collections API, add filter-by-organisation

We've had a request to make the nascent Collection API able to filter the results by organisation. Not 100% clear what level this means (top-level collections, collections, targets).

	async def lookup_url(
	url: AnyHttpUrl = Query(
	...,
	title="URL to find.",
	description="URL to look for (will canonicalize the URL before running the query).",
	example='http://portico.bl.uk/'
	),
	matchType: Optional[schemas.LookupMatchType] = Query(
	schemas.LookupMatchType.exact,
	title='Type of match to look for.'
	),
	sort: Optional[schemas.LookupSort] = Query(
	schemas.LookupSort.default,
	title='Order to return results.'
	),
	limit: Union[int, None] = Query(
	None,
	title='Number of matching records to return.'
	),
	):
	# Only put through allowed parameters:
	params = {
	'url': url,
	'matchType': matchType.value,
	'sort': sort.value,
	'limit': limit,