Giter VIP home page Giter VIP logo

ukwa-access-api's Introduction

UKWA Access API

This FastAPI application acts as a front-end for our access-time API services.

APIs

All APIs are documented using Swagger, and the system includes Swagger UI. e.g. when running in dev mode, you can go to:

http://localhost:8000/docs

and you'll get a UI that describes the APIs.

Wayback Resolver

This takes the timestamp and URL of interest, and redirects to the appropriate Wayback instance.

IIIF Image API for rendering archived web pages

urn:pwid:webarchive.org.uk:1995-04-18T15:56:00Z:page:http://acid.matkelly.com

/iiif/2/dXJuOnB3aWQ6d2ViYXJjaGl2ZS5vcmcudWs6MTk5NS0wNC0xOFQxNTo1NjowMFo6cGFnZTpodHRwOi8vYWNpZC5tYXRrZWxseS5jb20==/0,0,1366,1366/300,/0/default.png

Development & Deployment

For development, you can run it (in a suitable virtualenv) using:

$ pip install -f requirements.txt
$ uvicorn ukwa_api.main:app --reload

For staging/beta/production, it's designed to run under Docker, using uvicorn as the runtime engine.

ukwa-access-api's People

Contributors

anjackson avatar dependabot[bot] avatar ldbiz avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ukwa-access-api's Issues

Handle HTTP 451 properly

Currently, HTTP 451 get reported as an internal server error because Flask-RESTx does not recognise 415 as a valid HTTP code (because the validation is based on Python 3.7's and HTTP 451 was added in Python 3.8).

I've proposed an update to Flask-RESTx, here: python-restx/flask-restx#262

Mementos API should include and honour the exclude and block lists

There's a disjoint between this project and the PyWB API, as the PyWB API implements the access limitations.

In this API, it would be good to know what the block list is, so users can easily find out if it's accessible. Similarly, it is problematic that URLs that should be excluded can show up via this API.

Expose search as GraphQL API

The SearchKit toolkit has been refactored use a GraphQL API which gives me an idea.

Rather than coding directly against the Solr API, we could use the same conventions as SearchKit and expose a compatible GraphQL API.

There's good support for that kind of thing, eg https://graphene-python.org which can be integrated with Flask.

We could even using the solr-sqlalchemy adapter and do something like https://medium.com/swlh/python-flask-with-graphql-server-with-sqlalchemy-and-graphene-and-sqlite-ac9fcc9d3d83 but two APIs might be overkill!

Bring Save and Nominate together

Change or extend the API to be the Nominations API as well as Save An URL.

i.e. record all submitted URLs, optionally as nominations, where nominations have fields:

  • *Title
  • *URL
  • *Name
  • Email
  • Additional Information

Back-end should store all these in a proper DB table for analysis, e.g. tracking out-of-scope submissions. The existing page could also be ported over to it, where the * fields are required.

Ideas:

  • Follow an established async request pattern - this should also prevent trivial resubmissions by redirecting to a nomination status summary.
  • Fill out RSS feed items for Nominations.
  • Email (detailed?) nominations to the archivist account.
  • Update the Nominations table with outcomes (Out Of Scope, Crawled At, etc.).
  • Use simple file locking to safely start up when using multiple workers and an SQLite DB.
  • Use dbmate to manage schema.
  • Consider adding a sigil or tags - one or more public tags that allow users to subscribe to a feed and get updates.
  • Consider an iOS Shortcut (is there Android equivalent? Tasker?) that integrates with the Save/Nominate API.
  • Consider supporting a webhook field, posting outcome events when the URL is processed.
  • Consider bulk submission and response?
  • Consider extending the API to allow bulk lookup of URL status (for Google Sheets style plugin etc.)

Review and fill out embedded API documentation.

The FastAPI annotations are used to add basic API documentation. This should be review to see if it makes sense, and in particular, any existing parameters that do not have much or any documentation should be updates to include brief descriptions.

As an example. Is the CDX API documentation clear? Does the current setup make it sufficiently clear that only limited values are allowed for particular parameters?

async def lookup_url(
url: AnyHttpUrl = Query(
...,
title="URL to find.",
description="URL to look for (will canonicalize the URL before running the query).",
example='http://portico.bl.uk/'
),
matchType: Optional[schemas.LookupMatchType] = Query(
schemas.LookupMatchType.exact,
title='Type of match to look for.'
),
sort: Optional[schemas.LookupSort] = Query(
schemas.LookupSort.default,
title='Order to return results.'
),
limit: Union[int, None] = Query(
None,
title='Number of matching records to return.'
),
):
# Only put through allowed parameters:
params = {
'url': url,
'matchType': matchType.value,
'sort': sort.value,
'limit': limit,

Create caching screenshot service for OA material

Add an endpoint that uses https://github.com/ukwa/webrender-api to render screenshots of OA items via pywb in proxy mode. It should cache them essentially permanently, and keep the URL lists to retain the transclusions for whitelisting.

This has been made much easier by using an off-the-shelf caching IIIF server (Cantaloupe) as an intermediary. This now works reasonably well, see: ukwa/ukwa-services#24 (comment)

Some work remains:

  • Remove banner by having separate pywb config for rendering.
  • Chase up T:73492 (all o' Twitter)
  • Review exposed API and tidy up.
  • Actually do thumbs/full/card sizes properly
  • Update pywb and add in IIIF API hooks for social cards.
  • Note switch to https://flask-restx.readthedocs.io/en/latest/
  • Extend Flask-RESTx to add logo etc as per this
  • Switch to general OG markup rather than proprietary Twitter tags, as per
  • Report pywb returning that 304. (???)
  • Implement test suite to cover new functionality.

Expose UKWA entities, using IIIF concepts as appropriate

This is a sketch of a proposal for how to improve our API to make it more useful and standard. The current API is visible at: https://beta.webarchive.org.uk/api/

We wish to make our content easy to reuse elsewhere, leveraging the IIIF standard:

As we're using the IIIF standard, we can build on standard libraries, and use iiif-prezi to support Collection API v.2 and add prezi-2-to-3 to expose the same data as v3 manifests.

Building on #1, we could make the API more RESTful by focusing on exposing these nouns:

  • Mementos: a resource that encapsulates a prior state of the Original Resource. (as per the Memento spec)
    • Use the IIIF Image API to provide access to rendered versions of individual pages.
    • /api/mementos?... used for CDX queries.
    • /api/mementos/resolve/{timestamp}/{url} used to provide the resolution service require by e.g. Document Harvester catalogue entries.
    • /api/mementos/warc/{timestamp}/{url} used to access the WARC record for a specific memento.
    • /api/mementos/screenshot/{timestamp}/{url} a small extension to the IIIF interface to help users construct the right PWID from URL+timestamp. Redirects the user to the appropriate IIIF endpoint.
  • IIIF Image API: IIIF Image Service
    • /ap/iiiif/2/{PWID}/... exposes the IIIF Image service that gives access to images, using PWIDs to identify what to show.
  • Websites: the descriptive side of our current crawl Targets, encapsulating a set of Mementos and metadata describing an archives web site.
    • Use the IIIF Collection API to describe the set of seed snapshot images, effectively treating each time-stamped Memento as a different Canvas (which is how IIIF talks about the images).
    • Hierarchy: Website (Collection) > URLs (Sub-collection) > Mementos (Canvas)
    • /api/websites?... list/query websites, returning metadata and links to IIIF Collection manifests.
    • /api/websites/_{site_id}_/manifest.json - IIIF description of each website, including a top collection that lists all
    • /api/websites/_{site_id}_/_{url_id}_/manifest.json/etc The manifest per URL associated with each site, pointing to all Mementos for that URL.
  • Collections: a.k.a. Topics & Themes, A heirarchy of groups of Websites and Mementos.
    • Hierarchy: Collection Areas (Collection) > Topics & Themes (Collection) > [Sub-Sections (Collection), Websites (Collections), Mementos (Canvas)]
    • /api/collections?... list/query all collections (flat list with parent IDs, links to IIIF Collections.
    • /api/collections/iiif/... Expose Collections as IIIF Collections.
    • Lowest level is usually linking to manifests for URLs at specific times based on collection metadata, in contrast to Websites, that expose the whole timeline for each URL.
  • Crawls: Just the top-level noun for crawler statistics. Not IIIF
    • /api/crawls/fc/livestats expose the live statistics coming from the Frequent Crawl stream.
  • Nominations: API to accept nominations from the website or other services (webhook-compatible). Not IIIF.
    • /api/nominations list nominations (?)
    • /api/nominations POST API to nominate a URL.
    • /api/nominations/save the current idea of a quick save-this-now-if-in-scope function.

Handle HTTPS URLs properly when screenshotting.

This works:

/api/screenshots/?url=http%3A%2F%2Fportico.bl.uk%2F&type=thumbnail&source=archive

This doesn't (connect timeout):

/api/screenshots/?url=https%3A%2F%2Fportico.bl.uk%2F&type=thumbnail&source=archive

Scope queries via API

For external parties to know which URLs we can crawl, and hence what is worth posting to the save endpoint or what requires a new W3ACT record, we should allow the current permissible crawl scope to be queried.

Essentially, GET /in-scope?url=http://test.url returns true/false.

n.b. this is similar to: ukwa/ukwa-heritrix#37

Enhancements to the screenshot API

Building on #1 ...

  • Support crawl-time screenshots as well as screenshots of archives? Not sure how. could extend the PWID to include +original e.g. urn:pwid:webarchive.org.uk:2020-10-10T00:00:00Z:page+original:https%253A%252F%252Fportico.bl.uk%252F
  • Include Memento-Datetime header in response so you can tell when the shot is from?
  • Update docs/README
  • Should parse/check PWID at the first pass through the API
  • Should also redirect to closest timestamp to reduce variation
  • Overlay UKWA logo via Cantaloupe configuration.
  • Also expose as oEmbed API?
  • Use iiif-prezi to create a Manifest API that generates IIIF Presentation API manifests for a URL over time (with some date range and max. N/min. date diff params).
  • Optimisation: reduce delay by e.g. hooking preview generation into pywb, so its cached before use in a social card.
  • Publicise the approach and API etc.

Extending the API to expose crawl events as an RSS/Atom feed

At the level of individual URLs, expose CDX information as an RSS feed of crawl events, allowing users to be notified if a particularly interesting page is changed, e.g.

e.g. /api/mementos/rss?url=http://example.com/

where (by default) only changes, crawls with different hashes, are reported.

Add all safe and useful parameters to the CDX query API

The current CDX API only exposes four parameters:

async def lookup_url(
url: AnyHttpUrl = Query(
...,
title="URL to find.",
description="URL to look for (will canonicalize the URL before running the query).",
example='http://portico.bl.uk/'
),
matchType: Optional[schemas.LookupMatchType] = Query(
schemas.LookupMatchType.exact,
title='Type of match to look for.'
),
sort: Optional[schemas.LookupSort] = Query(
schemas.LookupSort.default,
title='Order to return results.'
),
limit: Union[int, None] = Query(
None,
title='Number of matching records to return.'
),
):
# Only put through allowed parameters:
params = {
'url': url,
'matchType': matchType.value,
'sort': sort.value,
'limit': limit,

The full back-end API offers more (and more detailed documentation! See also #42)

https://nla.github.io/outbackcdx/api.html#operation/query

Not all parameters are safe/simple to add, so I propose we add:

  • closest
  • output
  • from
  • to
  • collapse / collapseToFirst
  • collapseToLast

In call cases, the FastAPI/Pydantic code/annotations can perform basic validation, after which the parameters can be copied and passed to the back-end API query.

The filter parameter is useful but it potentially unsafe, as arbitrary RegEx's can be used in OutbackCDX. I'm not planning to exposing that parameter directly, but in the future we might consider adding a non-RegEx version so that e.g. status code filtering can be done, e.g. filter=!statuscode:429.

The fl parameter is not all that useful, and needs a custom parser, so won't be implemented unless there is demand.

Save API failing to work because it strips the double-slash

Requests like

https://beta.webarchive.org.uk/save/https%3A%2F%2F%2Fwww.bl.uk%2Fbritishlibrary%2F~%2Fmedia%2Fbl%2Fglobal%2Fabout%2520us%2Ffreedom%2520of%2520information%2Fpublication%2520scheme%2F4%2520decisions%2Fboard%2520meetings%2Fboard%2520meetings%25202018%2F180227-368%2Fblb%25201803.pdf

get response like

{
  "result": {
    "ia": {
      "event": "save-page-now",
      "reason": "OK",
      "status": 200
    },
    "ukwa": {
      "event": "save-page-now",
      "reason": "Crawl Requested",
      "status": 201
    }
  },
  "url": "https:/www.bl.uk/britishlibrary/~/media/bl/global/about%20us/freedom%20of%20information/publication%20scheme/4%20decisions/board%20meetings/board%20meetings%202018/180227-368/blb%201803.pdf"
}

i.e. https:/www. so something is stripping the slashes.

Save API needs to know the crawl scope

I just attempted to use the API to quickly crawl a new .co.uk site, but of course the frequent crawl only has the scope of the seeds it's seen (otherwise it'd be a domain crawl).

So, when popping URLs in through the API, the current permissible scope should be known, and if a URL is in scope, it should be marked as a seed when enqueuing it so that the scope is widened out.

Add Reconciliation API support for Archived Resources

As an additional mode of integration, and covering much of the utility of the now broken Google Sheets Add On, we could support OpenRefine Reconciliation (see API spec).

Starting with Archived Resources a.k.a. Mementos, users could pass in URLs and an optional target time, and this would be matched against our CDX. Later, this could be extended to cover other web archives, or cover other entities like Host or Target records. e.g. which Target does this URL belong too?

n.b. Can be manually tested using https://reconciliation-api.github.io/testbench/

Switch to FastAPI and move documentation

FastAPI is a modern and widely-used framework for Python APIs with autogenerated OpenAPI documentation. This components could easily be switched to that, and used to generate the basic API docs. The detailed documentation, including examples etc., should be moved to ukwa-documentation where it can be edited and localised without messing with the live service.

Collections API, add filter-by-organisation

We've had a request to make the nascent Collection API able to filter the results by organisation. Not 100% clear what level this means (top-level collections, collections, targets).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.