conal-tuohy / troveproxy Goto Github PK

View Code? Open in Web Editor NEW

0.0 4.0 0.0 20.5 MB

A transforming proxy and harvester for the National Library of Australia's Trove API

License: Apache License 2.0

XSLT 44.95% XProc 54.49% Dockerfile 0.39% Shell 0.16%

humanities proxy-server troveaustralia xproc xslt ardc-cdl harvester

troveproxy's People

Contributors

Watchers

troveproxy's Issues

`@next` URLs in multi-category search results are broken

At the moment I've written code in TroveProxy to fix these broken URLs, and it seems to me that the Trove API actually could incorporate this same fix, so I'd like to be able to move this fix "upstream" to Trove, so that other people who aren't using TroveProxy don't experience these broken links.

The XML response includes <records> elements with next attributes whose values are URLs which include a single category parameter whose value is a list of category names, separated by a (URL-encoded) comma. If I replace that category parameter with one whose value is taken from the code attribute of the parent (i.e. <category>) element of the <records> element, then the resulting URL does work.

e.g. take the following query URL:
https://api.trove.nla.gov.au/v3/result?category=book&category=newspaper&q=water%20dragon&s=*&n=1&bulkHarvest=true

The result looks like this:

<response>
  <query>water dragon</query>
  <category code="book" name="Books &amp; Libraries">
    <records s="*" n="1" total="3700" next="https://api.trove.nla.gov.au/v3/result?category=book%2Cnewspaper&amp;q=water+dragon&amp;n=1&amp;bulkHarvest=true&amp;s=AoEqc3UxMDAwMTE4Ng%3D%3D" nextStart="AoEqc3UxMDAwMTE4Ng==">
      <!-- omitted for brevity -->
    </records>
  </category>
  <category code="newspaper" name="Newspapers &amp; Gazettes">
    <records s="*" n="1" total="98345" next="https://api.trove.nla.gov.au/v3/result?category=book%2Cnewspaper&amp;q=water+dragon&amp;n=1&amp;bulkHarvest=true&amp;s=AoEpMTAwMDI0NDE5" nextStart="AoEpMTAwMDI0NDE5">
      <!-- omitted for brevity -->
    </records>
  </category>
</response>

Those "next" URLs are broken, but if I change them like so, they do appear to work correctly:

<response>
  <query>water dragon</query>
  <category code="book" name="Books &amp; Libraries">
    <records s="*" n="1" total="3700" next="https://api.trove.nla.gov.au/v3/result?category=book&amp;q=water+dragon&amp;n=1&amp;bulkHarvest=true&amp;s=AoEqc3UxMDAwMTE4Ng%3D%3D" nextStart="AoEqc3UxMDAwMTE4Ng==">
      <!-- omitted for brevity -->
    </records>
  </category>
  <category code="newspaper" name="Newspapers &amp; Gazettes">
    <records s="*" n="1" total="98345" next="https://api.trove.nla.gov.au/v3/result?category=newspaper&amp;q=water+dragon&amp;n=1&amp;bulkHarvest=true&amp;s=AoEpMTAwMDI0NDE5" nextStart="AoEpMTAwMDI0NDE5">
      <!-- omitted for brevity -->
    </records>
  </category>
</response>

The code I'm using to fix these broken URLs is here:

TroveProxy/src/xslt/fix-trove-response.xsl

Line 4 in 0778f71

<xsl:variable name="category-code" select="ancestor::category/@code"/>

TroveProxy/src/xslt/fix-trove-response.xsl

Lines 13 to 23 in 0778f71

 <xsl:variable name="parameters" select="$query => tokenize('&amp;')"/> 

 <!-- throw out any category parameters that don't match current category --> 

 <xsl:variable name="refined-parameters" select=" 

  ( 

  $parameters[substring-before(., '=') != 'category'], (: ditch any categories :) 

  concat('category=', $category-code) (: add the current category back :) 

  ) 

  "/> 

 <xsl:attribute name="next" select=" 

  concat($base-uri, '?', string-join($refined-parameters, '&amp;')) 

  "/>

Filter `key` parameters from request URIs

Move API authentication keys from request URIs (i.e. the key parameter) into HTTP X-API-KEY headers:

make URIs publishable without leaking authentication credentials
allow URIs to be canonical
allow URIs to be reused with other keys without editing the URI

Generate Citation metadata in CFF format

Add support for "cff" as another value for the proxy-metadata-format parameter
Modify harvester to download the cff file

Facets are not returned if `facet` parameter contains a comma-separated list

When querying the Trove API, if the value of the facet URL parameter is a comma-separated list of facet names, then facets are NOT returned in the response, e.g.
https://api.trove.nla.gov.au/v3/result?category=newspaper&facet=format,decade

The section on facets in the online documentation (the Controlling the metadata returned section) says

You can separate multiple values with commas
https://trove.nla.gov.au/about/create-something/using-api/v3/api-technical-guide#parameters-available-when-searching

This does work for other parameters, but not for facet.

Trove API invalidly returns two access-control-allow-origin headers

If a request is made to the Trove API containing an "Origin" request header, then the response includes two access-control-allow-origin headers, both with the value *. If a request is made without an Origin header, then a single access-control-allow-origin header is returned. However, requests to the API from a JS client in a browser will always have an Origin header, and because multiple Access-Control-Allow-Origin headers are not allowed, these requests will fail, making it impossible to call the Trove API from such a client, except by going through a proxy which can remove one of the supernumerary headers.

This is a Trove server error.

what does the "art" in "l-artType" signify?

What does "art" stand for? Article? Artifact? Art?

Generate RO-Crate description of output

Accept additional query parameters containing metadata, and generate an RO-Crate manifest.

Investigate implementation limits on Trove API queries

Word on the street is that there is a limitation to roughly a few kb for query URIs, and that there may be a separate limit on the number of logical disjunctions in the query.

See if we can determine these values empirically.

Make `eac-cpf` inclusion optional?

The eac-cpf inclusion does slow down processing and increase the size of the response (for responses which include people, at least).

Do we want to make this transclusion functionality optional. i.e. do we want people to be able to query for people records and not have the eac-cpf data transcluded?

Certainly if you were just after a list of names, then the entire eac-cpf would be massive overkill.

If it were optional, what should be the default? We could require a proxy-include-eac-cpf=true parameter to make it happen, or require proxy-include-eac-cpf=false to make it not happen.

Create short "coming soon" blurb

Tom H:

@Conal-Tuohy if you want to give me a short blurb for your bit, then I'll add it to the "coming soon" section next time I queue up changes to the page
https://researchcloud.slack.com/archives/C05AQ9WSBJ8/p1695340535306169

Generate RO-Crate metadata describing a Trove query

Allow any proxy-metadata-* parameter and pass them through to the RO-Crate-rendering stylesheet, so the user can add a free text description, their name, orcid, licence, etc, and have that turn up in the RO-Crate metadata object.

One of these parameters will be proxy-metadata-format which will (at first) have just one acceptable value: ro-crate, and which will cause the pipeline to pass the query request itself (rather than the query results) to a stylesheet which will then transform it into an RO-Crate metadata object. Later we could add other transformations to generate other metadata formats such as plain text citations, Zenodo metadata objects, etc.

Potentially one or more of these metadata objects could also get wodged into the TEI corpus document, too, inside elements, controlled by a proxy-metadata-embed parameter.

Set up XProc pipeline to accept proxy-metadata-* parameters and return query metadata.
Write stylesheet to produce a fairly complete RO-Crate metadata object.
Update the QueryBuilderForm front ends to include a UI for these parameters. Maybe a trove-metadata web component?

Refactor initial spike to fully separate proxy and data transformation concerns

The stylesheet converting Trove XML to TEI XML should be usable outside the proxy framework.
Trove bug workarounds should be a separate pipeline step
Proxy should handle requests for multiple formats and perform an appropriate transformation only if needed

extend RESTfulness with additional url attributes

The v3 API has taken the approach of assigning absolute URIs to resources, and including those URIs as the values of url attributes in various places, but some places are missing, and could benefit from this extending this approach consistently throughout the interface.

e.g. the resource at https://api.trove.nla.gov.au/v3/newspaper/titles?state=qld is a list of Queensland newspaper titles. Each title is represented by a <newspaper> element, e.g.

  <newspaper id="1055">
    <title>Brisbane Telegraph (Qld. : 1948 - 1954)</title>
    <state>Queensland</state>
    <issn>22051449</issn>
    <troveUrl>https://nla.gov.au/nla.news-title1055</troveUrl>
    <startDate>1948-01-01</startDate>
    <endDate>1954-12-31</endDate>
  </newspaper>

The newspaper element has an id attribute, but if it were consistent with other parts of the API, it would also include a url attribute with the value https://api.trove.nla.gov.au/v3/newspaper/title/1055 e.g.

  <newspaper id="1055" url="https://api.trove.nla.gov.au/v3/newspaper/title/1055">
    <title>Brisbane Telegraph (Qld. : 1948 - 1954)</title>
    <state>Queensland</state>
    <issn>22051449</issn>
    <troveUrl>https://nla.gov.au/nla.news-title1055</troveUrl>
    <startDate>1948-01-01</startDate>
    <endDate>1954-12-31</endDate>
  </newspaper>

Create Atom crosswalk

Create TEI crosswalk

Convert <b> tags in <snippet> into actual elements

This is the same general issue as with the newspaper articleText elements, which contain escaped angle brackets where they are used for markup, but don't escape other angle brackets, or ampersands.

Fix handling of upstream (Trove) errors

Currently the Harvester app will not abort a harvest when Trove throttles the request (due to a missing 'key' parameter); instead it will blithely harvest the <c:error/> errors documents returned by the TroveProxy.

TroveProxy should report upstream errors nicely.
Harvester should abort if an HTTP request fails, and report the error.

How should a collection of Trove records be represented in TEI?

A collection of texts could be represented as a teiCorpus (conceptually, a collection of texts) or as a TEI/text/group (conceptually, a text consisting of a compilation of texts)

Varieties of Composite Text
[...]
In corpora, the component samples are clearly distinct texts, but the systematic collection, standardized preparation, and common markup of the corpus often make it useful to treat the entire corpus as a unit, too. Some corpora may become so well established as to be regarded as texts in their own right; the Brown and LOB corpora are now close to achieving this status.
[...]
The group element is provided to simplify the encoding of collections, anthologies, and cyclic works; as noted above, the group element can also be used to record the potentially complex internal structure of language corpora. For a full description, see chapter 4 Default Text Structure.

From Trove API people results use people@id push to
http://www.nla.gov.au/apps/srw/search/peopleaustralia?query=oai.identifier+%253D+**[people@id]**&version=1.1&operation=searchRetrieve&recordSchema=urn%3Aisbn%3A1-931666-33-4&maximumRecords=10&startRecord=1&resultSetTTL=300&recordPacking=xml&recordXPath=&sortKeys=

Then we can parse the output

Trove's 'otherLimits' parameter is undocumented

Need to request documentation.

What does it do? Can it be useful?

	<xsl:variable name="parameters" select="$query => tokenize('&')"/>
	<!-- throw out any category parameters that don't match current category -->
	<xsl:variable name="refined-parameters" select="
	(
	$parameters[substring-before(., '=') != 'category'], (: ditch any categories :)
	concat('category=', $category-code) (: add the current category back :)
	)
	"/>
	<xsl:attribute name="next" select="
	concat($base-uri, '?', string-join($refined-parameters, '&'))
	"/>

conal-tuohy / troveproxy Goto Github PK

troveproxy's People

Contributors

Watchers

troveproxy's Issues

Recommend Projects

Recommend Topics

Recommend Org