conal-tuohy / troveproxy Goto Github PK
View Code? Open in Web Editor NEWA transforming proxy and harvester for the National Library of Australia's Trove API
License: Apache License 2.0
A transforming proxy and harvester for the National Library of Australia's Trove API
License: Apache License 2.0
At the moment I've written code in TroveProxy
to fix these broken URLs, and it seems to me that the Trove API actually could incorporate this same fix, so I'd like to be able to move this fix "upstream" to Trove, so that other people who aren't using TroveProxy
don't experience these broken links.
The XML response includes <records>
elements with next
attributes whose values are URLs which include a single category
parameter whose value is a list of category names, separated by a (URL-encoded) comma. If I replace that category
parameter with one whose value is taken from the code
attribute of the parent (i.e. <category>
) element of the <records>
element, then the resulting URL does work.
e.g. take the following query URL:
https://api.trove.nla.gov.au/v3/result?category=book&category=newspaper&q=water%20dragon&s=*&n=1&bulkHarvest=true
The result looks like this:
<response>
<query>water dragon</query>
<category code="book" name="Books & Libraries">
<records s="*" n="1" total="3700" next="https://api.trove.nla.gov.au/v3/result?category=book%2Cnewspaper&q=water+dragon&n=1&bulkHarvest=true&s=AoEqc3UxMDAwMTE4Ng%3D%3D" nextStart="AoEqc3UxMDAwMTE4Ng==">
<!-- omitted for brevity -->
</records>
</category>
<category code="newspaper" name="Newspapers & Gazettes">
<records s="*" n="1" total="98345" next="https://api.trove.nla.gov.au/v3/result?category=book%2Cnewspaper&q=water+dragon&n=1&bulkHarvest=true&s=AoEpMTAwMDI0NDE5" nextStart="AoEpMTAwMDI0NDE5">
<!-- omitted for brevity -->
</records>
</category>
</response>
Those "next" URLs are broken, but if I change them like so, they do appear to work correctly:
<response>
<query>water dragon</query>
<category code="book" name="Books & Libraries">
<records s="*" n="1" total="3700" next="https://api.trove.nla.gov.au/v3/result?category=book&q=water+dragon&n=1&bulkHarvest=true&s=AoEqc3UxMDAwMTE4Ng%3D%3D" nextStart="AoEqc3UxMDAwMTE4Ng==">
<!-- omitted for brevity -->
</records>
</category>
<category code="newspaper" name="Newspapers & Gazettes">
<records s="*" n="1" total="98345" next="https://api.trove.nla.gov.au/v3/result?category=newspaper&q=water+dragon&n=1&bulkHarvest=true&s=AoEpMTAwMDI0NDE5" nextStart="AoEpMTAwMDI0NDE5">
<!-- omitted for brevity -->
</records>
</category>
</response>
The code I'm using to fix these broken URLs is here:
TroveProxy/src/xslt/fix-trove-response.xsl
Lines 13 to 23 in 0778f71
Move API authentication keys from request URIs (i.e. the key
parameter) into HTTP X-API-KEY
headers:
Add support for "cff" as another value for the proxy-metadata-format
parameter
Modify harvester to download the cff file
When querying the Trove API, if the value of the facet
URL parameter is a comma-separated list of facet names, then facets are NOT returned in the response, e.g.
https://api.trove.nla.gov.au/v3/result?category=newspaper&facet=format,decade
The section on facets in the online documentation (the Controlling the metadata returned section) says
You can separate multiple values with commas
https://trove.nla.gov.au/about/create-something/using-api/v3/api-technical-guide#parameters-available-when-searching
This does work for other parameters, but not for facet
.
If a request is made to the Trove API containing an "Origin" request header, then the response includes two access-control-allow-origin
headers, both with the value *
. If a request is made without an Origin
header, then a single access-control-allow-origin
header is returned. However, requests to the API from a JS client in a browser will always have an Origin
header, and because multiple Access-Control-Allow-Origin headers are not allowed, these requests will fail, making it impossible to call the Trove API from such a client, except by going through a proxy which can remove one of the supernumerary headers.
This is a Trove server error.
What does "art" stand for? Article? Artifact? Art?
Accept additional query parameters containing metadata, and generate an RO-Crate manifest.
Word on the street is that there is a limitation to roughly a few kb for query URIs, and that there may be a separate limit on the number of logical disjunctions in the query.
See if we can determine these values empirically.
The eac-cpf
inclusion does slow down processing and increase the size of the response (for responses which include people, at least).
Do we want to make this transclusion functionality optional. i.e. do we want people to be able to query for people records and not have the eac-cpf data transcluded?
Certainly if you were just after a list of names, then the entire eac-cpf
would be massive overkill.
If it were optional, what should be the default? We could require a proxy-include-eac-cpf=true
parameter to make it happen, or require proxy-include-eac-cpf=false
to make it not happen.
Tom H:
@Conal-Tuohy if you want to give me a short blurb for your bit, then I'll add it to the "coming soon" section next time I queue up changes to the page
https://researchcloud.slack.com/archives/C05AQ9WSBJ8/p1695340535306169
Allow any proxy-metadata-*
parameter and pass them through to the RO-Crate-rendering stylesheet, so the user can add a free text description, their name, orcid, licence, etc, and have that turn up in the RO-Crate metadata object.
One of these parameters will be proxy-metadata-format
which will (at first) have just one acceptable value: ro-crate
, and which will cause the pipeline to pass the query request itself (rather than the query results) to a stylesheet which will then transform it into an RO-Crate metadata object. Later we could add other transformations to generate other metadata formats such as plain text citations, Zenodo metadata objects, etc.
Potentially one or more of these metadata objects could also get wodged into the TEI corpus document, too, inside elements, controlled by a proxy-metadata-embed
parameter.
proxy-metadata-*
parameters and return query metadata.trove-metadata
web component?The v3 API has taken the approach of assigning absolute URIs to resources, and including those URIs as the values of url
attributes in various places, but some places are missing, and could benefit from this extending this approach consistently throughout the interface.
e.g. the resource at https://api.trove.nla.gov.au/v3/newspaper/titles?state=qld is a list of Queensland newspaper titles. Each title is represented by a <newspaper>
element, e.g.
<newspaper id="1055">
<title>Brisbane Telegraph (Qld. : 1948 - 1954)</title>
<state>Queensland</state>
<issn>22051449</issn>
<troveUrl>https://nla.gov.au/nla.news-title1055</troveUrl>
<startDate>1948-01-01</startDate>
<endDate>1954-12-31</endDate>
</newspaper>
The newspaper element has an id
attribute, but if it were consistent with other parts of the API, it would also include a url
attribute with the value https://api.trove.nla.gov.au/v3/newspaper/title/1055 e.g.
<newspaper id="1055" url="https://api.trove.nla.gov.au/v3/newspaper/title/1055">
<title>Brisbane Telegraph (Qld. : 1948 - 1954)</title>
<state>Queensland</state>
<issn>22051449</issn>
<troveUrl>https://nla.gov.au/nla.news-title1055</troveUrl>
<startDate>1948-01-01</startDate>
<endDate>1954-12-31</endDate>
</newspaper>
This is the same general issue as with the newspaper articleText
elements, which contain escaped angle brackets where they are used for markup, but don't escape other angle brackets, or ampersands.
Currently the Harvester app will not abort a harvest when Trove throttles the request (due to a missing 'key' parameter); instead it will blithely harvest the <c:error/>
errors documents returned by the TroveProxy.
A collection of texts could be represented as a teiCorpus
(conceptually, a collection of texts) or as a TEI/text/group
(conceptually, a text consisting of a compilation of texts)
Varieties of Composite Text
[...]
In corpora, the component samples are clearly distinct texts, but the systematic collection, standardized preparation, and common markup of the corpus often make it useful to treat the entire corpus as a unit, too. Some corpora may become so well established as to be regarded as texts in their own right; the Brown and LOB corpora are now close to achieving this status.
[...]
The group element is provided to simplify the encoding of collections, anthologies, and cyclic works; as noted above, the group element can also be used to record the potentially complex internal structure of language corpora. For a full description, see chapter 4 Default Text Structure.
The TroveProxy
service may be behind another proxy with a different host
component to the request URIs.
URIs published by the TroveProxy
should reflect the hostname of the front-end proxy.
Reported to Trove as a bug, user-added characters are not included in the API output
From Trove API people results use people@id push to
http://www.nla.gov.au/apps/srw/search/peopleaustralia?query=oai.identifier+%253D+**[people@id]**&version=1.1&operation=searchRetrieve&recordSchema=urn%3Aisbn%3A1-931666-33-4&maximumRecords=10&startRecord=1&resultSetTTL=300&recordPacking=xml&recordXPath=&sortKeys=
Then we can parse the output
Need to request documentation.
What does it do? Can it be useful?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.