Giter VIP home page Giter VIP logo

troveproxy's People

Contributors

conal-tuohy avatar mraadgev avatar

Watchers

 avatar  avatar  avatar  avatar

troveproxy's Issues

`@next` URLs in multi-category search results are broken

At the moment I've written code in TroveProxy to fix these broken URLs, and it seems to me that the Trove API actually could incorporate this same fix, so I'd like to be able to move this fix "upstream" to Trove, so that other people who aren't using TroveProxy don't experience these broken links.

The XML response includes <records> elements with next attributes whose values are URLs which include a single category parameter whose value is a list of category names, separated by a (URL-encoded) comma. If I replace that category parameter with one whose value is taken from the code attribute of the parent (i.e. <category>) element of the <records> element, then the resulting URL does work.

e.g. take the following query URL:
https://api.trove.nla.gov.au/v3/result?category=book&category=newspaper&q=water%20dragon&s=*&n=1&bulkHarvest=true

The result looks like this:

<response>
  <query>water dragon</query>
  <category code="book" name="Books &amp; Libraries">
    <records s="*" n="1" total="3700" next="https://api.trove.nla.gov.au/v3/result?category=book%2Cnewspaper&amp;q=water+dragon&amp;n=1&amp;bulkHarvest=true&amp;s=AoEqc3UxMDAwMTE4Ng%3D%3D" nextStart="AoEqc3UxMDAwMTE4Ng==">
      <!-- omitted for brevity -->
    </records>
  </category>
  <category code="newspaper" name="Newspapers &amp; Gazettes">
    <records s="*" n="1" total="98345" next="https://api.trove.nla.gov.au/v3/result?category=book%2Cnewspaper&amp;q=water+dragon&amp;n=1&amp;bulkHarvest=true&amp;s=AoEpMTAwMDI0NDE5" nextStart="AoEpMTAwMDI0NDE5">
      <!-- omitted for brevity -->
    </records>
  </category>
</response>

Those "next" URLs are broken, but if I change them like so, they do appear to work correctly:

<response>
  <query>water dragon</query>
  <category code="book" name="Books &amp; Libraries">
    <records s="*" n="1" total="3700" next="https://api.trove.nla.gov.au/v3/result?category=book&amp;q=water+dragon&amp;n=1&amp;bulkHarvest=true&amp;s=AoEqc3UxMDAwMTE4Ng%3D%3D" nextStart="AoEqc3UxMDAwMTE4Ng==">
      <!-- omitted for brevity -->
    </records>
  </category>
  <category code="newspaper" name="Newspapers &amp; Gazettes">
    <records s="*" n="1" total="98345" next="https://api.trove.nla.gov.au/v3/result?category=newspaper&amp;q=water+dragon&amp;n=1&amp;bulkHarvest=true&amp;s=AoEpMTAwMDI0NDE5" nextStart="AoEpMTAwMDI0NDE5">
      <!-- omitted for brevity -->
    </records>
  </category>
</response>

The code I'm using to fix these broken URLs is here:

<xsl:variable name="category-code" select="ancestor::category/@code"/>

<xsl:variable name="parameters" select="$query => tokenize('&amp;')"/>
<!-- throw out any category parameters that don't match current category -->
<xsl:variable name="refined-parameters" select="
(
$parameters[substring-before(., '=') != 'category'], (: ditch any categories :)
concat('category=', $category-code) (: add the current category back :)
)
"/>
<xsl:attribute name="next" select="
concat($base-uri, '?', string-join($refined-parameters, '&amp;'))
"/>

Filter `key` parameters from request URIs

Move API authentication keys from request URIs (i.e. the key parameter) into HTTP X-API-KEY headers:

  • make URIs publishable without leaking authentication credentials
  • allow URIs to be canonical
  • allow URIs to be reused with other keys without editing the URI

Facets are not returned if `facet` parameter contains a comma-separated list

When querying the Trove API, if the value of the facet URL parameter is a comma-separated list of facet names, then facets are NOT returned in the response, e.g.
https://api.trove.nla.gov.au/v3/result?category=newspaper&facet=format,decade

The section on facets in the online documentation (the Controlling the metadata returned section) says

You can separate multiple values with commas
https://trove.nla.gov.au/about/create-something/using-api/v3/api-technical-guide#parameters-available-when-searching

This does work for other parameters, but not for facet.

Trove API invalidly returns two access-control-allow-origin headers

If a request is made to the Trove API containing an "Origin" request header, then the response includes two access-control-allow-origin headers, both with the value *. If a request is made without an Origin header, then a single access-control-allow-origin header is returned. However, requests to the API from a JS client in a browser will always have an Origin header, and because multiple Access-Control-Allow-Origin headers are not allowed, these requests will fail, making it impossible to call the Trove API from such a client, except by going through a proxy which can remove one of the supernumerary headers.

This is a Trove server error.

Investigate implementation limits on Trove API queries

Word on the street is that there is a limitation to roughly a few kb for query URIs, and that there may be a separate limit on the number of logical disjunctions in the query.

See if we can determine these values empirically.

Make `eac-cpf` inclusion optional?

The eac-cpf inclusion does slow down processing and increase the size of the response (for responses which include people, at least).

Do we want to make this transclusion functionality optional. i.e. do we want people to be able to query for people records and not have the eac-cpf data transcluded?

Certainly if you were just after a list of names, then the entire eac-cpf would be massive overkill.

If it were optional, what should be the default? We could require a proxy-include-eac-cpf=true parameter to make it happen, or require proxy-include-eac-cpf=false to make it not happen.

Generate RO-Crate metadata describing a Trove query

Allow any proxy-metadata-* parameter and pass them through to the RO-Crate-rendering stylesheet, so the user can add a free text description, their name, orcid, licence, etc, and have that turn up in the RO-Crate metadata object.

One of these parameters will be proxy-metadata-format which will (at first) have just one acceptable value: ro-crate, and which will cause the pipeline to pass the query request itself (rather than the query results) to a stylesheet which will then transform it into an RO-Crate metadata object. Later we could add other transformations to generate other metadata formats such as plain text citations, Zenodo metadata objects, etc.

Potentially one or more of these metadata objects could also get wodged into the TEI corpus document, too, inside elements, controlled by a proxy-metadata-embed parameter.

  • Set up XProc pipeline to accept proxy-metadata-* parameters and return query metadata.
  • Write stylesheet to produce a fairly complete RO-Crate metadata object.
  • Update the QueryBuilderForm front ends to include a UI for these parameters. Maybe a trove-metadata web component?

extend RESTfulness with additional url attributes

The v3 API has taken the approach of assigning absolute URIs to resources, and including those URIs as the values of url attributes in various places, but some places are missing, and could benefit from this extending this approach consistently throughout the interface.

e.g. the resource at https://api.trove.nla.gov.au/v3/newspaper/titles?state=qld is a list of Queensland newspaper titles. Each title is represented by a <newspaper> element, e.g.

  <newspaper id="1055">
    <title>Brisbane Telegraph (Qld. : 1948 - 1954)</title>
    <state>Queensland</state>
    <issn>22051449</issn>
    <troveUrl>https://nla.gov.au/nla.news-title1055</troveUrl>
    <startDate>1948-01-01</startDate>
    <endDate>1954-12-31</endDate>
  </newspaper>

The newspaper element has an id attribute, but if it were consistent with other parts of the API, it would also include a url attribute with the value https://api.trove.nla.gov.au/v3/newspaper/title/1055 e.g.

  <newspaper id="1055" url="https://api.trove.nla.gov.au/v3/newspaper/title/1055">
    <title>Brisbane Telegraph (Qld. : 1948 - 1954)</title>
    <state>Queensland</state>
    <issn>22051449</issn>
    <troveUrl>https://nla.gov.au/nla.news-title1055</troveUrl>
    <startDate>1948-01-01</startDate>
    <endDate>1954-12-31</endDate>
  </newspaper>

Convert <b> tags in <snippet> into actual elements

This is the same general issue as with the newspaper articleText elements, which contain escaped angle brackets where they are used for markup, but don't escape other angle brackets, or ampersands.

Fix handling of upstream (Trove) errors

Currently the Harvester app will not abort a harvest when Trove throttles the request (due to a missing 'key' parameter); instead it will blithely harvest the <c:error/> errors documents returned by the TroveProxy.

  • TroveProxy should report upstream errors nicely.
  • Harvester should abort if an HTTP request fails, and report the error.

How should a collection of Trove records be represented in TEI?

A collection of texts could be represented as a teiCorpus (conceptually, a collection of texts) or as a TEI/text/group (conceptually, a text consisting of a compilation of texts)

Varieties of Composite Text
[...]
In corpora, the component samples are clearly distinct texts, but the systematic collection, standardized preparation, and common markup of the corpus often make it useful to treat the entire corpus as a unit, too. Some corpora may become so well established as to be regarded as texts in their own right; the Brown and LOB corpora are now close to achieving this status.
[...]
The group element is provided to simplify the encoding of collections, anthologies, and cyclic works; as noted above, the group element can also be used to record the potentially complex internal structure of language corpora. For a full description, see chapter 4 Default Text Structure.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.