Giter VIP home page Giter VIP logo

dancer-plugin-catmandu-oai's Introduction

NAME

Dancer::Plugin::Catmandu::OAI - OAI-PMH provider backed by a searchable Catmandu::Store

SYNOPSIS

#!/usr/bin/env perl

use Dancer;
use Catmandu;
use Dancer::Plugin::Catmandu::OAI;

Catmandu->load;
Catmandu->config;

my $options = {};

oai_provider '/oai' , %$options;

dance;

DESCRIPTION

Dancer::Plugin::Catmandu::OAI is a Dancer plugin to provide OAI-PMH services for Catmandu::Store-s that support CQL (such as Catmandu::Store::ElasticSearch). Follow the installation steps below to setup your own OAI-PMH server.

REQUIREMENTS

In the examples below an ElasticSearch 1.7.2 https://www.elastic.co/downloads/past-releases/elasticsearch-1-7-2 server will be used.

Follow the instructions below for a demonstration installation:

$ cpanm Dancer Catmandu::OAI Catmandu::Store::ElasticSearch

$ wget https://download.elastic.co/elasticsearch/elasticsearch/elasticsearch-1.7.2.zip
$ unzip elasticsearch-1.7.2.zip
$ cd elasticsearch-1.7.2
$ bin/elasticsearch

RECORDS

Records stored in the Catmandu::Store can be in any format. Preferably the format should be easy to convert into the mandatory OAI-DC format. At a minimum each record contains an identifier '_id' and a field containing a datestamp.

$ cat sample.yml
---
_id: oai:my.server.org:123456
datestamp: 2016-05-17T13:37:18Z
creator:
 - Musterman, Max
 - Jansen, Jan
 - Svenson, Sven
title:
 - Test record
...

CATMANDU CONFIGURATION

ElasticSearch requires a configuration file to map record fields to CQL terms. Below is a minimal configuration required to query for identifiers and datastamps in the ElasticSearch collection:

$ cat catmandu.yml
---
store:
  oai:
    package: ElasticSearch
    options:
      index_name: oai
      bags:
        data:
          cql_mapping:
            default_index: basic
            indexes:
              _id:
                op:
                  'any': true
                  'all': true
                  '=': true
                  'exact': true
                field: '_id'
              datestamp:
                op:
                  '=': true
                  '<': true
                  '<=': true
                  '>=': true
                  '>': true
                  'exact': true
                field: 'datestamp'
      index_mappings:
        publication:
          properties:
            datestamp: {type: date, format: date_time_no_millis}

IMPORT RECORDS

With the Catmandu configuration files in place records can be imported with the catmandu command:

# Drop the existing ElasticSearch 'oai' collection
$ catmandu drop oai

# Import the sample record
$ catmandu import YAML to oai < sample.yml

# Test if the records are available in the 'oai' collection
$ catmandu export oai

DANCER CONFIGURATION

The Dancer configuration file 'config.yml' contains basic information for the OAI-PMH plugin to work:

* store - In which Catmandu::Store are the metadata records stored
* bag - In which Catmandu::Bag are the records of this 'store' (use: 'data' as default)
* datestamp_field - Which field in the record contains a datestamp ('datestamp' in our example above)
* datestamp_index - Which CQL index should be used to find records within a specified date range (if not specified, the value from the 'datestamp_field' setting is used)
* repositoryName - The name of the repository
* uri_base - The full base url of the OAI controller. To be used when behind a proxy server. When not set, this module relies on the Dancer request to provide its full url. Use middleware like 'ReverseProxy' or 'Dancer::Middleware::Rebase' in that case.
* adminEmail - An administrative email. Can be string or array of strings. This will be included in the Identify response.
* compression - a compression encoding supported by the repository. Can be string or array of strings. This will be included in the Identify response.
* description - XML container that describes your repository. Can be string or array of strings. This will be included in the Identify response. Note that this module will try to validate the XML data.
* earliestDatestamp - The earliest datestamp available in the dataset as YYYY-MM-DDTHH:MM:SSZ. This will be determined dynamically if no static value is given.
* deletedRecord - The policy for deleted records. See also: L<https://www.openarchives.org/OAI/openarchivesprotocol.html#DeletedRecords>
* repositoryIdentifier - A prefix to use in OAI-PMH identifiers
* cql_filter -  A CQL query to find all records in the database that should be made available to OAI-PMH
* default_search_params - set default arguments that get passed to every call to the bag's search method
* search_strategy - default is C<paginate>, set to C<es.scroll> to avoid deep paging (Elasticsearch only)
* limit - The maximum number of records to be returned in each OAI-PMH request
* delimiter - Delimiters used in prefixing a record identifier with a repositoryIdentifier (use: ':' as default)
* sampleIdentifier - A sample identifier
* metadata_formats - An array of metadataFormats that are supported
    * metadataPrefix - A short string for the name of the format
    * schema - An URL to the XSD schema of this format
    * metadataNamespace - A XML namespace for this format
    * template - The path to a Template Toolkit file to transform your records into this format
    * fix - Optionally an array of one or more L<Catmandu::Fix>-es or Fix files
* sets - Optional an array of OAI-PMH sets and the CQL query to retrieve records in this set from the Catmandu::Store
    * setSpec - A short string for the same of the set
    * setName - A longer description of the set
    * setDescription - an optional and repeatable container that may hold community-specific XML-encoded data about the set. Should be string or array of strings.
    * cql - The CQL command to find records in this set in the L<Catmandu::Store>
* xsl_stylesheet - Optional path to an xsl stylesheet
* template_options - An optional hash of configuration options that will be passed to L<Catmandu::Exporter::Template> or L<Template>.

Below is a sample minimal configuration for the 'sample.yml' demo above:

$ cat config.yml
charset: "UTF-8"
plugins:
  'Catmandu::OAI':
    store: oai
    bag: data
    datestamp_field: datestamp
    repositoryName: "My OAI DataProvider"
    uri_base: "http://oai.service.com/oai"
    adminEmail: [email protected]
    earliestDatestamp: "1970-01-01T00:00:01Z"
    cql_filter: "datestamp>1970-01-01T00:00:01Z"
    deletedRecord: persistent
    repositoryIdentifier: oai.service.com
    limit: 200
    delimiter: ":"
    sampleIdentifier: "oai:oai.service.com:1585315"
    metadata_formats:
      -
        metadataPrefix: oai_dc
        schema: "http://www.openarchives.org/OAI/2.0/oai_dc.xsd"
        metadataNamespace: "http://www.openarchives.org/OAI/2.0/oai_dc/"
        template: oai_dc.tt

METADATAPREFIX TEMPLATE

For each metadataPrefix a Template Toolkit file needs to exist which translate Catmandu::Store records into XML records. At least one Template Toolkit file should be made available to transform stored records into Dublin Core. The example below contains an example file to transform 'sample.yml' type records into Dublin Core:

$ cat oai_dc.tt
<oai_dc:dc xmlns="http://www.openarchives.org/OAI/2.0/oai_dc/"
           xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
           xmlns:dc="http://purl.org/dc/elements/1.1/"
           xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
           xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
[%- FOREACH var IN ['title' 'creator' 'subject' 'description' 'publisher' 'contributor' 'date' 'type' 'format' 'identifier' 'source' 'language' 'relation' 'coverage' 'rights'] %]
    [%- FOREACH val IN $var %]
    <dc:[% var %]>[% val | html %]</dc:[% var %]>
    [%- END %]
[%- END %]
</oai_dc:dc>

START DANCER

If all the required files are available, then a Dancer application can be started. See the 'demo' directory of this distribution for a complete example:

$ ls
app.pl  catmandu.yml  config.yml  oai_dc.tt
$ cat app.pl
#!/usr/bin/env perl

use Dancer;
use Catmandu;
use Dancer::Plugin::Catmandu::OAI;

Catmandu->load;
Catmandu->config;

my $options = {};

oai_provider '/oai' , %$options;

dance;

# Start Dancer
$ perl ./app.pl

# Test queries:

$ curl "http://localhost:3000/oai?verb=Identify"
$ curl "http://localhost:3000/oai?verb=ListSets"
$ curl "http://localhost:3000/oai?verb=ListMetadataFormats"
$ curl "http://localhost:3000/oai?verb=ListIdentifiers&metadataPrefix=oai_dc"
$ curl "http://localhost:3000/oai?verb=ListRecords&metadataPrefix=oai_dc"

SEE ALSO

Dancer, Catmandu, Catmandu::Store

AUTHOR

Nicolas Steenlant, <nicolas.steenlant at ugent.be>

CONTRIBUTORS

Nicolas Franck, <nicolas.franck at ugent.be>

Vitali Peil, <vitali.peil at uni-bielefeld.de>

Patrick Hochstenbach, <patric.hochstenbach at ugent.be>

LICENSE

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

dancer-plugin-catmandu-oai's People

Contributors

nics avatar vpeil avatar phochste avatar nicolasfranck avatar danmichaelo avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar Snorri Briem avatar Dave Sherohman avatar  avatar  avatar  avatar James Cloos avatar  avatar Petra Kohorst avatar Najko Jahn avatar

Forkers

danmichaelo

dancer-plugin-catmandu-oai's Issues

current version not indexed on cpan

the current version 0.0303 is not indexed correctly on cpan:

$ cpanm Dancer::Plugin::Catmandu::OAI
Dancer::Plugin::Catmandu::OAI is up to date. (0.0302)

you have changed permissions on cpan, maybe something went wrong.

Error in documentation

In the synopsis:

use Dancer::Plugin::Catmandu::SRU;

should be:

use Dancer::Plugin::Catmandu::OAI;

Introduce option to add an OAI stylesheet

Introduce a configuration option to add e.g.

<?xml-stylesheet type='text/xsl' href='/oai.xsl' ?>

OR

Add this line by default, but make it stable in case the oai.xsl is not present.

Specifying a fix for metadata format results in runtime error

Specifying a fix in the config:

...
       metadata_formats:
            -
                metadataPrefix: oai_dc
                ....
                fix:
                  - publication_to_dc()

results in the following error (Perl v5.22.1) when fetching a record (ListRecords, GetRecord):
"Useless use of single ref constructor in void context" (Catmandu::Fix::_build_fixer).

The cause of the error seems to be that $format->{fix} is set to Catmandu::Fix object (in Dancer/Plugin/Catmandu/OAI.pm, line 111):

$format->{fix} = Catmandu::Fix->new(fixes => $fix);

It is then passed as a fix parameter to Catmandu::Exporter::Template, in line 419:

my $exporter = Catmandu::Exporter::Template->new(
  template => $format->{template},
  file => \$metadata,
  fix => $format->{fix},
);

Commenting out line 111 in Dancer/Plugin/Catmandu/OAI.pm avoids the error and the fix is executed correctly.

Problem with datestamps and earliestDatestamp

According to the OAI specification:
"A repository must update the datestamp of a record if a change occurs, the result of which would be a change to the metadata part of the XML-encoding of the record. Such changes include, but are not limited to, changes to the metadata of the record, changes to the metadata format of the record, introduction of a new metadata format, termination of support for a metadata format, etc."

So in practice it will often end up that the earliestDateStamp will be the time when one of the formats was last changed.

So lets say if we change one of the formats for an OAI service of an existing database. As things are set up now in the OAI plugin it would require re-indexing of all the records in Elasticsearch since the OAI datestamp field should be updated to correspond to the format change and the re-indexing needs to be in place before the OAI service can serve the records in the modified format, otherwise the selective harvesting (using from and until) will not work correctly. But there is a dilemma since we can't reliably set the date of the format change until we make the switch in the OAI service. So we would either have take the OAI service off line during the indexing or else we risk screwing up the selective harvesting for harvesters accessing the service during this time.

Solution

We can avoid re-indexing for format changes if we instead just update the datestamp in the response, on the fly for each record when needed. So if the datestamp is earlier than the earliestDateStamp it is set to earliestDateStamp in the OAI output. The query for the selective harvesting also needs to be adjusted so it still works correctly. The from condition needs to removed from the CQL query if the from date is equal to or earlier than the earliestDateStamp. If the until date is earlier than the earliestDatestamp no results should be returned.

This should be easy to implement and also wouldn't break how it is working now assuming the record datestamps and earliestDateStamp are managed correctly.

Does this sound reasonable?

cannot find an option to make it respect the key_prefix

Hi everybody,
When I use the option "key_prefix: my_" on the import of records they get the following layout:

{
"_index": "oai",
"_type": "data",
"_id": "14",
"_score": 1,
"_source": {
"my_id": "14",
...

Unfortunately the records published via the dancer plugin have no id in their identifier (for the example above it should be "oai:gs.dmethbib.ethz.ch:14"):

header
identifier>oai:gs.dmethbib.ethz.ch:</identifier
datestamp>2017-05-09T13:29:10.035Z</datestamp
setSpec>gs</setSpec
/header

I think the Dancer::Plugin::Catmandu::OAI is still looking for "_id" within the "_source", but I cannot figure out, how to configure this to use the prefixed id "my_id" nor the real "_id" thats above "_source".
The config.yml seems to have no option for that.

Best,
Sven

More elements in the XML response for Identify

The following are allowed but can't be specified in the plugin:

  • multiple entries for adminEmail (only one can be specified currently)
  • compression (0 or more)
  • description (0 ore more). This one is bit trickier because there is no fixed structure. Maybe just specify another template that is included?

deep paging problem

Solr and ElasticSearch suffer from a deep paging problem:

cf. https://lucidworks.com/2013/12/12/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/

This means: the deeper you get into the results, the slower the response.
The culprit is the sorting.

A better way is:

  • sort by _id
  • add range filter: _id:{LAST_ID TO *]
    only applied AFTER the first batch
  • start param remains at 0

The filter makes sure that previous records are not included in the hits
( "{" means "exclusive", "]" means "inclusive" ), so Solr never needs to
sort more than "limit" records.

This way the walltime does not increase rapidly, but remains stable.

How to add this filter while OAI is ignorant about the bag implementation?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.