pelagios / peripleo2 Goto Github PK

View Code? Open in Web Editor NEW

21.0 10.0 1.0 15.41 MB

The Pelagios Exploration Engine

License: Other

Scala 45.25% HTML 11.19% JavaScript 33.10% Groovy 0.04% Less 10.42%

linkeddata elasticsearch space time

peripleo2's People

Contributors

Stargazers

Watchers

Forkers

mdkmisc

peripleo2's Issues

First draft "result" data model

A first draft for the part of the data model that covers search result records (which can be "items" - such as objects, books, etc. - or places)

Search: place facet + georesolution

Along the same lines as in Peripleo 1, geodata is not directly connected to items. Instead, we'll facet along places, and resolve the facet counts.

ES querying: filter by image

Filter based on whether the item has any depictions associated with it.

Reliable parent/child ref for place references

Currently, place references are only created in the index if the annotation is to the URI that also happens to be the root_uri of the place in the index. If the URI is among the alternative place URIs, the reference is not created.

(P.S.: we really need this - the sooner the better - to cross-check the new Peripleo with the old version, e.g. using the "tetradrachm" search example.)

ES querying: filter by space

Bounding box or center + radius

Basic search API endpoint

Basic infrastructure for gazetteer upload

Hierarchical facets

Specifically with regard to data sources: the "old" Peripleo only allows filtering/display of the first level of hierarchy. At the moment, this specifically affects the nomisma Partner Objects and University of Graz datasets. In the future, it would also affect the Pelagios 3 Early Geospatial Documents dataset and OpenContext, since all are divided into sub-collections.

Note that this issue covers enabling hierarchical facets only, on an infrastructure/API level. It does not cover any UI features.

Ground overlay support

Make sure we can handle ground plan overlays in the data model. Should work for:

(JPG) image files
Pre-tiled image sets

In terms of the model, use the same approach as we did for the iDig integration, i.e. the index holds a metadata record for the overlay, plus a pointer to a file (base) path from which the map UI can fetch the overlay.

Explicitely represent gazetteers as Datasets

So that a gazetteer can show up as a specific search result, and so that people are able to search for "Pleiades", "DARE", etc. as a project.

Dataset import from VoID dump links

Search: filter by type

We'll need a whole bunch of filter options, but implement this one first so that we can easily pull gazetteer stats (type=PLACE) out of ElasticSearch.

Basic Auth Setup for Admin Area

Optimization: request caching

The requestCache parameter in Elastic4s doesn't really seem to have much effect. (Perhaps if we start trying with more data.) But we should think about caching at some point, in particular for the initial match-all query. It's potentially the heaviest query on the system, and every user will trigger it when opening the page - so well worth caching, even if it's the only caching we do.

We can use the standard play cache for this (and flush after every ingest action).

Streamlined JSON serialization in search results

Can we streamline the JSON serialization of top places in the search results vs. the internal representation in ElasticSearch? E.g. the following properties could be omitted for a (c)leaner result:

root_uri
item_type
Perhaps is_in_dataset

Also, we may want to limit to a single representative geometry (since that's all that's needed for mapping), unless specifically requested otherwise (e.g. through a verbose=true arg).

Progress reporting across dataset import process

As opposed to the current progress tracking framework - which tracks progress across a single process (import of a single dumpfile) - we need a framework that can track "compound import processes", i.e. the combination of multiple file downloads and imports (partially running in parallel, which complicates matters a bit more).

/gazetteers API path

To fetch information about gazetteers in the system + currently running imports. Use this to populate UI in issue #12.

Or should we treat that as a normal /dataset instead?

ElasticSearch Setup/Upgrade

Set up a basic app backed by ElasticSearch. The goal is to share the code for API & data ingest between Recogito 2 and Peripleo 2. However, since the time Recogito started, ElasticSearch has made two major releases.

Set up a basic installation based on the newest version of ElasticSearch (5.1.x), and create a fork of Recogito's API/ingest code that's compatible. Once we have done that, migrate the changes back to Recogito and update to the latest ES version there. (Data migration seems well-supported, although a bit tricky, using ElasticSearch 2.4.x as an intemediate step.)

Distinguish "modern" vs. "historical" names?

How should we do this in terms of data model? Should we do it at all?

OSMNames Gazetteer

Add GeoJSON conversion (or a direct crosswalk?) for the OSMNames gazetteer

Corresponds to pelagios/recogito2#341

Search: item type facet

Faceting by PLACE | OBJECT | PERSON | DATASET | PERIOD

Import dataset from VoID URL

Was available in Peripleo 1 already, so we can (clean up and) reuse from there.

iDig crosswalk

Make sure we have everything in place (conceptually & data-model-wise) so we can deal with iDIG dumps in the future.

ES querying: filter by place

Text context snippets

Due to the way data fulltext content is organized in Peripleo (through annotations), this should actually be simpler than would normally be the case with standard search highlighting. We'll simply need to pull the context field from the inner_hits in the item query.

ElasticSearch Migration Strategy

Figure out how we can go from ES 1.7.x to ES 5.1.x without losing data. Currently, the only type that seems to cause problems is annotation_history. In the worst case, we can erase that type in the production system (annotations themselves won't be lost). However, it seems possible to:

pull out existing versions into a JSON file
delete the mapping and replace it with an updated one that fulfills ES 2.4.x criteria
upload the annotation versions from the JSON

Initial test datasets

OpenContext
nomisma partner object
University of Graz
Sample literature from Recogito 2

Data model: reconcile is_in_dataset and source_gazetteer fields

So they are automatically treated equally in issue #22.

Admin login page design

Stick to the layout of Recogito, but make it visually distinct. For background, we can use one of the LINHD maps.

Derive basic "object taxonomy" for faceting

Nothing fancy, just a (two-level?) hierarchy that makes sense for the current data in Peripleo 1, so we have a basis for further discussion.

Data validation

This is more of planning issue (that will later be split into a range of implementation tickets). Anyways: we'll need a (public) validation endpoint where people can test their datasets for validity/compatibility with Peripleo. The following options need to be verified:

Note: a non-exhaustive list that will grow as we go along!

Syntax correctness

The aim is not to rebuild an RDF/(Geo)JSON validator per se. But wrap the errors thrown by the parser in some form of human-friendly UI output.

Explicitely check for things that don't break the parser. E.g. I encountered URIs without protocol prefix ("http://"). These were interpreted as relative URIs (technically fine), but - in that particular case - definitely not the way things were intended by the publisher. Issue warnings in such cases.

Required & recommended properties

Issue errors when the following properties are missing or incorrect:

title (applies to all - VoID dataset, place, objects)
publisher (dataset)
license (dataset)
homepage (dataset, object)
temporal (all) - invalid formats, timespans where start date is after end date
...

Issue warnings when the following properties are missing or incorrect:

description (applies to all)
temporal (all) - dates in the future? (Might still be intentional, when dealing with literature?)
...

Provide recommendations with regard to:

Having created and last_modified timestamps
...

Provide "did you know" hints:

Timestamps will show your items in time-filtered searches
You can add image links (with caption and separate license/creator info etc.)
...

Peripleo specifics

Issue an info when an entity (place, person, etc.) is referenced by URI that is not indexed in Peripleo. E.g. if an annotation references a place not in Peripleo, the reference will be discarded during import. (As the purpose of importing place references is to map them, it wouldn't make sense to keep it.)

ES querying: filter by date range

ElasticSearch querying: response settings

Implement a clean way to select what should appear in the search results:

time histogram
term aggregations
top places

Basic Admin Area Page Design

To be re-used in Recogito later (perhaps with different signature color).

ES querying: filter by dataset

Search: language facet

Search: time period facet

Since our items have date ranges (not just a single date), generating time histograms is not trivial. SOLR offers a special data type for this (DateRangeField), but ES doesn't. A viable approach is suggested here:

https://discuss.elastic.co/t/indexing-date-ranges/36524

https://gist.github.com/jpountz/cebb8452648c36099cee

Basic infrastructure for object dataset upload

Search: dataset hierarchy facet

Deleting datasets

Possible optimization of time histogram

The time histogram is currently computed using a scripted aggregation:

f = doc['temporal_bounds.from']
t = doc['temporal_bounds.to']
buckets = []
if (!(f.empty || t.empty))
   for (i=f.date.year; i<t.date.year; i+= $interval) { buckets.add(i) }
buckets;

The trick of generating a series of timestamps across the item's temporal bounds, at the interval of the histogram, could be done at indexing time, as well. This might make the query-time aggregation a faster.

We'll need to see if the effect is significant (especially for the initial match-all query), or whether it is negligible in practice. At least it's worth a try.

"Multi-level" gazetteers

How do we handle multiple levels of granularity in gazetteers?

Refactoring: upload button activation

That's now implemented as a client-side tag. Change to JS.

Search query 'schema'

We need to model the various options that are available for configuring the search query.

term filters (exclude/only)
- item_types
- categories
- datasets
- (is_part_of?)
- (license?)
- languages
- periods (the way this is handled may change later, as we might treat periods as 1st class entities)
range filters (from/to)
- temporal bounds
geo
- bounding box
- center + distance
aggregation options
- time histogram on/off
- term aggregations on/off (do we want more fine-grained control?)
- top places
- top people
parent/child filtering options
- referencing relations