Giter VIP home page Giter VIP logo

peripleo2's People

Contributors

rsimon avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

mdkmisc

peripleo2's Issues

First draft "result" data model

A first draft for the part of the data model that covers search result records (which can be "items" - such as objects, books, etc. - or places)

Search: place facet + georesolution

Along the same lines as in Peripleo 1, geodata is not directly connected to items. Instead, we'll facet along places, and resolve the facet counts.

Reliable parent/child ref for place references

Currently, place references are only created in the index if the annotation is to the URI that also happens to be the root_uri of the place in the index. If the URI is among the alternative place URIs, the reference is not created.

(P.S.: we really need this - the sooner the better - to cross-check the new Peripleo with the old version, e.g. using the "tetradrachm" search example.)

Hierarchical facets

Specifically with regard to data sources: the "old" Peripleo only allows filtering/display of the first level of hierarchy. At the moment, this specifically affects the nomisma Partner Objects and University of Graz datasets. In the future, it would also affect the Pelagios 3 Early Geospatial Documents dataset and OpenContext, since all are divided into sub-collections.

Note that this issue covers enabling hierarchical facets only, on an infrastructure/API level. It does not cover any UI features.

Ground overlay support

Make sure we can handle ground plan overlays in the data model. Should work for:

  • (JPG) image files
  • Pre-tiled image sets

In terms of the model, use the same approach as we did for the iDig integration, i.e. the index holds a metadata record for the overlay, plus a pointer to a file (base) path from which the map UI can fetch the overlay.

Search: filter by type

We'll need a whole bunch of filter options, but implement this one first so that we can easily pull gazetteer stats (type=PLACE) out of ElasticSearch.

Optimization: request caching

The requestCache parameter in Elastic4s doesn't really seem to have much effect. (Perhaps if we start trying with more data.) But we should think about caching at some point, in particular for the initial match-all query. It's potentially the heaviest query on the system, and every user will trigger it when opening the page - so well worth caching, even if it's the only caching we do.

We can use the standard play cache for this (and flush after every ingest action).

Streamlined JSON serialization in search results

Can we streamline the JSON serialization of top places in the search results vs. the internal representation in ElasticSearch? E.g. the following properties could be omitted for a (c)leaner result:

  • root_uri
  • item_type
  • Perhaps is_in_dataset

Also, we may want to limit to a single representative geometry (since that's all that's needed for mapping), unless specifically requested otherwise (e.g. through a verbose=true arg).

Progress reporting across dataset import process

As opposed to the current progress tracking framework - which tracks progress across a single process (import of a single dumpfile) - we need a framework that can track "compound import processes", i.e. the combination of multiple file downloads and imports (partially running in parallel, which complicates matters a bit more).

/gazetteers API path

To fetch information about gazetteers in the system + currently running imports. Use this to populate UI in issue #12.

Or should we treat that as a normal /dataset instead?

ElasticSearch Setup/Upgrade

Set up a basic app backed by ElasticSearch. The goal is to share the code for API & data ingest between Recogito 2 and Peripleo 2. However, since the time Recogito started, ElasticSearch has made two major releases.

Set up a basic installation based on the newest version of ElasticSearch (5.1.x), and create a fork of Recogito's API/ingest code that's compatible. Once we have done that, migrate the changes back to Recogito and update to the latest ES version there. (Data migration seems well-supported, although a bit tricky, using ElasticSearch 2.4.x as an intemediate step.)

iDig crosswalk

Make sure we have everything in place (conceptually & data-model-wise) so we can deal with iDIG dumps in the future.

Text context snippets

Due to the way data fulltext content is organized in Peripleo (through annotations), this should actually be simpler than would normally be the case with standard search highlighting. We'll simply need to pull the context field from the inner_hits in the item query.

ElasticSearch Migration Strategy

Figure out how we can go from ES 1.7.x to ES 5.1.x without losing data. Currently, the only type that seems to cause problems is annotation_history. In the worst case, we can erase that type in the production system (annotations themselves won't be lost). However, it seems possible to:

  • pull out existing versions into a JSON file
  • delete the mapping and replace it with an updated one that fulfills ES 2.4.x criteria
  • upload the annotation versions from the JSON

Initial test datasets

  • OpenContext
  • nomisma partner object
  • University of Graz
  • Sample literature from Recogito 2

Admin login page design

Stick to the layout of Recogito, but make it visually distinct. For background, we can use one of the LINHD maps.

Data validation

This is more of planning issue (that will later be split into a range of implementation tickets). Anyways: we'll need a (public) validation endpoint where people can test their datasets for validity/compatibility with Peripleo. The following options need to be verified:

Note: a non-exhaustive list that will grow as we go along!

Syntax correctness

The aim is not to rebuild an RDF/(Geo)JSON validator per se. But wrap the errors thrown by the parser in some form of human-friendly UI output.

Explicitely check for things that don't break the parser. E.g. I encountered URIs without protocol prefix ("http://"). These were interpreted as relative URIs (technically fine), but - in that particular case - definitely not the way things were intended by the publisher. Issue warnings in such cases.

Required & recommended properties

Issue errors when the following properties are missing or incorrect:

  • title (applies to all - VoID dataset, place, objects)
  • publisher (dataset)
  • license (dataset)
  • homepage (dataset, object)
  • temporal (all) - invalid formats, timespans where start date is after end date
  • ...

Issue warnings when the following properties are missing or incorrect:

  • description (applies to all)
  • temporal (all) - dates in the future? (Might still be intentional, when dealing with literature?)
  • ...

Provide recommendations with regard to:

  • Having created and last_modified timestamps
  • ...

Provide "did you know" hints:

  • Timestamps will show your items in time-filtered searches
  • You can add image links (with caption and separate license/creator info etc.)
  • ...

Peripleo specifics

  • Issue an info when an entity (place, person, etc.) is referenced by URI that is not indexed in Peripleo. E.g. if an annotation references a place not in Peripleo, the reference will be discarded during import. (As the purpose of importing place references is to map them, it wouldn't make sense to keep it.)

Possible optimization of time histogram

The time histogram is currently computed using a scripted aggregation:

f = doc['temporal_bounds.from']
t = doc['temporal_bounds.to']
buckets = []
if (!(f.empty || t.empty))
   for (i=f.date.year; i<t.date.year; i+= $interval) { buckets.add(i) }
buckets;

The trick of generating a series of timestamps across the item's temporal bounds, at the interval of the histogram, could be done at indexing time, as well. This might make the query-time aggregation a faster.

We'll need to see if the effect is significant (especially for the initial match-all query), or whether it is negligible in practice. At least it's worth a try.

Search query 'schema'

We need to model the various options that are available for configuring the search query.

  • term filters (exclude/only)

    • item_types
    • categories
    • datasets
    • (is_part_of?)
    • (license?)
    • languages
    • periods (the way this is handled may change later, as we might treat periods as 1st class entities)
  • range filters (from/to)

    • temporal bounds
  • geo

    • bounding box
    • center + distance
  • aggregation options

    • time histogram on/off
    • term aggregations on/off (do we want more fine-grained control?)
    • top places
    • top people
  • parent/child filtering options

    • referencing relations

Task model

Add a 'label' field, so we can show some meaningful caption in progress UIs.

Basic Project Setup

Latest Play Framework version + ElasticSearch 2.4.

(Elasticsearch 5.x is not possible yet - see here)

Object identifiers

Currently, object get random UUIDs assigned; existing identifiers get discarded. Use object identifiers from RDF dumps instead. (Note that the RDF dumps can have multiple identifiers. But at least one is always guaranteed to be there.)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.