pelagios / peripleo2 Goto Github PK
View Code? Open in Web Editor NEWThe Pelagios Exploration Engine
License: Other
The Pelagios Exploration Engine
License: Other
A first draft for the part of the data model that covers search result records (which can be "items" - such as objects, books, etc. - or places)
Along the same lines as in Peripleo 1, geodata is not directly connected to items. Instead, we'll facet along places, and resolve the facet counts.
Filter based on whether the item has any depictions associated with it.
Currently, place references are only created in the index if the annotation is to the URI that also happens to be the root_uri
of the place in the index. If the URI is among the alternative place URIs, the reference is not created.
(P.S.: we really need this - the sooner the better - to cross-check the new Peripleo with the old version, e.g. using the "tetradrachm" search example.)
Bounding box or center + radius
Specifically with regard to data sources: the "old" Peripleo only allows filtering/display of the first level of hierarchy. At the moment, this specifically affects the nomisma Partner Objects and University of Graz datasets. In the future, it would also affect the Pelagios 3 Early Geospatial Documents dataset and OpenContext, since all are divided into sub-collections.
Note that this issue covers enabling hierarchical facets only, on an infrastructure/API level. It does not cover any UI features.
Make sure we can handle ground plan overlays in the data model. Should work for:
In terms of the model, use the same approach as we did for the iDig integration, i.e. the index holds a metadata record for the overlay, plus a pointer to a file (base) path from which the map UI can fetch the overlay.
So that a gazetteer can show up as a specific search result, and so that people are able to search for "Pleiades", "DARE", etc. as a project.
We'll need a whole bunch of filter options, but implement this one first so that we can easily pull gazetteer stats (type=PLACE) out of ElasticSearch.
The requestCache
parameter in Elastic4s doesn't really seem to have much effect. (Perhaps if we start trying with more data.) But we should think about caching at some point, in particular for the initial match-all query. It's potentially the heaviest query on the system, and every user will trigger it when opening the page - so well worth caching, even if it's the only caching we do.
We can use the standard play cache for this (and flush after every ingest action).
Can we streamline the JSON serialization of top places in the search results vs. the internal representation in ElasticSearch? E.g. the following properties could be omitted for a (c)leaner result:
root_uri
item_type
is_in_dataset
Also, we may want to limit to a single representative geometry (since that's all that's needed for mapping), unless specifically requested otherwise (e.g. through a verbose=true
arg).
As opposed to the current progress tracking framework - which tracks progress across a single process (import of a single dumpfile) - we need a framework that can track "compound import processes", i.e. the combination of multiple file downloads and imports (partially running in parallel, which complicates matters a bit more).
To fetch information about gazetteers in the system + currently running imports. Use this to populate UI in issue #12.
Or should we treat that as a normal /dataset instead?
Set up a basic app backed by ElasticSearch. The goal is to share the code for API & data ingest between Recogito 2 and Peripleo 2. However, since the time Recogito started, ElasticSearch has made two major releases.
Set up a basic installation based on the newest version of ElasticSearch (5.1.x), and create a fork of Recogito's API/ingest code that's compatible. Once we have done that, migrate the changes back to Recogito and update to the latest ES version there. (Data migration seems well-supported, although a bit tricky, using ElasticSearch 2.4.x as an intemediate step.)
How should we do this in terms of data model? Should we do it at all?
Add GeoJSON conversion (or a direct crosswalk?) for the OSMNames gazetteer
Corresponds to pelagios/recogito2#341
Faceting by PLACE | OBJECT | PERSON | DATASET | PERIOD
Was available in Peripleo 1 already, so we can (clean up and) reuse from there.
Make sure we have everything in place (conceptually & data-model-wise) so we can deal with iDIG dumps in the future.
Due to the way data fulltext content is organized in Peripleo (through annotations), this should actually be simpler than would normally be the case with standard search highlighting. We'll simply need to pull the context field from the inner_hit
s in the item query.
Figure out how we can go from ES 1.7.x to ES 5.1.x without losing data. Currently, the only type that seems to cause problems is annotation_history
. In the worst case, we can erase that type in the production system (annotations themselves won't be lost). However, it seems possible to:
So they are automatically treated equally in issue #22.
Stick to the layout of Recogito, but make it visually distinct. For background, we can use one of the LINHD maps.
Nothing fancy, just a (two-level?) hierarchy that makes sense for the current data in Peripleo 1, so we have a basis for further discussion.
This is more of planning issue (that will later be split into a range of implementation tickets). Anyways: we'll need a (public) validation endpoint where people can test their datasets for validity/compatibility with Peripleo. The following options need to be verified:
Note: a non-exhaustive list that will grow as we go along!
The aim is not to rebuild an RDF/(Geo)JSON validator per se. But wrap the errors thrown by the parser in some form of human-friendly UI output.
Explicitely check for things that don't break the parser. E.g. I encountered URIs without protocol prefix ("http://"). These were interpreted as relative URIs (technically fine), but - in that particular case - definitely not the way things were intended by the publisher. Issue warnings in such cases.
Issue errors when the following properties are missing or incorrect:
title
(applies to all - VoID dataset, place, objects)publisher
(dataset)license
(dataset)homepage
(dataset, object)temporal
(all) - invalid formats, timespans where start date is after end dateIssue warnings when the following properties are missing or incorrect:
description
(applies to all)temporal
(all) - dates in the future? (Might still be intentional, when dealing with literature?)Provide recommendations with regard to:
created
and last_modified
timestampsProvide "did you know" hints:
Implement a clean way to select what should appear in the search results:
To be re-used in Recogito later (perhaps with different signature color).
Since our items have date ranges (not just a single date), generating time histograms is not trivial. SOLR offers a special data type for this (DateRangeField), but ES doesn't. A viable approach is suggested here:
The time histogram is currently computed using a scripted aggregation:
f = doc['temporal_bounds.from']
t = doc['temporal_bounds.to']
buckets = []
if (!(f.empty || t.empty))
for (i=f.date.year; i<t.date.year; i+= $interval) { buckets.add(i) }
buckets;
The trick of generating a series of timestamps across the item's temporal bounds, at the interval of the histogram, could be done at indexing time, as well. This might make the query-time aggregation a faster.
We'll need to see if the effect is significant (especially for the initial match-all query), or whether it is negligible in practice. At least it's worth a try.
How do we handle multiple levels of granularity in gazetteers?
That's now implemented as a client-side tag. Change to JS.
We need to model the various options that are available for configuring the search query.
term filters (exclude/only)
range filters (from/to)
geo
aggregation options
parent/child filtering options
Add a 'label' field, so we can show some meaningful caption in progress UIs.
E.g. to avoid object dates in the future
Latest Play Framework version + ElasticSearch 2.4.
(Elasticsearch 5.x is not possible yet - see here)
E.g. Bing, or perhaps configurable tile URLs? (Google probably not an option due to Terms of Service.)
Currently, object get random UUIDs assigned; existing identifiers get discarded. Use object identifiers from RDF dumps instead. (Note that the RDF dumps can have multiple identifiers. But at least one is always guaranteed to be there.)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.