periodo / periodo-data Goto Github PK

View Code? Open in Web Editor NEW

5.0 5.0 0.0 1.66 MB

Tracking PeriodO data quality issues

Home Page: http://perio.do

License: The Unlicense

Shell 10.38% Python 89.62%

periodo-data's People

Stargazers

Watchers

periodo-data's Issues

CIDOC-CRM mapping

We need to map the terms we are using in our JSON-LD context to CIDOC-CRM terms, and serve this mapping.

Identifying aggregations of period concepts

We currently group period concepts into the same set if they share a common published source. We identify the set as a "concept scheme," which means nothing more than "an aggregation of concepts." But there are reasons other than sharing a source that we might want to aggregate period concepts. People using PeriodO to provide a period authority file for their own system may want to aggregate a selection of other people's concepts. This requires being able to aggregate period concepts into a scheme and to assign the scheme a long-term identifier. Even in cases when the users of the authority file are also the authors, we need to distinguish the two aggregations: Pleiades' "currently preferred" period authority file differs from the "authored by Pleiades" aggregation, as the latter may include deprecated period concepts.

So, we need to allow period concepts to be in multiple schemes. This would not be a problem except that currently we connect the bibliographic description of the published source to the concept scheme (under the assumption that the periods in a scheme all share the same source). If we allow concepts to belong to multiple schemes, we need to allow a scheme to contain concepts from different sources. This means we ought to attach the bibliographic description of the published source to the concept as well. This further implies that we need to assign URIs to "our" (i.e. not from Crossref or Worldcat) bibliographic descriptions, since now they may be objects of multiple statements (scheme -> source -> description and concept -> source -> description). We could use fragment identifiers for these, e.g. http://n2t.net/ark:/99152/p0fp7wv#source.

Question: are aggregations other than "same source" aggregations part of the main dataset? Do curators need to accept patches to create new aggregations? Remember that we are giving these stable URIs, which means history tracking, etc. too. So we shouldn't enter that commitment lightly. On the other hand, it does seem that we need a way to create shared, persistently identified schemes, otherwise people will just update a local copy and will have no incentive to keep the canonical dataset up to date.

To recap:

allow concepts to be in multiple schemes
attach source statements to individual period concepts
stop using blank nodes as objects of source statements
allow people to add and edit collections?

Multiple creators in a single value field

http://n2t.net/ark:/99152/p072r4q has multiple creators entered into a single field:

<http://n2t.net/ark:/99152/p072r4q> dcterms:source [ dcterms:creator [ foaf:name  "Alex R. Knodell, Susan E. Alcock, Christopher A. Tuttle, Christian F. Cloke, Tali Erickson-Gini, Ceceilia Feldman, Gary O. Rollefson, Micaela Sinibaldi, Thomas M. Urban, Clive Vella" ] .

Need to check for other cases of this and fix them.

Needed: deterministic serialization of graph

Since our change system is tied to using JSON Patch, I think it should be JSON-LD. Maybe it would be achievable just through using a JSON-LD Frame, but I've never quite figured out what those are most useful for.

Most periods are missing spatialCoverageDescription

Out of 1791 period definitions, 1207 have a blank field for spatial coverage description. I'm pretty sure that in the vast majority of cases, these periods have one single entry in spatial coverage (as in, one country). It would probably make sense to copy the text of the country name to the description.

Predicate for approximate dates

@rybesh , you mentioned you have used some kind of "circa" predicate before- what was it?

Use resource values for dcterms:language, literal values for dc:language

From periodo/periodo-server#60:

Since you use literal for dcterms:language, better use dc:language (DC Elements not DC Terms). Alternatively if you want to use DCterms, pick an appropriate URL from lingvo.org

PAS URLs have broken, no redirect

URLs to the Portable Antiquities Scheme collection definitions have broken as the result of a change to the PAS website. The original URLs, in the form http://finds.org.uk/database/terminology/period/id/21, lead to a 404; the current address is https://finds.org.uk/datalabs/terminology/period/id/21. Need to know if we should fix all of these by hand.

Content needing correction (for Adam; Ryan and Patrick ignore)

I'm just going to use the issue tracker to keep track of input errors that I need to go back and fix later. This one is for http://n2t.net/ark:/99152/p0wf3wdnm6q, which should be checked but appears to need an alternate name and end in 238 BC, not 476 AD. We could fix some of the alternate labels here, too -- some didn't make it out of the editorial notes.

ARIADNE definition of Mesolithikum in Austria is missing structured data for stop year

The label for the stop year is there but the structured data is missing.
http://n2t.net/ark:/99152/p0qhb66sqt2

CSV mapping

To map to a CSV output, we need to decide how to handle

Spatial coverage (I'm fine with just including the description)
Alternate labels
Localized labels

VoID description

Need a VoID dataset description.

Next/previous relations among periods in an authority

Ryan Baumann suggested that this would make it easier to do sequential processing of periods, but we would need to think hard about the editing interface.

Review spatial coverage descriptions and gazetteer links of periods in Pleiades authority

For period http://n2t.net/ark:/99152/p03wskdwdtr the spatial coverage description is Iran but the spatial coverage is Palestine. Seems wrong?

If it is wrong, we should probably do a review of all the Pleiades spatial coverages; I know we did this automatically and there may be other weird ones.

Minimize or eliminate use of blank nodes

For a variety of reasons, it is undesirable to have blank nodes. @rybesh, you pointed out problems relating to:

Error messages in the SHACL validator being unreadable when related to blank nodes
Forming certain sorts of SPARQL queries

Additionally, if we take the approach in #44, it is impossible to refer to blank nodes in an annotation.

We use blank nodes to represent start/stop resources, and for referring to specific pages within sources. We can probably address both those cases by just giving those resources URIs based off the identifier of the associated period.

Chronostratigraphic periods

(via @atomrab)

Would you mind taking a look at the rdf and/or ttl files in this folder: https://utexas.box.com/s/wtzn309lqo1aosp84nylndn0zumft3ro and letting me know if we can ingest the 2014 version programmatically, so that I don't have to add all of these by hand? I feel like this shouldn't be too hard to line up with our model, at least for someone who can actually write scripts, and it would save a tremendous amount of time. The folder also includes half a dozen older versions of the chronostratigraphic chart, which could be really interesting to visualize (but for the moment, I'd settle with having the current version).

In case these aren't already obvious, here are some observations about the rdf and ttl files:

The URIs, which do resolve properly, are in the form http://resource.geosciml.org/classifier/ics/ischart/Aeronian (though they resolve as eg http://vocabs.ands.org.au/repository/api/lda/csiro/international-chronostratigraphic-chart-2016/2016-12-v3/resource.html?uri=http://resource.geosciml.org/classifier/ics/ischart/Aeronian). These URIs, as far as I can tell, appear in the rdf representation but not in the ttl one (??).
The date-range is expressed in rdfs:comment as "older bound-" (="start") and "younger bound-" (="stop"), with a +/- that can be incorporated into four-part dates. All these dates are in Ma (=megayear=one million Julian years=million years ago, usually with "present" as 1950; the date notation doesn't appear in the rdf/ttl, but it does in the pages that the URIs resolve to). So
```
<rdfs:comment xml:lang="en">older bound-439 +/-1.8</rdfs:comment><rdfs:comment xml:lang="en">younger bound-436 +/-1.9</rdfs:comment>
```
should be parsed as earliestStart:-440798050 (that is, 439ma plus 1.8ma before 1950), latestStart:-437198050.
The alternate languages are expressed with two-character language codes, without script codes, but we could probably identify these manually for the non-Latin scripts (I know the Bulgarian is Cyrillic, but I can't identify the Chinese or Japanese character set off the top of my head).
I think we can use "World" as spatial coverage, at least for a start -- I have a query in with Denné about this.
There are sameAs relations with dbpedia entries here -- should we try to capture those, and if so, how? Although the concepts are the same, the dates are sometimes different (eg http://dbpedia.org/resource/Aptian has 113 +/-1 Ma as the end date, but the corresponding entry in the dataset has 112 +/-1 Ma).

Use xsd:integer rather than xsd:gYear

Zero values, and values less than -9999 or greater than 9999, for xsd:gYear are not well-supported by RDF tools, despite what the spec says. We should change to using xsd:integer instead, which would (I think) turn our time:DateTimeDescriptions into time:GeneralDateTimeDescriptions.

New @base value for dataset

Change @base from http://n2t.net/ark:/99152/ to http://n2t.net/ark:/99152/p0 so that ids of periods and authorities in JSON-LD are "clean" i.e. don't include information about the ARK shoulder.

Reduce use of blank nodes in serialized data

These are a a headache for some users, and many of the places that we use them we could mint fragment URIs based on the definition / collection URI.

Additional datasets

We have agreed that the British Museum periodization, for which all the relevant information is in scope notes, will be entered by hand by Sarah using the client interface.

The following partners have not yet contributed their periodizations, all of which should probably be batch-imported from spreadsheets, if that's possible:

Deutsches Archaologisches Institut: we have the Arachne periodization (see in "source_docs" in the PeriodO thesauri dropbox folder), but it appears to lack actual dates or spatial references (though since records have those references, we might be able to ask them for a total dump of records with period terms, locations, and absolute date ranges, and extract those values). Wolfgang said that the Zenon periodization was more specific, but we haven't received it yet.

UCLA Encyclopedia of Egyptology: I missed a window with Willeke, who then went off into the field. I've added the preferred periodization that the DAI uses, which is still on the UEE website, but it may now have been superseded by an updated version. I'm waiting to hear from her to find out.

CLAROS: Sebastian Rahtz was responsive back in the spring, and I talked to him at the CAA, but he was on vacation when I wrote over the summer, and has not responded to an email since. It's not clear to me how CLAROS is using periods, in any case: "period" in the browser seems to mean only "date range", although their CRM mapping suggests they use period terms as well (so maybe they're reconciling them internally?).

I am also planning to contact Nick Croft to see if he's willing to share his RDF-expressed period gazetteer with us.

Language tags violate BCP47

@hcayless pointed out on Twitter that our language tags are out of conformance. There are two issues:

BCP47 decrees that for languages with both an ISO 639-1 (2-letter) tag and an ISO 639-3 (3-letter) tag, the shortest one must be used. So we can't use deu for German, we have to use de.
Although not strictly a part of the spec, BCP47 also discourages using script tags where they are unnecessary. So unless we expect to need to distinguish German sources written in the Fraktur script from those written in ordinary Latin script, we should drop the -latn from our tags. (In fact I'm not even sure we should have the script in there at all, for any of our language tags.)

Originals of deleted duplicate LCSH periods need editing

We tried very hard to avoid these, but they crept in anyway, so Ryan will need to delete them. Here's the list:

Delete http://n2t.net/ark:/99152/p06c6g3h3wh (duplicate of http://n2t.net/ark:/99152/p06c6g3gfns)

Delete http://n2t.net/ark:/99152/p06c6g35pg5 (duplicate of http://n2t.net/ark:/99152/p06c6g3nnbs, incorrect statement about separate URI for Three Crowns' War -- the links are identical)

Delete http://n2t.net/ark:/99152/p06c6g3z9k9 (duplicate of http://n2t.net/ark:/99152/p06c6g3h)

Delete whichever is more recent of http://n2t.net/ark:/99152/p06c6g3h and http://n2t.net/ark:/99152/p06c6g35vgf

Delete http://n2t.net/ark:/99152/p06c6g3nkxb (duplicate of http://n2t.net/ark:/99152/p06c6g3f3dw)

Delete http://n2t.net/ark:/99152/p06c6g3z9b7 (duplicate of http://n2t.net/ark:/99152/p06c6g3bt2q, though after deletion add alternate Japanese label to the latter)

Delete http://n2t.net/ark:/99152/p06c6g3b46q (duplicate of http://n2t.net/ark:/99152/p06c6g3h2j9, though after deletion add alternate label to the latter)

Delete http://n2t.net/ark:/99152/p06c6g35rqq (duplicate of http://n2t.net/ark:/99152/p06c6g3rhfb, though latter needs to be updated to reflect revision of LCSH entry in 2017 which apparently removed some of earlier variants)

Delete http://n2t.net/ark:/99152/p06c6g3vkbm (duplicate of http://n2t.net/ark:/99152/p06c6g3szt6, though after deletion add alternate label to the latter)

Delete http://n2t.net/ark:/99152/p06c6g3g4sn (duplicate of http://n2t.net/ark:/99152/p06c6g34vjs)

Broader/narrower relations between period definitions

E.g. splitting Iron Age into Iron Age I, Iron Age II, etc. The first should have skos:narrower relations to the rest, and the rest should have skos:broader relations to the first.

19 periods missing structured descriptions of temporal coverage

These period definitions have labels for their start and stop intervals, but are missing structured representations of one or the other. Was this an oversight, or done intentionally?

Missing spatial coverage values from DBpedia (batch correct)

Some of our records are missing a spatial coverage value because the lookup list never included them (don't exist in DBpedia, or import failed?). We should add a country value to these before we move over to the bounding-box system, especially since we're going to pull in old values by mapping the DBpedia URIs (right?). They include:

Norway
Cambodia
South Korea (though they have North Korea, for some reason)
Moldova

If these don't exist in DBpedia at all, we should note periods with these spatial coverages and map them to the new set of geometries directly.

Label typos

I noticed that there are a couple of typos in the Fasti period list (English version). I don't want to correct these and lose the originals, since this would cause a mismatch in the URI values. Some of these will have versions in other languages as well (original languages, in most cases). So: should I a) add another column for original English, and correct the typos in a PeriodO label column? b) leave the typos alone for the moment? c) correct the typos in the current label_en column?

Question: what to do with identical LoC entries associated with different spatial coverage?

For the most part, the LoC period subject headings refer to one and only one country/spatial entity. In the case of periods for the Austro-Hungarian Empire, the LoC has two sets: one for Austria, the other for Hungary. But the periods themselves are otherwise identical. Elsewhere, we have simply added two countries to the coverage. But these periods have separate URIs in the online LoC. Do we simply produce two separate entries, one for Austria and one for Hungary, following the LoC exactly? Or do we make a single entry that maps to two nations and two URIs?

I assume the former, but I just wanted to check.

Representing curatorial descriptions as annotations

Here's a draft:

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix dbpedia: <http://dbpedia.org/resource/> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix oa: <http://www.w3.org/ns/oa#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix periodo: <http://n2t.net/ark:/99152/p0v#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix time: <http://www.w3.org/2006/time#> .
@prefix void: <http://rdfs.org/ns/void#> .
@prefix xml: <http://www.w3.org/XML/1998/namespace> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://n2t.net/ark:/99152/p0zmdxzf369>
    a skos:Concept ;
    periodo:spatialCoverageDescription "Ras Shamra" ;
    dc:language "en" ;
    dcterms:language <http://lexvo.org/id/iso639-1/en> ;
    skos:altLabel "Pre-pottery Neolithic"@en ;
    skos:inScheme <http://n2t.net/ark:/99152/p0zmdxz> ;
    skos:prefLabel "Pre-pottery Neolithic" ;
    time:intervalStartedBy [
        skos:prefLabel "Ca. 7500 B.C.E." ;
    ] ;
    time:intervalFinishedBy [
        skos:prefLabel "Ca. 7000 B.C.E." ;
    ] .

:periodannot
    a oa:Annotation ;
    oa:motivation oa:describing ;
    oa:hasTarget <http://n2t.net/ark:/99152/p0zmdxzf369> ;
    oa:hasBody [
        dcterms:spatial dbpedia:Ugarit ;
        time:intervalStartedBy [
            time:hasDateTimeDescription [
                time:year "-7499"^^xsd:gYear
            ]
        ] ;
        time:intervalFinishedBy [
            time:hasDateTimeDescription [
                time:year "-6999"^^xsd:gYear
            ]
        ]
    ] .

dbpedia:Ugarit
    skos:prefLabel "Ras Shamra" .

<http://n2t.net/ark:/99152/p0zmdxzf369.ttl>
    void:inDataset <http://n2t.net/ark:/99152/p0d> .

Change / versioning policy

We need an explicit formal namespace policy to describe just how persistent the data at each URI is going to be, and what level of edits require a new URI to be generated. See http://dublincore.org/documents/dcmi-namespace/ for an example.

"Derived from" relationships between period definitions

Need to decide how to represent relationships between period definitions e.g. "derived from".

Wikidata URIs

URIs of Wikidata items follow the pattern http://www.wikidata.org/entity/ID and not http://www.wikidata.org/wiki/ID (the later is the URI of the HTML document describing the item).

See https://www.wikidata.org/wiki/Wikidata:Data_access for more informations

Delete duplicate period

An LC subject heading was duplicated again. When you have a chance, please delete http://n2t.net/ark:/99152/p06c6g3tktj (this was from accidental inclusion of a period from a subheading, when the period from the main heading already existed).

Question about spatial coverage at transitional moment

@rybesh, there are a number of LCSH headings that currently have sub-country spatial coverage descriptions and no spatial coverage. I was planning to go in and associate those with the larger countries, but it occurs to me that perhaps it would be better to wait and use these as test-cases for the bounding-box selection process?

Conversely, I have a number of LCSH "Byzantine Empire" coverage descriptions that correspond to very different imperial extents. Some have no spatial coverage, others have a standard but incomplete set of countries. I was planning to delete the spatial coverage values from the definitions that do have countries, in anticipation of an eventual bounding-box approach that could be calibrated to the extent of the empire in a given period -- or even pointed to a URI for an entity + shapefile for the Byzantine Empire in, say, 1100. Should I go ahead and strip the countries from the ones that have them, add countries to the ones that don't, or just sit back and wait for a new bounding-box alternative that will pull in the boundaries of the current countries?

Provenance issues

In the provenance graph we currently have, there are statements like this one:

"specializationOf": "http://n2t.net/ark:/99152/p086kj9kr9q",
"wasRevisionOf": 
{
    "id": "http://n2t.net/ark:/99152/p086kj9kr9q?version=0"
}

Couple questions:

Does wasRevisionOf need to be a JSON object? Can it be a simple string, with the mapping to id done in the JSON-LD context?
Should we have wasRevisionOf values for new assertions/collections? wasRevisionOf only makes sense to me when the new version was actually a revision of something that already existed.

Also, we're not currently including type information (i.e. prov:Activity and prov:Entity). Was the on purpose? Are those implied by the relationships between things?

Periods missing spatial coverage (link to gazetteer)

The few I checked were from all ARIADNE. These appear to have (textual) spatial coverage descriptions, but no associated gazetteer entities. Full list attached.

Missing spatial coverage.xlsx

Update to period spreadsheet

I'd like to move our conversation about period data here and out of email, if no one objects. I'll start providing relevant updates and questions as we go.

First relevant update: Fasti period assertions now have URIs, and I've clarified that BP in their system means 2000, not 1950, as it does for C14 or prehistoric work. Please make sure you calculate those dates accordingly.

First question: once I've cleaned the Pleiades period list so that the dates are in separate fields, should I put those into the spreadsheet too, even though we don't have clear geographic coverage? Or should I wait until Tom Elliott and I can come up with some way to represent the coverage according to the locations of the sites where each of those terms is applied?

Places and geometries for spatial coverage

I've been working with the “spatial entity” picker UI that Bits Coop is building for us, and realizing that we need to do some more thinking about how we want to handle spatial entities in PeriodO. Originally, the idea was that we would use “modern countries.” That seems straightforward, until you realize that there is no agreed-upon list of modern countries. Even seemingly straightforward “countries” like France or Norway are not defined the same in the major gazetteers… And on top of that, we strayed from the “modern countries” idea, and also have some administrative regions within countries, ill-defined historical places, etc. Unfortunately, this means that we need to start maintaining our own place name + place geometry gazetteer, assembled from various different sources. I don't think there is any way around that, but I would like to do some thinking about how we can set a reasonable scope for that: something that is a happy medium between “these are the 195 modern countries that you can choose from, and that's it” and “any place you can imagine, we'll add it.”

Keeping in mind that the purpose of this is not to support sophisticated spatial reasoning but just to show and choose things on a low-resolution map, do either of you have any ideas about a sane way to scope the places we support?

Reduplication of labels in JSON

Lex caught this in the local collection Sarah made for the CHGIS periods. Some but not all of these periods have double alternate labels in the JSON (these do not appear in the client view). What is going on here, and how do we get rid of them if they're not visible for deletion? Should I have Sarah start over?

periodo-guide2-1449883467897.zip

Should spatialCoverageDescription be required?

Some entries in the dataset do not currently have a value for spatialCoverageDescription, which (if I understand correctly) is the spatial coverage explicitly defined within the source (ie not added by the curator)

als-latn language tag

A number of the Ariadne and FASTI definitions have the language tag als-latn on their preferred labels. als is the language code for Tosk Albanian, “the southern dialect group of the Albanian language, spoken by the ethnographic group known as Tosks.” @atomrab, can you verify that this is indeed the correct language tag, and not (as I suspect) sq, which the language code for Albanian in general? Full list of affected definitions is below.

Add source citation to official data model

Currently this is a field that appears in the JSON but is not part of the RDF data model.

Add HuTime as an authority for Japanese eras

Start at http://datetime.hutime.org/calendar/1001.1/era/2458604.5 and follow the chain back. Need to ask Sekino-san about the difference between the Northern Court and Southern Court, and whether we need to include both or not.

Missing authors / creators / contributors in authority source metadata

This may be related to periodo/periodo-client#90. In at least one instance, a call to a Worldcat record with a clear set of authors, when adding a new collection, pulls in the title and the date but not the creators (http://www.worldcat.org/oclc/892462417). This is a problem if we're trying to make it easy to keep track of intellectual genealogies. I haven't tested to see if the problem is specific to this title, or a current problem with Worldcat titles in general.

ell-latn language tag

A number of the Ariadne definitions have the language tag ell-latn on their preferred labels, but these labels are in Greek script, not Latin. @atomrab, can you verify that these are errors? Full list below.

Documentation of data model

We need some documentation of the data model, including:

JSON structure
JSON-LD context
provenance

Documenting provenance involves indicating which properties have values that are original to the sources, and which have values that are our translations or parsing (so basically, label_en, the converted quantitative start and end dates, and the spatial_coverage_name, which we've parsed from spatial_coverage_label).

Perhaps we need a top-level object in our JSON-LD with various administrative things like rights info (CC0), last modified dates, pointer to previous version, authors, etc... and this could also document which statements are original and which are derived.

periodCollection -> authority (naming)

Alternate label language for periods

Am I correct that it is always assumed that "alternate labels" will be in English?

which means:

label is how the period the period was defined in the source
localizedLabelis how the period was translated in the source (?)
alternateLabels are created by the curator (?)

I supposed I don't understand the distinction between localizedLabel and alternateLabels except that the latter hardcodes English and allows for multiple values.

Invalid gYear values

We have 48 instances of invalid gYear values in the canonical dataset. These values are things like 400 (should be 0400), -271 (should be -0271), and 0000 (there is no ISO year zero). Probably easiest to fix these as a batch programmatically, but before that we need to prevent it from happening in the first place (see periodo/periodo-client#118).

yearPublished is sometimes not a year

Currently we define yearPublished in our JSON-LD context as follows:

    "yearPublished": {
      "@type": "http://www.w3.org/2001/XMLSchema#gYear",
      "@id": "http://purl.org/dc/terms/issued"
    }

But when we get values for yearPublished from Crossref or OCLC, we use their values for the predicated dc:date and schema:datePublished. Usually these values are just years, but occasionally they are not (for example http://n2t.net/ark:/99152/p0323gx), resulting in invalid triples.

So, we need to either change our context so that the value type is less narrow (xsd:date rather than xsd:gYear) and rename the key accordingly, or we need to strip out months and days from our external sources.

Which do you prefer, @ptgolden?

periodo / periodo-data Goto Github PK

periodo-data's People

Stargazers

Watchers

periodo-data's Issues

Recommend Projects

Recommend Topics

Recommend Org