data2health / contributor-attribution-model Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 0.0 6.22 MB

A simple data model to represent contributions made by agents to research artifacts

CSS 3.78% JavaScript 15.99% HTML 80.23%

contributor-attribution-model's People

Contributors

Stargazers

Watchers

contributor-attribution-model's Issues

If/how to capture artifacts that mediate 'indirect' contributions?

Consider the utility of an attribute to hang from an 'indirect' contribution that lets us capture intermediate artifact that was used in the creation of the artifact the contribution is actually about?

e.g. in the Example scenario here, this attribute would be used to indicate that Stacey's resource provider contribution toward a paper was realized through the creation of mouse1. And that Kristi's contribution to the paper via creation of mouse2, or Karen's via creation of dataset3.

In the context of a broader provenance model (e.g. PROV), the mediating entity could be represented by capturing the 'influence' / 'derivation' relationship between the upstream and downstream artifacts (as shown between the mouse and data set and publication in the diagram lined above)

Without this, the contribution toward the mouse has to be captured, as does the Influence link between paper and mouse. But this is beyond the scope of our contribution model, and it might be nice to provide a shorthand way to capture the artifacts that mediate indirect contributions and support 'transitive' credit.

In line 121 of the data example here I included 'mediatingEntity' as an exploratory property on ex:contribution007, that captures the identity of any such upstream entities through which the contribution to the target Entity is mediated. We should consider if this is useful.

Provide data examples, Competency Questions, Use Cases

This is a ticket for @kristiholmes (or others) to provide data examples and use cases from systems that might use the contribution data model.

Seeing examples of real data from these systems I very useful to inform and test the model.
Similarly, seeing a list of competency questions (specific examples of actual questions people would want to ask of the data) can be very informative.

Ticket #1 has a list of application use cases describing the types of systems that might use the model and generally what they would use it for. Here we are asking for actual examples of data from such systems, and/or examples of questions they users of these systems might want to ask of the data, or analyses they might want to perform using the data.

Capturing 'context' in which a contribution is made

Is the wasPerformedFor attribute in the Contribution object sufficient to capture requirements around 'context' in which a contribution is made?

This property indicates that the contribution was made on behalf of some other agent/organization. But from what I have heard, there may be more nuance we want to capture. @mellybelly please comment.

'Artifact' scope questions

In the Draft Spec document, Anne raised some interesting questions about how the model might be used to describe things like archeological artifacts, fossils, and biological specimens. Many such entities are naturally occurring, but altered by agents as they become research specimens. As such, attributes such as dateCreated may need additional qualification if they are to apply to these things.

We should consider if these types of entities are in scope, and if so how to adjust the model/documentation accordingly.

And on a related note, should we remove any constraints that the model is meant to cover only research/scholarly artifacts? I think this was the original scope, but it may be necessarily limiting

Requirements for modeling 'Agent' and its Subtypes

We need to define at least minimal models for agents - including an abstract (non-instantiated) 'Agent' class, with concrete (instantiated) subclasss representing a 'Person', 'Organization', and 'Computational Agent'. Think about existing standards we want to use/recommend here - e.g. foaf or VIVO/openRIF for rep of a Person?

In the prototype data model I crated placeholder Classes for each with some arbitrarily chosen properties. But these are entirely uninformed by any requirements - we need to better understand these to determine how complex/rich to make these objects in our model. Some CQs and use cases would really help here.

Capture Agent title/status at time of contribution?

Should we include an attribute on the Contribution object that lets us capture the professional position/title/status of an agent when the contribution was made? This would support queries like "Find papers with more than 4 contributions by graduate students"

Or should this be handled as part of the Agent model which we defer to implementations to define?

How to accommodate for the absence of a Computational Agent class

In addition to Persons and Organizations, a third class of entities recognized to possess ‘agency’ are computational programs/algorithms which are capable of driving artifact creation. We decided not to formally define a Computational Agent subtype in the CDM model, (but implementations may extend the model to do so as desired). This raises the question of how to describe cases where a piece of software generates an artifact (since we cannot directly attribute the software as a contributor). If this use case is not in scope, no further consideration is needed. If it is, read on . . .

One possibility is to bring back the Computational Agent class - so a computational agent could be asserted as the contributor. But several folks had decided this is not what we want in this case.

An alternate possibility is a pattern that requires two separate Contributions to be described:

Contribution1 connects the artifact of interest to the Person or Organization who ran the software - with a realizedRole of something like ‘software execution role’
Contribution2 that connects the software as an Artifact to the Agent(s) who created it (with one or more ‘software role’ from the CRO, e.g. 'software designer').

From this pattern, ‘transitive credit’ could be inferred/assigned to the software creators for computationally generated artifacts.

A couple issues/questions with this:

it is verbose and complicated to connect the software to the artifact
what roles should be assigned in the first contribution (proposed 'software execution role' - which would need to be added)
it requires some way to associate Contribution1 with the software executed by the Agent (again, the software is not the agent or the Artifact in this Contribution, so we would need a new attribute to link to the software. The previously proposed (but tabled) mediatingEntity could work? Or something like a secondaryParticipant attribute?

@mellybelly please comment. I suspect this is an important use case that we need to be able to address (describing software as participating in artifact creation). But we could of course punt on this to next version.

Create a unified artifact type hierarchy?

We had previously decided that, for our initial release, we do not attempt to provide a formally unified model/hierarchy of artifact types that re-uses/aligns/maps terms from existing terminologies in this space into a logically consistent and cross-referenced structure. Rather, we provide an informal recommended value set types (possible with mappings to a couple established terminologies).

On the 8-15-19 Arch Attribution call, Kristi seemed convinced that a unified model/hierarchy of terms/concepts here would be very valuable product of our work, and greatly enhance the value of our modeling framework. Others concurred about the value of this product, but raised concerns about the feasibility of providing this by the September v1 release target.

We should discuss the utility of such a deliverable (what value does it provide, for who/what use cases), and how to approach this in the short term (for the September v1 release) vs longer term (December, beyond).

Nicole, Lisa, and Kristi are working to assemble an overview of existing terminologies and efforts at unification/alignment that can help us make decisions here, and weight trade off of effort required vs value added (see #14)

Name for the Contribution Data Model

In the documentation so far I used CDM (Contribution Data Model) is a place holder. But this is pretty generic. What do we want as a formal name (and acronym) for this work?

Collect CQs / use cases to inform modeling requirements

So far I have been developing the data model in somewhat of a vacuum - being informed by high-level directions on what to create, and personal assumptions based on limited experience in the domain of scholarly attribution and evaluation.

As I have started to formalize the model, I am finding areas where more concrete requirements would be very helpful. e.g. what attributes of the Agent or the Artifact would be useful to support queries and analyses? Wondering if/how people think we should go about this? Given the simplicity of this model, we'd want a fairly simple and focused effort here.

In nearly all other modeling projects, the collection of competency questions from stakeholders and potential users of the model has proved very useful to help define the scope, structure, and granularity of our models. I know that much outreach and requirements collection was done for the CRO itself - to help define the types of contribution roles that exist. While this proved very helpful for CRO development, it does not inform requirements I need to build a data model in which these contribution roles will be used.

So, just throwing this out there as a possibility for bottom-up requirements collection. But if you all (Melissa, Kristi, etc) feel that you have a handle on the requirements, I am happy to take top-down direction.

Data types related to dates to follow ISO 8610

I added reference to this standard for date, dateTime, and duration data types in the spec doc.
@diatomsRcool can you make sure the text and all the examples for these data types in the spec are indeed compliant with the ISO 8601 standard?

Models for 'secondary' entities/types in the model

If/how should we provide models for/constraints on secondary object types in the model, including Locations, Methods, Funding Mechanisms.

1) Representation of the location where a contribution was made

Is this attribute necessary at all? if so, how to represent it? free text? a code of some kind (zip code, country code, city code, etc)? or proper Location object - e.g. see the FHIR Location resource here)?
Or we could simply say this is out of scope and leave this up to implementers to determine how they want to represent this, and specify the model accordingly. This is what PROV seems to do . . . from https://www.w3.org/TR/prov-dm/#term-attribute-location:

"A location can be an identifiable geographic place (ISO 19112), but it can also be a non-geographic place such as a directory, row, or column. As such, there are numerous ways in which location can be expressed, such as by a coordinate, address, landmark, and so forth. This document does not specify how to concretely express locations, but instead provide a mechanism to introduce locations, by means of a reserved attribute."

2) Representation of Plan/Method guiding a contribution to an artifact

Same general options as for Location - free text, or provide a proper Method object with attributes. Or leave to implementer to define how they want to represent this?

Application Use Cases

Starting a list of the types of applications that might implement the model to represent a particular type of contribution-related information, as a starting point for fleshing out more detailed use cases and requirements. All welcome to add to / refine this list (feel free to edit the comment directly).

Publishers capturing contributions of each author of a paper, e.g. Elsevier, PLOS, etc.
Curated Knowledgebases collecting info about how different curators contributed to an annotation/record as it matures through the system, e.g. Clinical Interpretation of Variants in Cancer (CIViC).
Research Profiling Systems capturing contributions of a researchers to diverse research outputs/artifacts (pubs, presentations, grants, datasets, courses, patents, etc.), e.g. ORCID, VIVO.
Research Data Management Platforms tracking information about contributions to data objects it manages, e.g. Invenio.
Data Repositories capturing contributions to submitted data sets it catalogs, e.g. Figshare, Dryad, etc.
Software Repositories capturing contributions to software artifacts tracked/developed in their systems, e.g. Github.

In these contexts/systems, the model will support the collection, provision, and exchange of detailed provenance metadata, display of this metadata to system users, and the ability to answer precise contribution-related queries and perform computational analysis to understand and predict patterns in the data.

Value Set Metamodel and Definitions

Value Sets are named collections of codes that are bound to specific attributes. They are an important part of the information model that helps standardize data collection for improved clarity, queryability, and interoperability. We will want to define a handful of these and recommend their use.

For this, we need a meta-model for describing their content. FHIR has an expansive one - we may want to base ours on some subset of this? See FHIR model here, and the VSD project here, based on the Value Set Definition Project here.

SEPIO also has a simple SKOS-based value set model, informed by the curation efforts of the ClinGen consortium - see here.

Should data types include 'enum' for restricted value sets?

Current data types listed in the docs include:

Simple Data Types
- string
- url
- code
- identifier
- class
- date
- dateTime
- duration
Complex Data Types
- Coding

Thinking about JSON Schema / OpenAPI types, are there any cases in which the allowed values would be constrained to a specific set of strings?

Document existing research output type terminologies and harmonization efforts

This is a ticket for @nicolevasilevsky @kristiholmes @LisaOKeefe1 and @marijane
to start cataloging information about research output/artifact types. Specifically:

What terminologies/vocabularies/code sets/ontologies describing research output types currently exist (e.g. NISO types, NLM Mesh Object types, COAR types, Wikidata types, etc),
What work has been dont to integrate/map/align across these efforts (e.g. the ROO, COAR-Wikidata type mappings, COAR-NLM/MeSH object type mappings)

A google doc has been started here as a place to initially capture this information.

This will hep us evaluate how to approach a research output type value set or ontology in the short (September) and longer (December or beyond) term.

Capturing multiple roles in a Contribution

With respect to capturing multiple roles in a given Contribution object, the spec currently says the following:

A contribution connects a single agent to a single Artifact, but we allow for a single Contribution object to capture multiple roles played by the agent in generating the artifact. This pattern can be used to provide a more concise representation when an agent plays multiple roles, and data creators do not wish to capture details of each role played (i.e. how, when, and where each was realized). In cases where such details for each role are required, separate Contribution objects should be created for each role played.

Questions:

Is this clear to folks?
Is this what we want? Or should we say only one role per Contribution?

'Identifier' specification in data model

At present the spec requires identifiers to be specified as curies in string form - see Information Model documentation here. But there are many other ways we might specify/constrain identifier representation. e.g.:

Do not require id to be a curie - just allow any string here but recommend using a curie with a namespace and reference component. This is simplest and least constraining. But if we want to support creation of rdf/linked data then curies/URIs are a requirement.
Require representing curies as structured objects, perhaps adopting the FHIR Identifier data type for this
Allowing both id and Identifier data types (where 'id' would allow for local identifiers without associated namespaces) - again, following FHIR lead here

We should discuss the most practical approach, given the technical context in which the model will be implemented.

Specifyig dates of contributions

The Contribution object model model provides startDate and endDate attributes to allow precise reporting of the time period during which the contribution occurred. The recommendation is that implementations wishing to specify a single time can simply report the date and/or time that the contribution ended, using the endDate attribute. An alternative approach of specifying the same value as the start and end date is discouraged.

Thought on this? And how strong of language should we use here (MUST / MUST NOT, or SHOULD / SHOULD NOT).

@mellybelly please comment.

ReadTheDocs theme for documentation

We will soon move the content of the Info Model spec here into a ReadtheDocs website.

I plan to use the Sphinx theme that efforts such as the GA4GH variant representation spec and Phenopackets have used. Please take a look at the theme used in these two RTD sites and confirm that this is acceptable.

Modeling of 'Influences' between Artifacts

The last round of review highlighted the need for a relationship between artifacts where one influenced the creation of the other - to enable inference of 'transitive credit' (a key use case for our model)

To support this, I simple defined a wasInfluencedBy attribute on the Artifact class, defined as "A different artifact that directly or indirectly influenced creation of the artifact of interest". It comes with the following implementation guidance in the spec:

The notion of an ‘Influence’ between two artifacts broadly covers scenarios where one is directly or indirectly used in the creation of another. It is based on the PROV notion of influence - but narrower in that it applies here only between two Artifacts.

Influences include derivations or transformations of material or informational content (e.g. a cell line being derived from a tumor specimen, incorporation of a jpg image into a blog post, a format translation from a JSON dataset to an RDF version of the dataset).

Influences also cover an artifact providing a source of information used to generate an entirely new artifact or conclusion (e.g. a dataset on ice core CO2 levels as evidence for an assertion about arctic climate change, a knockout mouse strain and a dataset from studies using it to measure gene-phenotype associations).

The CDM defines a single, broadly-scoped influencedBy attribute to cover all such scenarios. But implementations MAY define specializations of this attribute with more constrained meaning - e.g. derivedFrom, _informedB_y, providesEvidenceFor, etc.

@mellybelly please comment/provide feedback. And feel free to revise the spec directly.

Datatype specification for coded values

In the current spec, I opted to follow a FHIR Coding-like approach for representing coded values in the data (as opposed to a simpler but less informative 'code' approach, or a more complex and likely overkill 'CodableConcept' approach). See full description of our approach in the Information Model doc here.

This Coding data type is similar in content/complexity to the OntologyClass type/class that we (e.g. in the Phenopackets spec here) and others have used to capture coded values in a way that includes additional metadata such as the human-readable label and source ontology.

We should discuss if this is the right level given the use cases and context of implementation for the contribution model.

Define relationship between ontologies supporting the Contribution Data Model (CDM)

Our semantic data model needs ontological mappings for its types and attributes, and ontology terms to use as value sets.

For a contribution role value set, we will use the CRO provides
For artifact types we can assemble a recommended set of terms from diverse sources such as Wikidata and the COAR, and provide as a value set coded in several different formats for use by implementations. e.g. a plain text or tsv list, a spreadsheet, skos vocabulary, possibly even an owl file? But we will not pursue a formal OBO ontology for this space at this time.
For core data model classes and properties needed for a complete mapping to types and attributes in the information model and json-LD schema, we will need a separate ontology/owl file. This 'core data model' ontology can import the CRO, but formally remain a separate ontology.

Attributes linking Contributions to Artifact and Agent

The model includes attributes to link a Contribution object to the artifact that is the subject of the contribution, and to the agent performing the contribution. Two issues to consider:

Should both attributes be included?
What to name them? At present I have named them contributionMadeTo and contributionMadeBy. Other suggestions welcome.

Support emerging jams author export format

https://github.com/jam-schema/jams

@jcolomb can help brainstorm more.

data2health / contributor-attribution-model Goto Github PK

contributor-attribution-model's People

Contributors

Stargazers

Watchers

contributor-attribution-model's Issues

Recommend Projects

Recommend Topics

Recommend Org