researchobject / ro-crate Goto Github PK

View Code? Open in Web Editor NEW

77.0 37.0 34.0 15.42 MB

Research Object Crate

Home Page: https://w3id.org/ro/crate/

License: Apache License 2.0

Python 58.70% Makefile 41.30%

linked-data research-object specification jsonld reproducibility

ro-crate's People

Contributors

Stargazers

Watchers

ro-crate's Issues

Use Case: Represent collections in a repository with data entities by reference

As a repository manager, I want to be able to represent collections in my repository using RO-Crate so that the repository can use RO-Crate consistently.

At UTS have been working with the PARADISEC (the Pacific And Regional Archive for Digital Sources in Endangered Cultures) team on a proof of concept restructuring the storage layer as RO-Crates housed in an OCFL repository. PARADISEC has "items" which are easily represented as RO-Crates and "Collections" which aggregate items - we have run into a modelling issue about how to include external resources, ie the items that make a up a collection, using RO-Crate we'd be violating the 'self-contained'.

Suggested Solution: The aggregation can be done with RepositoryCollection (pcdm:Collection) with a hasMember relation and/or the with a memberOf on the item (to avoid having to update collection records as they change membership.

Use Case: relation to OAI-PMH

As an developper of solution to make "research objects" identifyable from within institution archive repository, I want know if RO-crate will be proposing a syntax for OAI-PMH set so that RO could be findable by selective harvesting or if this task is up to the different specialized communities?

Use Case: compatible with BagIt

As a repository manager, I want a well-defined way of to creating a BagIt bag from an RO-Crate so that I can leverage existing tooling to track fixity the object payload and metadata over time

See, for example, bagit-ro (https://github.com/ResearchObject/bagit-ro), BDBags and the way in which DataCrate distinguishes Working DataCrates from Bagged DataCrates

@stain, is this what you mean by the "json-in-Github" scenario?

Use Case: In-situ/on-the-fly manifests alongside 'payload' data

As a developer/researcher/data-steward, I want to be able to capture the manifest in the root of my git-repo/data-folder as I work on my code/data, so that my existing processes & folders aren't affected while I iteratively enhance the metadata and payload.

See, for example, CodeMeta, Frictionless Data, and the way in which DataCrate distinguishes Working DataCrates from Bagged DataCrates

This contrasts to the approach where some sort of wrapper folder/container is put around 'payload', e.g. BagIt.

Join the team (post here to be added)

Add a comment to this issue to join the https://github.com/ResearchObject/ro-crate team.

There are no obligations in being a contributor except abiding by our code of conduct.

Missing entity from workflow example, or spec needs clarifying.

Describe the bug
The workflow RO crate example refers to { "@id": "README.md" } in its hasPart, but there is no entity with that @id in the @graph. Is this valid?

URL
https://github.com/ResearchObject/ro-crate/blob/master/examples/workflow-0.2.0/ro-crate-metadata.jsonld#L79

Suggested fix
Add an entity for the README file, or clarify in the spec that this is a valid scenario.

Additional context
The spec mentions:

Where files and folders are represented as Data Entities in the RO-Crate JSON-LD, these MUST be linked to, either directly or indriectly, from the Root Data Entity using the hasPart property.

but it is not clear if the inverse is true.

Use-case template

As discussed in the first call, we need a template to capture use-cases

It may be enough to capture these as user stories, i.e.
As a < type of user >, I want < some goal > so that < some reason >.

However, we might also want to use a fuller use-case template. See, for example, the Use Case template used by the w3c Data Exchange Working Group: https://www.w3.org/2017/dxwg/wiki/Use_Case_Working_Space#.2F.2FUse_Case_template

cc @stain @ptsefton

Use Case: Export as RO-Crate

As a repository manager/user, I want to be able to losslessly export a given object in RO-Crate format so that I can work with it in a different environment (local PC/offline) or migrate/replicate it to another platform.

See the final recommendations of the Research Data Alliance Repository Interoperability Working Group. Having surveyed a variety of options (e.g. OAI-PMH, ResourceSync, etc.), this RDA group came to the conclusion that a simple but extensible, BagIt-based packaging format was the best candidate to provide interoperability across diverse repositories. This particularities of the RDA spec should be considered when addressing #13

Use Case: Given a DOI, directly download & verify an RO-Crate

As a researcher (or aggregation service), when given a DataCite DOI, I'd like to be able to directly download & verify (e.g. by comparing checksums) an RO-Crate so that I can (for example) use it in a downstream workflow without having to parse the DOI landing page

Realizing this use case is beyond scope of this spec, but it should inform the spec and guidelines, so that RO-Crate can satisfy this use-case for third-parties. For discussion/examples of the need for direct download from a Persistent Identifier see:

Use Case: There's at least one easily accessed normalization tool to take arbitrary JSON-LD and get it into RO-Crate compliant JSON-LD

As a Research Software Engineer, I want to be able to build a DataCrate using code that suits my problem-space so that I can leave out the @context or use one that's convenient (in violation of #10), build an object tree that suits my domain (in violation of #9) and use singleton value rather than arrays for convenience (in violation of #22).

This means that we should have at least one command line tool and maybe an online "playground" that can normalize JSON-LD. (We have at least an alpha version of most of this in the CalcyteJs tool and as that's Javascript based it would be possible to build an in-browser playground from this code reasonably cheaply.

A Python version would also be easy to write as there is a JSON-LD library.

Use Case: Describe a tabular data file directly in RO-Crate metadata

As a researcher working with tabular data, I want to be able to define the columns (description, data-type, valid values/ranges, etc.), so that I can provide a structured data dictionary.

Approaches elsewhere:

Frictionless Data: Tabular Data Package, Tabular Data Resource, etc.
CSV on the Web which now has experimental support in Google Dataset Search - note this can use a side-car file
DataSpice, see example including schema.org
Pysch-DS can describe tabular data in the root metadata file OR in sidecar json-ld files. Like DataSpice, in both cases this is done use schema.org, i.e. variableMeasured etc.
BIDS approach to tabular files - note this is a side-car file
Metatab

Use Case: Published Workflow

As a researcher, I want to publish the workflow that I used in a specific publication so that I can make this available along side the publication so others can reproduce the exact analysis I did.

In case I've based myself on a reference workflow (see other Use Case), I want to refer to that one, but I will add my own data (links) and tweak parameters, reference data, ...

I want to be able to pull this RO-crate into the workflow system I used and have it populate it properly. In case of Galaxy this means: install workflow (and tools), data into data library (this can be scripted through the API). To make my life easy, I also want an easy way to export an RO-crate from that workflow system.

JSON-LD context should be licensed CC0?

Describe the bug
Now the license for RO-Crate specification and this GitHub repository is the Apache License 2.0.

Should perhaps down-stream users be allowed to copy-paste our JSON-LD Context into their own files without keeping it under Apache License?

I suggest we re-license (only) the JSON-LD context (and perhaps examples) to public-domain CC0 so it can be copy-pasted freely and included in data of any license.

If we can legally do this depends on how we built the current list of schema.org predicates.. is it derived from the schema.org downloads/context? @ptsefton ?

URL
https://w3id.org/ro/crate/context

Suggested fix

License context.json as CC0 which does not require any attribution or re-licensing.

Additional context

AL2 is a wide-ranging permissive license that is friendly to both business and further open source use. However it does have some attribution requirements for redistribution that might be cumbersome:

You must give any other recipients of the Work or Derivative Works a copy of this License; and

..which does not mean they have to license things under Apache License, just give them a copy. But this can be quite a burden if you are just in the middle of making a ro-crate-metadata.jsonld - and can cause confusion if the RO-Crate is under a different license!

You must cause any modified files to carry prominent notices stating that You changed the files; and

This is quite tricky - as it means users can't edit the JSON-LD file without adding such notice.

You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and

We have NOT got any such notices in the context.json - so nothing to retain. (this also mean the context.json does not reference back to the "copy of this license")

We DO have such notices in the markdown file - in fact it is there twice as we also want to present this in the generated HTML on the website.

(This means if someone copies the ro-crate HTML/Markdown spec they have to keep that license/attribution info there)

If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License.

We do not currently have a NOTICE file, so clause 4 does not apply, which would have required its attributions to be re-distributed to any derived work. This clause is more useful for software as the attributions within NOTICE have to be preserved all the way into final compiled products (even inside your DVD player).

(continued) You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License.

This would currently apply to people extending context.json - that is their own copyright/license applies to whatever they add/modify - however this license still applies to the "original" bits.

A complicating matter is that JSON does not permit file comments, so if we WERE to have a license header in it, we would have to do it similar to this which would then look confusing if embedded within the JSON of an RO-Crate.

But as I would argue there is minimal intellectual property in our JSON-LD context and we just want people to get on with their life, so we can instead license that file only as public-domain CC0 and make that clear within the website.

Trim down Organization departments

We don't need to show affiliation to department as it complicates the examples and organizational identifiers.

Use Case: Link to and re-use existing metadata (sidecar) files

As a researcher who already uses a defined metadata standard/serialisation, I want to be able to reuse this metadata and have RO-Crate express its relationship to the dataset as a whole and/or parts of the dataset (a directory or set of files), so that I can easily layer RO-Crate on top of my existing workflow and community/disciplinary norms

Examples:
I have an existing side-car file for describing tabular data and don't want to have to sync that into RO-Crate syntax, but want RO-Crate to express the relationship between the tabular data file and the sidecar file, see BIDS handling of tabular data

See, in contrast, #27 where variable-level metadata is captured directly in RO-Crate metadata

Use case: Use flattened JSON-LD

As a non-professional developer or Research Software Engineer, I want to be able to rely on JSON-LD manifests being in Flattened JSON format so that I don't have to traverse an arbitrarily structured data structure with some properties included hierarchically and some by reference.

Context for this: There are libraries for processing JSON-LD in Python and Javascript/Node.js but they DO not have features for traversing data relationships, only high level operations like "Flatten".

Replaces #4.

Align with CodeMeta

We should have a CodeMeta dual Resarch Object example, specially as CodeMeta is now being picked up by GitHub, Zenodo, etc and it heavily uses http://schema.org/

It might be that the Workflow example can be adapted for that as arguably a workflow is also software.

Q: Should codemeta.json file be a valid manifest.jsonld alternative by adding some extra RO terms, or would both manifests live side by side? (generating one from another)

Use Case: Properties from multiple contexts

As a developer with limited understanding of JSON-LD, I want to use properties from multiple sources so that I can augment, for example, schema.org properties with those from ORE.

I think it would be easiest to develop, read and maintain if all properties apart from those from a single context were prefixed.

Working in a new use case, hopefully to be completed soon

Can't make it to our next call, but will work on this and share something asap.

As a type of user, I want some goal so that some reason.

Change the name to RO-Crate?

We have to settle on the name and it's spelling of our spec. Is it ROLite, RO Lite, RO-Lite, Research Object Lite?

In discussion with @eocarragain and @ptsefton we are not sure if "Lite" gives the right message for anyone outside the RO community (who therefore also does not know what is "RO").

One suggestion is that as we are intending to build on both Research Objects and DataCrate the name could reflect both origins.

How about RO-Crate? Some properties of a crate is that it is not closed, you can see through it's edges, you can put anything inside it, even something large.

(I'm afraid I would want the - in there to avoid sounding like the ROC-rate - which would upset the economists)

Use Case: Provide a DOI for the spec

As an editor of this specification, I want there to be a DOI so that the spec can be cited, and it can be found even if the hosting plaftorm (eg github goes away).

Stian has made a DOI and deposited a version of the spec in Zenodo - but this resolves to PDF - is this what we want to do? Or should we have a DOI that resolves directly to the RO-Crate website, which would create a governance problem - who will mind that DOI over time?

should hyperlink concepts to schema.org

In https://github.com/ResearchObject/ro-crate/blob/master/docs/0.3-DRAFT/index.md#workflows-and-scripts

accessibilityAPI should link to https://schema.org/accessibilityAPI and so forth

Agree a set of related reference/target formats

We should identify related projects in this area (other than DataCrate). This will help:

inform and validate ROLite spec against real examples in the wild;
provide examples of how new domain-specific formats could benefit from ROLite as an extensible, generic base-specification
potentially lead to alignment, collaboration, adoption

See #2 (WholeTale) an #3 as examples.

There are a large number of such projects (see https://docs.google.com/document/d/155lA2BcixTl-zwJHGfLkxsmg7WmQbBK00QWyP8QggkE/edit?usp=sharing), so we should agree early which subset are relevant references for ROLite work

additionalType needs to be namespaced

Describe the bug

Adding WorkflowSketch etc to the @context is nice, but does not actually work with additionalType as the @id is interpreted in JSON-LD as a relative path.

URL
https://researchobject.github.io/ro-crate/0.2-DRAFT/#workflows-and-scripts

Suggested fix

Either we have to use a prefix so that it becomes expanded by the @context as the earlier wfdesc:Workflow (which exposes the different workflows), we use full URL, or we move Workflow, Script and WorkflowSketch to @type: [array] which always expand by the @context instead of relative path.

Additional context

{  "@context": ["https://w3id.org/ro/crate/0.2-DRAFT/context", {"@base": "http://example.com/crate-1/"}],
            "@id": "workflow/retropath.knime",
            "@type": "SoftwareSourceCode",
            "additionalType": {"@id": "Workflow"},
            "name": "RetroPath Knime workflow",
            "description": "KNIME implementation of RetroPath2.0 workflow",
            "creator": {"@id": "#thomas"},
            "programmingLanguage": {"@id": "#knime"},
            "license": "https://spdx.org/licenses/BSD-2-Clause.html",
            "potentialAction": {
                "@type": "ActivateAction",
                "instrument": {"@id": "#knime"}
            }
        }

gives the triples in JSON-LD playground

Subject	Predicate	Object
_:b0	http://schema.org/instrument	http://example.com/crate-1/#knime
_:b0	http://www.w3.org/1999/02/22-rdf-syntax-ns#type	http://schema.org/ActivateAction
http://example.com/crate-1/workflow/retropath.knime	http://schema.org/additionalType	http://example.com/crate-1/Workflow
http://example.com/crate-1/workflow/retropath.knime	http://schema.org/creator	http://example.com/crate-1/#thomas
http://example.com/crate-1/workflow/retropath.knime	http://schema.org/description	KNIME implementation of RetroPath2.0 workflow
http://example.com/crate-1/workflow/retropath.knime	http://schema.org/license	https://spdx.org/licenses/BSD-2-Clause.html
http://example.com/crate-1/workflow/retropath.knime	http://schema.org/name	RetroPath Knime workflow
http://example.com/crate-1/workflow/retropath.knime	http://schema.org/potentialAction	_:b0
http://example.com/crate-1/workflow/retropath.knime	http://schema.org/programmingLanguage	http://example.com/crate-1/#knime
http://example.com/crate-1/workflow/retropath.knime	http://www.w3.org/1999/02/22-rdf-syntax-ns#type	http://schema.org/SoftwareSourceCode

the intention was for Workflow to be expanded by the @context to http://purl.org/ro/wfdesc#Workflow - this works if you use wfdesc:Workflow

Tables for MUST, SHOULD etc in spec are too hard to complete

The MUST / SHOULD tables are too complicated to complete in time for a release.

Some reasons:

The Data citation use case is not well specified and will require development to inter-operate with DataCite software
The Google Dataset search integration is untested - we don't have data to advise people of what to put in RO-Crates.
The institutional repository case will vary by institution so we can't do that as part of the spec.

Suggested fix

I suggest we remove the tables and replace with a simple list of MUST have elements for a minimal RO-Crate - then start working on Implementation guides for people who want to be DataCite or Google Dataset search compliant.

My Suggestion for MUST:

@type MUST be (at least) schema:Dataset
@id Must be a a string of ‘./’ (with the RO-Crate Metadata File Descriptor present as well)
name - So humans can identify the crate
datePublished To tell different versions apart
license - so that people know what they are allowed to do with the data including -
systems administrators and people are cleaning up after the creator has gone
contactPoint so there is an email address.

A future version of this specification will have implementation guides for data publishers which explain how to create citable datasets.

The following properties are SHOULDs - to assist in distributing data

author SHOULD appear at least once | referencing a Contextual Entity of @type: Person or Organization

And where there is publishing support available:

identifier - Such as DOI, or a URL - but this depends on repository infrastructure
distribution Link to a downloadable version of the data

Use Case: Allow the use of variant file names for manifest/catalog/index/readme

As a data manager, I want to be able to choose what I call the manifest JSON-LD and the root HTML page so that I can make datasets that appeal to my community and avoid name-collisions with existing files - eg use index.html for most datasets but allow for a variant where the payload is a website containing index.html files.

Implementation note: The manifest/catalog file and HTML root can be specified in a "magic" DataCrate or RO-Crate.json file.

Issue: path/contentUrl only allowed on MediaObject

As noted in the spec, schema.org currently only allows contentUrl (aliased as path in RO-Crate) on MediaObject objects. Effectively all RO-Crates won't be valid schema.org

Do we actually need path? Can we use the @id to capture the path within the RO-Crate.

should directory data entities id end in /?

In the current Directory File Entity we say @id SHOULD end in /, e.g.

Directory File Entity

<schema property constraints Valid RO-Crate Citation Use-case (DataCite) JISC RDSS Data discovery (Google Dataset Search)

@type MUST be Dataset

@id MUST be a URI Path relative to the RO Crate root; SHOULD end with / Y Y

<schema property	constraints	Valid RO-Crate	Citation Use-case (DataCite)	JISC RDSS	Data discovery (Google Dataset Search)
`@type`	MUST be `Dataset`
`@id`	MUST be a URI Path relative to the RO Crate root; SHOULD end with `/`	Y		Y

For example:

{
  "@id": "photos/",
  "@type": "Dataset",
  "name": "Photos of Gibraltar from 1950 till 1975",
  "about": {"@id": "http://dbpedia.org/resource/Gibraltar"},
}

So should we detail this ending in / a bit more or rather recommend the opposite? The idea of ending with / is to mirror what happens in browsers, e.g. https://data.research.uts.edu.au/examples/ro-crate/0.2/farms_to_freeways/data/files/431 (@id: "data/files/431") will redirect to https://data.research.uts.edu.au/examples/ro-crate/0.2/farms_to_freeways/data/files/431/ @id: "data/files/431/"

JSON-LD processing should not remove the trailing / - in fact those two identifiers will be treated as different (RDF 1.1 clarified to use same string-equals logic for all URIs after absoluting)

Use Case: linking to related URLs from a dataset

As a researcher, I want to be able to provide a link to a related website, publication, vocabulary or other resource which gives contextual information for the dataset I'm publishing.

Our research data catalogue has fields for these, and they're currently expressed using the JSON-LD "citation" property, which is very general.

Use Case: Validate JSON and get valid JSON-LD

As a developer of software I want to validate my RO-Crate JSON output so that I can be sure it is valid according to the spec, without having to learn and test for each of the underlying technologies like JSON-LD, URIs and schema.org.

In particular JSON Schema is popular in JSON community, with a plethora of tooling available.

A challenge of validating a graph structure in JSON-LD using JSON Schema is the lack of hierarchy. With our use of @graph and flattened-compacted there is a more predictable structure which is easier to validate, at least for each block (given their @type).

So the idea here is not that every RO-Crate JSON would be validate against such a schema, but that if you followed the schema, your JSON would be valid; you would be making JSON-LD "by accident".

As further semantic checks might be needed (for instance the @id of a author should point to a neighbouring block that has a name), it might be possible to use RDF Shapes in SHACL or ShEx. While these are closer to the graph nature and much more natural for this use case, they are not as developer-friendly as they will be removed from the syntax used to make the JSON-LD. Thus a developer would not understand easily the RDF Shapes errors caused by for instance nesting a JSON element at the wrong level.

Some related approaches that play along the duality of JSON Schemas, JSON-LD and RDF Shapes:

Use Case: Values in JSON-LD are predictable (eg always arrays, never scalars)

As a Research Software Engineer, I want the DataCrate manifest to be easy to navigate and parse so that I don't have to write helper functions such as working out whether a value is a scalar or an array.

This means that values should always be arrays apart from the @id.
So rather than:

"@graph": [
    {
      "@id": "/",
      "path": "/",
      "@type": "Dataset",
      "Description": "This data set doesn't really exist"
    }

This should be:

"@graph": [
    {
      "@id": "/",
      "path": ["/"],
      "@type": ["Dataset"],
      "Description": ["This data set doesn't really exist"]
    }

This is in a family with #9 and #10 - designed to make implementation simple.

Missing schema:author in Workflow Example

Describe the bug
According to the 0.2 spec, schema:author must appear at least once.

In the workflow-0.2, no author is defined.

URL
https://github.com/ResearchObject/ro-crate/blob/master/examples/workflow-0.2.0/ro-crate-metadata.jsonld

Suggested fix
Add
"author": {"@id": "#thomas"} or equivalent to the file.

Missing in v1: Distribution property and Download

Describe the bug

The DataCrate Spec mentions the distribution property, pointing to a DataDownload Contextual entity. This is used by Google Dataset search. We lost this somewhere in the editing process.

DataCrates which are packaged for distribution SHOULD:

Have a DOI as an identifier.
Use the DOI URL as the @id.
Include the DOI without a URL as an identifier.
Link to a DataDownload using the distribution property.
{
    "@id": "https://doi.org/10.4225/59/59672c09f4a4b",
    "@type": "Dataset",
    "citation": {
        "@id": "https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0181020"
    },
    "path": "data/",
    "datePublished": "2017",
    "description": "Palliative care planning for nursing home residents with advanced dementia is often suboptimal. This study compared effects of facilitated case conferencing (FCC) with usual care (UC) on end-of-life care",
     
    "identifier": [
        "https://doi.org/10.4225/59/59672c09f4a4b",
        "doi.org/10.4225/59/59672c09f4a4b"
    ],
    "name": "Data files associated with the manuscript:Effects of facilitated family case conferencing for advanced dementia: A cluster randomised clinical trial",
    "distribution": [
        {
            "@id": "https://data.research.uts.edu.au/examples/v1.0/timluckett.zip"
        }
    ]
}

{
    "@id": "https://data.research.uts.edu.au/examples/v1.0/timluckett.zip",
    "contentUrl": "https://data.research.uts.edu.au/examples/v1.0/timluckett.zip",
    "@type": "DataDownload",
    "encodingFormat": "zip"
}

URL
Provide link to RO-Lite page or source code most related to the bug.

Suggested fix

Add in the example above.

Additional context

See the DataCrate spec (which RO-Crate replaces).

Use Case: Archive MIAME compliant RNA-sequencing data

As a data steward responsible for managing and preserving of data generated in my lab, I want ensure raw data generated by a sequencing experiment (gzipped FastQ files, typically few GB per sample) is stored securely in my university RDM system, with all the relevant metadata so that I can submit the data to ArrayExpress (the designated ELIXIR repository) with the click of a button.

My archiving solution (https://viaa.be/en) default expects BagIt containers with a BagInfo.txt file with DataCite metadata. I want to add additional metadata (MIAME standard used by ArrayExpress).

Use Case: Save NMR data associated with a chemical structure

As a coordiator of the NMReDATA Initiative aiming at making public NMR assignment data of small molecules, I want to determine how to make our "NMR records" RO-crate compliants so that they could be Findable in Institutional archive repository, Zenodo, etc.

How are diverse specialized communities going to propose their specific format/schema to the RO-crate community?

Use Case: Preserve order of creators

As a Researcher, I want the order of creators to be preserved so that academic norms for assigning credit are assumed.

This could be implemented using @list - if we define the context:

"author": {
"@id": "http://schema.org/author",
"@container": "@list"
}

I suggest we consider doing this for ALL terms - and also the @graph if that's legal to ensure that the Root Dataset can be placed first - potentially speeding up processing time for implementers.

IMPLEMENTATION NOTE: Most naive implementers are likely to treat JSON-LD as plain JSON in which ordering is preserved as author (and other) lists will be arrays, where order is preserved - this should ensure that when round tripping into other formats such as RDF, this expectation is met.

Use Case: @context is always a simple "label": "URI" with no namespaces or indirection

As a non-professional developer or Research Software Engineer, I want to be able to rely on context always a simple {"label": "URI"} with no namespaces or indirection, so that I don't have to understand and write code to deal with all the different ways that @context can be expressed.

Context for this: There are libraries for processing JSON-LD in Python and Javascript/Node.js but they DO NOT have features for resolving context labels, only high level operations like "Flatten".

NOTE: DataCrate already does it this way.

Consider using flattened JSON-LD

When writing the DataCrate spec I started using the same JSON-LD style as used here, where properties are in-line.

{
     "name": "Dataset of repository sizes in CWL Viewer",
    "description": "This is a simple dataset showing the size in bytes of the repositories examined by https://view.commonwl.org/ sinze September 2018",
    "keywords": "CWL, repository, example",    
    "temporalCoverage": "2018-09/2019-01",
    "contentLocation": {
            "@id": "http://sws.geonames.org/9884115/",
            "name": "Kilburn Building, Manchester, UK"
    }

However, this is difficult to code against; to get the content location on the first mention you look for a contentLocation property, but on subsequent mentions you have to find the item by @id - but this is non-trivial to code.

We decided that the format should be optimised for ease of coding for linked-data-naive programmers. If you use flattened JSON-LD then it is trivial to index the @graph into a dictionary by ID and to traverse the @graph.

In DataCrate we also noted that we didn't expect hand-authoring to be much of a thing - and it's pretty easy to code tools that generate flattened JSON-LD.

Of course, if there were good libraries for using JSON-lD, and traversing the @graph then this would not matter, but as far as I know there are not.

Some questions from a specific use case

Hi all,

I'm working on the use case described in #20
Basically I want to wrap a reference workflow obtained from a Galaxy instance so the main file would be a .ga file here named "workflow_Galaxy.ga".
Besides this file I will include a CWL-abstract version of it in a file named here as "workflow_abstract.cwl". This would be more like an extra metadata file as it is not really executable or interpretable by software.
Those are the 2 payload files included in the RO-crate and would both be in the root dir.
I built a template for the ro-crate-metadata.jsonld file of this create:

{ "@context": "https://w3id.org/ro/crate/1.0/context",
 "@graph": [
   {
       "@type": "CreativeWork",
       "@id": "ro-crate-metadata.jsonld",
       "conformsTo": {"@id": "https://w3id.org/ro/crate/1.0"},
       "about": {"@id": "./"}
 },  
 {
   "@id": "./",
   "@type": "Dataset",
   "hasPart": [
     {
       "@id": "workflow_Galaxy.ga"
     },
     {
       "@id": "workflow_abstract.cwl"
     },
     ],
  },
 {
   "@id": "workflow_Galaxy.ga",
   "@type":["File", "SoftwareSourceCode", "Workflow"],
   "contentSize": "****Fill at runtime***",
   "description": "Workflow description in Galaxy format 2",
   "encodingFormat": "text/yaml",
   "programmingLanguage": {"@id": "https://galaxyproject.org/"},  
 },
 {
   "@id": "workflow_abstract.cwl",
   "@type":  ["File", "Workflow"],    
   "contentSize": "****Fill at runtime***",
   "description": "Workflow description in CWL-abstract format",
   "encodingFormat": "text/yaml" ,
   "programmingLanguage": {"@id": "https://w3id.org/cwl/v1.1/"},   
 }
 {
   "@id": "#history-01",
   "@type": "CreateAction",
   "object": { "@id": "workflow_Galaxy.ga" },
   "name": "Workflow file created",
   "endTime": "2020-01-27",
   "agent": { "@id": "human agent responsible for this" },
   "instrument": { "@id": "https://usegalaxy.be" },
   "actionStatus":  { "@id": "http://schema.org/CompletedActionStatus" }
 },
 {
   "@id": "https://usegalaxy.be", 
   "@type": "SoftwareApplication",
   "name": "The Belgian Galaxy instance",
   "url": "http://usegalaxy.be",
   "version": "2020-01-27"   
 }
]

But still have a few specific questions that couldn't find in the specification:

Is it ok to use the encodingFormat property for the workflow file/s? the specification references to the programmingLanguage property to describe the software that creates/runs the workflow but I think it's also useful to define the format itself if possible (yaml in the case of Galaxy format 2).
I would like to represent the fact that the workflow file was created in a specific server instance (in this case usegalaxy.be) but could, in theory, be run in any server running Galaxy. Is it correct to have different entities for each? or in the case of webservices should i create a single entity with the software name (Galaxy) and the specific url of the instance as a property?
Also for web services, would it make sense to use the date when the service was used as a version?
Would it make sense to add the "SoftwareSourceCode" in the abstract cwl? it's not really executable/interpretable by any software.

Hope someone can help me with these details.
Thanks,
Ignacio

Issue: revisiting self-containment of data entities

Following discussions at Open Repositories, the current spec reads (with added emphasis):

At the basic level, an RO-Crate is a collection of files represented as a Schema.org Dataset, that together form a meaningful unit for the purposes of communication, citation, distribution, preservation, etc. The RO-Crate Metadata File describes the RO-Crate, and MUST be stored in the RO-Crate Root. Self-containment is a core principle of RO-Crate, i.e. that all Dataset files and relevant metadata SHOULD, as far as possible, be contained by the RO-Crate, rather than referring to external resources. However the RO-Crate MAY also reference external resources which are stored or accessed separately, via URIs, e.g. because these cannot be included for practical or legal reasons.

I suggest we change this to:

An RO-Crate is a collection of files and folders represented as a schema.org Dataset, that together form a meaningful unit for the purposes of communication, citation, distribution, preservation, etc.

Self-containment is a core principle of RO-Crate, i.e. that all files and folders that make up the RO-Crate are contained in or under the RO-Crate Root. For this reason, all Data Entities described in the RO-Crate Metadata File using the hasPart property MUST be reference with a relative path. Note, for some use-cases, some RO-Crate files may be stored in external locations with mechanisms provided to re-compose an RO-Crate when needed; however, from RO-Crate's perspective all Data Entities are local. For example, if using RO-Crate with the Bagit specification, the fetch.txt file can be used for this purpose.

The RO-Crate Metadata File describes the RO-Crate, and MUST be stored in the RO-Crate Root. Self-description is a core principle of RO-Crate, i.e. that all relevant metadata SHOULD, as far as possible, be contained by the RO-Crate, rather than referring to external resources. The RO-Crate Metadata section below describes specific requirements with describing Data Entities and Contextual Entities, including which properties of externally referenced Contextual Entities should appear in the RO-Crate Metadata file.

I think it makes sense to pull apart files/folders (self-containment) from metadata (self-description).

For self-containment (files/olders), removing the option of referencing external files/folders makes RO-Crate a lot simpler to explain and work with. I think it is a good example of where we should be opinionated and constrain the scope of RO-Crate rather than leave things open. From an Research Object perspective, it makes RO-Crate's more explicitly like RO-Bundles/bagit-ro (focused on packaging) rather than the general RO model (which can aggregate content from anywhere). Mechanisms like fetch.txt in Bagit get around some of the 'practical' reasons for referencing external resources, e.g. duplication of large files, and we can illustrate these in implementation guidance.

The issue of access control ("legal reasons") is trickier. One option would be to treat this kind of content as related to, but not a component part of, the RO-Crate. For example, if you want to refer to and describe external content that has some relevance to the RO-Crate, we could use something like pcdm:hasRelatedObject rather than schema:hasPart:

pcdm:hasRelatedObject - Links to a related Object that is not a component part, such as an object representing a donor agreement or policies that govern the resource.

At least this would draw a clear line between strict component parts and external content.

Use Case: RO-Crate as a mini website with discoverable metadata

As a developer/researcher, I want to publish a set of files, dataset and/or visualisation with a user-friendly web interface (and CSS/JS to support this) which is also a RO-Crate with high-quality machine-readable metadata.

Two examples:

an interactive d3.js visualisation of a humanities dataset (evolution of university faculties) with accompanying text
a website to make documents available to industry collaborators without requiring them to navigate a technical-looking HTML interface to download them

"path" is mapped to "contentUrl" in default context

Describe the bug
A clear and concise description of what the bug is.

URL
Provide link to RO-Lite page or source code most related to the bug.

Suggested fix
Do you have any suggestions on how to fix the bug?

Additional context
Add any other context about the problem here.

Use Case: Reference workflow

As a bioinformatician I want to make the fantastic analysis workflow I developed to solve a specific scientific problem so that others can take it, run it (in Galaxy in my case) using example data included so they can assess if it 1) runs on their system and 2) seems to do what they expect.

For Galaxy we have an example: based on a workflow (Galaxy's own .ga format), we can install all tools (need to be wrapped according to Galaxy's best practices) and data needed to run the workflow in a Docker container (or any Galaxy instance).
https://github.com/ELIXIR-Belgium/BioContainers_for_training/tree/master/Galaxy_training_container

This contains metadata structured based on work done for training materials (https://training.galaxyproject.org).

We are also working on a NextFlow PoC according to the same concept

Make overriding ("inheritance") of properties more explicit

After some discussions with @stain at an IBISBA meeting:

The only explicit mentioning that I could find about overriding properties is the attribution:

In ROLite, if a file does not list a creator, and is within the Research Object's folders, it's creator can reasonably be assumed to be the creator of the containing research object. However, where appropriate, the Research Object manifest allows overriding with more precise attribution per resource.

It is unclear if this also applies to e.g. license as well or only attribution
Does an @id referring to a folder propagate the properties to all the files that are contained in the referred to folder (unless a more specific thing (sorry, lacking the word) is available)?

Question: Why does the crate need to be flattened?

Hi,

I've been using RO-Crate to define objects in an OCFL filesystem and I've been wondering why the crates need to be flattened.

From my limited experience of working with ro-crates and json-ld I see that flattening is something that the user could do if they wanted to work with a crate in that form. Perhaps I'm wrong but in my experience the system that creates a crate (e.g. my code) first builds something like:

{
  "@context": "https://raw.githubusercontent.com/ResearchObject/ro-crate/master/docs/0.2-DRAFT/context.json",
  "@graph": [
    {
      "@id": "./",
      "@type": [
        "Dataset",
        "RepositoryObject"
      ],
      "author": {
        "name": "A Person",
        "givenName": "A",
        "lastName": "Person"
      },

---- snip ----

As a human I can easily parse this information and understand it. Furthermore, as a developer that is going to ingest this data into elastic search - a document oriented search service - I can just pass it as is and get on with the job.

By comparison - the flattened crate:

{
  "@context": "https://raw.githubusercontent.com/ResearchObject/ro-crate/master/docs/0.2-DRAFT/context.json",
  "@graph": [
    {
      "@id": "./",
      "@type": [
        "Dataset",
        "RepositoryObject"
      ],
      "additionalType": "item",
      "author": {
        "@id": "_:b0"
      },
---- snip ----
  },
    {
      "@id": "_:b0",
      "givenName": "A",
      "name": "A Person
    },
}

imposes a greater mental load (I need to follow the id links to see what author resolves to) to parse the information and also means that I have to reconstruct the object in order to use it with elastic.

So - if it's easy for anyone using a crate to flatten and compact the object (which it is), and given that the crate is likely to be used by code which can flatten and compact it at will, why does the spec require this to happen up-front?

It seems to me that not enforcing the flattening results in a best of both worlds system. A human reading a crate would see a fairly easy to read JSON object whilst a consumer (code) could flatten the crate as it desires.

Change CSS/theme to wider view or ReadTheDocs style

Describe the bug

The RO Crate 1.0 specification page has quite a narrow page due to the GitHub Pages style. While it looks nice it can be a bit of a long scroll, and the narrow margin is not so well suited for some of the JSON-LD examples.

Ironically if viewed on a small device in portrait mode, the empty white space left column disappears and the then wider text is much easier to read.

URL
https://w3id.org/ro/crate/1.0/

..but then there won't be logo and link back to the front page etc.

A floating table-of-content that updates as you scroll would be nice thing to have on the side.

I am not sure if I am suggesting to put a hard-coded HTML like the above, or to change the Jekyll style for the whole RO-Crate site (which would affect all pages).

Align with WholeTale format

The WholeTail project is defining a Tale format using BDBag and Bagit-RO manifest.json - so RO Lite should align with what they have concluded so far. @kylechard and @craig-willis are involved.

See craig-willis/bdbag-water-tale#1 where I try to align this with RO Lite, and two alternative extremes:

all schema.org: https://gist.github.com/stain/2673f4c920a86d1b257cc0a696b32df4 (RO and bdbag tools does not know this yet) - no ORE aggregates :-(
soft schema: prefix usage hema: prefix only: https://gist.github.com/stain/93686e8e557f13e8edccc15a767e8499 (still works in https://search.google.com/structured-data/testing-tool/)

One interesting take I got was to use http://schema.org/dataset as an upper property, making the ResearchObject also be a boring http://schema.org/DataCatalog of 1 http://schema.org/Dataset.
This split might make more sense than the current direct Dataset approach which quickly becomes inflexible.

Remove links to external examples

Describe the bug

We have a dependency on some external examples.

URL

https://github.com/ResearchObject/ro-crate/blob/master/docs/0.3-DRAFT/index.md

Suggested fix

I will remove the example links we should not have dependencies like that in the spec.

Additional context
Add any other context about the problem here.

Use Case: Describe/include software containers

As an open science researcher, I want to provide Docker/Singularity container images so that others can reliably reproduce my results or reuse the same software.

This implies that the container images and their recipes (e.g. Dockerfile) should be included in the RO-Crate and typed as such, so users know they can be executed.

It is desirable also to use tooling to expand the description with a list of dependencies installed in the container this will help provide light-weight software citations.

Related efforts to align with:

Use Case: Post to & fully populate a repository record

As a data creator, I want to be able to post/upload an RO-Crate to a repository and have it populate the repository metadata, so that I can automate my processes and don't have to re-key slight variations of the same information again, and again, and again.

This could be accomplished in a number of ways, e.g.:

A given repository supports a deposit protocol like Sword (v1, v2, v3), AND accepts/understands RO-Crate format so that the repository can parse and map the package to its own internal representation (requires adoption by the repository)
A given repository adds support for RO-Crate to its bespoke REST API, (e.g. Zenodo, Figshare) so that the repository can parse and map the package to its own internal representation (requires adoption by the repository)
A given repository adds support for RO-Crate when integrating with other platforms, e.g. the Zenodo and Figshare integration with Github hooks, but without having to manually populate the metadata in the repository (requires adoption by the repository)
A separate tool/library which can parse an RO-Crate and map the RO-Crate representation either to a generic Sword package or to the specific API of a given repository (does not require adoption by the repository, but the tool(s)/libraries must be maintained etc.)

Note, there is a whole other discussion about how the repository treats the upload, i.e. does it maintain the original RO-Crate as the Archival Information Package, or does it simple parse & unpack what it needs and discard the rest of the RO-Crate structure, or both. Sword v3 is a useful starting point thinking through these scenarios, but they are beyond scope of the RO-Crate spec itself

researchobject / ro-crate Goto Github PK

ro-crate's People

Contributors

Stargazers

Watchers

Forkers

ro-crate's Issues

Directory File Entity

Recommend Projects

Recommend Topics

Recommend Org