On 2022-09-21 we debated how to actually form the identifiers. Like, is there a <code

Great writeup, <a class="user-mention notranslate" data-hovercard-type="user" data-hov

Identifier construction: To prefix or not to prefix,about ga4gh/refget

Comments (11)

nsheff commented on August 21, 2024

On 2022-10-19, there was some support for including the type prefix, but not the namespace prefix (ga4gh). To define the type prefixes, we'd just re-use the array names. If we did it this way, it would look like this:

Level 0

seqcol.a6748aa0f6a1e165f871dbed5e54ba62

Level 1

{
  "lengths": "lengths.4925cdbd780a71e332d13145141863c1",
  "names": "names.ce04be1226e56f48da55b6c130d45b94",
  "sequences": "sequences.3b379221b4d6ea26da26cec571e5911c"
}

Level 2

{
  "lengths": [
    "1216",
    "970",
    "1788"
  ],
  "names": [
    "A",
    "B",
    "C"
  ],
  "sequences": [
    "SQ.76f9f3315fa4b831e93c36cd88196480",
    "SQ.d5171e863a3d8f832f0559235987b1e5",
    "SQ.b9b1baaa7abf206f6b70cf31654172db"
  ]
}

At level 2, would we want to add in the ga4gh namespace because that would be necessary for the lookup for refget 2.0? If so, you'd end up with this, which would lose consistency:

  "sequences": [
    "ga4gh:SQ.76f9f3315fa4b831e93c36cd88196480",
    "ga4gh:SQ.d5171e863a3d8f832f0559235987b1e5",
    "ga4gh:SQ.b9b1baaa7abf206f6b70cf31654172db"
  ]

from refget.

andrewyatz commented on August 21, 2024

Level 2 is the bit that worries me since ga4gh:SQ.b9b1baaa7abf206f6b70cf31654172db is the identifier and it is our domain knowledge that allows us to know we need to add ga4gh: before it is valid. I wonder if there is a missing component in the schema level where a namespace can be specified and that really means you have to add the following namespace onto the identifier before it is a valid identifier?

from refget.

sveinugu commented on August 21, 2024

Level 2 is the bit that worries me since ga4gh:SQ.b9b1baaa7abf206f6b70cf31654172db is the identifier and it is our domain knowledge that allows us to know we need to add ga4gh: before it is valid. I wonder if there is a missing component in the schema level where a namespace can be specified and that really means you have to add the following namespace onto the identifier before it is a valid identifier?

According to the CURIE Syntax document:

A host language MAY declare a default prefix value, or MAY provide a mechanism for defining a default prefix value. In such a host language, when the prefix is omitted from a CURIE, the default prefix value MUST be used. Conversely, if such a language does not define a default prefix value mechanism and does not define a set of reserved values, CURIEs MUST NOT be used without a leading prefix and colon.

Not 100% sure whether a service-related schema such as ours would qualify as a "host language", but if so we seem to be free to define our own mechanism for defining a default prefix value.

I googled my way to the specification of the UHF Hypermedia Format (UHF), which makes use of default CURIE prefixes and is also similar to our use case as it is basically a JSON schema or "format".

I am really only arguing that we can omit the prefix and still state that the values are CURIEs. Any automated usage must still extract our default prefix in a custom way, as the CURIE syntax document does not seem to define a canonical method for providing the default prefix in an automated fashion.

In the end, I suggest we contact identifiers.org or other relevant entities to get their view of the issue.

@andrewyatz For clarity, does the refget standard specify that the endpoints require the prefix to be available or is it optional?

from refget.

andrewyatz commented on August 21, 2024

GA4GH compliance refget instances in v2 will accept GA4GH identifiers of the format ga4gh:SQ.XXXXXXXXX..., md5 checksums or namespace:identifier constructs such as insdc:CM000663.2. The prefix is seen as non-optional

from refget.

sveinugu commented on August 21, 2024

A nice blog post about CURIEs and why we need them, as background: https://cthoyt.com/2021/09/14/curies.html

from refget.

nsheff commented on August 21, 2024

Some summary from today's discussion:

2 questions posted by Tim:

Do we want what is going into the serialization to be the same thing that we expose to the public? Or do we not care about this level of consistency?
- What we make available publicly is a lot easier to change in the future. We could always change prefixes later. In contrast if we change what's in the digest, that messes stuff up.
If you don't necessarily require the same thing that is digested, is there much value in adding a lot of unnecessary characters to what you digest?

It seems we were approaching consensus that we could offer API endpoints that behave both ways: either they give exactly the string that was digested, if requested, or they give a more information-rich version. In fact, if we include non-digested arrays, then by definition the server will be serving up data that is different from exactly what is digested. Maybe it would be nice to have a flag or endpoint or option to get the exact digested string, though.

So, a thought experiment is:

for internal stuff (seqcol entities), we digest only digests, not identifiers (no prefixes or type prefixes)
for external identifiers, like refget identifiers, we accept them as strings at face value
for sequence digest arrays specfiically, we're following the ga4gh specification, so we'd expect these to be complete identifiers, with both namespace and type prefixes. But really, this is not specified by seqcol, which specifies no additional constraints

So this leads to a few next questions:

what do we want to accept in the API? with or without prefixes?
what does the server serve? the output provided to the user. Do we have to say that these strings have to be prefixed with something? When we return things, do we include these prefixes? Or do we make it user-controlled through query parameters or something?

from refget.

sveinugu commented on August 21, 2024

Great writeup, @nsheff!

I only want to add some comments regarding the Refget v2 digest. I think we also agreed that the Refget v2 digest isn’t actually a CURIE, even though it looks very much like one. This was surprising to me and I think it has also been a cause of misunderstandings lately.

From the CURIE syntax document:

CURIEs are an abbreviation for strings which are intended to represent IRIs (as defined by the IRI production in [IRI]), but checking that intent is not part of CURIE conformance. The intended IRI is constructed by concatenating the prefix binding with the reference part, if any. There MUST be a prefix binding for the prefix (or the default prefix, if the prefix is absent) in scope. The concatenation of the prefix value associated with a CURIE and its reference MUST be an IRI (as defined by the IRI production in [IRI]).

So for the reget v2 digest to be a CURIE, say
ga4gh:SQ.a63c69dcd…, it should be possible to replace the "ga4gh" part with an IRI prefix and produce a valid IRI that would resolve into the concept that the CURIE represents, here the sequence itself. But since the ga4gh namespace is mandatory input for the refget endpoint, this is not possible.

Example:

Say you host a refget v2 server with the main endpoint available at (sorry, i did not bother looking up the actual endpoint name requirements in refget v2):

https://my.refgetserver.net/refget/

Then if ga4gh:SQ.a63c69dcd… was a CURIE, one should be able to replace the namespace with the endpoint IRI, and get a working IRI:

https://my.refgetserver.net/refget/SQ.a63c69dcd…`

However, this leaves out the namespace from the input to the endpoint, contrary to what Refget v2 requires, according to @andrewyatz (#37 (comment)).

I think it is unfortunate that the Refget v2 digest quacks like a duck without being a duck (but perhaps a swan?… 😁). Even if the standard does not state that the digest is a CURIE, it looks very much like one. I understand the ship has sailed in Refget v2 on this, sadly.

I think another thing we were nearing consensus on was that we would probably want to raise an issue to a higher power in GA4GH on what to use for the namespace of a seqcol CURIE identifier?

I would argue for using just ga4gh would make the refget v2 digest look even more like a CURIE and thus generate even more confusion. One possibility could be to instead include a type prefix in the namespace prefix, e.g.:

ga4gh.seqcol:6bc72cdf

Which is not uncommmon for CURIES, ex ega.study: and ega.dataset:.

Including some variant of a seqcolprefix on both sides of the colon is, I suppose, also a possibility:

ga4gh.seqcol:sc.6bc72cdf

from refget.

sveinugu commented on August 21, 2024

Just wanted to concretize some of my thoughts after todays meeting and the decision to not include any prefixes in the serializations (except the Refget one):

Digests vs identifiers

For me, the decision was made based on a clear separation of concern between the:

digest, which represents a particular content
identifier, which represents a particular concept

Two different concepts should have different identifiers, even if the contents are the same.

A way to clearly separate these concerns is to not include any prefixes at all in the digests. This is in essence what I believe we decided on today.

About identifiers

Regarding the identifiers, I think we should discern between locally and globally unique identifiers (Reference: "Unique, persistent identifiers" FAIR Cookbook). Identifiers should also be persistent and machine-resolvable. Identifiers could be full URI, for instance using persistent URLs, or they could be represented as CURIEs (see the FAIR Cookbook recipe or the above-mentioned blog post.

Suggestion for top-level seqcol identifiers

Syntax

So I have the following simple suggestion for relating globally unique identifiers in the form of CURIEs with the top level digests:

ga4gh.seqcol:<digest>

e.g.

ga4gh.seqcol:ya7YJT-8kndreP6UamO9v20BZIPacuCi

Globally vs locally unique

If we remove the prefix, we get a locally unique identifier, which is in this case is equal to the digest. Following the conceptual framework from the CURIE syntax, this can be viewed as defining, in the context of a seqcol server, that the "default prefix value" is ga4gh.seqcol. In the context of a seqcol server, a top-level digest then also functions as a locally unique identifier and is furthermore also a valid CURIE!

Similarly, when others are making use of the seqcol identifiers in other contexts, they could in the same way define ga4gh.seqcol as the default prefix for the particular field holding the seqcol identifier. In such cases, the top-level-digest would still be a valid CURIE.

In conclusion: In the specification, we can basically say that a seqcol identifier is a CURIE, constructed according to the above syntax, and that the default prefix for a seqcol server is ga4gh.seqcol. One would not need to say anything about how the identifier should be used elsewhere, typing it as a CURIE would make sure of proper usage.

Note: A consequence of defining ga4gh.seqcol as the default prefix is that we might want the endpoints to also allow the user to specify the identifier WITH the prefix. Since the default prefix for a CURIE is only considered in the cases where the prefix is not present, it might be natural to make it optional to specify the prefix. Restricting the endpoints to only allow CURIEs without the default prefix will remove the possibility for later extending support to other prefixes, should we want to do that. We have anyway discussed having the prefix as optional just to be nice to the user.

Resolving the CURIE identifiers to URIs

In a CURIE resolution service, such as identifiers.org or N2T one could e.g. provide the following mappings:

ga4gh.seqcol -> https://www.ncbi.nlm.nih.gov/seqcol/collection/
ga4gh.seqcol -> https://www.ebi.ac.uk/ga4gh/collection/

Resolving the ga4gh.seqcol:ya7YJT-8kndreP6UamO9v20BZIPacuCi CURIE to the list

https://www.ncbi.nlm.nih.gov/seqcol/collection/ya7YJT-8kndreP6UamO9v20BZIPacuCi
https://www.ebi.ac.uk/ga4gh/collection/ya7YJT-8kndreP6UamO9v20BZIPacuCi

Suggestion for second-level seqcol identifiers

So what about possible identifiers for concepts represented by arrays (second level)?

I suggest the following syntax:

ga4gh.seqcol:<array name>.<digest>

e.g.

ga4gh.seqcol:lengths.kiVAmcKvvUQ8LRWIkIeQf2n9psRqKx8o

CURIE resolution services would then resolve this identifier into e.g.:

https://www.ncbi.nlm.nih.gov/seqcol/collection/lengths.kiVAmcKvvUQ8LRWIkIeQf2n9psRqKx8o
https://www.ebi.ac.uk/ga4gh/collection/lengths.kiVAmcKvvUQ8LRWIkIeQf2n9psRqKx8o

Whether the endpoints would accept that identifier or not is up to the implementation.

Note on persistent URLs

One could also later provide mapping to a persistent URLs scheme if there is the need for that, e.g.:

http://purl.org/ga4gh/seqcol/ya7YJT-8kndreP6UamO9v20BZIPacuCi

(BTW: I found this ga4gh domain under the Internet Archive-governed PURL system. It seems to have been registered by the GA4GH-Pedigree-Standard, helpfully using the top-level domain directly...)

from refget.

nsheff commented on August 21, 2024

In discussions in November and December 2022, we divided this issue into 2 related issues:

Should we prefix things internally?
Should we prefix the final level 0 digest in what we refer to as the "seqcol identifier?"

For the first, we have an agreement: we do not include the ga4gh prefix, or type prefixes. This is codified in PR #42.

The second is kind of a spinoff question, which I believe is still under debate.

from refget.

andrewyatz commented on August 21, 2024

Following other discussions with Nathan I had in a 1:1 discussion, apologies for not being in the meeting yesterday from the start, we think there is a good course of action. We also believe that due to the misnaming of name-spaced identifiers as CURIEs we have conflated retrieval of an entity by its ID and the data required to resolve such an identifier.

Change refget to accept non-prefixed identifiers i.e. SQ.nnnn which I think was discussed in previous refget meetings as a sensible extension (since SQ. is unique)
Suggest that things should not be prefixed internally (the change in point 1 allows seqcol sequence ids to resolve to a sequence)
Talk to the vrs group about their use of CURIE. Allowing refget to sit in a halfway house would allow VRS to continue to work as enforcing a change of not resolving namespaced identifiers in refget would be a major issue for them

from refget.

ahwagner commented on August 21, 2024

We discussed this in the GKS leads call this week. A few takeaways from the discussion:

There is no requirement that CURIEs are locatable. URIs cover URLs, URNs, and other URI types. AFAIK only URLs need be locatable, but CURIEs are not limited to URLs only. I've always thought of VRS object identifiers as URNs.
@larrybabb thinks we should have gone the <namespace>.<type_prefix>:<digest> route. I agree with him.
Following from 2, I don't think there's any reason the ga4gh namespace or SQ. prefix need be stored in refget. I actually think it is somewhat awkward to do this inside VRS objects, since we also strip those components when computing nested VRS digests that contain nested identifiable objects.
Unrelated, it would be great if refget could move to just one digest scheme, but @andrewyatz rightly pointed out that this would be breaking for the CRAM spec. Though I would still push refget to consider a major version release at some point that is TRUNC512 only, leaving older versions available for use with CRAM, etc.
I would like the VR team to work under a shared identifier / digest paradigm to refget, and assume that @larrybabb and @andreasprlic feel similarly, but would encourage them to chime in here too.

from refget.

Identifier construction: To prefix or not to prefix about refget HOT 11 OPEN

Comments (11)

Level 0

Level 1

Level 2

Digests vs identifiers

About identifiers

Suggestion for top-level seqcol identifiers

Syntax

Globally vs locally unique

Resolving the CURIE identifiers to URIs

Suggestion for second-level seqcol identifiers

Note on persistent URLs

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent