Comments (11)
On 2022-10-19, there was some support for including the type prefix, but not the namespace prefix (ga4gh
). To define the type prefixes, we'd just re-use the array names. If we did it this way, it would look like this:
Level 0
seqcol.a6748aa0f6a1e165f871dbed5e54ba62
Level 1
{
"lengths": "lengths.4925cdbd780a71e332d13145141863c1",
"names": "names.ce04be1226e56f48da55b6c130d45b94",
"sequences": "sequences.3b379221b4d6ea26da26cec571e5911c"
}
Level 2
{
"lengths": [
"1216",
"970",
"1788"
],
"names": [
"A",
"B",
"C"
],
"sequences": [
"SQ.76f9f3315fa4b831e93c36cd88196480",
"SQ.d5171e863a3d8f832f0559235987b1e5",
"SQ.b9b1baaa7abf206f6b70cf31654172db"
]
}
At level 2, would we want to add in the ga4gh
namespace because that would be necessary for the lookup for refget 2.0? If so, you'd end up with this, which would lose consistency:
"sequences": [
"ga4gh:SQ.76f9f3315fa4b831e93c36cd88196480",
"ga4gh:SQ.d5171e863a3d8f832f0559235987b1e5",
"ga4gh:SQ.b9b1baaa7abf206f6b70cf31654172db"
]
from refget.
Level 2 is the bit that worries me since ga4gh:SQ.b9b1baaa7abf206f6b70cf31654172db
is the identifier and it is our domain knowledge that allows us to know we need to add ga4gh:
before it is valid. I wonder if there is a missing component in the schema level where a namespace can be specified and that really means you have to add the following namespace onto the identifier before it is a valid identifier?
from refget.
Level 2 is the bit that worries me since
ga4gh:SQ.b9b1baaa7abf206f6b70cf31654172db
is the identifier and it is our domain knowledge that allows us to know we need to addga4gh:
before it is valid. I wonder if there is a missing component in the schema level where a namespace can be specified and that really means you have to add the following namespace onto the identifier before it is a valid identifier?
According to the CURIE Syntax document:
A host language MAY declare a default prefix value, or MAY provide a mechanism for defining a default prefix value. In such a host language, when the prefix is omitted from a CURIE, the default prefix value MUST be used. Conversely, if such a language does not define a default prefix value mechanism and does not define a set of reserved values, CURIEs MUST NOT be used without a leading prefix and colon.
Not 100% sure whether a service-related schema such as ours would qualify as a "host language", but if so we seem to be free to define our own mechanism for defining a default prefix value.
I googled my way to the specification of the UHF Hypermedia Format (UHF), which makes use of default CURIE prefixes and is also similar to our use case as it is basically a JSON schema or "format".
I am really only arguing that we can omit the prefix and still state that the values are CURIEs. Any automated usage must still extract our default prefix in a custom way, as the CURIE syntax document does not seem to define a canonical method for providing the default prefix in an automated fashion.
In the end, I suggest we contact identifiers.org or other relevant entities to get their view of the issue.
@andrewyatz For clarity, does the refget standard specify that the endpoints require the prefix to be available or is it optional?
from refget.
GA4GH compliance refget instances in v2 will accept GA4GH identifiers of the format ga4gh:SQ.XXXXXXXXX...
, md5 checksums or namespace:identifier
constructs such as insdc:CM000663.2
. The prefix is seen as non-optional
from refget.
A nice blog post about CURIEs and why we need them, as background: https://cthoyt.com/2021/09/14/curies.html
from refget.
Some summary from today's discussion:
2 questions posted by Tim:
-
Do we want what is going into the serialization to be the same thing that we expose to the public? Or do we not care about this level of consistency?
- What we make available publicly is a lot easier to change in the future. We could always change prefixes later. In contrast if we change what's in the digest, that messes stuff up.
-
If you don't necessarily require the same thing that is digested, is there much value in adding a lot of unnecessary characters to what you digest?
It seems we were approaching consensus that we could offer API endpoints that behave both ways: either they give exactly the string that was digested, if requested, or they give a more information-rich version. In fact, if we include non-digested arrays, then by definition the server will be serving up data that is different from exactly what is digested. Maybe it would be nice to have a flag or endpoint or option to get the exact digested string, though.
So, a thought experiment is:
- for internal stuff (seqcol entities), we digest only digests, not identifiers (no prefixes or type prefixes)
- for external identifiers, like refget identifiers, we accept them as strings at face value
- for sequence digest arrays specfiically, we're following the ga4gh specification, so we'd expect these to be complete identifiers, with both namespace and type prefixes. But really, this is not specified by seqcol, which specifies no additional constraints
So this leads to a few next questions:
- what do we want to accept in the API? with or without prefixes?
- what does the server serve? the output provided to the user. Do we have to say that these strings have to be prefixed with something? When we return things, do we include these prefixes? Or do we make it user-controlled through query parameters or something?
from refget.
Great writeup, @nsheff!
I only want to add some comments regarding the Refget v2 digest. I think we also agreed that the Refget v2 digest isn’t actually a CURIE, even though it looks very much like one. This was surprising to me and I think it has also been a cause of misunderstandings lately.
From the CURIE syntax document:
CURIEs are an abbreviation for strings which are intended to represent IRIs (as defined by the IRI production in [IRI]), but checking that intent is not part of CURIE conformance. The intended IRI is constructed by concatenating the prefix binding with the reference part, if any. There MUST be a prefix binding for the prefix (or the default prefix, if the prefix is absent) in scope. The concatenation of the prefix value associated with a CURIE and its reference MUST be an IRI (as defined by the IRI production in [IRI]).
So for the reget v2 digest to be a CURIE, say
ga4gh:SQ.a63c69dcd…
, it should be possible to replace the "ga4gh" part with an IRI prefix and produce a valid IRI that would resolve into the concept that the CURIE represents, here the sequence itself. But since the ga4gh
namespace is mandatory input for the refget endpoint, this is not possible.
Example:
Say you host a refget v2 server with the main endpoint available at (sorry, i did not bother looking up the actual endpoint name requirements in refget v2):
https://my.refgetserver.net/refget/
Then if ga4gh:SQ.a63c69dcd…
was a CURIE, one should be able to replace the namespace with the endpoint IRI, and get a working IRI:
https://my.refgetserver.net/refget/SQ.a63c69dcd…`
However, this leaves out the namespace from the input to the endpoint, contrary to what Refget v2 requires, according to @andrewyatz (#37 (comment)).
I think it is unfortunate that the Refget v2 digest quacks like a duck without being a duck (but perhaps a swan?… 😁). Even if the standard does not state that the digest is a CURIE, it looks very much like one. I understand the ship has sailed in Refget v2 on this, sadly.
I think another thing we were nearing consensus on was that we would probably want to raise an issue to a higher power in GA4GH on what to use for the namespace of a seqcol CURIE identifier?
I would argue for using just ga4gh
would make the refget v2 digest look even more like a CURIE and thus generate even more confusion. One possibility could be to instead include a type prefix in the namespace prefix, e.g.:
ga4gh.seqcol:6bc72cdf
Which is not uncommmon for CURIES, ex ega.study:
and ega.dataset:
.
Including some variant of a seqcol
prefix on both sides of the colon is, I suppose, also a possibility:
ga4gh.seqcol:sc.6bc72cdf
from refget.
Just wanted to concretize some of my thoughts after todays meeting and the decision to not include any prefixes in the serializations (except the Refget one):
Digests vs identifiers
For me, the decision was made based on a clear separation of concern between the:
- digest, which represents a particular content
- identifier, which represents a particular concept
Two different concepts should have different identifiers, even if the contents are the same.
A way to clearly separate these concerns is to not include any prefixes at all in the digests. This is in essence what I believe we decided on today.
About identifiers
Regarding the identifiers, I think we should discern between locally and globally unique identifiers (Reference: "Unique, persistent identifiers" FAIR Cookbook). Identifiers should also be persistent and machine-resolvable. Identifiers could be full URI, for instance using persistent URLs, or they could be represented as CURIEs (see the FAIR Cookbook recipe or the above-mentioned blog post.
Suggestion for top-level seqcol identifiers
Syntax
So I have the following simple suggestion for relating globally unique identifiers in the form of CURIEs with the top level digests:
ga4gh.seqcol:<digest>
e.g.
ga4gh.seqcol:ya7YJT-8kndreP6UamO9v20BZIPacuCi
Globally vs locally unique
If we remove the prefix, we get a locally unique identifier, which is in this case is equal to the digest. Following the conceptual framework from the CURIE syntax, this can be viewed as defining, in the context of a seqcol server, that the "default prefix value" is ga4gh.seqcol
. In the context of a seqcol server, a top-level digest then also functions as a locally unique identifier and is furthermore also a valid CURIE!
Similarly, when others are making use of the seqcol identifiers in other contexts, they could in the same way define ga4gh.seqcol
as the default prefix for the particular field holding the seqcol identifier. In such cases, the top-level-digest would still be a valid CURIE.
In conclusion: In the specification, we can basically say that a seqcol identifier is a CURIE, constructed according to the above syntax, and that the default prefix for a seqcol server is ga4gh.seqcol
. One would not need to say anything about how the identifier should be used elsewhere, typing it as a CURIE would make sure of proper usage.
Note: A consequence of defining ga4gh.seqcol
as the default prefix is that we might want the endpoints to also allow the user to specify the identifier WITH the prefix. Since the default prefix for a CURIE is only considered in the cases where the prefix is not present, it might be natural to make it optional to specify the prefix. Restricting the endpoints to only allow CURIEs without the default prefix will remove the possibility for later extending support to other prefixes, should we want to do that. We have anyway discussed having the prefix as optional just to be nice to the user.
Resolving the CURIE identifiers to URIs
In a CURIE resolution service, such as identifiers.org or N2T one could e.g. provide the following mappings:
ga4gh.seqcol
-> https://www.ncbi.nlm.nih.gov/seqcol/collection/
ga4gh.seqcol
-> https://www.ebi.ac.uk/ga4gh/collection/
Resolving the ga4gh.seqcol:ya7YJT-8kndreP6UamO9v20BZIPacuCi
CURIE to the list
https://www.ncbi.nlm.nih.gov/seqcol/collection/ya7YJT-8kndreP6UamO9v20BZIPacuCi
https://www.ebi.ac.uk/ga4gh/collection/ya7YJT-8kndreP6UamO9v20BZIPacuCi
Suggestion for second-level seqcol identifiers
So what about possible identifiers for concepts represented by arrays (second level)?
I suggest the following syntax:
ga4gh.seqcol:<array name>.<digest>
e.g.
ga4gh.seqcol:lengths.kiVAmcKvvUQ8LRWIkIeQf2n9psRqKx8o
CURIE resolution services would then resolve this identifier into e.g.:
https://www.ncbi.nlm.nih.gov/seqcol/collection/lengths.kiVAmcKvvUQ8LRWIkIeQf2n9psRqKx8o
https://www.ebi.ac.uk/ga4gh/collection/lengths.kiVAmcKvvUQ8LRWIkIeQf2n9psRqKx8o
Whether the endpoints would accept that identifier or not is up to the implementation.
Note on persistent URLs
One could also later provide mapping to a persistent URLs scheme if there is the need for that, e.g.:
http://purl.org/ga4gh/seqcol/ya7YJT-8kndreP6UamO9v20BZIPacuCi
(BTW: I found this ga4gh domain under the Internet Archive-governed PURL system. It seems to have been registered by the GA4GH-Pedigree-Standard, helpfully using the top-level domain directly...)
from refget.
In discussions in November and December 2022, we divided this issue into 2 related issues:
- Should we prefix things internally?
- Should we prefix the final level 0 digest in what we refer to as the "seqcol identifier?"
For the first, we have an agreement: we do not include the ga4gh prefix, or type prefixes. This is codified in PR #42.
The second is kind of a spinoff question, which I believe is still under debate.
from refget.
Following other discussions with Nathan I had in a 1:1 discussion, apologies for not being in the meeting yesterday from the start, we think there is a good course of action. We also believe that due to the misnaming of name-spaced identifiers as CURIEs we have conflated retrieval of an entity by its ID and the data required to resolve such an identifier.
- Change refget to accept non-prefixed identifiers i.e.
SQ.nnnn
which I think was discussed in previous refget meetings as a sensible extension (sinceSQ.
is unique) - Suggest that things should not be prefixed internally (the change in point 1 allows seqcol sequence ids to resolve to a sequence)
- Talk to the vrs group about their use of CURIE. Allowing refget to sit in a halfway house would allow VRS to continue to work as enforcing a change of not resolving namespaced identifiers in refget would be a major issue for them
from refget.
We discussed this in the GKS leads call this week. A few takeaways from the discussion:
- There is no requirement that CURIEs are locatable. URIs cover URLs, URNs, and other URI types. AFAIK only URLs need be locatable, but CURIEs are not limited to URLs only. I've always thought of VRS object identifiers as URNs.
- @larrybabb thinks we should have gone the
<namespace>.<type_prefix>:<digest>
route. I agree with him. - Following from 2, I don't think there's any reason the
ga4gh
namespace orSQ.
prefix need be stored in refget. I actually think it is somewhat awkward to do this inside VRS objects, since we also strip those components when computing nested VRS digests that contain nested identifiable objects. - Unrelated, it would be great if refget could move to just one digest scheme, but @andrewyatz rightly pointed out that this would be breaking for the CRAM spec. Though I would still push refget to consider a major version release at some point that is
TRUNC512
only, leaving older versions available for use with CRAM, etc. - I would like the VR team to work under a shared identifier / digest paradigm to
refget
, and assume that @larrybabb and @andreasprlic feel similarly, but would encourage them to chime in here too.
from refget.
Related Issues (20)
- Define what the service info will contain HOT 11
- Discussion on undigested attributes and sorted-name-length-pairs HOT 20
- RFC-8785 and refget compatibility HOT 1
- Alphabet as inherent property of a sequence collection HOT 2
- Reserved namespace policy for future extension of SeqCol HOT 1
- Terminology round 2 HOT 3
- Minimal and extended schemas proposal
- Should we prefix the digests that we return from seqcol? HOT 2
- How to store and represent and compare non collated single value attributes in a sequence collection HOT 9
- Identifier vs digest in the specs HOT 2
- List endpoint and pagination HOT 10
- Add sorted_sequences as recommended non-inherent attribute HOT 4
- Should lengths and names be required properties in every sequence collection ? HOT 13
- Documentation request- seqcol without sequences HOT 2
- Test suite?
- Use case: a digest for a collection of sequences HOT 4
- Revise decision record: sorted_sequences HOT 2
- New schema term: accessions
- Proposal: the attribute endpoint HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from refget.