I wanted to summarise one of the outcome of today's call and clarify a comment made in

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I can also add that the discussion has mostly been about the <code class="notranslate"

Having re-read this, I believe <a class="user-mention notranslate" data-hovercard-type

Should lengths and names be required properties in every sequence collection ? about refget HOT 13 OPEN

tcezard commented on August 21, 2024

Should lengths and names be required properties in every sequence collection ?

from refget.

Comments (13)

nsheff commented on August 21, 2024

or in some case not available.

we've discussed this at length in the past; names can always just become 1-N integers, which is basically what you'd do in a computer if you didn't have names. Similarly, lengths can be easily computed from sequences. And if you don't have sequences and also no lengths, then you don't have a sequence collection.

So, there is no situation where names/lengths cannot be computed.

Keep in mind: we're not forcing anyone to use this structure for use case X -- this is just what you would make to compute a digest. In the case of a CRAM file where you have a digest for each sequence, and you want to compute a seqcol digest you could:

use names 1:N
compute lengths of sequences
construct canonical seqcol form
compute digest

So, it is possible. The thing is, you probably wouldn't want to do this, because you'd really want to just use a collection that had an identifier computed with the names from some provider -- then you'd just ignore them. Making names option does not change that you still wouldn't want to do this.

I recall weeks of discussion on this topic where we eventually concluded It is a bigger problem to make names optional.

I think the above, to me, reinforces that it makes a lot of sense to have names be mandatory -- because it also guides the user away from doing something that wouldn't be a great decision from an interoperability perspective.

from refget.

sveinugu commented on August 21, 2024

@raskoleinonen I do think it would be useful to get a bit more details on the use case. The way I understood it, the context makes computation of the lengths difficult. Could you explain exactly why? Perhaps one way of doing this could be to look up the level-1 digest of the sequences in some seqcol service to receive the corresponding length array (and possibly names), of course given a point in the future when seqcol is more widely adopted.

@nsheff I have never really liked the idea of just naming the sequences with its index. I don't understand why one would want to do that, as it basically reduces interoperability. You can always just add an index-based names array on the user's side if there is a need for this. There will be no extra information content in the names array, and including it just makes the overall digest less interoperable. One could argue that there is information in the fact that there are no names in the collection that point to anything externally, but this fact could in any case be represented by a single binary value, where the most natural choice to me would be whether the names array is present or not. The way I remember it, agreeing with this decision was to me a compromise, personally a "choose your battles" decision. My intuition has always told me that if you do not have any names, you should not invent any.

The lengths array is to me very different. While it can be always be computed from the sequences array, this is often cumbersome. Basically, I think the lengths array represents a broadly useful summary variable of the "level 3" contents of the sequence collection, i.e. the refget-expanded sequences. Making lengths required was to me a way to give a certain weight to the end-user experience. Here, my intuition tells me that being able to depend on the lengths to be always present can be very useful downstream. In many cases, the lengths themselves can be used to map between different patches of the same reference genome, at least for the primary sequences. Knowing at a glance the size of the sequences of the genome would be very useful in many contexts and I honestly am surprised that this is not already a part of the CRAM headers, as it is in e.g. GFF3. I would think having this available in a CRAM file would be great for e.g. quality control.

So to me, requiring the lengths array to always be present is a way to push some of the cumbersomeness from the end users to the data providers. Hence, it is not really a counter-argument that this is cumbersome. In my mind, it is meant to be a bit cumbersome. This is in my mind a feature of seqcol.

from refget.

raskoleinonen commented on August 21, 2024

I suggest that there is two levels of the specification: 'core' and 'community'. The 'core' level describes the scope and format of the sequence collections, the terminology, how digests are computed, and has a list of standard fields with definitions. The decision of which fields are required and which are inherent would be in the 'community' part of the specification.

This would allow the 'core' to be approved without discussing which fields are required and/or inherent in different contexts. My intuition tells me that having any standard fields in the 'core' as required and/or inherent is a significant distraction from approving the 'core' as a GA4GH standard. For example, my opinion is that lengths is neither inherent nor required. I suggest that opinions about field usage should not have a impact on the 'core' specification.

from refget.

sveinugu commented on August 21, 2024

@raskoleinonen I believe the overall idea in your suggestion is very much in line with the consensus of the group. I think the main point where we differ is that we have defined the required and inherent properties of the main three fields as part of the 'core' specification, as the choices here have consequences for core interoperability aspects. I don't believe leaving the specification of which of the other fields are required and inherent to a 'community' part is controversial; in fact, this is exactly what we have been thinking. So the question here is not really about the core/community split as such (however the way this is organised/presented can of course be improved), but whether the core specification shall enforce inherent/required properties of names and lengths, which we have been debating at length and ended up with an ADR on, a decision which this issue is now reopening.

from refget.

sveinugu commented on August 21, 2024

I can also add that the discussion has mostly been about the names array. As far as I can remember, there have been no real disagreements on the lengths array and I follow @sheff in that without the restrictions on the lengths array, the seqcol specification turns into something very different than what we have have spent a lot of effort into creating. Thinking more about this, not requiring lengths destroys much of the point of the specification for me, as it is very much about harmonising the coordinate system-oriented community with the sequence-oriented community, and the lengths array is all about that.

EDIT: Even though I am not fully happy with requiring the names array, I agree with @nsheff that if having names mandatory is what is needed to also keep the lengths array mandatory, then we made the right decision.

from refget.

nsheff commented on August 21, 2024

Just another thought here:

If you have neither names nor lengths, and you just have sequences, then what you want is just the level 1 sequences array digest.

That's going to be part of the main digests.

I guess the question is this: do we want to make it so that you can define a top-level digest with that same content? (just the sequences array?)

If we do this, then there will be 2 digests that refer to that : the top-level and the level 1 digest, since that's basically a sequence collection with only a single attribute (sequences).

The problem here is just that now, nothing is required. So an implementation can't even rely on the fact that there are unique identifiers (names) for the sequences in the collection.

I guess I still argue that for this use case, users should use level 1 digests. I guess there's a pretty major downside to this, though, and that is that these would not be compatible with the APIs in the same way as top-level digests.

The major advantage of doing it this way, though, is that those digests will perfectly match the sequence digests of other collections that do have names/lengths, which is nice.

But I'm open to more discussion on this topic.

from refget.

nsheff commented on August 21, 2024

Also: remember, for a particular use case, someone could just say "I'm not using the canonical minimal schema. I'm using my own schema, and in my schema, names and lengths are not required". That's fine. You'd just lose interoperability with other implementations that do require them, that's the cost.

from refget.

nsheff commented on August 21, 2024

In recent discussions, we've moved a bit away from even having a "core schema" -- so now the schema is depicted as an example.

I think this should probably solve the issue for all parties, because you can still be "seqcol compatible" even if using a completely different schema -- which means you can define your own set of required/inherent attributes.

This comes at the cost of interoperability, though -- but that seems to be what is wanted. So.

from refget.

andrewyatz commented on August 21, 2024

Having re-read this, I believe @nsheff I agree that moving away from a "core schema" is the right thing to do. I would suggest the schema isn't positioned as an example but an exemplar and recommend maintaining the semantic meanings of attributes if someone does diverge from it

from refget.

sveinugu commented on August 21, 2024

My fear is that we are throwing baby out with the bath water by removing the concept of the core schema. This is basically what we have been spending years to discuss and have landed on and is an important outcome of that work. To me, the core schema IS the main core contribution of the standard, it is the thing that might fix the issues we have been facing for 20 years and we should not take it lightly to take that out of the standard.

from refget.

andrewyatz commented on August 21, 2024

I certainly understand that aspect. However there is broader utility in the seqcol spec beyond sequences where you have a central immutable concept that can be represented as a hash (sequences, container images, annotation releases). It feels like there has to be a way we can describe the core spec whilst not avoiding the major improvement of the schema.

Also just something I've noted. The BAM spec states that @SQ:SN and @SQ:LN headers are both required attributes. CRAM's spec suggests CRAM header blocks are the SAM header with the addition that @SQ:MD5 is required. The original ticket included a line about CRAM files which lack sequence lengths but I am not sure how that is possible.

from refget.

nsheff commented on August 21, 2024

I agree with you @sveinugu -- but basically it seems like people don't want interoperability, they want to just solve the one use case they have in the simplest way possible, which by necessity eliminates the ability to use the same thing in other contexts. In the end, people who value the interoperability can still agree upon a shared schema as we have done, and I'm hopeful that if enough momentum builds around that shared schema it could create a community resource where it's worth the extra effort by people outside that central community to buy into the interoperable network built on that core schema... but at the beginning, lacking such a network existing, it's a harder sell. And even if we do build up such a community around an interoperable schema, really, it doesn't really prevent the standard from being useful in a silo, in addition to that .

from refget.

nsheff commented on August 21, 2024

with the recent changes including the attribute endpoint, this changes things for this issue of core schema.

@sveinugu what's your position now on what should be required/inherent in the core schema?

I guess it seems that taking lengths to no longer required/inherent makes more sense now.

from refget.

Should lengths and names be required properties in every sequence collection ? about refget HOT 13 OPEN

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent