While implementing the k-means analytic I came across a design choice over the sample

Structure of sample mask for gene pair clusters about kinc HOT 7 CLOSED

systemsgenetics commented on August 29, 2024

Structure of sample mask for gene pair clusters

from kinc.

Comments (7)

4ctrl-alt-del commented on August 29, 2024

I'm sorry I do not follow what you are saying. Are you talking about the sample name list in the header? If so that takes a trivial amount of space as far as storage.

If you are talking about the sample mask size for each gene pair that must be of a fixed size, that is an assumption made in the CCM data plugin and a variable size would cause it to break. If your analytic can be faster ignoring certain samples you can still do that and then simply flag those samples as such in the mask for the CCM.

from kinc.

bentsherman commented on August 29, 2024

Yes I'm referring to the sample mask. In that case the clustering analytics are definitely not writing the masks correctly, but I think I can see how to correct it.

That being said, I think this question still warrants consideration: if the cluster matrix doesn't assume a constant size and only includes values for non-NAN samples in each mask, then we could make the cluster matrix smaller (I haven't tried to estimate how big they will be at this point). On the other hand, analytics that read the cluster matrix will have to filter out NAN samples again before applying the sample mask. So it seems like there is a memory / speed trade off here. Does that make sense?

from kinc.

bentsherman commented on August 29, 2024

K-means and GMM analytics both write the sample mask properly now, and it looks like the cluster matrix is on the order of several GB, so I don't think we need to worry about this trade-off.

from kinc.

spficklin commented on August 29, 2024

I'm re-opening this to make sure I understand your suggestion... how would you know which samples were NA? Do you keep a sample mask somewhere else?

from kinc.

bentsherman commented on August 29, 2024

As an example, consider how a correlation analytic like Spearman would read the cluster matrix to determine which samples to include in each correlation:

If the cluster matrix includes flags for all samples, then Spearman just reads each sample mask directly from the cluster matrix. It does not have to scan the expression matrix for NA samples.
Alternatively, if the cluster matrix only includes flags for clean samples (meaning that sample masks would vary in size across gene pairs), Spearman would have to scan the expression matrix, make a sample mask of "clean" samples, and kind of zip that up with the sample mask from the cluster matrix to have the true sample mask.

So in the second case, the sample mask you're wondering about is implicit in the expression matrix.

from kinc.

4ctrl-alt-del commented on August 29, 2024

Interesting idea Ben but the issue is with how I structured the sparse gene pair style data objects;

I lay them out as basically an array of entries, the first entry being the two gene indexes and cluster id to identify it as a unique number. To quickly search through the array I divide and conquer, looking at the value in the middle and looking at the lower half or upper half depending on if what I am looking for is greater or less than the one in the middle.

For this to happen each entry in the array must be a fixed size so it knows how to find the nth entry in the array. Variable masks would violate that and break the array structure.

I could change the structure to make a variable mask size work, but as I said I would have to go back and completely redo the base gene pair class that cluster matrix and correlation matrix inherit. As of right now it would not work.

from kinc.

bentsherman commented on August 29, 2024

Yeah, and I don't think the small memory benefit is worth all of that refactoring, especially now that I can see that the CCM's are reasonable in size. I think the way you've implemented those classes are fine.

from kinc.

Structure of sample mask for gene pair clusters about kinc HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent