Giter VIP home page Giter VIP logo

Comments (7)

4ctrl-alt-del avatar 4ctrl-alt-del commented on August 29, 2024

I'm sorry I do not follow what you are saying. Are you talking about the sample name list in the header? If so that takes a trivial amount of space as far as storage.

If you are talking about the sample mask size for each gene pair that must be of a fixed size, that is an assumption made in the CCM data plugin and a variable size would cause it to break. If your analytic can be faster ignoring certain samples you can still do that and then simply flag those samples as such in the mask for the CCM.

from kinc.

bentsherman avatar bentsherman commented on August 29, 2024

Yes I'm referring to the sample mask. In that case the clustering analytics are definitely not writing the masks correctly, but I think I can see how to correct it.

That being said, I think this question still warrants consideration: if the cluster matrix doesn't assume a constant size and only includes values for non-NAN samples in each mask, then we could make the cluster matrix smaller (I haven't tried to estimate how big they will be at this point). On the other hand, analytics that read the cluster matrix will have to filter out NAN samples again before applying the sample mask. So it seems like there is a memory / speed trade off here. Does that make sense?

from kinc.

bentsherman avatar bentsherman commented on August 29, 2024

K-means and GMM analytics both write the sample mask properly now, and it looks like the cluster matrix is on the order of several GB, so I don't think we need to worry about this trade-off.

from kinc.

spficklin avatar spficklin commented on August 29, 2024

I'm re-opening this to make sure I understand your suggestion... how would you know which samples were NA? Do you keep a sample mask somewhere else?

from kinc.

bentsherman avatar bentsherman commented on August 29, 2024

As an example, consider how a correlation analytic like Spearman would read the cluster matrix to determine which samples to include in each correlation:

  • If the cluster matrix includes flags for all samples, then Spearman just reads each sample mask directly from the cluster matrix. It does not have to scan the expression matrix for NA samples.
  • Alternatively, if the cluster matrix only includes flags for clean samples (meaning that sample masks would vary in size across gene pairs), Spearman would have to scan the expression matrix, make a sample mask of "clean" samples, and kind of zip that up with the sample mask from the cluster matrix to have the true sample mask.

So in the second case, the sample mask you're wondering about is implicit in the expression matrix.

from kinc.

4ctrl-alt-del avatar 4ctrl-alt-del commented on August 29, 2024

Interesting idea Ben but the issue is with how I structured the sparse gene pair style data objects;

I lay them out as basically an array of entries, the first entry being the two gene indexes and cluster id to identify it as a unique number. To quickly search through the array I divide and conquer, looking at the value in the middle and looking at the lower half or upper half depending on if what I am looking for is greater or less than the one in the middle.

For this to happen each entry in the array must be a fixed size so it knows how to find the nth entry in the array. Variable masks would violate that and break the array structure.

I could change the structure to make a variable mask size work, but as I said I would have to go back and completely redo the base gene pair class that cluster matrix and correlation matrix inherit. As of right now it would not work.

from kinc.

bentsherman avatar bentsherman commented on August 29, 2024

Yeah, and I don't think the small memory benefit is worth all of that refactoring, especially now that I can see that the CCM's are reasonable in size. I think the way you've implemented those classes are fine.

from kinc.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.