Giter VIP home page Giter VIP logo

gcp's People

Contributors

bricoletc avatar leoisl avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

bricoletc

gcp's Issues

10000 simulations models the coverage distribution for the correct allele with some errors (propagates to the genotype confidence modelling)

This library is now integrated in pandora, and I have been testing it to ensure the implementation is correct. I have found some bugs that were fixed so far. This next one is not a actually a bug, but more like an imprecision that is up for discussion.

To use this library, we have to simulate genotype confidences. In pandora we do so by simulating coverages of correct and incorrect alleles. Coverage of correct alleles are simulated with negative binomial or binomial models with the parameters automatically chosen by pandora after mapping reads. Coverage of incorrect alleles are simulated using only a binomial model with n = mean_kmer_covg and p = error rate (not sure if this needs a change based on the model used for the correct coverages... this is what Minos does).

Taking as example the Negative Binomial model for the coverage of correct alleles (the default for pandora), we can see that 10k simulations (which is done by default in this library) does not model well the probability density function for the coverage (the parameters chosen for the Negative Binomial are realistic, from one of our samples in the evaluation). This is the pdf curve:

image

This is pdf + 1k simulations curve:

image

pdf + 10k simulations curve (the default):

image

pdf + 100k simulations curve:
image

pdf + 1M simulations curve:
image

pdf + 10M simulations curve:
image

If the coverages for the correct alleles are not well modelled then the genotype confidences aren't neither, and the percentiles will thus be a bit imprecise (and more variable from execution to execution).

On my machine, 1M simulations takes 3 seconds, and 10M takes 32 seconds, so I will change the default value for this lib to 1M, but in pandora will use 10M (I think it is worth the precision gain over this little slow down).

Use factory method to construct Genotypers

From original @mbhall88 message:

I think you could use a factory constructor to side-step needing to know how the user's Genotyper is constructed? So, we just require that the Genotype supplied implements a factory constructor, say Genotyper::from_stuff(stuff)?

this is a far better approach than implicitly hoping that the user will do this through the constructor

Ownership

Thought it would be best to raise this as early as possible, but given this will be used/developed throughout the lab, maybe best to transfer to the lab organisation?

Use templates instead of void * in Simulator/Genotyper

That will be safer as type checking will be done in compilation time, instead of runtime.
Can also simplify Simulator interface, as destroy_simulation_data() won't be needed anymore (will automatically call the destructor of the template type).
Will also simplify the user not having to deal with void pointers

Modelling through cumulative distribution function

Moving Zam's idea from Slack to here:

Zam:
hmm...
guys....
once we get down to this level, isn't the function we're implementing, just getting the cumulative distribution function from a probabiity density?
the prob density is our input, the distribution of values the gt conf takes, call this f (edited)
the cdf (call it c) is given by
c(x)=area under f from -infinity to x.
ie probability that f is <=x
here's a boost implementation
https://www.boost.org/doc/libs/master/libs/math/doc/html/math_toolkit/dist_ref/dists/empirical_cdf.html
Don't think this really helps as we dont want a boost dependency, and i dont see other implementations of this,

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.