leoisl / gcp Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 1.0 99 KB

Genotype confidence percentile header-only c++ implementation

CMake 5.21% C++ 94.79%

gcp's People

Contributors

Stargazers

Watchers

Forkers

bricoletc

gcp's Issues

10000 simulations models the coverage distribution for the correct allele with some errors (propagates to the genotype confidence modelling)

This library is now integrated in pandora, and I have been testing it to ensure the implementation is correct. I have found some bugs that were fixed so far. This next one is not a actually a bug, but more like an imprecision that is up for discussion.

To use this library, we have to simulate genotype confidences. In pandora we do so by simulating coverages of correct and incorrect alleles. Coverage of correct alleles are simulated with negative binomial or binomial models with the parameters automatically chosen by pandora after mapping reads. Coverage of incorrect alleles are simulated using only a binomial model with n = mean_kmer_covg and p = error rate (not sure if this needs a change based on the model used for the correct coverages... this is what Minos does).

Taking as example the Negative Binomial model for the coverage of correct alleles (the default for pandora), we can see that 10k simulations (which is done by default in this library) does not model well the probability density function for the coverage (the parameters chosen for the Negative Binomial are realistic, from one of our samples in the evaluation). This is the pdf curve:

This is pdf + 1k simulations curve:

pdf + 10k simulations curve (the default):

pdf + 100k simulations curve:

pdf + 1M simulations curve:

pdf + 10M simulations curve:

If the coverages for the correct alleles are not well modelled then the genotype confidences aren't neither, and the percentiles will thus be a bit imprecise (and more variable from execution to execution).

On my machine, 1M simulations takes 3 seconds, and 10M takes 32 seconds, so I will change the default value for this lib to 1M, but in pandora will use 10M (I think it is worth the precision gain over this little slow down).

Use factory method to construct Genotypers

From original @mbhall88 message:

I think you could use a factory constructor to side-step needing to know how the user's Genotyper is constructed? So, we just require that the Genotype supplied implements a factory constructor, say Genotyper::from_stuff(stuff)?

this is a far better approach than implicitly hoping that the user will do this through the constructor

Add a Model class so that clients can configure exactly how they want to model the coverages (and other info) in the simulations

We could have a Model class and create some few predefined Models, and allow users to extend this class if they need a different Model for whatever genotyping they are doing. This is a bit conflicting with #6 : if we do not need to simulate, no need for a model.

stats functions replacements for cpp

np.random.negative_binomial can be replaced by http://www.cplusplus.com/reference/random/negative_binomial_distribution/

np.random.binomial can be replaced by http://www.cplusplus.com/reference/random/binomial_distribution/

best replacement I've found for stats.rankdata() is https://sites.google.com/site/jivsoft/Home/compute-ranks-of-elements-in-a-c---array-or-vector

Ownership

Thought it would be best to raise this as early as possible, but given this will be used/developed throughout the lab, maybe best to transfer to the lab organisation?

Use templates instead of void * in Simulator/Genotyper

That will be safer as type checking will be done in compilation time, instead of runtime.
Can also simplify Simulator interface, as destroy_simulation_data() won't be needed anymore (will automatically call the destructor of the template type).
Will also simplify the user not having to deal with void pointers

Make this library NOT responsible for simulating and genotyping, but just GCP computation

One option is to ask the client itself to simulate and genotype, and having GCP be responsible only for producing a mapping between conf and percentile, and capacity to interpolate for confs it has not seen before. The problem is that simulations should be closely coupled to genotyping, and each tool can be very particular on how they genotype

Modelling through cumulative distribution function

Moving Zam's idea from Slack to here:

Zam:
hmm...
guys....
once we get down to this level, isn't the function we're implementing, just getting the cumulative distribution function from a probabiity density?
the prob density is our input, the distribution of values the gt conf takes, call this f (edited)
the cdf (call it c) is given by
c(x)=area under f from -infinity to x.
ie probability that f is <=x
here's a boost implementation
https://www.boost.org/doc/libs/master/libs/math/doc/html/math_toolkit/dist_ref/dists/empirical_cdf.html
Don't think this really helps as we dont want a boost dependency, and i dont see other implementations of this,

leoisl / gcp Goto Github PK

gcp's People

Contributors

Stargazers

Watchers

Forkers

gcp's Issues

10000 simulations models the coverage distribution for the correct allele with some errors (propagates to the genotype confidence modelling)

Use factory method to construct Genotypers

Add a Model class so that clients can configure exactly how they want to model the coverages (and other info) in the simulations

stats functions replacements for cpp

Ownership

Use templates instead of void * in Simulator/Genotyper

Make this library NOT responsible for simulating and genotyping, but just GCP computation

Modelling through cumulative distribution function

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent