leoisl / gcp Goto Github PK
View Code? Open in Web Editor NEWGenotype confidence percentile header-only c++ implementation
Genotype confidence percentile header-only c++ implementation
This library is now integrated in pandora
, and I have been testing it to ensure the implementation is correct. I have found some bugs that were fixed so far. This next one is not a actually a bug, but more like an imprecision that is up for discussion.
To use this library, we have to simulate genotype confidences. In pandora
we do so by simulating coverages of correct and incorrect alleles. Coverage of correct alleles are simulated with negative binomial or binomial models with the parameters automatically chosen by pandora
after mapping reads. Coverage of incorrect alleles are simulated using only a binomial model with n = mean_kmer_covg and p = error rate (not sure if this needs a change based on the model used for the correct coverages... this is what Minos does).
Taking as example the Negative Binomial model for the coverage of correct alleles (the default for pandora
), we can see that 10k simulations (which is done by default in this library) does not model well the probability density function for the coverage (the parameters chosen for the Negative Binomial are realistic, from one of our samples in the evaluation). This is the pdf curve:
This is pdf + 1k simulations curve:
pdf + 10k simulations curve (the default):
If the coverages for the correct alleles are not well modelled then the genotype confidences aren't neither, and the percentiles will thus be a bit imprecise (and more variable from execution to execution).
On my machine, 1M simulations takes 3 seconds, and 10M takes 32 seconds, so I will change the default value for this lib to 1M, but in pandora
will use 10M (I think it is worth the precision gain over this little slow down).
From original @mbhall88 message:
I think you could use a factory constructor to side-step needing to know how the user's Genotyper is constructed? So, we just require that the Genotype supplied implements a factory constructor, say Genotyper::from_stuff(stuff)?
this is a far better approach than implicitly hoping that the user will do this through the constructor
We could have a Model
class and create some few predefined Model
s, and allow users to extend this class if they need a different Model for whatever genotyping they are doing. This is a bit conflicting with #6 : if we do not need to simulate, no need for a model.
np.random.negative_binomial
can be replaced by http://www.cplusplus.com/reference/random/negative_binomial_distribution/
np.random.binomial
can be replaced by http://www.cplusplus.com/reference/random/binomial_distribution/
best replacement I've found for stats.rankdata()
is https://sites.google.com/site/jivsoft/Home/compute-ranks-of-elements-in-a-c---array-or-vector
Thought it would be best to raise this as early as possible, but given this will be used/developed throughout the lab, maybe best to transfer to the lab organisation?
That will be safer as type checking will be done in compilation time, instead of runtime.
Can also simplify Simulator interface, as destroy_simulation_data()
won't be needed anymore (will automatically call the destructor of the template type).
Will also simplify the user not having to deal with void pointers
One option is to ask the client itself to simulate and genotype, and having GCP be responsible only for producing a mapping between conf and percentile, and capacity to interpolate for confs it has not seen before. The problem is that simulations should be closely coupled to genotyping, and each tool can be very particular on how they genotype
Moving Zam's idea from Slack to here:
Zam:
hmm...
guys....
once we get down to this level, isn't the function we're implementing, just getting the cumulative distribution function from a probabiity density?
the prob density is our input, the distribution of values the gt conf takes, call this f (edited)
the cdf (call it c) is given by
c(x)=area under f from -infinity to x.
ie probability that f is <=x
here's a boost implementation
https://www.boost.org/doc/libs/master/libs/math/doc/html/math_toolkit/dist_ref/dists/empirical_cdf.html
Don't think this really helps as we dont want a boost dependency, and i dont see other implementations of this,
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.