There are two other issues with phase_parent() that just came to my attention.
Issue 1: See the following errors:
set.seed(13567)
GBS.array <- sim.array(size.array=50, numloci=100, hom.error=0.02, het.error=0.8, rec=0.25, selfing=0.1, imiss=0.5, misscode=3)
GBS.array <- get_true_GBS(GBS.array)
probs <- error_mx(hom.error=0.02, het.error=0.8, imiss=0.5)
system.time(phase <- phase_parent(GBS.array, win_length=2, join_length=2, verbose=F))
user system elapsed
0.31 0.00 0.31
set.seed(13567)
GBS.array <- sim.array(size.array=50, numloci=500, hom.error=0.02, het.error=0.8, rec=0.25, selfing=0.1, imiss=0.5, misscode=3)
GBS.array <- get_true_GBS(GBS.array)
probs <- error_mx(hom.error=0.02, het.error=0.8, imiss=0.5)
system.time(phase <- phase_parent(GBS.array, win_length=2, join_length=2, verbose=F))
>>> join chunks [ 1 and 2, total:2] ...
Error in log(tem) : non-numeric argument to mathematical function
Timing stopped at: 2.02 0 2.02
The only difference between the two runs is the numloci, and I think that whenever there is one or more heterozygous sites that cannot be phased, phase_parent() automatically results in an error. When we increase numloci, we start to get more simulations that result in heterozygous sites that cannot be phased.
Issue 2: Phasing runtime
I originally thought that phasing is quick, but I forgot that I was dealing with small number of sites. When I extrapolated the number of sites to 500,000, the number of hours seems astronomical. See following:
win_length=2, size_array=50, numloci=100, time=0.98s, extrapolate into real data: 1.5hours
win_length=5, size_array=50, numloci=100, time=6.89s, extrapolate into real data: 11.5hours
win_length=10, size_array=50, numloci=100, time=434s, extrapolate into real data: 30days
The only win_length that seems feasible to me is 2. But I am not sure if win_length=2 is good enough.
Thanks,
CJ