evoldoers / historian Goto Github PK

View Code? Open in Web Editor NEW

8.0 8.0 2.0 2.4 MB

Reconstruction of phylogenetic insertion-deletion histories

License: BSD 3-Clause "New" or "Revised" License

Perl 8.86% C++ 86.58% C 1.75% Makefile 1.58% Python 0.18% JavaScript 0.92% HTML 0.12%

historian's People

Contributors

Stargazers

Watchers

Forkers

bioinfotools elixie212

historian's Issues

Using a guide tree creates faulty reconstruction results

I am trying to reconstruct the ancestors of a publicly-available HIV sequence data set called target using Historian by feeding it a guide MSA and a guide tree. However, when I provide my guide tree, Historian creates a problematic output.

Here's an overview of what I've tried as input to Historian already
(target is the data set causing problems):

Guide MSA and guide alignment on a completely unrelated data set (F1):
historian r -guide F1_MSA.fasta -tree F1.tree -output fasta > F1_recon.txt
...which works fine and provides normal output: F1_recon.txt
Raw sequence file OR guide MSA file with no guide tree
historian r target.fasta -output fasta > target_recon.fasta
historian r -guide target_MSA.fasta -output fasta > target_recon.fasta
...which generates normal output too, like the reconstruction file shown below
Guide MSA file with a guide tree
historian r -guide target_MSA.fasta -tree target.tree -output fasta > target_recon.fasta
...which creates a problematic reconstruction file (first 4 sequences only): target_recon.txt
...and aligns like this in Aliview:

Tree: target_tree.txt
MSA: target_MSA.txt

Any advice or help would be greatly appreciated.

Historian refuses to run with simple command:

On current master, I get:

$ historian a.fasta
You can't fix both tree and alignment when doing MCMC - you must sample one of them!

Update models to use GGI continuous-time approximators

The goal is to update the HMMs of Historian (ML and MCMC components) to use the systematic approximation to the indel model described here: https://academic.oup.com/genetics/article/216/4/1187/6065876

The primary difficulties in doing this are as follows:

The Pair HMM of Holmes 2020 (representing evolution along a single branch) is fully connected, whereas Historian assumes that the D->I transition probability is zero. This has knock-on ramifications for the composite HMMs used for parent-sibling triads (two branches), and grandparent-parent-sibling tetrads (three branches) (although the tetrads are never used directly, instead we use Redelings-Suchard kernels to propose moves)
- If(?) there are any parts of the code that group insertions before deletions (implicitly assuming that the I->D transition is allowed but the D->I transition isn't), these may need to be updated
- Alternatively (and probably better), when calculating likelihoods of gaps that include both deletions and insertions, use combinatoric formulae to calculate likelihood of gap summed over I/D ordering
Historian currently assumes that the transition probabilities of the Pair HMM can be computed in closed form; instead a numerical approximation (eg Runge-Kutta RK4) is needed
The current parameter-fitting method of Historian uses a (fudged) EM method to compute the indel rates; instead we will need a gradient ascent approach based on autodiff of RK4, and methods that accumulate indel counts will need to be adapted to tally transition counts sorted by branch length

(A limited strategy may be to leave the current inference code in place for the ML alignment/reconstruction phase (and associated HMMs), and update only the MCMC and parameter-fitting code.
Viability of this strategy is unclear though, it seems a little risky from a correctness standpoint.)

For a full update, at a bare minimum, the following code will need to be updated:

model.h, model.cpp
- ProbModel::transProb
- IndelCounts, EventCounts
pairhmm.h, pairhmm.cpp
- This is the triad HMM (parent + siblings)
- Needs a careful analysis of the states and transition probabilities
- There is already some approximation here, it may be acceptable to ignore some transition paths
forward.h, forward.cpp
- Dynamic programming using the triad HMM
sampler.h, sampler.cpp
- Dynamic programming using the branch & triad HMMs
refiner.h, refiner.cpp

One question is... would it be better to build a generic, parallelizable, MCMC-only system using Machine Boss?

Pros of doing in in Machine Boss:

it would be cleaner
it would be generic (multiple models), thus a better fit with the research direction of investigating richer models by transducer coarse-graining
it could be designed to be parallelizable
addressing a narrower algorithm (MCMC) avoids getting bogged down in fixing historian bugs like #4
historian does not have a big user base and the code is a little messy, so maintaining it doesn't make a lot of sense

Cons of doing it in Machine Boss:

complete rewrites take time
MCMC without summing over substitutions mixes more slowly
there may be lots of annoying edge cases for generic version (eg null cycles) that can be heuristically avoided in specific version

Prune and regraft moves should reattach to any parent closer to the root

Not just parents that are in range defined by current parent (which violates ergodicity)

See https://www.researchgate.net/figure/Figure-A1-A-Subtree-pruning-and-regrafting-algorithm-to-update-the-gene-tree-topology_fig3_10954521

Use N/X instead of * for wildcards

Originally requested by @bredelings in #6

Fast placement of sequence into existing alignment/tree

Add test for -ancseq ancestral sequence reconstruction option

Actually two tests:

In reconstruction mode, it should return the most likely sequence
In MCMC mode, it should sample ancestral sequences

Large data set crashing

Hi there,

I've been using Historian to reconstruct the ancestral sequences within multiple different sequence data sets. My largest data set, which contains roughly 1000 sequences, has been unable to complete a run with Historian. The error message is displayed below with a high verbosity setting, simply stating "Killed" when it crashes.

I was first wondering whether this is expected when handling a large data set like this one. Is Historian able to handle sequence sets of this size?

If this failure isn't expected, would you know of a way to retrieve more information about the problem? Or know of possible adjustments to try?

Thanks in advance,

John

Ancestor reconstructions

I'm very impressed at what historian can do! It seems like it can/should replace PRANK, at the very least. I will make sure to include it in any future benchmarks.

For the ancestral sequence reconstructions, two thoughts:

maybe you could use N/X instead of * for an unknown letter? I think that * can mean "stop codon" in some cases, and I don't think many people in bioinformatics use *
how hard would it be to sample actually letters instead of N/X? In mcmc mode where you write a number of different alignments, that could be useful.

Benchmark MCMC

Linking needs `-lpthread` on newer systems

On my desktop, but not the remote cluster, I get this:

clang++ -std=c++11 -g -O3 -I/usr/include  -Isrc -c -o obj/jcrna.o src/jcrna.cpp
clang++ -lstdc++ -lm -lgsl -lgslcblas -lm  -lz -o bin/historian obj/historian.o obj/alignpath.o obj/ctok.o obj/dayhoff.o obj/diagenv.o obj/ECMrest.o obj/ECMunrest.o obj/fastseq.o obj/forward.o obj/gamma.o obj/gason.o obj/jc.o obj/jcrna.o obj/jones.o obj/jsonutil.o obj/knhx.o obj/lg.o obj/logger.o obj/logsumexp.o obj/memsize.o obj/model.o obj/nexus.o obj/optparser.o obj/pairhmm.o obj/presets.o obj/profile.o obj/quickalign.o obj/recon.o obj/refiner.o obj/sampler.o obj/seqgraph.o obj/simulator.o obj/span.o obj/stockholm.o obj/sumprod.o obj/tree.o obj/util.o obj/wag.o 
/usr/bin/ld: obj/logger.o: in function `std::recursive_timed_mutex::_M_clocklock(int, timespec const&)':
/usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/mutex:336: undefined reference to `pthread_mutex_clocklock'
clang: error: linker command failed with exit code 1 (use -v to see invocation)

Historian crashes if one of the sequences is 0-length

Hi Ian,

I hope all is well with you!

If I include zero-length sequences in the alignment, I get the following:

Abort: Zero forward likelihood even in the absence of guide alignment constraints - this is not good
Stack trace:
historian() [0x51302e]
historian() [0x4a7aaa]
historian() [0x4afa9b]
historian() [0x418079]
historian() [0x417dc4]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xea) [0x7ffb37193d0a]
historian() [0x406b5a]
terminate called without an active exception
Aborted

I'm doing some polishing of genome-to-genome multiple-alignments ("pan-genome alignments") by realigning chunks, and historian does quite well on these, I think with -careful. Sometimes these chunks do have zero-length sequences.

No pressure, just logging the issue.

-BenRI