Giter VIP home page Giter VIP logo

historian's People

Contributors

ihh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

historian's Issues

Using a guide tree creates faulty reconstruction results

I am trying to reconstruct the ancestors of a publicly-available HIV sequence data set called target using Historian by feeding it a guide MSA and a guide tree. However, when I provide my guide tree, Historian creates a problematic output.

Here's an overview of what I've tried as input to Historian already
(target is the data set causing problems):

  1. Guide MSA and guide alignment on a completely unrelated data set (F1):
    historian r -guide F1_MSA.fasta -tree F1.tree -output fasta > F1_recon.txt
    ...which works fine and provides normal output: F1_recon.txt

  2. Raw sequence file OR guide MSA file with no guide tree
    historian r target.fasta -output fasta > target_recon.fasta
    historian r -guide target_MSA.fasta -output fasta > target_recon.fasta
    ...which generates normal output too, like the reconstruction file shown below
    target_recon_guidemsa_notree

  3. Guide MSA file with a guide tree
    historian r -guide target_MSA.fasta -tree target.tree -output fasta > target_recon.fasta
    ...which creates a problematic reconstruction file (first 4 sequences only): target_recon.txt
    ...and aligns like this in Aliview:
    target_alignment_withtree

Tree: target_tree.txt
MSA: target_MSA.txt

Any advice or help would be greatly appreciated.

Update models to use GGI continuous-time approximators

The goal is to update the HMMs of Historian (ML and MCMC components) to use the systematic approximation to the indel model described here: https://academic.oup.com/genetics/article/216/4/1187/6065876

The primary difficulties in doing this are as follows:

  1. The Pair HMM of Holmes 2020 (representing evolution along a single branch) is fully connected, whereas Historian assumes that the D->I transition probability is zero. This has knock-on ramifications for the composite HMMs used for parent-sibling triads (two branches), and grandparent-parent-sibling tetrads (three branches) (although the tetrads are never used directly, instead we use Redelings-Suchard kernels to propose moves)
    • If(?) there are any parts of the code that group insertions before deletions (implicitly assuming that the I->D transition is allowed but the D->I transition isn't), these may need to be updated
    • Alternatively (and probably better), when calculating likelihoods of gaps that include both deletions and insertions, use combinatoric formulae to calculate likelihood of gap summed over I/D ordering
  2. Historian currently assumes that the transition probabilities of the Pair HMM can be computed in closed form; instead a numerical approximation (eg Runge-Kutta RK4) is needed
  3. The current parameter-fitting method of Historian uses a (fudged) EM method to compute the indel rates; instead we will need a gradient ascent approach based on autodiff of RK4, and methods that accumulate indel counts will need to be adapted to tally transition counts sorted by branch length

(A limited strategy may be to leave the current inference code in place for the ML alignment/reconstruction phase (and associated HMMs), and update only the MCMC and parameter-fitting code.
Viability of this strategy is unclear though, it seems a little risky from a correctness standpoint.)

For a full update, at a bare minimum, the following code will need to be updated:

One question is... would it be better to build a generic, parallelizable, MCMC-only system using Machine Boss?

Pros of doing in in Machine Boss:

  • it would be cleaner
  • it would be generic (multiple models), thus a better fit with the research direction of investigating richer models by transducer coarse-graining
  • it could be designed to be parallelizable
  • addressing a narrower algorithm (MCMC) avoids getting bogged down in fixing historian bugs like #4
  • historian does not have a big user base and the code is a little messy, so maintaining it doesn't make a lot of sense

Cons of doing it in Machine Boss:

  • complete rewrites take time
  • MCMC without summing over substitutions mixes more slowly
  • there may be lots of annoying edge cases for generic version (eg null cycles) that can be heuristically avoided in specific version

Large data set crashing

Hi there,

I've been using Historian to reconstruct the ancestral sequences within multiple different sequence data sets. My largest data set, which contains roughly 1000 sequences, has been unable to complete a run with Historian. The error message is displayed below with a high verbosity setting, simply stating "Killed" when it crashes.

Screenshot from 2019-07-09 13-05-30

I was first wondering whether this is expected when handling a large data set like this one. Is Historian able to handle sequence sets of this size?

If this failure isn't expected, would you know of a way to retrieve more information about the problem? Or know of possible adjustments to try?

Thanks in advance,

John

Ancestor reconstructions

I'm very impressed at what historian can do! It seems like it can/should replace PRANK, at the very least. I will make sure to include it in any future benchmarks.

For the ancestral sequence reconstructions, two thoughts:

  • maybe you could use N/X instead of * for an unknown letter? I think that * can mean "stop codon" in some cases, and I don't think many people in bioinformatics use *
  • how hard would it be to sample actually letters instead of N/X? In mcmc mode where you write a number of different alignments, that could be useful.

Linking needs `-lpthread` on newer systems

On my desktop, but not the remote cluster, I get this:

clang++ -std=c++11 -g -O3 -I/usr/include  -Isrc -c -o obj/jcrna.o src/jcrna.cpp
clang++ -lstdc++ -lm -lgsl -lgslcblas -lm  -lz -o bin/historian obj/historian.o obj/alignpath.o obj/ctok.o obj/dayhoff.o obj/diagenv.o obj/ECMrest.o obj/ECMunrest.o obj/fastseq.o obj/forward.o obj/gamma.o obj/gason.o obj/jc.o obj/jcrna.o obj/jones.o obj/jsonutil.o obj/knhx.o obj/lg.o obj/logger.o obj/logsumexp.o obj/memsize.o obj/model.o obj/nexus.o obj/optparser.o obj/pairhmm.o obj/presets.o obj/profile.o obj/quickalign.o obj/recon.o obj/refiner.o obj/sampler.o obj/seqgraph.o obj/simulator.o obj/span.o obj/stockholm.o obj/sumprod.o obj/tree.o obj/util.o obj/wag.o 
/usr/bin/ld: obj/logger.o: in function `std::recursive_timed_mutex::_M_clocklock(int, timespec const&)':
/usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/mutex:336: undefined reference to `pthread_mutex_clocklock'
clang: error: linker command failed with exit code 1 (use -v to see invocation)

Historian crashes if one of the sequences is 0-length

Hi Ian,

I hope all is well with you!

If I include zero-length sequences in the alignment, I get the following:

Abort: Zero forward likelihood even in the absence of guide alignment constraints - this is not good
Stack trace:
historian() [0x51302e]
historian() [0x4a7aaa]
historian() [0x4afa9b]
historian() [0x418079]
historian() [0x417dc4]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xea) [0x7ffb37193d0a]
historian() [0x406b5a]
terminate called without an active exception
Aborted

I'm doing some polishing of genome-to-genome multiple-alignments ("pan-genome alignments") by realigning chunks, and historian does quite well on these, I think with -careful. Sometimes these chunks do have zero-length sequences.

No pressure, just logging the issue.

-BenRI

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.