Giter VIP home page Giter VIP logo

libpll's Introduction

libpll

Build Status License

Introduction

The aim of this project is to implement a versatile high-performance software library for phylogenetic analysis. The library should serve as a lower-level interface of PLL (Flouri et al. 2015) and should have the following properties:

  • open source code with an appropriate open source license.
  • 64-bit multi-threaded design that handles very large datasets.
  • easy to use and well-documented.
  • SIMD implementations of time-consuming parts.
  • as fast or faster likelihood computations than RAxML (Stamatakis 2014).
  • fast implementation of the site repeats algorithm (Kobert 2017).
  • functions for tree visualization.
  • bindings for Python.
  • generic and clean design.
  • Linux, Mac, and Microsoft Windows compatibility.

Compilation instructions

Currently, libpll requires that GNU Bison and Flex are installed on the target system. On a Debian-based Linux system, the two packages can be installed using the command

apt-get install flex bison

The library also requires that a GNU system is available as it uses several functions (e.g. asprintf) which are not present in the POSIX standard. This, however will change in the future in order to have a more portable and cross-platform library.

The library can be compiled using either of the following two ways.

Cloning the repo Clone the repo and bild the executable and documentation using the following commands.

git clone https://github.com/xflouris/libpll.git
cd libpll
./autogen.sh
./configure
make
make install    # as root, otherwise run: sudo make install

When using the cloned repository version, you will also need autoconf, automake and libtool installed. On a Debian-based Linux system, the packages can be installed using the command

sudo apt-get install autotools-dev autoconf libtool

The library will be installed on the operating system's standard paths. For some GNU/Linux distributions it might be necessary to add that standard path (typically /usr/local/lib) to /etc/ld.so.conf and run ldconfig.

Microsoft Windows compatibility was tested with a cross-compiler and seems to work out-of-the-box using MingW.

Available functionality

libpll currently implements the General Time Reversible (GTR) model (Tavare 1986) which can be used for nucleotide and amino acid data. It supports models of variable rates among sites, the Inv+Γ (Gu et al. 1995) and has functions for computing the discretized rate categories for the gamma model (Yang 1994). Furthermore, it supports several methods for ascertainment bias correction (Kuhner et al. 2000, McGill et al. 2013, Lewis 2011, Leaché et al. 2015). Additional functionality includes tree visualization, functions for parsimony (minimum mutation cost) calculation and ancestral state reconstruction using Sankoff's method (Sankoff 1975, Sankof and Rousseau 1975). The functions for computing partials, evaluating the log-likelihood and updating transition probability matrices are vectorized using both SSE3, AVX and AVX2 instruction sets.

Documentation

Please refer to the wiki page.

Usage examples

Please refer to the wiki page and/or the examples directory.

libpll license and third party licenses

The libpll code is currently licensed under the GNU Affero General Public License version 3. Please see LICENSE.txt for details.

libpll includes code from several other projects. We would like to thank the authors for making their source code available.

libpll includes code from GNU Compiler Collection distributed under the GNU General Public License.

Code

The code is written in C with some parts written using in-line assembler and intrinsic functions.

File Description
compress.c Functions for compressing alignment into site patterns.
core_derivatives_avx2.c AVX2 vectorized core functions for computing derivatives of the likelihood function.
core_derivatives_avx.c AVX vectorized core functions for computing derivatives of the likelihood function.
core_derivatives.c Core functions for computing derivatives of the likelihood function.
core_derivatives_sse.c SSE vectorized core functions for computing derivatives of the likelihood function.
core_likelihood_avx2.c AVX2 vectorized core functions for computing the log-likelihood.
core_likelihood_avx.c AVX vectorized core functions for computing the log-likelihood.
core_likelihood.c Core functions for computing the log-likelihood, that do not require partition instances.
core_likelihood_sse.c SSE vectorized core functions for computing the log-likelihood.
core_partials_avx2.c AVX2 vectorized core functions for updating vectors of conditional probabilities (partials).
core_partials_avx.c AVX vectorized core functions for updating vectors of conditional probabilities (partials).
core_partials.c Core functions for updating vectors of conditional probabilities (partials).
core_partials_sse.c SSE vectorized core functions for updating vectors of conditional probabilities (partials).
core_pmatrix_avx2.c AVX2 vectorized core functions for updating transition probability matrices.
core_pmatrix_avx.c AVX vectorized core functions for updating transition probability matrices.
core_pmatrix.c Core functions for updating transition probability matrices.
core_pmatrix_sse.c SSE vectorized core functions for updating transition probability matrices.
derivatives.c Functions for computing derivatives of the likelihood function.
fasta.c Functions for parsing FASTA files.
fast_parsimony_avx2.c AVX2 fast unweighted parsimony functions.
fast_parsimony_avx.c AVX fast unweighted parsimony functions.
fast_parsimony.c Non-vectorized fast unweighted parsimony functions.
fast_parsimony_sse.c SSE fast unweighted parsimony functions.
gamma.c Functions related to Gamma (Γ) function and distribution.
hardware.c Hardware detection functions.
lex_rtree.l Lexical analyzer for parsing newick rooted trees.
lex_utree.l Lexical analyzer for parsing newick unrooted trees.
likelihood.c Functions ofr computing the log-likelihood of a tree given a partition instance.
list.c (Doubly) Linked-list implementations.
maps.c Character mapping arrays for converting sequences to the internal representation.
models.c Model parameters related functions.
output.c Functions for output in terminal (i.e. conditional likelihood arrays, probability matrices).
parse_rtree.y Functions for parsing rooted trees in newick format.
parse_utree.y Functions for parsing unrooted trees in newick format.
parsimony.c Parsimony functions.
partials.c Functions for updating vectors of conditional probabilities (partials).
phylip.c Functions for parsing phylip files.
pll.c Functions for setting PLL partitions (instances).
random.c Re-entrant multi-platform pseudo-random number generator.
rtree.c Rooted tree manipulation functions.
utree.c Unrooted tree manipulation functions.
utree_moves.c Functions for topological rearrangements on unrooted trees.
utree_svg.c Functions for SVG visualization of unrooted trees.

Bugs

The source code in the master branch is thoroughly tested before commits. However, mistakes may happen. All bug reports are highly appreciated. You may submit a bug report here on GitHub as an issue, or you could send an email to [email protected].

libpll core team

  • Tomáš Flouri
  • Diego Darriba
  • Kassian Kobert
  • Mark T. Holder
  • Alexey Kozlov
  • Alexandros Stamatakis

Acknowledgements

Special thanks to the following people for patches and suggestions:

  • Frederick Matsen
  • Ben Redelings
  • Andreas Tille
  • Ziheng Yang

Contributing to libpll

Please read the section Contributing to libpll of the wiki.

References

  • Flouri T., Izquierdo-Carrasco F., Darriba D., Aberer AJ, Nguyen LT, Minh BQ, von Haeseler A., Stamatakis A. (2015) The Phylogenetic Likelihood Library. Systematic Biology, 64(2): 356-362. doi:10.1093/sysbio/syu084

  • Gu X., Fu YX, Li WH. (1995) Maximum Likelihood Estimation of the Heterogeneity of Substitution Rate among Nucleotide Sites. Molecular Biology and Evolution, 12(4): 546-557.

  • Kobert K., Stamatakis A., Flouri T. (2017) Efficient detection of repeating sites to accelerate phylogenetic likelihood calculations. Systematic Biology, 66(2): 205-217. doi:10.1093/sysbio/syw075

  • Leaché AL, Banbury LB, Felsenstein J., de Oca ANM, Stamatakis A. (2015) Short Tree, Long Tree, Right Tree, Wrong Tree: New Acquisition Bias Corrections for Inferring SNP Phylogenies. Systematic Biology, 64(6): 1032-1047. doi:10.1093/sysbio/syv053

  • Lewis LO. (2001) A Likelihood Approach to Estimating Phylogeny from Discrete Morphological Character Data. Systematic Biology, 50(6): 913-925. doi:10.1080/106351501753462876

  • Sankoff D. (1975) Minimal Mutation Trees of Sequences. SIAM Journal on Applied Mathematics, 28(1): 35-42. doi:10.1137/0128004

  • Sankoff D, Rousseau P. (1975) Locating the Vertices of a Steiner Tree in Arbitrary Metric Space. Mathematical Programming, 9: 240-246. doi:10.1007/BF01681346

  • Stamatakis A. (2014) RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics, 30(9): 1312-1313. doi:10.1093/bioinformatics/btu033

  • Tavaré S. (1986) Some probabilistic and statistical problems in the analysis of DNA sequences. American Mathematical Sciety: Lectures on Mathematics in the Life Sciences, 17: 57-86.

  • Yang Z. (2014) Maximum likelihood phylogenetic estimation from dna sequences with variable rates over sites: Approximate methods. Journal of Molecular Evolution, 39(3): 306-314. doi:10.1007/BF00160154

libpll's People

Contributors

amkozlov avatar ddarriba avatar matsen avatar mtholder avatar pierrebarbera avatar xflouris avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

libpll's Issues

Add pattern weights vector to partition instance

  • Add a vector pattern_weights of pattern weights to pll_partition_t structure
  • Add a pll_set_pattern_weights() function
  • Modify pll_compute_root_loglikelihood and pll_compute_edge_loglikelihood() to accommodate pattern_weights

Setup a wiki with current libpll documentation

Setup wiki documentation for the current milestone:

  • Basics about libpll. Partition structure, how to setup a partition.
  • Explain how tip vectors should look like in a typical phylogenetic analysis (given a solid and a degenerate character)
  • The two ways of setting tip vectors (set clv manually or pass sequence and map)
  • How to parse FASTA files (complete or partial parse)
  • How to parse newick files (rooted and unrooted)
  • How to setup frequencies, rate heterogeneity, substitution parameters
  • Explain what is a rate matrix (R), an instantaneous rate matrix (Q-matrix) and a transition probability matrix (P-matrix). Show all the gory details of creating a p-matrix (normalization at diagonal etc). Give a description of the GTR.
  • How to compute the p-matrices with the library (update_prob_matrices) given the branch lengths
  • How to compute the CLV vectors (pll_update_partials)
  • How to evaluate the log likelihood of a rooted tree (pll_compute_root_loglikelihood)
  • How to evaluate the log likelihood of an unrooted tree (pll_compute_edge_loglikelihood)
  • How to account for invariant sites
  • Document all protein models, give references
  • Document the demo examples

Change of data types

Perhaps now it would be a good idea to decide the data types of the following pll_partition_t variables:

type Variable Role Suggested type
int tips Number of tip sequences unsigned long
int clv_buffers Number of CLV buffers (for inner nodes) unsigned long
int states Number of states int
int sites Number of sites in partition unsigned long
int rate_matrices Number of different rate matrices that will be used unsigned int
int prob_matrices Number of probability matrices to allocate unsigned long
int rate_cats Number of rate categories unsigned int
int scale_buffers Number of scalers unsigned long
unsigned int ** scale_buffer Scalers unsigned long **
int * invariant Indicates whether a site is invariant or not unsigned long *
int * eigen_decomp_valid Indicates whether the eigen decomposition for a pmatrix is valid int *

My justification for the proposed changes is that int (or unsigned int) could be an insufficient range for large phylogenomic datasets. Changing tips, clv_buffers, sites, prob_matrices and scale_buffers would enable the analysis of datasets comprising of more than 4,294,967,295 sites.
The drawback is that the memory requirements will double.

Also, if the proposed changed are done, it would also make sense to change scale_buffers, invariant to unsigned long ** and unsigned long * respectively, because the new data type (unsigned long) would then be of the same size (8 bytes) as a CLV entry (double), and this would improve vectorization efficiency.

What do you think?

Implement numerical scaling

My idea is to test several strategies, although we should discuss them before implementing.

  • Option for disabling scaling completely
  • Option for scaling always
  • Option for scaling 'sometimes'

Functionality for customized traversal of rooted and unrooted trees

Update pll_rtree_traverse and pll_utree_traverse in a way to allow a customized traversal using a callback function. For example, to be able to perform a partial postorder traversal based on the validity of CLV at inner nodes. That is if nodes are correctly oriented, then the subtree rooted at those nodes is not traversed.

profiling of CLV and P-matrix updates

We should do profiling runs for full tree traversals to determine how much time is spent in P-matrix and CLV updates and if it would be worthwhile to further optimize the p matrix calcs.

Implement a FASTA parser

Implement two functions. One for reading a complete FASTA file in memory, and another for iteratively getting the reads. Will copy the functions from SALT.

Potential bug with odd number of states

PMatrices are allocated according to the padded number of states:

partition->pmatrix[i] = pll_aligned_alloc(states_padded * states_padded *
                                              rate_cats * sizeof(double),
                                              partition->alignment);

However, when computing the likelihood it does not account for the padding. It works well for 4 and 20 states, but it might be a problem if padding is applied.

pmatrix = partition->pmatrix[matrix_index];
for (i = 0; i < partition->rate_cats; ++i)
{
  for (j = 0; j < partition->states; ++j)
  {
     ....
     pmatrix += states;
  }
}

We should check if this problem happens somewhere else in the code.

Getting initial branch length estimate

I also wondered if branch length optimization can be accelerated, if, instead of initial default branch lengths, we compute a good initial value for each branch separately using parsimony

Implement a method for using the least amount of CLV vectors

Implement a pll_update_operations_minimal that takes a list of operations and an integer N and re-assigns parent_clv_index, child1_clv_index and child2_clv_index of all nodes in order to use the minimal amount C of CLVs, which is, in the worst case, log(n)+2.

Another useful function would be: Given a tree, and not requiring a partition instance, compute the minimal number C, such that this amount may be passed later on to the pll_partition_create function, in case we are working with very large datasets.

tip lookup table

I believe that implementing separate, sepcifically optimized functions for the tips should yield huge performance improvements, keep in mind that CLVs above the tips make up for about 50% of all CLV updates

Fix and document lg4x example

Several issues:

  • instead of the source code there is a binary executable
  • missing documentation (overall description) as a README.md file in the example dir.

low low level PLL

Offer core LLPLL functions that do not require passing a partition data type as argument, but just C arrays and variables, this will facilitate integration with ExaML.

Makefile doesn't work on OS X

The Makefile doesn't work on Mac OS X because -soname is not a valid compiler option. I think, it should be substituted with -install_name on OS X (if that's possible).

Work on highly efficient parallelization

Extend/adapt partition datatype such that it can store the overall length of the partition and also the number of sites it does computations on, for instance, when the partition has been split up across one or more processors.

Vectorization & brief code review

Just a few comments on the vectorization if you want to make it generic:

  1. Keep in mind that for a truly generic implementation there might be data-types that are not multiples of the vector width, e.g., a 7-state model (this exists for RNA secondary structure models!)
  2. Thus, you may have to use uload for vector words and also need to add additional loops outside the vectorized loops for handling those indexes that don't fit into a vector word.
  3. Alternatively, you can of course use padding, but that may be more complicated to implement

Test case for multi-platform compatibility

It would be nice to have test cases for Win{32,64} and Mac.

For example, something like ./runtest.py win64, which will use a cross-compiler (i.e. mingw) to build a PE file.

Per-category scalers

If the shape of the Gamma distribution is low enough, it might happen that the likelihood scores of different categories require different scaling. That means that at some point it is very likely that some per-category likelihoods go under precision. We can keep one individual scaler for each site and category (thus this requires 'c' times more memory for the scalers, where 'c' is the number of different categories), and the likelihood computation is modified as follows:

codecogseqn
s_{i,j} is the scaler for site 'i' and category 'j'.
K is the likelihood scaling threshold

Apart from the additional memory required, which is not excessive compared to the CLVs, we need to check whether each term with scaling involved in the sum (i.e., s_{i,j} > s_{i,min}) may or may not go out of precision. If we assume that K is fixed to 2^256, then a scaling >= 2 is enough for discarding the term. However, we might want to implement this in a more generic way and decide which would be the highest supported difference in the scalers. For example, we could change the scaling threshold at some point for testing purposes.

Problem when scaling is required but scaling is disabled.

A segmentation fault occurs when scaling is disabled for the parent node, i.e. operations[i].parent_scaler_index = PLL_SCALE_BUFFER_NONE, in which case the scaler array is set to NULL. If, however, scaling is required, the function update_partials() will attempt to fill the scaler array.

pll_traverse_utree

Depending on the starting node we pass, pll_traverse_utree will set the tip CLV indices in the operations structure on a visit-first order (post-order traversal). Therefore, the indices may (and will) not correspond to the indices used in pll_set_tip_states or pll_set_tip_clv when setting tip CLVs using sequences.

Proposed solution

Add a CLV index variable in the pll_utree_t structure for tip nodes, which must be filled, for instance when setting tip CLVs from a tree traversal. The function pll_traverse_utree will then uses those CLV indices, instead of (wrongly) enumerating on a first-visit basis.

Range based function to update a CLV

A CLV update function, that only updates between indices i and j, where 0 <= i < j < num_sites

Motivation:
Suppose you're calculating the inner node CLV between two leafs with sequences

-  -  -  -  -  A  T  A  G  C  T  -  -  -  -  -  - 
-  -  -  -  -  A  T  T  G  C  T  A  -  -  -  -  - 
0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16

then ideally we would like to skip the computation for the outer parts where neither sequence has valid data.

performance

Compare performance with PLL, right now, looking at the code (nicely written by the way) I can still not believe that the performance differences are so small ....

Thread-based parallelization

I am just working on developing a PThreads-version of ExaML.

In this context I observed the following:

  1. We mainly need to implement a function analogous to the MPI allreduce operation in PThreads.
  2. Thinking of auto-tuned likelihood implementations, we may also have to consider aut-tuned barrier implementations. The reason is that barrier performance and types required will probably be very different if we just have one partition or several partitions that are mostly distributed monolithically to threads.

This could be a nice technical paper in the context of a master's thesis.

Set custom tip clv

  1. extend pll_set_tip_states() with a map parameter that will specify the conversion of characters to states. Currently the maps map_nt and map_aa are implicitely used if states=4 and states=20 respectively.
  2. add a function pll_set_tip_clv() which can be used instead of pll_set_tip_states for setting the tip clv manually.

Implement simple examples

  • Evaluating the log-likelihood of a rooted tree
  • Evaluating the log-likelihood of an unrooted tree
  • Perhaps one of them with DNA data, the other with AA
  • Show invariant sites usage
  • Include parsing of fasta/newick files
  • Show both methods for setting the tip CLVs.

Alternative branch length optimization approaches

for testing purposes

a) Parallel Newton-Raphson optimization

Instead of iteratively optimize and apply the branch lengths, optimize each of them and store the optimized value (proposal) in a buffer. Once all branches have been optimized, applied them and perform a full traversal.

The advantage of this approach is that branches can be optimized in parallel and in an arbitrary order, leading always to the same branch lengths (reproducible result regardless how they are computed). There is no need to update CLVs/scalers between branches, so the optimization of the set of branches in a single core is faster for each iteration. However, the convergence of the branch length optimization process will probably take a longer time, and perhaps it is prone to get stuck in local optima easier, or to end with a lower global likelihood.

b) Multiple-branch BFGS

Optimize simultaneously small subsets of branches (e.g., 5 to 10) using BFGS. Instead of approximating the gradients, they can be calculated using the first derivative.

Protein models validation

For the 2000 taxon dataset it seems that for some models we are not getting the correct log-likelihood. Specifically:

DAYHOFF

Program log-likelihood
phyml -588223.285675
libpll -588219.889941
PLL -588222.723049

DCMUT

Program log-likelihood
phyml -587850.513078
libpll -587846.694946
PLL -587849.020145
PAUP -587849.11689

CPREV

Program log-likelihood
phyml -575729.707167
libpll -575646.393357
PLL -575728.228283

VT

Program log-likelihood
phyml -572919.022689
libpll -572916.091308
PLL -572913.456739

BLOSUM62

Program log-likelihood
phyml -560908.378240
libpll -560904.810505
PLL -560901.666187

MTZOA

Program log-likelihood
phyml (not computed yet)
libpll -577035.462869
PLL -577322.089243

PMB

Program log-likelihood
phyml (not computed yet)
libpll -563036.633537
PLL -563034.893054

HIVB

Program log-likelihood
phyml -621655.329921
libpll -621650.754625
PLL -621634.016406

JTTDCMUT

Program log-likelihood
phyml (not computed yet)
libpll -577930.954592
PLL -577875.933972
PAUP -577931.07336

testing framework

Please add a wiki documentation section with the steps on how to run the testing framework. It's always a painful thing to make it run, especially with the increasing number of header files.

should code be defensive about compilation with -DNDEBUG ?

Or is the library intended to only be compiled with asserts enabled?

The really anal approach is to write:

assert(root != NULL);

as:

assert(root != NULL);
if (root == NULL)
{
  *some code here*
}

IMHO, this is too anal. Just thought I'd ask...

change in API naming

The following API function names will be modified

Old New
pll_create_partition pll_partition_create
pll_destroy_partition pll_partition_destroy
pll_parse_newick_rtree pll_rtree_parse_newick
pll_destroy_rtree pll_rtree_destroy
pll_show_ascii_rtree pll_rtree_show_ascii
pll_write_newick_rtree pll_rtree_export_newick
pll_traverse_rtree pll_rtree_traverse
pll_query_rtree_tipnames pll_rtree_query_tips with a change in functionality such that pointers to tips are returned instead of the labels
pll_parse_newick_utree pll_utree_parse_newick
pll_destroy_utree pll_utree_destroy
pll_show_ascii_utree pll_utree_show_ascii
pll_write_newick_utree pll_utree_export_newick
pll_traverse_utree pll_utree_traverse
pll_query_utree_tipnames pll_utree_query_tips with a change in functionality such that pointers to tips are returned instead of the labels

Implement site weights for each partition

Add an array of integers to the partition structure which will hold the frequencies of each site. It will be initialized to all ones at the pll_partition_create() call, and then the user will be able to set it by calling pll_set_site_weights().

The prototype of the function should be:

PLL_EXPORT void pll_set_site_weights(pll_partition_t * partition,
                                     const unsigned int * weights);

BFGS pitfalls

Hi guys,

Check the latest RAxML commit (optimizeModel.c) where I fixed a numerical issue in BFGS:

stamatak/standard-RAxML@9a02452

You need to be very careful about specifying the lower bounds for the variables you wish to optimize, otherwise you'll spend a day debugging at least. The lower bounds should not be smaller than the error margin!

Alexis

model-optimize example

Three things

  • missing documentation (overall description) as a README.md file in the example dir.
  • keep consistent directory structures (examples dont have their own src).
  • currently broken with the TIP-TIP addition.

Need to pass a map to the pll_create_partition. However, the map is decided after the partition is created. I guess it's a simple fix to move the switch deciding the map before the pll_create_partition.

But i'm not well oriented in this example, I guess you can do it quickly @ddarriba

Implement protein models

Set the matrices for the popular models and append an interface function for setting the parameters in the partition.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.