Giter VIP home page Giter VIP logo

treecluster's Introduction

TreeCluster

TreeCluster is a tool that, given a tree T (Newick format) and a distance threshold t, finds the minimum number of clusters of the leaves of T such that some user-specified constraint is met in each cluster. The user can also specify a branch support threshold s such that no pair of leaves in any cluster can be connected by branches with support less than or equal to s. Note that all leaves given a cluster of -1 are singletons, meaning they did not cluster with any other leaves (i.e., each leaf with a cluster of -1 is in its own cluster).

TreeCluster was motivated by Cluster Picker.

The default method is "Max Clade" (see Clustering Methods). There is no explicit default distance threshold, but because Cluster Picker recommends a distance threshold of 0.045 and because the same objective function is optimized by both Cluster Picker and TreeCluster "Max Clade", we currently recommend 0.045 as well.

Note that TreeCluster can run within seconds even on ultra-large datasets, so it may make sense to use a range of thresholds and determine the appropriate choice based on the results. We intend to develop non-parametric methods of TreeCluster clustering in the future.

Installation

TreeCluster can be installed using pip:

sudo pip install treecluster

If you are using a machine on which you lack administrative powers, TreeCluster can be installed locally using pip:

pip install --user treecluster

Usage

usage: TreeCluster.py [-h] [-i INPUT] -t THRESHOLD [-s SUPPORT] [-m METHOD] [-tf THRESHOLD_FREE]

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        Input Tree File (default: stdin)
  -o OUTPUT, --output OUTPUT
                        Output File (default: stdout)
  -t THRESHOLD, --threshold THRESHOLD
                        Length Threshold (default: None)
  -s SUPPORT, --support SUPPORT
                        Branch Support Threshold (default: -inf)
  -m METHOD, --method METHOD
                        Clustering Method (options: avg_clade, leaf_dist_avg, leaf_dist_max, leaf_dist_min, length, length_clade, max, max_clade,
                        med_clade, root_dist, single_linkage, single_linkage_cut, single_linkage_union, sum_branch, sum_branch_clade) (default: max_clade)
  -tf THRESHOLD_FREE, --threshold_free THRESHOLD_FREE
                        Threshold-Free Approach (options: argmax_clusters) (default: None)
  -v, --verbose         Verbose Mode (default: False)
  --version             Display Version (default: False)

Example Files and Helper Scripts

To help users, we have provided example files in the example directory, and we have provided some helper scripts that implement common clustering-related tasks in the helper_scripts directory:

Clustering Methods

  • Avg Clade: Cluster the leaves such that the following conditions hold for each cluster:

    1. The average pairwise distance between leaves in the cluster is at most t
    2. Leaves cannot be connected by branches with support less than or equal to s
    3. The leaves in the cluster must define a clade in T
    • For a tree with n leaves, this algorithm is O(n) in the worst case
    • If verbose mode is enabled (-v), the clades defined by the clusters will be printed to standard error
  • Leaf Dist Avg: Cluster the leaves by cutting the tree at t distance away from the bottom of the tree, where "bottom" is defined as the average root-to-tip distance

    1. Branches with support less than or equal to s are simply treated as infinitely long
    2. For a tree with n leaves, this algorithm is O(n) in the worst case
  • Leaf Dist Max: Cluster the leaves by cutting the tree at t distance away from the bottom of the tree, where "bottom" is defined as the furthest leaf from the root

    1. Branches with support less than or equal to s are simply treated as infinitely long
    2. For a tree with n leaves, this algorithm is O(n) in the worst case
  • Leaf Dist Min: Cluster the leaves by cutting the tree at t distance away from the bottom of the tree, where "bottom" is defined as the closest leaf to the root

    1. Branches with support less than or equal to s are simply treated as infinitely long
    2. For a tree with n leaves, this algorithm is O(n) in the worst case
  • Length: Cluster the leaves such that the following conditions hold for each cluster:

    1. The cluster does not contain any edges above length t
    2. Leaves cannot be connected by branches with support less than or equal to s
    • For a tree with n leaves, this algorithm is O(n) in the worst case
  • Length Clade: Cluster the leaves such that the following conditions hold for each cluster:

    1. The cluster does not contain any edges above length t
    2. Leaves cannot be connected by branches with support less than or equal to s
    3. The leaves in the cluster must define a clade in T
    • For a tree with n leaves, this algorithm is O(n) in the worst case
    • If verbose mode is enabled (-v), the clades defined by the clusters will be printed to standard error
  • Max: Cluster the leaves such that the following conditions hold for each cluster:

    1. The maximum pairwise distance between leaves in the cluster is at most t
    2. Leaves cannot be connected by branches with support less than or equal to s
    • For a tree with n leaves, this algorithm is O(n) in the worst case
  • Max Clade: Cluster the leaves such that the following conditions hold for each cluster:

    1. The maximum pairwise distance between leaves in the cluster is at most t
    2. Leaves cannot be connected by branches with support less than or equal to s
    3. The leaves in the cluster must define a clade in T
    • For a tree with n leaves, this algorithm is O(n) in the worst case
    • If verbose mode is enabled (-v), the clades defined by the clusters will be printed to standard error
  • Med Clade: Cluster the leaves such that the following conditions hold for each cluster:

    1. The median pairwise distance between leaves in the cluster is at most t
    2. Leaves cannot be connected by branches with support less than or equal to s
    3. The leaves in the cluster must define a clade in T
    • For a tree with n leaves, this algorithm is O(n² log n) in the worst case
    • If verbose mode is enabled (-v), the clades defined by the clusters will be printed to standard error
  • Root Dist: Cluster the leaves by cutting the tree at t distance away from the root

    • Branches with support less than or equal to s are simply treated as infinitely long
    • For a tree with n leaves, this algorithm is O(n) in the worst case
  • Single Linkage: Cluster the leaves such that the following conditions hold:

    1. For any two leaves u and v, if the distance between u and v is at most t, they must be in the same cluster
    2. Leaves cannot be connected by branches with support less than or equal to s
    3. The number of clusters is maximized
    • For a tree with n leaves, this algorithm is O(n) in the worst case
  • Sum Branch: Cluster the leaves such that the following conditions hold for each cluster:

    1. The total branch length of the spanning tree connecting the leaves of the cluster is at most t
    2. Leaves cannot be connected by branches with support less than or equal to s
    • For a tree with n leaves, this algorithm is O(n) in the worst case
  • Sum Branch Clade: Cluster the leaves such that the following conditions hold for each cluster:

    1. The total branch length of the spanning tree connecting the leaves of the cluster is at most t
    2. Leaves cannot be connected by branches with support less than or equal to s
    3. The leaves in the cluster must define a clade in T
    • For a tree with n leaves, this algorithm is O(n) in the worst case
    • If verbose mode is enabled (-v), the clades defined by the clusters will be printed to standard error

Threshold-Free Approaches

  • Argmax Clusters: Choose the threshold that maximizes the number of non-singleton clusters over all thresholds from 0 to t
    • Currently, for the sake of speed, only every 0.001 threshold is tested (i.e., 0, 0.001, 0.002, ..., t)

Requirements

Citing TreeCluster

If you use TreeCluster in your work, please cite:

Balaban M, Moshiri N, Mai U, Jia X, Mirarab S (2019). "TreeCluster: Clustering biological sequences using phylogenetic trees." PLoS ONE. 14(8):e0221068. doi:10.1371/journal.pone.0221068

treecluster's People

Contributors

balabanmetin avatar niemasd avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

treecluster's Issues

I am facing some issue

I am getting an error which I am not able to understand " File /usr/local/lib/python3.8/dist-packages/treeswift/Tree.py", line 1455, in read_tree_newick raise RuntimeError("Failed to parse string as Newick: %s"%ts)

Please help me troubleshoot this.

Thanks

ClusterNumber of -1?

Hi,

Thank you for developing TreeCluster. When performing clustering, what does a ClusterNumber of -1 describe?

Thanks,

Ammar

Explanation of hreshold free approach?

Hi there,

I'm interested in trying out the threshold-free approach of TreeCluster, but I can't seem to figure out what the command should be. Could you please assist?

Running TreeCluster normally, I've successfully used the following:

TreeCluster.py -t 0.0001 -i mytree.tre

Using the threshold-free approach, I've tried:

TreeCluster.py -i mytree.tre  --threshold_free

Output: "TreeCluster.py: error: argument -tf/--threshold_free: expected one argument"

TreeCluster.py -i mytree.tre  --threshold_free 5

Output: "TreeCluster.py: error: the following arguments are required: -t/--threshold"

TreeCluster.py -i mytree.tre  --threshold_free 5 -t 0.0001

Output: "AssertionError: ERROR: Invalid threshold-free approach: 5"

I'm stumped. What's the correct usage? Intuitively, I'd assume a threshold-free approach wouldn't require setting a threshold.

Thanks

max_clade issue

I notice that function min_clusters_threshold_max_clade calls treeswift.resolve_polytomies. Given the tree,
(A:2,B:2,C:1);
TreeCluster.py -t 3 -i <( echo "(A:2,B:2,C:1)" )
SequenceName ClusterNumber
A -1
B 1
C 1
This breaks the clade (A,B,C) into 2 clusters. On the other hand, if you comment out line 415, the result is,
SequenceName ClusterNumber
A -1
B -1
C -1
of which each cluster is a clade.

New functionality requested for TreeCluster

Please add some functionality to provide information on how DIFFERENT each cluster is from one another. Currently, the program only identifies taxa that are within a certain distance from one another but does not reveal if other groups would be pulled in with just a small change in bootstrap/distance.

Newick tree file with new lines

Greetings,

I'm using MAFFT to output Newick Tree files for my sequences, but MAFFT seems to output the Newick files with multiple new lines / carriage returns and TreeCluster seems to choke on it (see traceback below). My only work around is to open the raw Newick output and use find and replace to re-organize the file so everything is on one line (rather than many lines by deleting the returns). I don't know if this was intended but it seems like it would be better a beginner (like me) if one could go straight from the MAFFT tree output into TreeCluster unless I'm making a mistake somewhere in TreeCluster (or MAFFT which I realize you don't support... but it is widely used in the field so I'd accept any suggestions). Thanks!

Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/treeswift/Tree.py", line 1443, in read_tree_newick
while parse_label or ts[i] not in {':', ',', ';', ')', '['}:
IndexError: string index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/bin/TreeCluster.py", line 605, in
trees.append(read_tree_newick(l))
File "/usr/local/lib/python3.10/dist-packages/treeswift/Tree.py", line 1452, in read_tree_newick
raise RuntimeError(f"Failed to parse string as Newick: {ts}")
RuntimeError: Failed to parse string as Newick: 1_Sequence_0001

Definition of branch support

Hi Dr.Moshiri,

Thanks for developing this tool. It's very useful for our work (defining clusters). However, the definition of branch support threshold (-s) is not clear to me. You described that "no pair of leaves in any cluster can be connected by branches with support less than or equal to s". Does this mean every branch (or node) within the defined cluster must have the support more than s? Or it means only ancestral branch (or node) of the defined cluster must have the support more than s. For our study, we need only the ancestral node to have a high support.

Thank you in advance for clarifying,

Nakarin

arg_max optimisation

Hi,
thank you for the wonderful program,
I sucsessfully use its algorithms rewritten in C++ in my work.
I would like to draw your attention to the fact that the argmax_clusters function works very inefficiently with methods "xxx_clade". Each such method, at first, calculates the required distances when traversing the tree without using the "threshold" parameter, and then the another algorithm (the same in all "xxx_clade" methods) performs clustering with this parameter.
If you separate the distance calculation (one times) and clustering in cycle with "threshold" stepping of these methods in the argmax_clusters, you will get at least double/triple acceleration of this function, and when using the min_clusters_threshold_med_clade method, this acceleration is even difficult to estimate.
If you find time for such optimization, users will be very grateful to you.

Instalation issue

Hello, sorry for the inconvenience but I'm having issues installing TreeCluster. After running "sudo pip install treecluster" I'm getting the following error.

File "/usr/local/bin/TreeCluster.py", line 213
    print("%s;" % root.newick(), file=stderr)
                                     ^
SyntaxError: invalid syntax

I'm using phython Python 3.8.6 | packaged by conda-forge.
Thanks

TreeCluster : is it only for binary trees ?

Hey,
I was wondering if the algorithm is only usable for binary trees, and if so is there a possibility to generalize it to all kind of trees (or at least rooted ones).
Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.