Giter VIP home page Giter VIP logo

agbsvdquartets's Introduction

This repository contains all scripts used in our paper, A Comparative Study of SVDquartets and Other Coalescent-Based Species Tree Estimation Methods.

##Simulated Datasets The original, unprocessed simulated datsets used in this study were obtained from the following:

  1. The 11-taxon datasets M1, M2, M3, and M4 with varying levels of ILS was obtained from http://www.cs.utexas.edu/~bayzid/files/10-taxon.tar.bz

  2. The 15-taxon dataset with a pectinate model species tree was obtained from http://www.cs.utexas.edu/users/phylo/datasets/weighted-binning-datasets.html under the "15-taxon datasets" link.

  3. The 37-taxon mammalian simulated dataset with AD=18% was obtained from https://www.ideals.illinois.edu/handle/2142/55319 under the link "Sequence Alignments and Trees for Mammalian 2X for Mirarab et. al."

##Linux Executables for Species Tree Estimation Methods The linux executables for ASTRAL [1], NJst [2], FastTree [3], and RAxML [4] are in the phylogenetic_tools folder. The linux executable for PAUP* [5] is in the src-pipelines folder.

##Scripts for Running Species Tree Estimation Methods on the Simulated Datasets

The files in each "pipeline-" folder are a combination of shell scripts and qsub scripts for the UIUC campus cluster.

agbsvdquartets's People

Contributors

agupta0905 avatar syadu1988 avatar j-chou avatar

Stargazers

Sheng Wang avatar

Watchers

 avatar Sheng Wang avatar  avatar  avatar

Forkers

pranjalv123

agbsvdquartets's Issues

missing wqmc species trees

no tree outputted for species 13 svd_wqmc_s_tree_25_10_F.trees
cat: svd_wqmc_s_tree_25_10_F.trees: No such file or directory
no tree outputted for species 23 svd_wqmc_s_tree_25_1_D.trees
cat: svd_wqmc_s_tree_25_1_D.trees: No such file or directory
no tree outputted for species 27 svd_wqmc_s_tree_25_1_C.trees
cat: svd_wqmc_s_tree_25_1_C.trees: No such file or directory
no tree outputted for species 27 svd_wqmc_s_tree_25_1_D.trees
cat: svd_wqmc_s_tree_25_1_D.trees: No such file or directory
no tree outputted for species 27 svd_wqmc_s_tree_25_1_E.trees
cat: svd_wqmc_s_tree_25_1_E.trees: No such file or directory
no tree outputted for species 27 svd_wqmc_s_tree_25_1_F.trees
cat: svd_wqmc_s_tree_25_1_F.trees: No such file or directory
no tree outputted for species 27 svd_wqmc_s_tree_25_1_G.trees
cat: svd_wqmc_s_tree_25_1_G.trees: No such file or directory
no tree outputted for species 27 svd_wqmc_s_tree_25_5_F.trees
cat: svd_wqmc_s_tree_25_5_F.trees: No such file or directory
no tree outputted for species 27 svd_wqmc_s_tree_25_10_D.trees
cat: svd_wqmc_s_tree_25_10_D.trees: No such file or directory

converting svd scores to weights. scheme H

Sorry, I posted on the wrong github page.

Shashank and I were talking about the svd scores to weights conversion, and shashank suggested the following scheme:
Let R=1/e^(svd_max-svd_med). Then
weight(q_min)=e^(R_(1-svd_min/svd_overall_max))
weight(q_med)=e^(R_(1-svd_med/svd_overall_max))
weight(q_max)=e^(R*(1-svd_max/svd_overall_max))
where svd_overall_max is the global maximum svd score.
This will fix problems with svd_min = 0 and prevent the weights from being too large or too small. Let's call this new scheme H. In the python file we will need a list of all svd scores and compute the max of this list before assigning any weights to the quartets.

problem running taxon-relabeler on 15-taxon dataset

Hey guys, so I've converted the fasta files in the 15-taxon dataset using a script Ruth pointed out to me online. I'm now trying to sample and relabel the alignments, but I get the following errors when I run the script run_pipeline_cluster.sh:

Traceback (most recent call last):
File "src-pipeline/taxon_relabeler.py", line 72, in
processFilesTaxon(inp_folder,out_folder,dict_file)
File "src-pipeline/taxon_relabeler.py", line 57, in processFilesTaxon
new_dna_string+=new_taxon+'\t'+str(dna[old_taxon])+'\n'
File "/usr/local/python/2.7.8/lib/python2.7/site-packages/dendropy/dataobject/char.py", line 1168, in getitem
raise KeyError(label)
KeyError: '0'
cp: cannot stat `/home/jedchou1/scratch/AGBsvdquartets/data/sim1/S_relabeled_tree.trees': No such file or directory

The format of the 15-taxon dataset is as follows:
On the cluster, I have a folder scratch/AGBsvdquartets/data.
Inside this data folder are 10 folders called sim1,sim2,...,sim10 as well as a file called taxa_dict.txt which just has the taxa names A,B,C,...,O and their corresponding numbers 1,2,3...,15 separated by a tab character.
Inside each sim folder are 1000 fasta alignments (1.fasta,2.fasta,...), 1000 phylip alignments (1.phy,2.phy,...), and a file called s_tree.trees which contains the true species tree in newick format.

Do you guys know what's causing this error?

Need to replace reciprocal score conversion before weekend

Dear SVD Group:

Shashank called me today to talk about preparing for his class presentation, and his questions about the theory behind this method led us to some important realizations, that impact corrections that should be make to score conversions before we run things this weekend.

In particular, we should not use the function 1/x to convert the SVD scores for wQMC, because the theory predicts that if the frequencies in the data are close enough to the true probability distribution, the SVD score for the best split will be zero! I.E. x = 0 is something we should expect to deal with.

So, exp(-x) or exp(-x^p) for some p > 1 (which is 1 when x = 0 and is decreasing....) would be a good model function to replace 1/x with before the experiments you all were planning to run over the weekend take place.

I believe Jed had mentioned trying this earlier as a conversion and it got lost in the shuffle.

Another thing to expect is that is the frequencies in the gene data are not good enough approximations to the true probability distribution, it may be the case that all three scores are zero. We need to decide what to do in this case. Shashank suggested that perhaps we could give equal weight to all but reduce it somehow to reflect that we had no confidence in any of them.

  1. In the case where there is only one zero, and two nonzero scores, our mapping function of course should give highest value to that.
  2. If all the tree scores are zero, the approximation of frequencies to probabilities not very good, we should give a very low weight to all three quartets on a four taxon subset before feeding to wQMC.

Please do not hesitate to contact me today or tomorrow about this, and I am going to post this as an issue to Ashu's github too.

Best
Ruth

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.