agupta0905 / agbsvdquartets Goto Github PK

Objective-C 6.31% C 1.40% HTML 4.70% Shell 28.94% Python 24.64% Perl 34.01%

agbsvdquartets's Introduction

This repository contains all scripts used in our paper, A Comparative Study of SVDquartets and Other Coalescent-Based Species Tree Estimation Methods.

##Simulated Datasets The original, unprocessed simulated datsets used in this study were obtained from the following:

The 11-taxon datasets M1, M2, M3, and M4 with varying levels of ILS was obtained from http://www.cs.utexas.edu/~bayzid/files/10-taxon.tar.bz
The 15-taxon dataset with a pectinate model species tree was obtained from http://www.cs.utexas.edu/users/phylo/datasets/weighted-binning-datasets.html under the "15-taxon datasets" link.
The 37-taxon mammalian simulated dataset with AD=18% was obtained from https://www.ideals.illinois.edu/handle/2142/55319 under the link "Sequence Alignments and Trees for Mammalian 2X for Mirarab et. al."

##Linux Executables for Species Tree Estimation Methods The linux executables for ASTRAL [1], NJst [2], FastTree [3], and RAxML [4] are in the phylogenetic_tools folder. The linux executable for PAUP* [5] is in the src-pipelines folder.

##Scripts for Running Species Tree Estimation Methods on the Simulated Datasets

The files in each "pipeline-" folder are a combination of shell scripts and qsub scripts for the UIUC campus cluster.

agbsvdquartets's People

Contributors

Stargazers

Watchers

Forkers

pranjalv123

agbsvdquartets's Issues

15-taxon dataset processsing

The 15-taxon dataset (caterpillar model species tree generated with MCcoal) (http://www.cs.utexas.edu/~phylo/datasets/binning-response/) has gene alignments in fasta format. We need to convert this to phylip format before running the other processing scripts. Ruth, I think you mentioned dendropy has this functionality?

Need to run tests on full dataset

Jed can you run tests on full dataset on cluster. Justmake sure that the model folders inside data folder has full data

missing wqmc species trees

no tree outputted for species 13 svd_wqmc_s_tree_25_10_F.trees
cat: svd_wqmc_s_tree_25_10_F.trees: No such file or directory
no tree outputted for species 23 svd_wqmc_s_tree_25_1_D.trees
cat: svd_wqmc_s_tree_25_1_D.trees: No such file or directory
no tree outputted for species 27 svd_wqmc_s_tree_25_1_C.trees
cat: svd_wqmc_s_tree_25_1_C.trees: No such file or directory
no tree outputted for species 27 svd_wqmc_s_tree_25_1_D.trees
cat: svd_wqmc_s_tree_25_1_D.trees: No such file or directory
no tree outputted for species 27 svd_wqmc_s_tree_25_1_E.trees
cat: svd_wqmc_s_tree_25_1_E.trees: No such file or directory
no tree outputted for species 27 svd_wqmc_s_tree_25_1_F.trees
cat: svd_wqmc_s_tree_25_1_F.trees: No such file or directory
no tree outputted for species 27 svd_wqmc_s_tree_25_1_G.trees
cat: svd_wqmc_s_tree_25_1_G.trees: No such file or directory
no tree outputted for species 27 svd_wqmc_s_tree_25_5_F.trees
cat: svd_wqmc_s_tree_25_5_F.trees: No such file or directory
no tree outputted for species 27 svd_wqmc_s_tree_25_10_D.trees
cat: svd_wqmc_s_tree_25_10_D.trees: No such file or directory

converting svd scores to weights. scheme H

Sorry, I posted on the wrong github page.

Shashank and I were talking about the svd scores to weights conversion, and shashank suggested the following scheme:
Let R=1/e^(svd_max-svd_med). Then
weight(q_min)=e^(R_(1-svd_min/svd_overall_max))
weight(q_med)=e^(R_(1-svd_med/svd_overall_max))
weight(q_max)=e^(R*(1-svd_max/svd_overall_max))
where svd_overall_max is the global maximum svd score.
This will fix problems with svd_min = 0 and prevent the weights from being too large or too small. Let's call this new scheme H. In the python file we will need a list of all svd scores and compute the max of this list before assigning any weights to the quartets.

converting SVD scores to WQMC scores errors

Some files are giving errors (division by zero) while converting scores from SVD to WQMC, especially for small k. We need to look into that

problem running taxon-relabeler on 15-taxon dataset

Hey guys, so I've converted the fasta files in the 15-taxon dataset using a script Ruth pointed out to me online. I'm now trying to sample and relabel the alignments, but I get the following errors when I run the script run_pipeline_cluster.sh:

Traceback (most recent call last):
File "src-pipeline/taxon_relabeler.py", line 72, in
processFilesTaxon(inp_folder,out_folder,dict_file)
File "src-pipeline/taxon_relabeler.py", line 57, in processFilesTaxon
new_dna_string+=new_taxon+'\t'+str(dna[old_taxon])+'\n'
File "/usr/local/python/2.7.8/lib/python2.7/site-packages/dendropy/dataobject/char.py", line 1168, in getitem
raise KeyError(label)
KeyError: '0'
cp: cannot stat `/home/jedchou1/scratch/AGBsvdquartets/data/sim1/S_relabeled_tree.trees': No such file or directory

The format of the 15-taxon dataset is as follows:
On the cluster, I have a folder scratch/AGBsvdquartets/data.
Inside this data folder are 10 folders called sim1,sim2,...,sim10 as well as a file called taxa_dict.txt which just has the taxa names A,B,C,...,O and their corresponding numbers 1,2,3...,15 separated by a tab character.
Inside each sim folder are 1000 fasta alignments (1.fasta,2.fasta,...), 1000 phylip alignments (1.phy,2.phy,...), and a file called s_tree.trees which contains the true species tree in newick format.

Do you guys know what's causing this error?

Need to replace reciprocal score conversion before weekend

Dear SVD Group:

Shashank called me today to talk about preparing for his class presentation, and his questions about the theory behind this method led us to some important realizations, that impact corrections that should be make to score conversions before we run things this weekend.

In particular, we should not use the function 1/x to convert the SVD scores for wQMC, because the theory predicts that if the frequencies in the data are close enough to the true probability distribution, the SVD score for the best split will be zero! I.E. x = 0 is something we should expect to deal with.

So, exp(-x) or exp(-x^p) for some p > 1 (which is 1 when x = 0 and is decreasing....) would be a good model function to replace 1/x with before the experiments you all were planning to run over the weekend take place.

I believe Jed had mentioned trying this earlier as a conversion and it got lost in the shuffle.

Another thing to expect is that is the frequencies in the gene data are not good enough approximations to the true probability distribution, it may be the case that all three scores are zero. We need to decide what to do in this case. Shashank suggested that perhaps we could give equal weight to all but reduce it somehow to reflect that we had no confidence in any of them.

In the case where there is only one zero, and two nonzero scores, our mapping function of course should give highest value to that.
If all the tree scores are zero, the approximation of frequencies to probabilities not very good, we should give a very low weight to all three quartets on a four taxon subset before feeding to wQMC.

Please do not hesitate to contact me today or tomorrow about this, and I am going to post this as an issue to Ashu's github too.

Best
Ruth

agupta0905 / agbsvdquartets Goto Github PK

agbsvdquartets's Introduction

agbsvdquartets's People

Contributors

Stargazers

Watchers

Forkers

agbsvdquartets's Issues

15-taxon dataset processsing

Need to run tests on full dataset

missing wqmc species trees

converting svd scores to weights. scheme H

converting SVD scores to WQMC scores errors

problem running taxon-relabeler on 15-taxon dataset

Need to replace reciprocal score conversion before weekend

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent