willpearse / phylogenerator Goto Github PK
View Code? Open in Web Editor NEWAutomated Phylogeny Generation for Ecologists
License: Other
Automated Phylogeny Generation for Ecologists
License: Other
Just making sure I can run it ahead of the workshop...
I can't believe it's been over 3 years since I last looked at this! How time flies!
ross@ross-envy:~/workspace/pearsepg/phyloGenerator-master$ ./setupLinux.py /home/ross/workspace/pearsepg/BEASTv1.8.4/bin /home/ross/workspace/pearsepg/prank/bin/ /home/ross/workspace/pearsepg/pathd8 /home/ross/workspace/pearsepg/phylocom-4.2/src /home/ross/workspace/pearsepg/metal-linux64-1.1 /home/ross/workspace/pearsepg/trimAl/source /home/ross/workspace/pearsepg/clustalofolder/ /home/ross/workspace/pearsepg/standard-RAxML-master
Linux configuration script for phyloGenerator
Pass, as additional command line arguments, the path where you've downloaded all your files
and *which folder contains BEAST*
e.g., './setupLinux.py /home/will/phyloGenerator /home/will/phyloGenerator/BEAST\ v1.7.4'
Make sure all programs are executable - if unsure, make them so
e.g., 'chmod +x NAMEOFPROGRAM'
The resulting 'requires' folder must contain only output from this script
Do not leave source code from the programs phyloGenerator uses in the same folder
as phyloGenerator.py, or (for safety) in your output 'working directory'
This will cause obscure-looking errors from phyloGenerator!
Checking and configuring external programs
ln: failed to create symbolic link '/home/ross/workspace/pearsepg/phyloGenerator-master/requires/mafft': File exists
ln: failed to create symbolic link '/home/ross/workspace/pearsepg/phyloGenerator-master/requires/muscle': File exists
ln: failed to create symbolic link '/home/ross/workspace/pearsepg/phyloGenerator-master/requires/prank': Permission denied
ln: failed to create symbolic link '/home/ross/workspace/pearsepg/phyloGenerator-master/requires/clustalo': Permission denied
ln: failed to create symbolic link '/home/ross/workspace/pearsepg/phyloGenerator-master/requires/metal': Permission denied
ln: failed to create symbolic link '/home/ross/workspace/pearsepg/phyloGenerator-master/requires/trimal': Permission denied
ln: failed to create symbolic link '/home/ross/workspace/pearsepg/phyloGenerator-master/requires/phylomatic': Permission denied
ln: failed to create symbolic link '/home/ross/workspace/pearsepg/phyloGenerator-master/requires/PATHd8': File exists
ln: failed to create symbolic link '/home/ross/workspace/pearsepg/phyloGenerator-master/requires/raxml': Permission denied
ln: failed to create symbolic link '/home/ross/workspace/pearsepg/phyloGenerator-master/requires/beast': File exists
Checking Python libraries
ln: failed to create symbolic link '/home/ross/workspace/pearsepg/phyloGenerator-master/requires/treeannotator': File exists
CONGRATULATIONS!
phyloGenerator is setup. You should now be able to run it by typing './phyloGenerator.py'
ross@ross-envy:~/workspace/pearsepg/phyloGenerator-master$ ./phyloGenerator.py
bash: ./phyloGenerator.py: Permission denied
ross@ross-envy:~/workspace/pearsepg/phyloGenerator-master$ chmod +x phyloGenerator.py
ross@ross-envy:~/workspace/pearsepg/phyloGenerator-master$ ./phyloGenerator.pyTraceback (most recent call last):
File "./phyloGenerator.py", line 33, in <module>
import dendropy#To drop tips...
ImportError: No module named dendropy
ross@ross-envy:~/workspace/pearsepg/phyloGenerator-master$ ./phyloGenerator.py
Traceback (most recent call last):
File "./phyloGenerator.py", line 33, in <module>
import dendropy#To drop tips...
ImportError: No module named dendropy
I keep getting stuck at this step no matter what I try. My internet was definitely working all the time throughout this:
ross@ross-envy:~/workspace/pearsepg$ ./phyloGenerator-master/phyloGenerator.py
Welcome to phyloGenerator! Let's make a phylogeny!
---Please go to http://willpearse.github.com/phyloGenerator for help
---Written by Will Pearse ([email protected])
This program is easier to use with a wider console window
Mac/Linux: Drag the edge of your terminal window with the mouse
PC: Right click the command prompt icon, select properties,
click the 'layout' tab, and increase 'screen buffer'
and 'window' widths to at least '160'
When downloading sequence data, you will see warnings relating to
'missing DTD files. Do not be alarmed; this is normal, and will
have no effect on your output.
Please input a 'stem' name to act as a prefix to all output (e.g., 'stemName_phylogeny.tre')
Stem name: dog
Please input an *existing* directory for all your output
(hit enter to use /home/ross/workspace/pearsepg
Working directory (/home/ross/workspace/pearsepg):
Please enter the gene(s) you want to use (e.g., 'COI' for cytochrome oxidase one')
Specify 'aliases' (alternate names) by listing them after the main name using '-'
Gene names may contain spaces, or (as with the command line) they can be replaced with '_'
e.g., COI-this_is_an_alis-this is also an alias
If you wish to use the defaults for your taxa, please enter 'plant', 'invertebrate', or 'vertebrate' instead
Each gene on a separate line, and an empty line to continue
plant
DNA INPUT
If you already have DNA sequences in a FASTA file, please enter its location
If you have more than one set of sequences, please separate the file locations with commas
Otherwise, hit enter to continue
File locations:
No DNA loaded
DNA DOWNLOAD
Please enter the location of the list of species for which you want to build a phylogeny
Each species must be on a new line
/home/ross/workspace/pearsepg/phyloGenerator-master/chinaspp.txt
6098 species loaded.
Please enter a valid email address to download sequence data from GenBank
Email: [email protected]
To use the referenceDownload method, enter locations of sequence files (on separate lines), finishing with an empty line.
Just hit enter to perform a standard search (this is probably the option you're looking for).
refDownload:
Searching for: Acanthus ebracteatus
!!!Server error checking (((Acanthus ebracteatus[Organism]) AND rbcL[Gene]) NOT partial [Title]) NOT genome [Title] - retrying...
!!!!!!Unreachable. Returning nothing.
Traceback (most recent call last):
File "./phyloGenerator-master/phyloGenerator.py", line 4122, in <module>
main()
File "./phyloGenerator-master/phyloGenerator.py", line 4033, in main
currentState.loadGenBank()
File "./phyloGenerator-master/phyloGenerator.py", line 2333, in loadGenBank
self.sequences, self.genes = findGenes(self.speciesNames, self.genes, seqChoice=self.seqChoice, verbose=True, download=True, thorough=True, targetNoGenes=self.nGenes, spacer=self.spacer, delay=self.delay, taxonIDs=self.taxonIDs)
File "./phyloGenerator-master/phyloGenerator.py", line 476, in findGenes
sequence, _ = sequenceDownload(speciesList[i], geneNames[k], noSeqs=noSeqs, includePartial=includePartial, includeGenome=includeGenome, seqChoice=seqChoice, download=download, thorough=thorough, retMax=retMax, taxonID=taxonIDs)
File "./phyloGenerator-master/phyloGenerator.py", line 339, in sequenceDownload
seq = dwnSeq(includeGenome=False, includePartial=False, gene=gene)
File "./phyloGenerator-master/phyloGenerator.py", line 332, in dwnSeq
if int(firstSearch['Count']):
TypeError: tuple indices must be integers, not str
Helps with sub-species searches for some users
Something weird is happening when a Mac user went integratedBootstrap --> PATHD8 (and potentially with BEAST as well). It sounds like a []
issue with BioPython, but I need to check.
I got the following from a reviewer:
The command-line interface, while fairly well-designed, is still a potential problem for some users. An equivalent web browser-based interface would help and should be feasible: use Python's built-in CGI server (SimpleHTTPServer) to serve pages locally, and use the webbrowser module to load 'http://localhost' when launching the program. (Note that I'm not suggesting the authors host a public web server themselves.) Assuming you intend to maintain and improve phyloGenerator, I encourage you to look into doing this for a future release.
If you have strong views about the terminal-based interface, please let me know. I'm going to try and implement this, although it may well be at the expense of the terminal-based interface (i.e., I might not keep both going).
Cheers,
Will
The read-me file currently says:
Install Python >=2.6; Numpy and SciPy for Python; Biopython >=2.5.
The currently version of Biopython is 1.60 (one, sixty), so something is wrong with that.
Also, having looked at the code it will not work under both Python 2 and 3 as it is, for instance you are using print statements. Therefore saying Python >=2.6 is potentially confusing as some users might try this under Python 3. I would suggest saying install Python 2.6 or 2.7 (since Python 2.7 is the final Python 2.x release).
A user has requested this; I will try to get round to it!
It would be nice if the program attempted to save something on exit if there's an error. Admittedly, there shouldn't be an error (...!...) but it would still be nice if it tried.
...the downloads are different python files (minorly), and integrate the build python scripts
It would be nice if pG came with APGIII built into it, or at least a reference on the website...
...would be nice if pG wrote out on crashes as much as it could (e.g., after someone escapes a prank run...)
I keep getting error messages when attempting to unzip the zip file. Wonder if you can try making another one, or making a tarball instead?
...can hang. Sometimes this is GenBank, but sometimes it's pG and there's definitely a more intelligent way of handling this...
...can't hurt! Thanks Rampal!
...so change the readme!
pG will attempt to align empty genes (i.e., genes with no DNA data). It shouldn't do this, as (quite legitimately) alignment programs don't like it!
e.g., < 1000 rbcL
in trim mode
HT to Ben Warren (again!)
...doesn't seem to have matK on a Mac?
A user has reported that the order of species placed inside a large polytomy in their phylogeny is reflected in the branching order of those species in their output phylogeny.
I'm currently investigating; I'm not sure what could be causing their error, but would be grateful if anyone experiencing similar issues could send me their input files.
Thanks very much,
Will
...likely by stripping out scipy as you only use it for ~2 things, none of which are that key/hard to write yourself!
COI is not always stored as COI in GenBank - need a way to have aliases of genes when searching/trimming...
Using my own fasta file I get this error.
The names look like this:
Anacanthocoris_striicornis
Excep one is like this.
Diaphorina_citri diaci_nymph_66660000040632
Not sure which it is failing on.
ERROR
........
Other modes: 'reload', 'trim', 'replace', 'merge'. Hit enter to continue.
DNA Editing (delete):
Traceback (most recent call last):
File "phyloGenerator.py", line 3730, in
main()
File "phyloGenerator.py", line 3680, in main
currentState.renameSequences()
File "phyloGenerator.py", line 3267, in renameSequences
if self.sequences[i][k]:
IndexError: list index out of range
Some of your more technically inclined potential users would be concerned at the lack of unit tests in the repository. The provision of test data is a good first step - and could be the basis of a test suite.
Once you have a basic test script, which can return zero on success or non-zero on an error, this could be used for automated testing. If you are not already familiar with TravisCI and its excellent GitHub integration, I would suggest looking into that http://travis-ci.org/ - this would require the Linux binaries to be available as Debian/Ununtu packages, or simple to download 32bit Linux binaries.
I'd like to use your program to obtain family and order classification for around 2500 bird species across most of the 143 families identified by Sibley & Ahlquist (1990). I already have classification to genus level but anything else time saving would be fantastic!
Any help would be great!
Alistair Baxter
when doing a big thing (like thorough downloading), it might be an idea to have an 'undo' button.
that would probably be hard to write, but you could have a 'backup' option where the internal state is deepcopied to new pG.sequence slots, and then you could 'revert' back to them as needed.
If a genus doesn't exist in NCBI, there's a chance the species name might be found somewhere else. This could lead to weird THOROUGH replacements (Eric's diatoms!)
Set a list of 'select this, then this' -type options, which would also make a nice tutorial
Hi Will,
Would it be possible to use taxon IDs, as they are more specific, instead of species names?
Thanks,
Dom
...Could potentially be a Newick format problem in BioPython, but a user (unchecked) has had issues with constraint trees using this
PATHd8 seems to be causing problems for some people - it's not running.
File "phyloGenerator.py", line 3790, in <module> main() File "phyloGenerator.py", line 3764, in main currentState.rateSmooth() File "phyloGenerator.py", line 3260, in rateSmooth success = PATHd8() File "phyloGenerator.py", line 3161, in PATHd8 self.smoothPhylogeny = rateSmooth(self.phylogeny, sequenceLength=length) File "phyloGenerator.py", line 1686, in rateSmooth with open(tempPATHd8Output, 'r') as tempFile: IOError: [Errno 2] No such file or directory: 'tempPATHd8Output'
Yes, my alignment was terrible... matk interspersed with rbcl willy-nilly but even with garbage in / garbage out it should still run through the pipeline, right? Any idea what might be happening here? Have I not installed RAxML correctly perhaps?
DNA ALIGNMENT
Choose one alignment method ('muscle', 'mafft', 'clustalo', 'prank'), or...
'everything' - all four and compare their outputs
'quick' - do only the first three
Return will use MAFFT; prank is very slow!
DNA Alignment (default - mafft):
Starting alignment...
...aligning gene no. 1
......with MAFFT
Alignment complete!
ALIGNMENT CHECKING
Gene: rbcL
ID Alignment Length Med. Gaps SD Gaps Min-Max Gaps Med. Gap Frac. M-M Gap Frac. Warn?
0 mafft 1764 800.0 266.41 249.0 - 1232.0 0.45 0.141-0.698 !!!!
'output' - write out alignments. I recommend you look at your alignment before continuing
'DNA' - return to DNA editing stage
'align' - return to alignment stage, discarding current alignments.
'trimal' - automatically trim your sequences using trimAl
'raxml=X' - run X RAxML runs for each alignment, and calculate the R-F distances between the trees and alignments (slow)
'metal' - calculate SSP distances between alignments using metal
'clustal-x2' - open the Clustal-X2 website to download this alignment viewer
TIPS:
*_If the column 'Warn?' has '!!!' in it, BEWARE! Your alignment likely has problems._
Bad sequences cause bad alignments. Be careful in the DNA check stage, and return there now if necessary
'output' your alignments, and open them in something like Clustal-X2. You will immediately see sequences that should be RELOADed or TRIMMed
When downloading Clustal, make sure you get the graphical Clustal-X2, not the command line version
Hit enter to continue and choose one final alignment per gene
Alignment Checking:output
...output written!
Alignment Checking:
CONSTRAINT TREE
I recommend you use a constraint tree with this program
'newick' - supply your own constraint tree
'phylomatic' - use Phylomatic to generate a tree
'taxonomy' - download the NCBI taxonomy for your species (does not generate a constraint tree)
Warning: Phylomatic can trim the end off species names, causing conflicts with phyloGenerator that are hard to detect. Rooted phylogenies are not valid constraints.
Otherwise, press enter to continue without a constraint tree.
TIPS:
If you choose 'taxonomy', it will be written out to your working directory now. Use that to make a constraint tree!
If you have access to a reference phylogeny, try using Phylomatic
A constraint tree makes your phylogeny much more likely to be right. Use one!
Constraint Method: taxonomy
Creating a 'taxonomy' for your species from GenBank
...lineages found!
Constraint Method:
...Continuing without constraint tree
PHYLOGENY BUILDING
You can either build a maximum likelihood tree ('raxml') or a Bayesian tree ('beast')
If unsure, hit enter to use RAxML - using BEAST safely will require some knowledge of phylogenetics
Phylogeny Building (default raxml):
...using RAxML...
RAXML:
'integratedBootstrap=X' - conduct X number of bootstraps and a thorough ML search in one run (!)
'restart=X' - conduct X number of full ML searches (!)
'partitions' - concatenate all genes into a single partition (not the default)
Specify multiple options with hyphens (e.g., 'restart=5-partitions'), but do not mix options marked with '(!)'
Hit enter to conduct one search
TIPS:
The integrated boostrap method is fast, and gives confidence intervals on your tree, and a value of 1000 is probably more than adequate for most trees
Phylogeny Building (RAxML - default 1 search): integratedBootstrap=1000
Traceback (most recent call last):
File "./phyloGenerator.py", line 3790, in
main()
File "./phyloGenerator.py", line 3756, in main
currentState.phylogen()
File "./phyloGenerator.py", line 3122, in phylogen
raxmlSetup('')
File "./phyloGenerator.py", line 2984, in raxmlSetup
self.phylogeny = RAxML(align, method=self.phylogenyMethods+'localVersion', constraint=self.constraint, timeout=999999, partitions=partitions)
File "./phyloGenerator.py", line 904, in RAxML
os.remove(each)
OSError: [Errno 21] Is a directory: 'standard-RAxML'
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.