The geoglossomycetes-genomics-workflow from tinamelie

Sections:

Assembly

Extracting target genome from metagenome

Gene Prediction

Phylogenomics workflow

Assembly

Trim reads with Trimmomatic v0.39 with the following parameters

LEADING:3 TRAILING:15 SLIDINGWINDOW:4:10 MINLEN:36

Assembly with SPAdes v3.11.1 using the -meta flag.

Extracting target genome from metagenome with an ESOM

Modified and emended from the work out of Jill Banfield's lab. Greg Dick et al. 2009 Genome Biology. And Vincent Denef's version of the print_tetramer_freqs.pl perl script has been used.

Remove small contigs from your input assembly (scaffolds.fasta from SPAdes output) with remove_small_contigs.pl. Contigs below 300 - 1000bp can be removed. We start with 300 and increase the minimum threshold size as needed.

perl remove_small_contigs.pl 300 scaffolds.fasta > species_300bp_spades.fasta

Using blastn create a best blast hit for each contig. Use tabular format (6) for blast output.

blastn -query species_300bp_spades.fasta -db /your_db_path -out species_300bp_to_ntdb_E5 -max_target_seqs 1 -evalue 1e-5 -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore staxids" -num_threads 12

Your best blastn hit could be spurious or as equally likely as the second or fourth best blastn hit, but in aggregate, we try to separate the data. For bacteria, this actually works quite well, for Eukaryotes the databases are sparser and therefore we give a best blastn hit less meaning. A top hit to ANY fungus is usually a definite for our targets, a top hit to animals (especially inverts) could still be a fungal target.

Multiple hits will be assigned to each contig, filter the results so only a "best" blast hit is assigned. Do this with best_blast_hit.pl.

perl best_blast_hit.pl species_300bp_to_ntdb_E5_blast.out > best_species_300bp_to_ntdb_E5_blast.out

Assign taxonomy based on the best blast hit with new_ncbi_tax_report.py.

python new_ncbi_tax_report_test.py best_species_300bp_to_ntdb_E5_blast.out -o best_species_300bp_to_ntdb_E5_taxreport

Using the taxonomy assignment, bin contigs manually. Reformat to the "annotation" format taken by ESOM and save it as a plain text file: species_300bp_metagenome.annotation

Here is an example of the top few lines of the annotation file, species_300bp_metagenome.annotation:

contig  annotation
NODE_1147_length_2667_cov_16.2489	10
NODE_707_length_4433_cov_14.0962	10
NODE_329_length_11939_cov_66.1998	1
NODE_326_length_12118_cov_62.7537	1
NODE_2533_length_1171_cov_47.2948	1
NODE_1903_length_1542_cov_12.076	10
NODE_1518_length_1955_cov_17.2579	10
NODE_989_length_3135_cov_14.3055	10
NODE_1005_length_3078_cov_17.2253	10
NODE_988_length_3137_cov_10.7748	10
NODE_1225_length_2482_cov_16.2073	10
NODE_867_length_3613_cov_15.5486	10
NODE_1675_length_1764_cov_10.7923	10
NODE_458_length_7907_cov_59.2615	3

Note, the "key" to these annotation (e.g.taxonomy for 1,10,3) is kept in the spreadsheet where this work was done - not in the esom input file.

Example of the key for annotation:

0 Unassigned
1 Ascomycota
2 Other fungi
3 Plant
4 Animal
5 Bacteria 
6 Virus
7 Amoeba
10 Other eukaryote

Generate input files for ESOM with print_tetramer_freqs.pl.

perl print_tetramer_freqs_esom.pl -s species_300bp_spades.fasta -m 300 -w 3000 -k 4 -a species_300bp_metagenome.annotation

#s = your assembly
#m = minimum contig size to be analyzed (make this higher for faster/ less CPU intensive visualization)
#w = window size to calculate kmer frequency (again, make this higher for faster viz)
#k = kmer size, default is 4, longer means slower!
#a = annotation file (tab delimited plain text) you generated (either from Blast taxonomy report as above, or other info e.g. coverage, RNASeq data)

Output, 4 files: .lrn, .names, .cls, .annotation files within your directory.

You can run the next step locally with ESOM, however, Somoclu is advantageous when you have a large dataset as this can be time and resource-consuming when running locally.

Train the ESOM


somoclu -e 20 -l 0.5 -L 0.1 -m toroid -r 50 -x 450 -y 500 -v 2 species_300bp.fasta.lrn species_300bp_spades.fasta

#e = number of epochs
#l = starting learning rate 
#L = finishing learning rate
#m = map type
#r = start radius
#x = number of columns in map 
#y = number of rows in map 
#v = verbosity level (gives progress/status in output file)

Somoclu will number the the .lrn file starting at 1 and the .bm file starting at 0. If you use them as is in ESOM, you will get a message about mismatches in observations when you run the "Getclassfasta" script on your generated .cls file. Renumber the .bm file starting with 1 before importing into ESOM.

Renumbering is easy with awk

awk 'NR>2 {$1=NR-2}{print}' species_300bp.fasta.bm > renum.species_300bp.fasta.bm

Transfer files to your local computer. Load your .names, .umx, .wts, and fixed .bm file into your Databionic ESOM Tools GUI. If you encounter issues with this program, see extra tips here.
Vizualize output locally and identify target genome

View -> UMatrix background, tiled display. Use Zoom, Color, Bestmatch size to get your desired view.

A. Check "Draw Best Matches" in View pane.

B. File -> load .cls

C. Under "Classes" tab, select classes to be displayed on the map.

D. Once you have identified your target genome, select the "Data" tab. Then, left click around the target area of the map.

E. When you have enclosed all of the target area of the map, right click to select that target area.

This will only work correctly if you have selected the "Data" tab prior to beginning.

F. File, Selection, save as .cls. Name your selected .cls file to save to your folder.

Obtain target contigs using Getclassfasta.pl

perl Getclassfasta.pl -fasta species_300bp_spades.fasta -names species_300bp_spades.fasta.names -num 1 -loyal 75 -cls species_300bp_spades.fasta.cls

This will generate a .fasta (your target genome assembly), and a .conf giving confidence values of specific contigs

For more in-depth steps and example files, see Quandt Mycology's Running_ESOM page.

Gene Prediction with Funannotate

Sort by contigs by size and rename headers

funannotate sort -i after_esom_assembly.fasta -o sorted_assembly.fasta -b profar

Mask repeats

funannotate mask -i sorted_assembly.fasta -o masked_sorted_assembly.fas -s fungi

Run gene prediction pipeline.

funannotate predict -i masked_sorted_assembly.fas --name IDENTIFIER_ --species "Species name" --protein_evidence /path/to/protein/evidence -o species_funannotate_out --cpus $SLURM_NTASKS

This workflow uses nucleotide data (CDS files) as input.

Phylogenomics Workflow

Concatenate your original cds files and convert to a one line fasta

cat *.fasta | awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' > allcds_dna.fa

Make a new directory and move your concatenated file

mkdir cds_genes && mv allcds_dna.fa cds_genes

Translate sequences to aa with EMBOSS

for f in *.fasta
do
~/EMBOSS-6.6.0/emboss/transeq -trim -sequence ${f} -out ${f%.fasta}.preaa.fasta
done

EMBOSS stubbornly adds as "_1" to each fasta header. I haven't found a way to turn this off. It will cause issues downstream if it's not removed.

for f in *.aln;
do
sed -i 's/T1_1/T1/g' $f; 
done

Run protein ortho,

Run ProteinOrtho

proteinortho -clean -project=project_name -cpus=$SLURM_NTASKS *.aa.fasta

Get single copy, shared orthologous genes. This examples has 114 input genomes.

grep $'^114\t114' > single_copy_114.tsv

Alternative: Check for a blank line up top. Use this command to remove if necessary with the sed command before grep.

sed -n '1p' myproject.proteinortho.tsv | grep $'^114\t114' > single_copy_114.tsv

Grab proteins from files with src_proteinortho_grab_proteins.pl. It's a good idea to take the header from the proteinortho output and include it at the top of your new tsv file. This can speed up the process.

perl ./src_proteinortho_grab_proteins.pl -exact -tofiles new_all_114.tsv input_protein_file_1.fas input_protein_file_2.fas input_protein_file_3.fas

Move the new single copy files to cds_genes then cd to that directory

mv singlecopy114.tsv.OrthoGroup* cds_genes && cd $_

Align the single copy files with MAFFT

for f in *fasta
do
mafft --thread 12 --amino --maxiterate 1000 ${f} > ${f%.fasta}.aln
done

Might as well move the old stuff out of the way

mkdir protsinglecopy && mv *.fasta $_

Using your allcds_dna.fa, back translate to aa using RevTrans. This uses the match name to back translate. RevTrans is easy with a conda installation, but there's also a webserver.

for f in *aln
do 
revtrans.py -match name allcds_dna.fa ${f} > ${f%.aln}.revtrans.fasta
done

Run trimAl with gappyout parameter

for f in *.revtrans.fasta
do 
trimal -in ${f} -out ${f%.revtrans}.trim
done

Note: If you encounter any problems from here, it’s because of the uneven sequence count making it “not an alignment”. All sequences must have the same number of bases to be considered an alignment. You can check a single file using this command:

bioawk -c fastx '{ print $name, length($seq) }' inputfilename

Creating an ML tree

Concatenate your fasta files. Be sure to check your headers here before moving forward

catfasta2phyml.pl -f *trim > sequence.fasta

Run RAxML-NG

raxml-ng --all sequence.fasta --model GTR+G --bs-trees 300

Creating a gene trees and running ASTRAL

This process uses the files in trimmed/ to prepare individual gene trees to then estimate a species tree with ASTRAL

Run concatenated alignments of genes through RAxML to prepare individual gene trees

for f in *.trim
do
echo raxml-ng --all --msa ${f} --model GTR+G --bs-trees 100
done

Concatenate the resulting gene trees and clean up our folder a bit

mkdir genetrees && cd $_
mv ../*.support .
cat *.support > Astral_in.tre

Optional: Filter gene trees with low support branches using Newick utilities. Example filters anything below 30%. See ASTRAL paper for more information.

nw_ed Astral_in.tre 'i & b<=30' o > Astral_in30.tre

Run ASTRAL. Output can be visualized as a species coalescent tree.

java -jar astral.5.7.8.jar -i Astral_in30.tre -o Astral_out30.tre

Performing and visualizing condordance analysis

***Your ASTRAL output and .support files will be needed for this

Astral chooses an arbitrary taxon to root your tree to. It’s recommended that you manually reroot to your outgroup instead.

Example outgroup file:

Saccharomyces cerevisiae

Reroot your tree

reroot_trees.py Astral_30out.tre outgroup.list > rerooted_Astral30.tre

Run the Phyparts concordance analysis

java -jar phyparts-0.0.1-SNAPSHOT-jar-with-dependencies.jar -a 1 -v -d ../support -m rerooted_Astral30.tre -o out

### -a = the type of analysis
### -v = verbose
### -m = target "species tree" from ASTRAL
### -d = directory with gene trees

Visualize the results with phypartspiecharts. The number is the orthologous clusters/gene trees.

phypartspiecharts.py reroot_astral.tre out 84

tinamelie / geoglossomycetes-genomics-workflow Goto Github PK

geoglossomycetes-genomics-workflow's Introduction

Sections:

Assembly

Extracting target genome from metagenome with an ESOM

Gene Prediction with Funannotate

This workflow uses nucleotide data (CDS files) as input.

Phylogenomics Workflow

Creating an ML tree

Creating a gene trees and running ASTRAL

Performing and visualizing condordance analysis

geoglossomycetes-genomics-workflow's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent