Maximum Likelihood Estimation for Viral Populations
VIPRA and MLEHaplo README file
Pre-requisites:
- multi-dsk: k-mer counting software ( Extension of dsk [http://minia.genouest.org/dsk/] version 1.5655)
- Counts k-mers for multiple values of k simultaneously. The software doesn't combines the counts of forward and reverse complement k-mers, as is performed in traditional k-mer counting softwares
- perl with modules Bio::Perl, Getopt::Long, Graph.
- BioPerl is available at http://bioperl.org/
A dockerfile containing all the dependencies is now available. Find more documenation on Docker here.
To build a docker image, you would need Docker installed in your HPC or cloud environment. Create an empty folder with just the Dockerfile in it and type
docker build --tag mlehaplo:1.0 .
This will create the docker container containing all relevant scripts and libraries for you to run MLEHaplo. You can now start a live container by typing:
docker run -it mlehaplo:1.0
The following is a list of steps/commands you'd have to follow to run MLEHaplo.
Command: multi-k fasta/fastq_file list_of_kvalues -d diskspace_limit -m memory_limit
Example:
./multi-dsk/multi-dsk Example/paired-reads.fasta Example/listofkmers.txt -m 8192 -d 10000
- fasta/fastq file: is the file name of the fasta/fastq file. If there are a list of files, use a text file containing locations of all the files, one file per line.
For example, if there are two files containing paired reads,
file1.fastq
andfile2.fastq
, create a filelist_of_files.txt
containing following:
file1.fastq
file2.fastq
- list_of_kvalues : contains a list of k values for which k-mer counting has to be performed. For example if k-mer counting is desired for k-values 55,45,35,25. Create a file
list_of_kmers.txt
containing these numbers in decreasing order:
55
45
35
25
-
diskspace_limit : defines the limit of temporary disk space (in MB) used while performing k-mer counting
-
memory_limit : defines the memory limit (in MB) for storage of temporary hashes while performing k-mer counting
-
Output of multi-dsk is a collection of files with extension
prefix.solid_kmers_binary."kvalue"
in compressed format, which contains counts of k-mers present in the fasta/fastq file.
Command:
parse_results prefix.solid_kmers_binary.kvalue_file > prefix_file.kvalue
Example:
./multi-dsk/parse_results Example/paired-reads.solid_kmers_binary.60 > paired-reads.60
- The file
fasta/fastq_file.kvalue
now contains the k-mer counts for the fasta/fastq file in the format "k-mer count" per line.
Generating the graph needs two files and a parameter. This will combine paired files into a single file.
fasta_file
containing all the reads.kmer_file
generated above.threshold
value for ignoring erroneous k-mers.
Command:
perl construct_graph.pl fasta_file kmer_file threshold graph_file "s"
Example:
perl construct_graph.pl Example/paired-reads.fasta paired-reads.60 0 paired-reads.60.graph "s"
- "s" parameter tells the perl script that there is a single fasta file of reads
- Output is the
graph_file
containing pairs of k-mers that form edges in the De Bruijn graph.
TODO: Add "From (k+1)-mer counts file"
Create the paired set using the paired reads. It takes as input the two paired files,
file1.fasta
and file2.fasta
, the k-mer counts file
file1.kvalue
, and a threshold for ignoring erroneous k-mers
Command:
perl construct_paired_without_bloom.pl -file1 file1.fasta -file2 file2.fasta -paired -kmerfile file1.kvalue -thresh number -wr output_paired_set_file
Example:
perl construct_paired_without_bloom.pl -fasta Example/paired-reads.fasta -kmerfile paired-reads.60 -thresh 0 -wr paired-reads.60.pk.txt
-
Choice of threshold : Dependent on sequencing coverage. Lower threshold includes more erroneous k-mers in the graph, while higher threshold decreases the number of true k-mers and size of the graph.
-
Output is a file that contains pairs of k-mers on a line and the number of times such pair is observed:
kmer1 kmer2 count
Running the VIPRA algorithm takes inputs generated above and a parameter for the average insert size, threshold parameter and a value for M (factor) which decides the number of paths to generate per vertex
Command:
perl dg_cover.pl -graph graph_file_from_step3 -kmer kmer_file_from_step2 -paired paired_set_from_step4 -fact M -thresh threshold_value -IS insert_size > vipra_output_file
Example:
perl dg_cover.pl -graph paired-reads.60.graph -kmer paired-reads.60 -paired paired-reads.60.pk.txt -fact 15 -thresh 0 -IS 400 > paired-reads.60.fact15.txt
-
graph_file_from_step3
- output_file from Step 3: Generate the De Bruijn graph -
kmer_file_from_step2
- output_file from Step 2: De-compress the output of multi-dsk -
paired_set_from_step4
- output_file from Step 4: Create the paired set file -
vipra_output_file
: Contains the paths generated from the graph with high paired end supports. -
prefix.comp.txt
: Contains a sets of paired vertices in the condensed graph that are compatible with each other based on the paired set. -
prefix.cond.graph
: Contains the condensed version of De Bruijn graph, with non-branching paths condensed to a single vertex. -
Temporary output files:
prefix.bubble.txt
: Contains a sets of paired bubbles in the condensed graph that are compatible with each other based on the paired set.prefix.depth
: Contains the depth first search traversal of the graph.prefix.nodedepth
: Temporary file for debugging of code.
Extracting fasta file from outputfile
Command:
perl process_dg.pl vipra_output_file > paths_fasta_file
Example:
perl process_dg.pl paired-reads.60.fact15.txt > paired-reads.60.fact15.fasta
- Output:
paths_fasta_file
. The paths generated by VIPRA
Extracting just the paths in terms of nodes in the graph
Command:
perl get_paths_dgcover.pl -f vipra_output_file -w paths_write_file
Example:
perl get_paths_dgcover.pl -f paired-reads.60.fact15.txt -w paired-reads.60.fact15.paths.txt
- Output:
paths_write_file
Running MLEHaplo takes as input intermediate files generated by VIPRA Step A: Running VIPRA and paths_write_file
generated in Step C: Generate paths file for maximum likelihood estimation
Command:
perl likelihood_singles_wrapper_parallel.pl -condgraph prefix.cond.graph -compset prefix.comp.txt -pathsfile paths_write_file -back -slow -gl approximate_genome_size > MLE_textfile_output
Non Parallel Version Command:
perl likelihood_singles_wrapper.pl -condgraph prefix.cond.graph -compset prefix.comp.txt -pathsfile paths_write_file -back -gl approximate_genome_size -slow > MLE_textfile_output
Example
perl likelihood_singles_wrapper.pl -condgraph paired-reads.60.cond.graph -compset paired-reads.60.comp.txt -pathsfile paired-reads.60.fact15.paths.txt -back -gl 1200 -slow > paired-reads.60.smxlik.txt
Required input files & Parameters
prefix.cond.graph
: Condensed Graph file generated by VIPRA .prefix.comp.txt
: Compatible Set generated by VIPRA.paths_write_file
: Paths file from ViPRA Step C.approximate_genome_size
: Approximate genome Length.
Final viral population generation using MLE_textfile_output
Command:
perl extract_MLE.pl -f paths_fasta_file -l MLE_textfile_output > MLE_Haplo_fasta_OUTPUT
Example
perl extract_MLE.pl -f paired-reads.60.fact15.fasta -l paired-reads.60.smxlik.txt > paired-reads.60.MLE.fasta
paths_fasta_file
. The paths generated by VIPRA Step B: Generate fasta file
centroid-based clustering by similarity. A centroid in a cluster is a contig.
Command:
vsearch --cluster_fast paths_fasta_file --id --centroids cluster_centroid_sequences --consout cluster_consensus_sequences
Example
vsearch --cluster_fast paired-reads.60.fact15.fasta --id 0.995 --centroids paired-reads.60.fact15.centroids.0.995.fa --consout paired-reads.60.fact15.consout.0.995.fa
Input: paths reconstructed by ViPRA, or paths after reduced by VSEARCH when taking paths reduced by VSEARCH as input.
Command:
perl dg_subpath.pl paths_write_file cluster_centroid_sequences new_paths_file
Example
perl dg_subpath.pl paired-reads.60.fact15.paths.txt paired-reads.60.fact15.centroids.0.995.fa paired-reads.60.fact15.0.995.paths.txt
use this new paths file for likelihood_singles_wrapper.pl