wwood / coverm Goto Github PK

Read coverage calculator for metagenomics

License: GNU General Public License v3.0

Rust 99.23% R 0.26% Shell 0.25% Python 0.26%

coverm's Introduction

CoverM

CoverM

CoverM aims to be a configurable, easy to use and fast DNA read coverage and relative abundance calculator focused on metagenomics applications.

CoverM calculates coverage of genomes/MAGs coverm genome (help) or individual contigs coverm contig (help). Calculating coverage by read mapping, its input can either be BAM files sorted by reference, or raw reads and reference genomes in various formats.

Installation

Install through the bioconda package

CoverM and its dependencies can be installed through the bioconda conda channel. After initial setup of conda and the bioconda channel, it can be installed with

conda install coverm

Pre-compiled binary

Statically compiled CoverM binaries available on the releases page. This installation method requires non-Rust dependencies to be installed separately - see the dependencies section.

Compiling from source

CoverM can also be installed from source, using the cargo build system after installing Rust.

cargo install coverm

Development version

To run an unreleased version of CoverM, after installing Rust and any additional dependencies listed below:

git clone https://github.com/wwood/CoverM
cd CoverM
cargo run -- genome ...etc...

To run tests:

cargo build
cargo test

Dependencies

For the full suite of options, additional programs must also be installed, when installing from source or for development.

These can be installed using the conda YAML environment definition:

conda env create -n coverm -f coverm.yml

Or, these can be installed manually:

samtools v1.9
tee, which is installed by default on most Linux operating systems.
man, which is installed by default on most Linux operating systems.

and some mapping software:

minimap2 v2.21
bwa-mem2 v2.0

For dereplication:

Dashing v0.4.0
FastANI v1.3

Shell completion

Completion scripts for various shells e.g. BASH can be generated. For example, to install the bash completion script system-wide (this requires root privileges):

coverm shell-completion --output-file coverm --shell bash
mv coverm /etc/bash_completion.d/

It can also be installed into a user's home directory (root privileges not required):

coverm shell-completion --shell bash --output-file /dev/stdout >>~/.bash_completion

In both cases, to take effect, the terminal will likely need to be restarted. To test, type coverm gen and it should complete after pressing the TAB key.

Usage

CoverM operates in several modes. Detailed usage information including examples is given at the links below, or alternatively by using the -h or --full-help flags for each mode:

genome - Calculate coverage of genomes
contig - Calculate coverage of contigs

There are several utility modes as well:

make - Generate BAM files through alignment
filter - Remove (or only keep) alignments with insufficient identity
cluster - Dereplicate and cluster genomes
shell-completion - Generate shell completion scripts

Calculation methods

The -m/--methods flag specifies the specific kind(s) of coverage that are to be calculated.

To illustrate, imagine a set of 3 pairs of reads, where only 1 aligns to a single reference contig of length 1000bp:

read1_forward    ========>
read1_reverse                                  <====+====
contig    ...-----------------------------------------------------....
                 |        |         |         |         |
position        200      210       220       230       240

The difference coverage measures would be:

Method	Value	Formula	Explanation
mean	0.02235294	(10+9)/(1000-2*75)	The two reads have 10 and 9 bases aligned exactly, averaged over 1000-2*75 bp (length of contig minus 75bp from each end).
relative_abundance	33.3%	0.02235294/0.02235294*(2/6)	If the contig is considered a genome, then its mean coverage is 0.02235294. There is a total of 0.02235294 mean coverage across all genomes, and 2 out of 6 reads (1 out of 3 pairs) map. This coverage calculation is only available in 'genome' mode.
trimmed_mean	0	mean_coverage(mid-ranked-positions)	After removing the 5% of bases with highest coverage and 5% of bases with lowest coverage, all remaining positions have coverage 0.
covered_fraction	0.02	(10+10)/1000	20 bases are covered by any read, out of 1000bp.
covered_bases	20	10+10	20 bases are covered.
variance	0.01961962	var({1;20},{0;980})	Variance is calculated as the sample variance.
length	1000		The contig's length is 1000bp.
count	2		2 reads are mapped.
reads_per_base	0.002	2/1000	2 reads are mapped over 1000bp.
metabat	contigLen 1000, totalAvgDepth 0.02235294, bam depth 0.02235294, variance 0.01961962		Reproduction of the MetaBAT 'jgi_summarize_bam_contig_depths' tool output, producing identical output.
coverage_histogram	20 bases with coverage 1, 980 bases with coverage 0		The number of positions with each different coverage are tallied.
rpkm	1000000	2 * 10^9 / 1000 / 2	Calculation here assumes no other reads map to other contigs. See https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/ for an explanation of RPKM and TPM
tpm	1000000	rpkm/total_of_rpkm * 10^6	Calculation here assumes no other reads map to other contigs. See RPKM above.

Calculation of genome-wise coverage (genome mode) is similar to calculating contig-wise (contig mode) coverage, except that the unit of reporting is per-genome rather than per-contig. For calculation methods which exclude base positions based on their coverage, all positions from all contigs are considered together. For instance, if a 2000bp contig with all positions having 1X coverage is in a genome with 2,000,000bp contig with no reads mapped, then the trimmed_mean will be 0 as all positions in the 2000bp are in the top 5% of positions sorted by coverage.

License

CoverM is made available under GPL3+. See LICENSE.txt for details. Copyright Ben Woodcroft.

Developed by Ben Woodcroft at the Queensland University of Technology Centre for Microbiome Research.

coverm's People

Contributors

Stargazers

Watchers

coverm's Issues

genome should fail earlier when there's no specification

this happens after bwa index:

Either a separator (-s) or path(s) to genome FASTA files (with -d or -f) must be given

Compile from source on MacOS failed

error: Found argument '...etc...' which wasn't expected, or isn't valid in this context

Sharded flag not respected when calculating contig coverage

Hi Ben,

I'm trying to run coverm contig with the --sharded flag, but it appears CoverM still expects the BAM files to be reference sorted:

> samtools sort -n example.1.fastq_R1.bam > name_R1.bam
> samtools sort -n example.1.fastq_R2.bam > name_R2.bam
> coverm contig --bam-files name*.bam --sharded --min-read-percent-identity 95 --min-read-aligned-percent 95 -m mean -o coverage.tsv --no-zeros -t 1
[2021-04-29T16:36:37Z INFO  coverm] CoverM version 0.6.1
[2021-04-29T16:36:37Z INFO  coverm] Using min-read-percent-identity 95%
[2021-04-29T16:36:37Z INFO  coverm] Using min-read-aligned-percent 95%
[2021-04-29T16:36:37Z INFO  coverm] Writing output to file: coverage.tsv
[2021-04-29T16:36:37Z INFO  coverm] Using min-covered-fraction 0%
[2021-04-29T16:36:37Z ERROR coverm::contig] BAM file appears to be unsorted. Input BAM files must be sorted by reference (i.e. by samtools sort)
thread 'main' panicked at 'BAM file appears to be unsorted. Input BAM files must be sorted by reference (i.e. by samtools sort)', /opt/conda/conda-bld/coverm_1616760421473/work/src/contig.rs:130:25
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

If I sort the my BAM files by reference everything works as expected even when using the --sharded flag which doesn't seem correct.

Thanks for the great work.

Cheers,
Donovan

alignment length

Hi,

In coverm contig, If I want to filter the bam files according to alignment length, should I set the --min-read-aligned-length?

allowable characters in the genome fasta ID?

Hi!

I've been getting a consistent error with one batch of genome files when I run the "genome" command with a concatenated set of the genomes:

coverm genome \
    --reference ~/binned_dRepd/ \
    -s "-" \
    -m relative_abundance \
    --interleaved /home/GLBRCORG/trina.mcmahon/lake_data_general/Mendota_metaGs_renamed/*.fastq \
    --bam-file-cache-directory bam_cache \
    --min-read-aligned-percent 0.75 \
    --min-read-percent-identity 0.95 \
    --min-covered-fraction 0 \
    -x fasta -t 20 &> relative_abundance_output.txt &

The error is:

thread 'main' panicked at 'index out of bounds: the len is 0 but the index is 4294967295', src/genome.rs:811:23
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

But if I run "genome" with the individual files, it works fine.

coverm genome \
    --genome-fasta-directory  ~/binned_dRepd/ \
    -m relative_abundance \
    --interleaved /home/GLBRCORG/trina.mcmahon/lake_data_general/Mendota_metaGs_renamed/*.fastq \
    --bam-file-cache-directory bam_cache \
    --min-read-aligned-percent 0.75 \
    --min-read-percent-identity 0.95 \
    --min-covered-fraction 0 \
    -x fa -t 20 &> relative_abundance_output.txt &

I am wondering if it is a formatting problem with my fasta ID (I've done some other debugging that makes me think this).
Are there any forbidden characters that we should not use to separate the genome name from the contig name? I suspect it's the "-" but want to confirm.

Example fasta ID:

ME_SAG_rebinned_metabat1_bin.1-2739367620_Contig_60

thanks,
trina

the rationale of genome coverage calculation

Hi,

I have several MAGs and would like see their coverage across my samples. I am curious about the rationale of genome coverage calculation in CoverM. Is it the mean of contig coverage?

bwa-mem2 support to improve mapping speed

Dear Ben,
bwa-mem algorithm was improved and a 2x speedup is achieved. See here: https://ieeexplore.ieee.org/abstract/document/8820962?casa_token=KrlJpG5fVt8AAAAA:NUlxBO2400z4M-sFMCbDn2tSXTZj_y0si_MQgNbDvPd3y223cpV-si6b8DDWCWhl-1iSI3Gh

minimap2 is still not perfect for short reads mapping (https://lh3.github.io/2018/04/02/minimap2-and-the-future-of-bwa) but for large scale dataset the speed of minimap2 really fast. actually the syntax for bwa-mem2 is exactly the same with bwa. All need to add is to also recognize bwa-mem2 in the path, together with bwa and minimap2. I have done tests with bwa-mem2 and the output file and coverage are exactly the same with bwa but 2x speedup

Thanks,

Jianshu

Gene abundance

Hi, is it possible to use CoverM to calculate the abundance of specific genes? Would it be affected by the gene length? Thanks!

OSX: Default minimap2 method error

Hi Ben,

When I was trying producing bam files using coverm make command on both MacOS and Ubuntu 18.04 using default minimap2 method (compiled from the master branch, using the coverm executable in debug directory), I have the following message:

coverm make -o mapping_bam -r /vd03/home/jianshu/computation_genomics/env_geno_lab5/contig.fa -t 32 --interleaved /vd03/home/jianshu/computation_genomics/env_geno_lab5/00_Reads_QCed/T4AerOil_sbsmpl5.fa

[2020-05-28T12:17:08Z INFO coverm] CoverM version 0.4.0
[2020-05-28T12:17:08Z INFO bird_tool_utils::external_command_checker] Found minimap2 version 2.17-r941
[2020-05-28T12:17:08Z INFO bird_tool_utils::external_command_checker] Found samtools version 1.9
[2020-05-28T12:17:08Z INFO coverm] Writing BAM files to already existing directory mapping_bam
[2020-05-28T12:17:08Z INFO coverm] Not pre-generating minimap2 index
[2020-05-28T12:17:08Z INFO coverm] Running mapping number 1 ..
[2020-05-28T12:17:08Z ERROR coverm::bam_generator] Error when running mapping process: ["bash -c "set -e -o pipefail; minimap2 --split-prefix /tmp/.tmpbc4mD4 -a -x sr -t 96 '/vd03/home/jianshu/computation_genomics/env_geno_lab5/contig.fa' '/vd03/home/jianshu/computation_genomics/env_geno_lab5/00_Reads_QCed/T4AerOil_sbsmpl5.fa' 2>/tmp/.tmpM1U3x3 | remove_minimap2_duplicated_headers | samtools sort -T '/tmp/coverm-make-samtools-sortb1q2Ge' -l0 -@ 95 2>/tmp/.tmpXMBirl | samtools view -b -@ 95 -o 'mapping_bam/contig.fa.T4AerOil_sbsmpl5.fa.bam' 2>/tmp/.tmpVtVykf""]
[2020-05-28T12:17:08Z ERROR coverm::bam_generator] The STDERR for the MINIMAP2_SR part was: [M::mm_idx_gen::0.1300.86] collected minimizers
[M::mm_idx_gen::0.16113.67] sorted minimizers

[2020-05-28T12:17:08Z ERROR coverm::bam_generator] The STDERR for the samtools sort part was:
[2020-05-28T12:17:08Z ERROR coverm::bam_generator] The STDERR for the samtools view for cache part was:
[2020-05-28T12:17:08Z ERROR coverm::bam_generator] Cannot continue since mapping failed.

Any idea why?

Thanks,

Best,

Jianshu

relative_abundance

Deer developer,
I still don't understand the relative abundance calculation in CoverM. Whether the result of CoverM is the percentage of mapped reads of each genome ? I just want to get the number of reads recruited per kilobase genome per gigabase metagenome (RPKG). Can I use CoverM to calculate ?
The following is my command:
coverm genome -d /backup/zhouyl/Genome -x fasta -b TS01sorted -t 20 --min-read-aligned-length 50 --min-read-percent-identity 0.99 >TS01sorted .coverage
The following is the result:

Genome TS01sorted Relative Abundance (%)
unmapped 75.77681
B100T1B8 0
B100T1L10 0.45533308
B100T3L11 0.024251277
B101T1B8 0.03663064
B101T1L10 0.047809314
B101T3L11 0.014094652
B102T1B8 0
B102T1L10 0.07564686
B102T3L11 0.03380026
B103T1B8 0
B103T1L10 0.078486376
B104T1B8 0
B104T3L11 0.0088422205

Hope to your reply
Thanks in advance~

a question about building

@wwood
I notice that an issue opended by you as follows:
rust-bio/rust-htslib#80

I came across a similiar issue :
rust-rocksdb/rust-rocksdb#418

The same error was produced, although in diffrent projects.
"undefined reference to symbol '__printf_chk@@GLIBC_2.3.4'"

so I want to know
what is the root cause ?
how did you get rid of it?

thanks a lot

export the results to output file

Hi,

Would it be possible to export the results to an output file? In the help page of coverm genome and coverm contig, I can't find how to specify the output file.

Running multiple methods with one command

Hi, thanks for the great tool. I couldn't tell from the documentation if you can run multiple methods in the same command, such as if I want to have the output method be both relative_abundance and coverage_histogram without having to run them as separate jobs. I understand I could probably use the coverage_histogram output and manually calculate the relative abundance, but would be nice to have those output results from one run. Apologies if this was already explained in the documentation and I missed it.
Thanks.

add output file option

Hi,
Unless I'm mistaken, there is no output file option to save the results to a file.
It would maybe be useful to add as now the results go to the STDOUT

[Suggestion] Add TPM as a method for calculating coverage

Currently, CoverM offers the user to output contig and genome coverage in RPKM. This metric is cool because it corrects for both library size and contig/genome length.

TPM (transcripts per million) is another metric with the same positive aspects and has the advantage that the sum of all contigs/genomes must add to 1.000.000. More on this here. Nowadays, most RNA-Seq quantification software use TPM instead of RPKM.

A the advantage of TPM becomes clear when comparing the coverage between two libraries. If a contig/genome has a higher TPM in library A than in library B, you can infer that this contig/genome is more abundant in library A. If you observed the same with RPKM values, you wouldn't be able to reach the same conclusion, as the total RPKM varies between different libraries.

Set galah/dereplication as a crate feature

Hey Ben,

I know that CoverM isn't supposed to be used as a library and that implementing this would require some work, so feel free to ignore/close this issue if you think it is not worth it.

I just updated pyCoverM to use the latest release (0.6.0) and I noticed that I can't compile the wheel on Windows because of Galah:

error[E0433]: failed to resolve: could not find `unix` in `os`
   --> C:\Users\runneradmin\.cargo\registry\src\github.com-1ecc6299db9ec823\galah-0.3.0\src\cluster_argument_parsing.rs:385:22
    |
385 |             std::os::unix::fs::symlink(link, std::path::Path::new(&current_stab)).expect(&format!(
    |                      ^^^^ could not find `unix` in `os`

Because dereplication is not something that most people would expect in a tool to estimate contig/genome coverage, maybe it could be moved to the [features] section of the Cargo.toml file and then Galah could be marked as an optional requirement. This way, CoverM could be compiled on Windows.

Trimmed Mean larger than Mean

Thanks for this very nice tool. I have a question regarding the output. In my results, there is one bin showing quite different values between the Mean and Trimmed Mean (see below). In my understanding, Tirmmed Mean values should be less than Mean values. Most other bin follow this rule with only a few exceptions. Is it right? Of these two parameters, which one more accurately represenst the relative abundance? Thanks!

Genome | Mean | Relative Abundance (%) | Trimmed Mean
Bin1 | 106.6692 | 2.686203 | 2030.007

--bam-files incompatible with --min-read-*-pair

Hi,

Thanks for maintaining this great tool!

I encountered the following error:

$ coverm contig -b test.bam -o test.tsv -m mean -t 4 --no-zeros --min-read-percent-identity-pair 75 --min-read-aligned-percent-pair 60 --exclude-supplementary --min-covered-fraction 1e-10

error: The following required arguments were not provided:
    --coupled <coupled>...
    --interleaved <interleaved>...
    --proper-pairs-only
    -1 <read1>...
    -2 <read2>...
    --reference <reference>...
    --single <single>...

USAGE:
    coverm contig --bam-files <bam-files>... --contig-end-exclusion <contig-end-exclusion> --coupled <coupled>... --exclude-supplementary --interleaved <interleaved>... --mapper <mapper> --methods <methods>... --min-covered-fraction <min-covered-fraction> --min-read-aligned-percent-pair <min-read-aligned-percent-pair> --min-read-percent-identity-pair <min-read-percent-identity-pair> --no-zeros --output-file <output-file> --output-format <output-format> --proper-pairs-only -1 <read1>... -2 <read2>... --reference <reference>... --single <single>... --threads <threads> --trim-max <trim-max> --trim-min <trim-min>

For more information try --help

Removing the options --min-read-percent-identity-pair and --min-read-aligned-percent-pair worked. It would be ideal to have a clearer error message (can be as simple as --min-read-percent-identity-pair is not compatible with --bam-files), as it's the case when using e.g. -m metabat with --min-covered-fraction:

$ coverm contig -b raw_EB003-DS014-NT.bam -o raw_EB003-DS014-NT.coverm -m metabat -t 4 --no-zeros --exclude-supplementary --min-covered-fraction 1e-10

[2021-04-27T14:06:50Z INFO  coverm] CoverM version 0.6.1
[2021-04-27T14:06:50Z INFO  coverm] Setting single read percent identity threshold at 0.97 for MetaBAT adjusted coverage, and not filtering out supplementary, secondary and improper pair alignments
[2021-04-27T14:06:50Z INFO  coverm] Writing output to file: raw_EB003-DS014-NT.coverm
[2021-04-27T14:06:50Z INFO  coverm] Using min-covered-fraction 0.00000001%
[2021-04-27T14:06:50Z ERROR coverm] The 'length' coverage estimator cannot be used when --min-covered-fraction is > 0 as it does not calculate the covered fraction. You may wish to set the --min-covered-fraction to 0 and/or run this estimator separately.

Thanks

output

Hi, is there a way to output a tab delimited txt file or similar (coverm genome)? I was not able to reconstruct the screen output unambiguously. Thanks a lot!

Calculation of relative abundance

Hello,

I was wondering how the relative abundance is calculated. I saw in your test code the results below:
" "Sample Genome Relative Abundance (%) Mean
7seqs.reads_for_seq1_and_seq2 unmapped 0 NA
7seqs.reads_for_seq1_and_seq2 genome1 0 0
7seqs.reads_for_seq1_and_seq2 genome2 53.16792 1.4117647
7seqs.reads_for_seq1_and_seq2 genome3 0 0
7seqs.reads_for_seq1_and_seq2 genome4 0 0
7seqs.reads_for_seq1_and_seq2 genome5 46.832077 1.2435294
7seqs.reads_for_seq1_and_seq2 genome6 0 0"

It looks to me that the 53% is just 1.41/(1.41+1.24). But in the wiki, the calculation considers the percentage of total reads that were mapped. Could you provide a more detailed explanation of the calculation? Thank you very much!

TAD (truncated average depth) and Trimmed mean depth

Dear Ben,
I am now comparing different methods for calculating genome coverage.

Our group proposed a method to calculate the abundance of each MAG binned from assembled contigs called TAD80:

The abundance of each genome species (representative of a cluster after dereplication) was estimated using the MAG of highest genome quality as representative. For each metagenomic dataset, the sequencing depth was estimated per position (Bowtie (Langmead and Salzberg, 2012, for mapping shot reads to MAGs), bedtools (Quinlan and Hall, 2010, for calculation coverage using mapped bam file)) and truncated to the central 80% (BedGraph.tad.rb (Rodriguez-R and Konstantinidis, 2016)), a metric hereafter termed TAD (truncated average sequencing depth). Abundance was then estimated as TAD80 normalized by the genome equivalents of the metagenomic dataset. Three steps need to calculated TAD80:

Map reads to MAG using mapping tools (bwa or bowtie2) and get the sorted bam file
Calculate coverage for each position:
bedtools genomecov -ibam MAG_sorted.bam >> MAG.bedtools.cov.txt
Calculate TAD80 using the script BedGraph.tad.rb (https://github.com/lmrodriguezr/enveomics/blob/master/Scripts/BedGraph.tad.rb):
BedGraph.tad.rb -i lab5_MAG.001.bedtools.cov.txt -r 0.8

For the CoverM genome trimmed mean method, if I understand it correctly:

You did similar thing (choosing --trim-min 0.1 and —trim-max 0.9) compared to TAD80, but you also remove the first and last 75 bp to avoid bad mapping (edge effects):

coverm genome -d ./try_MAGs_1 -x fasta -b ./mapping_bam/MAG.001.bam --min-covered-fraction 0.001 -m trimmed_mean --trim-min 0.1 --trim-max 0.9 --contig-end-exclusion 75

The directory try_MAGs_1 contains only MAG.001.fasta, MAG.001.bam is generated by mapping reads to MAG.001.fasta

My question is: is TAD80 basically the same thing as CoverM trimmed mean (--trim-min 0.1 and —trim-max 0.9) (my understanding is: it is)? They might not be exactly the same but the general idea of workflow and logic for calculating average coverage for a MAG is the same right (except the removed 75 bp at both ends of contig)?

I know you are on vacation. Please feel free to answer the question whenever you have time.

Thank you very much,

Best,

Jianshu

failed to create temporary file

Hi, I encountered this problem using coverm 0.2.0-alpha7 ：
[M::mem_process_seqs] Processed 669062 reads in 511.997 CPU sec, 93.862 real sec [2021-01-08T18:33:04Z ERROR coverm::bam_generator] The STDERR for the samtools sort part was: samtools sort: failed to create temporary file "/tmp/coverm_fifo.wnwaV0ajKHzH/coverm-make-samtools-sorthx37j5.0005.bam": No space left on device samtools sort: failed to create temporary file "/tmp/coverm_fifo.wnwaV0ajKHzH/coverm-make-samtools-sorthx37j5.0008.bam": No space left on device [2021-01-08T18:33:04Z ERROR coverm::bam_generator] The STDERR for the samtools view for cache part was: [main_samview] fail to read the header from "-". thread 'main' panicked at 'Cannot continue since mapping failed.', src/bam_generator.rs:146:13 note: Run with RUST_BACKTRACE=1environment variable to display a backtrace.
I guess there is no space left on my /tmp directory, so can I specify the temporary directory by myself to solve this problem?
Thanks~

Relative abundance estimates double when using --genome-definitions and provided bam file

Hello Ben,

When I run coverm roughly following default settings I end up with half the abundances of coverm when I provide the bam file and --genome-definitions flag:

Default-ish setting:

coverm genome --reference MAGs_sp_concat.fa -s "~" -m relative_abundance -c /illumina_data/*.fastq --bam-file-cache-directory bams --min-read-aligned-percent 0.75 --min-read-percent-identity 0.95 --min-covered-fraction 0 -x fa -t 60  &> log.txt

vs providing a bam file with the --genome-definitions flag (my input for this was a (samtools sorted) bam file produced from minimap, and minimap had been provided with an interleaved illumina reads file):

coverm genome --genome-definition genome_definition_AalEMAGs.txt -m count -b bam/AalE.bam --min-read-aligned-percent 0 --min-read-percent-identity 0 --min-covered-fraction 0 -x fa -t 60 &> log2.txt

So, my log2.txt file indicates a relative abundance x2 of the log.txt - and it is pretty much exactly correlated (I used trimmed reads (no Ns) for log2.txt and the raw reads for log.txt)

Do you know what my problem could be? The -m counts are roughly the same (considering the slightly different read data and mapping programs I mentioned), so I think it is to do with the relative abundance calculation and because I used cutadapt-produced interleaved read files as input to minimap2 - maybe the pairs are not flagged the same and so coverm views them as unpaired and does not do a /2 step?

My reasoning for using the provided bam files and genome definitions (MAG_id tab contig1 etc) is that I'd like to get the coverage/abundances of my MAGs from the exact same data that was used for the assembly and binning.

I hope that explanation makes sense! And that I am using coverm in a legitimate way here...

Thanks :)
Caitlin

Option to produce percent identity summary

Hello,

I have been using CoverM often for filtering my BAM files based on read percent identity. I am wondering if there is any possibility of adding an option to output a summary indicating the percent identity of every mapped read and the location on the contig/genome that read mapped. I have been developing a manual script to do this in pysam by parsing the CIGAR and MD Tags but considering CoverM already makes some of those calculations for the filtering function it would be great to be able to just output a summary in one step when I filter my BAM files using CoverM.

-Elaina

Naive question: reads mapping to multiple locations

Hi,

Thanks for the cool pipeline!
I wonder if there is a way to count (and/or remove) the reads that map to multiple locations. I've seen that option (--exclude-supplementary) but not sure if it deals with these reads.
I wonder what is the best approach for coverage estimate: remove these reads or find a way to correct for them.
Thanks!

print zeros when no contig is covered

ben@u:~/git/coverm/tests/data$ cargo run -- contig -m covered_fraction -b 2seqs.reads_for_seq1.bam --min-covered-fraction 0
    Finished dev [unoptimized + debuginfo] target(s) in 0.12s                                                                                                                                                      
     Running `/home/ben/git/coverm/target/debug/coverm contig -m covered_fraction -b 2seqs.reads_for_seq1.bam --min-covered-fraction 0`
Sample	Contig	CoveredFraction
 INFO 2018-11-19T12:56:35Z: coverm::contig: In sample '2seqs.reads_for_seq1', found 0 reads mapped out of 12 total (0.00%)

That should print seq1 and seq2 with 0 covered fraction. Currently all reads are filtered out by flag filter.

sparse vs dense help needs improvement

suggested by Steve Robbins.

relative-abundance

Dear developer:
I successfully use CoverM to calculate the coverages of my genome coming from metagenomic data. And i use trimmed_mean method. But i got a question with my coverage data. There are so many MAGs in my sample, so i would not like plot all of the MAGs' abundance in a figure. And i want to classify all the MAGs in the level of phylum. I mean, can i add the Trimmed_mean coverage value of MAGs in the same phylum together and use the finally combined value to respresent the abundance of this phylum?

Calculation of RPKM

Hello,

I am trying to calculate RPKM value from the result of CoverM. RPKM = (mapNumReads * 10^9) / ( genomeLength * totalNumReads ). RPKM calculation don‘t refer to the length of each read, while CoverM does. It is not so smooth to eliminate this part from the results no matter which method, because my datas come from different sequencing platforms and they have inconsistent length of reads. So, I wonder how to determine CoverM method and how to use the result to further calculate RPKM value.

Best ragards,
Jercey

unmapped reads of abundance output and mapped reads of `samtools flagstat` is quite different

Hi,
I used this code to calculate abundance of genome "coverm genome -p bwa-mem --coupled demo.1.fq.gz demo.2.fq.gz --genome-fasta-directory ./dereplicated_genomes/ --bam-file-cache-directory ./ -o output.tsv", the final unmapped reads abundance is 85%, but this code "samtools flagstat coverm-genome.demo.1.fq.gz.bam" showed mapped reads is 92%, I am really confused.

head -2 output.tsv
Genome demo.1.fq.gz Relative Abundance (%)
unmapped 84.77992

samtools flagstat coverm-genome.demo.1.fq.gz.bam
4818489 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
732129 + 0 supplementary
0 + 0 duplicates
4428318 + 0 mapped (91.90% : N/A)
4086360 + 0 paired in sequencing
2043180 + 0 read1
2043180 + 0 read2
2774868 + 0 properly paired (67.91% : N/A)
3657264 + 0 with itself and mate mapped
38925 + 0 singletons (0.95% : N/A)
821120 + 0 with mate mapped to a different chr
362154 + 0 with mate mapped to a different chr (mapQ>=5)

Failed to correctly find or parse BAM file when submitting into a PBS cluster system

Hi, I am now working on the coverage of my genomes, but I get an error when using coverm genome in a PBS system and the error is about the location of temporary files. Is there anything I can do, or is it a bug in the software?

REMOTEHOST: Undefined variable.
[2020-09-25T05:49:40Z INFO  coverm] CoverM version 0.5.0
[2020-09-25T05:49:40Z INFO  coverm] Using min-covered-fraction 10%
[2020-09-25T05:49:40Z INFO  bird_tool_utils::external_command_checker] Found minimap2 version 2.17-r941
[2020-09-25T05:49:41Z INFO  bird_tool_utils::external_command_checker] Found samtools version 1.10
[2020-09-25T05:49:41Z INFO  coverm] Profiling 59 genomes
[2020-09-25T05:49:41Z INFO  coverm] Generating concatenated reference FASTA file of 59 genomes ..
[2020-09-25T05:49:42Z INFO  coverm] Not pre-generating minimap2 index
[2020-09-25T06:02:21Z ERROR coverm::bam_generator] Failed to correctly find or parse BAM file at "/var/tmp/pbs.001.cluster/coverm_fifo.HvOJkiD3KX2q/foo.pipe": unable to open SAM/BAM/CRAM file at /var/tmp/pbs.001.cluster/coverm_fifo.HvOJkiD3KX2q/foo.pipe
[2020-09-25T06:02:21Z ERROR coverm::bam_generator] Error when running mapping process: ["bash -c \"set -e -o pipefail; minimap2 --split-prefix /var/tmp/pbs.001.cluster/.tmp2tMq5F -a -x sr  -t 1  \'/var/tmp/pbs.001.cluster/.tmpbW2t90\' \'./tmp/IG10_3_FDME202321603-1a_H35KCDSXY_L1_1.fq.gz\' \'./tmp/IG10_3_FDME202321603-1a_H35KCDSXY_L1_2.fq.gz\' 2>/var/tmp/pbs.001.cluster/.tmp5GWVtX  | remove_minimap2_duplicated_headers 2>/var/tmp/pbs.001.cluster/.tmpqMJUWv| samtools sort -T \'/var/tmp/pbs.001.cluster/coverm_fifo.HvOJkiD3KX2q/coverm-make-samtools-sortfS1d6F\' -l0 -@ 0 2>/var/tmp/pbs.001.cluster/.tmpLtMBoP > \"/var/tmp/pbs.001.cluster/coverm_fifo.HvOJkiD3KX2q/foo.pipe\"\""]
[2020-09-25T06:02:21Z ERROR coverm::bam_generator] The STDERR for the MINIMAP2_SR part was: [M::mm_idx_gen::6.979*1.00] collected minimizers
    [M::mm_idx_gen::9.925*1.00] sorted minimizers
    [M::main::9.944*1.00] loaded/built the index for 61309 target sequence(s)
    [M::mm_mapopt_update::9.944*1.00] mid_occ = 1000
    [M::mm_idx_stat] kmer size: 21; skip: 11; is_hpc: 0; #seq: 61309
    [M::mm_idx_stat::10.706*1.00] distinct minimizers: 28258679 (98.49% are singletons); average occurrences: 1.019; average spacing: 6.051
    [M::worker_pipeline::19.780*1.00] mapped 333334 sequences
    [M::worker_pipeline::27.688*1.00] mapped 333334 sequences
    [M::worker_pipeline::35.504*1.00] mapped 333334 sequences
    [M::worker_pipeline::43.205*1.00] mapped 333334 sequences
    [M::worker_pipeline::50.665*1.00] mapped 333334 sequences
    [M::worker_pipeline::58.276*1.00] mapped 333334 sequences
    [M::worker_pipeline::65.829*1.00] mapped 333334 sequences
    [M::worker_pipeline::73.802*1.00] mapped 333334 sequences
    [M::worker_pipeline::81.864*1.00] mapped 333334 sequences
    [M::worker_pipeline::89.350*1.00] mapped 333334 sequences
    [M::worker_pipeline::96.892*1.00] mapped 333334 sequences
    [M::worker_pipeline::104.400*1.00] mapped 333334 sequences
    [M::worker_pipeline::111.848*1.00] mapped 333334 sequences
    [M::worker_pipeline::119.254*1.00] mapped 333334 sequences
    [M::worker_pipeline::126.367*1.00] mapped 333334 sequences
    [M::worker_pipeline::133.387*1.00] mapped 333334 sequences
    [M::worker_pipeline::140.807*1.00] mapped 333334 sequences
    [M::worker_pipeline::147.956*1.00] mapped 333334 sequences
    [M::worker_pipeline::155.232*1.00] mapped 333334 sequences
    [M::worker_pipeline::162.445*1.00] mapped 333334 sequences
    [M::worker_pipeline::169.530*1.00] mapped 333334 sequences
    [M::worker_pipeline::176.872*1.00] mapped 333334 sequences
    [M::worker_pipeline::184.142*1.00] mapped 333334 sequences
    [M::worker_pipeline::191.509*1.00] mapped 333334 sequences
    [M::worker_pipeline::198.880*1.00] mapped 333334 sequences
    [M::worker_pipeline::206.124*1.00] mapped 333334 sequences
    [M::worker_pipeline::213.304*1.00] mapped 333334 sequences
    [M::worker_pipeline::220.703*1.00] mapped 333334 sequences
    [M::worker_pipeline::228.143*1.00] mapped 333334 sequences
    [M::worker_pipeline::235.568*1.00] mapped 333334 sequences
    [M::worker_pipeline::242.799*1.00] mapped 333334 sequences
    [M::worker_pipeline::250.158*1.00] mapped 333334 sequences
    [M::worker_pipeline::257.621*1.00] mapped 333334 sequences
    [M::worker_pipeline::264.988*1.00] mapped 333334 sequences
    [M::worker_pipeline::272.512*1.00] mapped 333334 sequences
    [M::worker_pipeline::280.040*1.00] mapped 333334 sequences
    [M::worker_pipeline::287.637*1.00] mapped 333334 sequences
    [M::worker_pipeline::295.154*1.00] mapped 333334 sequences
    [M::worker_pipeline::302.617*1.00] mapped 333334 sequences
    [M::worker_pipeline::310.317*1.00] mapped 333334 sequences
    [M::worker_pipeline::317.957*1.00] mapped 333334 sequences
    [M::worker_pipeline::325.395*1.00] mapped 333334 sequences
    [M::worker_pipeline::332.775*1.00] mapped 333334 sequences
    [M::worker_pipeline::340.294*1.00] mapped 333334 sequences
    [M::worker_pipeline::347.744*1.00] mapped 333334 sequences
    [M::worker_pipeline::355.352*1.00] mapped 333334 sequences
    [M::worker_pipeline::362.873*1.00] mapped 333334 sequences
    [M::worker_pipeline::370.410*1.00] mapped 333334 sequences
    [M::worker_pipeline::377.826*1.00] mapped 333334 sequences
    [M::worker_pipeline::385.054*1.00] mapped 333334 sequences
    [M::worker_pipeline::392.402*1.00] mapped 333334 sequences
    [M::worker_pipeline::399.894*1.00] mapped 333334 sequences
    [M::worker_pipeline::407.360*1.00] mapped 333334 sequences
    [M::worker_pipeline::414.716*1.00] mapped 333334 sequences
    [M::worker_pipeline::422.255*1.00] mapped 333334 sequences
    [M::worker_pipeline::429.661*1.00] mapped 333334 sequences
    [M::worker_pipeline::437.151*1.00] mapped 333334 sequences
    [M::worker_pipeline::444.553*1.00] mapped 333334 sequences
    [M::worker_pipeline::451.538*1.00] mapped 333334 sequences
    [M::worker_pipeline::458.443*1.00] mapped 333334 sequences
    [M::worker_pipeline::465.386*1.00] mapped 333334 sequences
    [M::worker_pipeline::472.719*1.00] mapped 333334 sequences
    [M::worker_pipeline::479.774*1.00] mapped 333334 sequences
    [M::worker_pipeline::486.909*1.00] mapped 333334 sequences
    [M::worker_pipeline::494.363*1.00] mapped 333334 sequences
    [M::worker_pipeline::501.657*1.00] mapped 333334 sequences
    [M::worker_pipeline::509.168*1.00] mapped 333334 sequences
    [M::worker_pipeline::516.431*1.00] mapped 333334 sequences
    [M::worker_pipeline::523.611*1.00] mapped 333334 sequences
    [M::worker_pipeline::531.387*1.00] mapped 333334 sequences
    [M::worker_pipeline::538.540*1.00] mapped 333334 sequences
    [M::worker_pipeline::546.063*1.00] mapped 333334 sequences
    [M::worker_pipeline::553.472*1.00] mapped 333334 sequences
    [M::worker_pipeline::560.897*1.00] mapped 333334 sequences
    [M::worker_pipeline::568.404*1.00] mapped 333334 sequences
    [M::worker_pipeline::576.056*1.00] mapped 333334 sequences
    [M::worker_pipeline::584.445*1.00] mapped 333334 sequences
    [M::worker_pipeline::592.099*1.00] mapped 333334 sequences
    [M::worker_pipeline::599.743*1.00] mapped 333334 sequences
    [M::worker_pipeline::607.234*1.00] mapped 333334 sequences
    [M::worker_pipeline::614.833*1.00] mapped 333334 sequences
    [M::worker_pipeline::622.465*1.00] mapped 333334 sequences
    [M::worker_pipeline::630.040*1.00] mapped 333334 sequences
    [M::worker_pipeline::637.592*1.00] mapped 333334 sequences
    [M::worker_pipeline::645.179*1.00] mapped 333334 sequences
    [M::worker_pipeline::652.688*1.00] mapped 333334 sequences
    [M::worker_pipeline::660.295*1.00] mapped 333334 sequences
    [M::worker_pipeline::667.721*1.00] mapped 333334 sequences
    [M::worker_pipeline::675.190*1.00] mapped 333334 sequences
    [M::worker_pipeline::682.604*1.00] mapped 333334 sequences
    [M::worker_pipeline::690.103*1.00] mapped 333334 sequences
    [M::worker_pipeline::693.901*1.00] mapped 171650 sequences
    [M::worker_pipeline::695.753*1.00] mapped 333334 sequences
    [M::worker_pipeline::696.704*1.00] mapped 333334 sequences
    [M::worker_pipeline::697.583*1.00] mapped 333334 sequences
    [M::worker_pipeline::698.392*1.00] mapped 333334 sequences
    [M::worker_pipeline::699.246*1.00] mapped 333334 sequences
    [M::worker_pipeline::700.185*1.00] mapped 333334 sequences
    [M::worker_pipeline::709.982*0.99] mapped 333334 sequences
    [M::worker_pipeline::710.531*0.99] mapped 333334 sequences
    [M::worker_pipeline::711.383*0.99] mapped 333334 sequences
    [M::worker_pipeline::712.320*1.00] mapped 333334 sequences
    [M::worker_pipeline::713.166*1.00] mapped 333334 sequences
    [M::worker_pipeline::714.105*1.00] mapped 333334 sequences
    [M::worker_pipeline::723.329*0.99] mapped 333334 sequences
    [M::worker_pipeline::723.791*0.99] mapped 333334 sequences
    [M::worker_pipeline::724.733*0.99] mapped 333334 sequences
    [M::worker_pipeline::725.739*0.99] mapped 333334 sequences
    [M::worker_pipeline::726.592*0.99] mapped 333334 sequences
    [M::worker_pipeline::727.629*0.99] mapped 333334 sequences
    [M::worker_pipeline::728.399*0.99] mapped 333334 sequences
    [M::worker_pipeline::738.199*0.98] mapped 333334 sequences
    [M::worker_pipeline::738.756*0.98] mapped 333334 sequences
    [M::worker_pipeline::739.680*0.98] mapped 333334 sequences
    [M::worker_pipeline::740.573*0.98] mapped 333334 sequences
    [M::worker_pipeline::741.613*0.98] mapped 333334 sequences
    [M::worker_pipeline::742.427*0.98] mapped 333334 sequences
    [M::worker_pipeline::751.966*0.97] mapped 333334 sequences
    [M::worker_pipeline::752.425*0.98] mapped 333334 sequences
    [M::worker_pipeline::753.467*0.98] mapped 333334 sequences
    [M::worker_pipeline::754.327*0.98] mapped 333334 sequences
    [M::worker_pipeline::755.264*0.98] mapped 333334 sequences
    [M::worker_pipeline::756.105*0.98] mapped 333334 sequences

[2020-09-25T06:02:21Z ERROR coverm::bam_generator] The STDERR for the samtools sort part was:
[2020-09-25T06:02:21Z ERROR coverm::bam_generator] The STDERR for the remove_minimap2_duplicated_headers part was:
[2020-09-25T06:02:21Z ERROR coverm::bam_generator] Cannot continue since mapping failed.

Add support for URMAP

A very fast mapping tool URMAP(https://drive5.com/urmap/) can advance mapping processes and thus calculation of coverage for large metagenomic datasets. I would say the is a very good one for enhancing the CoverM in terms of speed when a large dataset is analysized.

minimap2 doesn't croak when unequal numbers of reads are in fwd and rev files

~/git/coverm/tests/data$ coverm genome --single-genome -c bad_read.1.fa <(randomFasta.pl 100 5) bad_read.2.fa bad_read.2.fa  -r 7seqs.fna --min-covered-fraction 0

relevant minimap2 output

[2021-02-02T06:41:27Z DEBUG coverm::bam_generator] The STDERR for the MINIMAP2_SR part was: [M::main::0.002*1.27] loaded/built the index for 7 target sequence(s)
    [M::mm_mapopt_update::0.002*1.27] mid_occ = 1000
    [M::mm_idx_stat] kmer size: 21; skip: 11; is_hpc: 0; #seq: 7
    [M::mm_idx_stat::0.003*1.24] distinct minimizers: 9434 (100.00% are singletons); average occurrences: 1.000; average spacing: 6.044
    [W::mm_bseq_read_frag2] query files have different number of records; extra records skipped.
    [M::worker_pipeline::0.003*1.18] mapped 10 sequences
    [W::mm_bseq_read_frag2] query files have different number of records; extra records skipped.
    [W::mm_bseq_read_frag2] query files have different number of records; extra records skipped.
    [W::mm_bseq_read_frag2] query files have different number of records; extra records skipped.
    [M::main] Version: 2.17-r941
    [M::main] CMD: minimap2 --split-prefix /tmp/.tmpnOkBsU -a -x sr -t 1 /tmp/coverm-mapping-index.FY2fg0iAmiL8/7seqs.fna bad_read.1.fa /dev/fd/63
    [M::main] Real time: 0.005 sec; CPU: 0.005 sec; Peak RSS: 0.006 GB

Only fix is to grep the stderr for 'different number of records' I guess. Gah.

thread 'main' panicked at 'index out of bounds

Hello,

I am getting the following error when I run coverm genome with an input of sorted BAM files and reference fasta.

[2020-02-02T18:33:54Z INFO coverm] CoverM version 0.3.2
[2020-02-02T18:33:54Z INFO coverm] Using min-covered-fraction 0%
[2020-02-02T18:33:54Z INFO coverm] Using min-read-percent-identity 95%
[2020-02-02T18:33:54Z INFO coverm] Using min-read-aligned-percent 95%
[2020-02-02T18:33:54Z INFO coverm] Using min-read-percent-identity-pair 95%
[2020-02-02T18:33:54Z INFO coverm] Using min-read-aligned-percent-pair 94%
[2020-02-02T18:33:54Z INFO coverm] Not using directory entry 'must/fasta_files.txt' as a genome FASTA file, as it does not end with the extension 'fna'
[2020-02-02T18:33:54Z INFO coverm] Reading contig names for 20 genomes ..
[2020-02-02T18:33:54Z INFO coverm::genome] Of 20 reference IDs, 20 were assigned to a genome and 0 were not
thread 'main' panicked at 'index out of bounds: the len is 0 but the index is 0', /rustc/4560ea788cb760f0a34127156c78e2552949f734/src/libcore/slice/mod.rs:2723:14
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace.

Do you know what could be causing this error? Thank you!

Set --min-covered-fraction when calculating contig coverage using 'coverm contig',

Hi wwood,
If I want to screen contigs with ≥70% of their length covered by the reads when using 'coverm contig', I wonder how to set --min-covered-fraction properly?

Failed to parse genome paths (even though the files are there)

I'm not sure if this elevates to an issue, since I'm just starting to use CoverM and am really unsure about the command structure (hard to tell which flags are needed and which not). Sorry if I'm cluttering the issues list!
I'm getting an error that says a genome is not a fasta file because it does not end in .fna (even though it does). Granted, I have other files in the genome directory which are not .fna but I was hoping they would be ignored.

Here is my command:
coverm genome -t 40 --single PATHTOMETAS -d PATHTOMAGS -x .fna -p minimap2-sr --min-read-percent-identity 0.9 --methods relative_abundance --bam-file-cache-directory bam_cache/ --discard-unmapped -v

[2020-12-08T20:55:57Z INFO coverm] CoverM version 0.5.0
[2020-12-08T20:55:57Z INFO coverm] Using min-covered-fraction 10%
[2020-12-08T20:55:57Z INFO coverm] Using min-read-percent-identity 90%
[2020-12-08T20:55:57Z INFO bird_tool_utils::clap_utils] Not using directory entry '/home/GLBRCORG/trina.mcmahon/Cyanos/data/dereplicated_MAGs/3300020571-bin.11.fna' as a genome FASTA file, as it does not end with the extension '.fna'
[2020-12-08T20:55:57Z INFO bird_tool_utils::clap_utils] Not using directory entry '/home/GLBRCORG/trina.mcmahon/Cyanos/data/dereplicated_MAGs/16cyanoMAGs_dreped.tar.gz' as a genome FASTA file, as it does not end with the extension '.fna'
[2020-12-08T20:55:57Z INFO bird_tool_utils::clap_utils] Not using directory entry '/home/GLBRCORG/trina.mcmahon/Cyanos/data/dereplicated_MAGs/2582580591.fna' as a genome FASTA file, as it does not end with the extension '.fna'
[2020-12-08T20:55:57Z INFO bird_tool_utils::clap_utils] Not using directory entry '/home/GLBRCORG/trina.mcmahon/Cyanos/data/dereplicated_MAGs/3300020551-bin.8.fna' as a genome FASTA file, as it does not end with the extension '.fna'
[2020-12-08T20:55:57Z INFO bird_tool_utils::clap_utils] Not using directory entry '/home/GLBRCORG/trina.mcmahon/Cyanos/data/dereplicated_MAGs/3300020573-bin.18.fna' as a genome FASTA file, as it does not end with the extension '.fna'
[2020-12-08T20:55:57Z INFO bird_tool_utils::clap_utils] Not using directory entry '/home/GLBRCORG/trina.mcmahon/Cyanos/data/dereplicated_MAGs/GEODES117-bin.31.fna' as a genome FASTA file, as it does not end with the extension '.fna'
[2020-12-08T20:55:57Z INFO bird_tool_utils::clap_utils] Not using directory entry '/home/GLBRCORG/trina.mcmahon/Cyanos/data/dereplicated_MAGs/3300020501-bin.15.fna' as a genome FASTA file, as it does not end with the extension '.fna'
[2020-12-08T20:55:57Z INFO bird_tool_utils::clap_utils] Not using directory entry '/home/GLBRCORG/trina.mcmahon/Cyanos/data/dereplicated_MAGs/refList.txt' as a genome FASTA file, as it does not end with the extension '.fna'
[2020-12-08T20:55:57Z INFO bird_tool_utils::clap_utils] Not using directory entry '/home/GLBRCORG/trina.mcmahon/Cyanos/data/dereplicated_MAGs/archive_edited_fasta_id' as a genome FASTA file
[2020-12-08T20:55:57Z INFO bird_tool_utils::clap_utils] Not using directory entry '/home/GLBRCORG/trina.mcmahon/Cyanos/data/dereplicated_MAGs/3300020490-bin.6.fna' as a genome FASTA file, as it does not end with the extension '.fna'
[2020-12-08T20:55:57Z INFO bird_tool_utils::clap_utils] Not using directory entry '/home/GLBRCORG/trina.mcmahon/Cyanos/data/dereplicated_MAGs/3300020558-bin.31.fna' as a genome FASTA file, as it does not end with the extension '.fna'
[2020-12-08T20:55:57Z INFO bird_tool_utils::clap_utils] Not using directory entry '/home/GLBRCORG/trina.mcmahon/Cyanos/data/dereplicated_MAGs/3300020542-bin.8.fna' as a genome FASTA file, as it does not end with the extension '.fna'
[2020-12-08T20:55:57Z INFO bird_tool_utils::clap_utils] Not using directory entry '/home/GLBRCORG/trina.mcmahon/Cyanos/data/dereplicated_MAGs/16cyanoMAGs_original_fasta_id' as a genome FASTA file
[2020-12-08T20:55:57Z INFO bird_tool_utils::clap_utils] Not using directory entry '/home/GLBRCORG/trina.mcmahon/Cyanos/data/dereplicated_MAGs/2582580531.fna' as a genome FASTA file, as it does not end with the extension '.fna'
[2020-12-08T20:55:57Z INFO bird_tool_utils::clap_utils] Not using directory entry '/home/GLBRCORG/trina.mcmahon/Cyanos/data/dereplicated_MAGs/3300020517-bin.16.fna' as a genome FASTA file, as it does not end with the extension '.fna'
[2020-12-08T20:55:57Z INFO bird_tool_utils::clap_utils] Not using directory entry '/home/GLBRCORG/trina.mcmahon/Cyanos/data/dereplicated_MAGs/16cyanoMAGs_edited_fasta_id' as a genome FASTA file
[2020-12-08T20:55:57Z INFO bird_tool_utils::clap_utils] Not using directory entry '/home/GLBRCORG/trina.mcmahon/Cyanos/data/dereplicated_MAGs/GEODES118-bin.197.fna' as a genome FASTA file, as it does not end with the extension '.fna'
[2020-12-08T20:55:57Z INFO bird_tool_utils::clap_utils] Not using directory entry '/home/GLBRCORG/trina.mcmahon/Cyanos/data/dereplicated_MAGs/3300020547-bin.20.fna' as a genome FASTA file, as it does not end with the extension '.fna'
[2020-12-08T20:55:57Z INFO bird_tool_utils::clap_utils] Not using directory entry '/home/GLBRCORG/trina.mcmahon/Cyanos/data/dereplicated_MAGs/3300020539-bin.6.fna' as a genome FASTA file, as it does not end with the extension '.fna'
[2020-12-08T20:55:57Z INFO bird_tool_utils::clap_utils] Not using directory entry '/home/GLBRCORG/trina.mcmahon/Cyanos/data/dereplicated_MAGs/3300020509-bin.3.fna' as a genome FASTA file, as it does not end with the extension '.fna'
[2020-12-08T20:55:57Z INFO bird_tool_utils::clap_utils] Not using directory entry '/home/GLBRCORG/trina.mcmahon/Cyanos/data/dereplicated_MAGs/16cyanoMAGs-fastaIDedited.tar.gz' as a genome FASTA file, as it does not end with the extension '.fna'
[2020-12-08T20:55:57Z INFO bird_tool_utils::clap_utils] Not using directory entry '/home/GLBRCORG/trina.mcmahon/Cyanos/data/dereplicated_MAGs/2582580577.fna' as a genome FASTA file, as it does not end with the extension '.fna'
thread 'main' panicked at 'Failed to parse genome paths: "Found 0 genomes from the genome-fasta-directory, cannot continue."', src/bin/coverm.rs:782:18
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace
(coverM) [trina.mcmahon@scarcity-11 coverM_mapping]$ ps
PID TTY TIME CMD
3710272 pts/2 00:00:00 bash
3711189 pts/2 00:00:00 ps
(coverM) [trina.mcmahon@scarcity-11 coverM_mapping]$ coverm genome -t 40 --single /home/GLBRCORG/trina.mcmahon/Cyanos/data/metaGs/ -d /home/GLBRCORG/trina.mcmahon/Cyanos/data/dereplicated_MAGs/ -x .fna -p minimap2-sr --min-read-percent-identity 0.9 --methods relative_abundance --bam-file-cache-directory bam_cache/ --discard-unmapped -v
[2020-12-08T20:57:20Z INFO coverm] CoverM version 0.5.0
[2020-12-08T20:57:20Z INFO coverm] Using min-covered-fraction 10%
[2020-12-08T20:57:20Z DEBUG coverm] Cached regular coverage taker with columns to normlise: [0] and rpkm_column: None
[2020-12-08T20:57:20Z INFO coverm] Using min-read-percent-identity 90%
[2020-12-08T20:57:20Z DEBUG coverm] Filter parameters set as FilterParameters { flag_filters: FlagFilter { include_improper_pairs: true, include_supplementary: true, include_secondary: false }, min_aligned_length_single: 0, min_percent_identity_single: 0.9, min_aligned_percent_single: 0.0, min_aligned_length_pair: 0, min_percent_identity_pair: 0.0, min_aligned_percent_pair: 0.0 }
[2020-12-08T20:57:20Z INFO bird_tool_utils::clap_utils] Not using directory entry '/home/GLBRCORG/trina.mcmahon/Cyanos/data/dereplicated_MAGs/3300020571-bin.11.fna' as a genome FASTA file, as it does not end with the extension '.fna'
[2020-12-08T20:57:20Z INFO bird_tool_utils::clap_utils] Not using directory entry '/home/GLBRCORG/trina.mcmahon/Cyanos/data/dereplicated_MAGs/16cyanoMAGs_dreped.tar.gz' as a genome FASTA file, as it does not end with the extension '.fna'
[2020-12-08T20:57:20Z INFO bird_tool_utils::clap_utils] Not using directory entry '/home/GLBRCORG/trina.mcmahon/Cyanos/data/dereplicated_MAGs/2582580591.fna' as a genome FASTA file, as it does not end with the extension '.fna'
[2020-12-08T20:57:20Z INFO bird_tool_utils::clap_utils] Not using directory entry '/home/GLBRCORG/trina.mcmahon/Cyanos/data/dereplicated_MAGs/3300020551-bin.8.fna' as a genome FASTA file, as it does not end with the extension '.fna'
[2020-12-08T20:57:20Z INFO bird_tool_utils::clap_utils] Not using directory entry '/home/GLBRCORG/trina.mcmahon/Cyanos/data/dereplicated_MAGs/3300020573-bin.18.fna' as a genome FASTA file, as it does not end with the extension '.fna'
[2020-12-08T20:57:20Z INFO bird_tool_utils::clap_utils] Not using directory entry '/home/GLBRCORG/trina.mcmahon/Cyanos/data/dereplicated_MAGs/GEODES117-bin.31.fna' as a genome FASTA file, as it does not end with the extension '.fna'
[2020-12-08T20:57:20Z INFO bird_tool_utils::clap_utils] Not using directory entry '/home/GLBRCORG/trina.mcmahon/Cyanos/data/dereplicated_MAGs/3300020501-bin.15.fna' as a genome FASTA file, as it does not end with the extension '.fna'
[2020-12-08T20:57:20Z INFO bird_tool_utils::clap_utils] Not using directory entry '/home/GLBRCORG/trina.mcmahon/Cyanos/data/dereplicated_MAGs/refList.txt' as a genome FASTA file, as it does not end with the extension '.fna'
[2020-12-08T20:57:20Z INFO bird_tool_utils::clap_utils] Not using directory entry '/home/GLBRCORG/trina.mcmahon/Cyanos/data/dereplicated_MAGs/archive_edited_fasta_id' as a genome FASTA file
[2020-12-08T20:57:20Z INFO bird_tool_utils::clap_utils] Not using directory entry '/home/GLBRCORG/trina.mcmahon/Cyanos/data/dereplicated_MAGs/3300020490-bin.6.fna' as a genome FASTA file, as it does not end with the extension '.fna'
[2020-12-08T20:57:20Z INFO bird_tool_utils::clap_utils] Not using directory entry '/home/GLBRCORG/trina.mcmahon/Cyanos/data/dereplicated_MAGs/3300020558-bin.31.fna' as a genome FASTA file, as it does not end with the extension '.fna'
[2020-12-08T20:57:20Z INFO bird_tool_utils::clap_utils] Not using directory entry '/home/GLBRCORG/trina.mcmahon/Cyanos/data/dereplicated_MAGs/3300020542-bin.8.fna' as a genome FASTA file, as it does not end with the extension '.fna'
[2020-12-08T20:57:20Z INFO bird_tool_utils::clap_utils] Not using directory entry '/home/GLBRCORG/trina.mcmahon/Cyanos/data/dereplicated_MAGs/16cyanoMAGs_original_fasta_id' as a genome FASTA file
[2020-12-08T20:57:20Z INFO bird_tool_utils::clap_utils] Not using directory entry '/home/GLBRCORG/trina.mcmahon/Cyanos/data/dereplicated_MAGs/2582580531.fna' as a genome FASTA file, as it does not end with the extension '.fna'
[2020-12-08T20:57:20Z INFO bird_tool_utils::clap_utils] Not using directory entry '/home/GLBRCORG/trina.mcmahon/Cyanos/data/dereplicated_MAGs/3300020517-bin.16.fna' as a genome FASTA file, as it does not end with the extension '.fna'
[2020-12-08T20:57:20Z INFO bird_tool_utils::clap_utils] Not using directory entry '/home/GLBRCORG/trina.mcmahon/Cyanos/data/dereplicated_MAGs/16cyanoMAGs_edited_fasta_id' as a genome FASTA file
[2020-12-08T20:57:20Z INFO bird_tool_utils::clap_utils] Not using directory entry '/home/GLBRCORG/trina.mcmahon/Cyanos/data/dereplicated_MAGs/GEODES118-bin.197.fna' as a genome FASTA file, as it does not end with the extension '.fna'
[2020-12-08T20:57:20Z INFO bird_tool_utils::clap_utils] Not using directory entry '/home/GLBRCORG/trina.mcmahon/Cyanos/data/dereplicated_MAGs/3300020547-bin.20.fna' as a genome FASTA file, as it does not end with the extension '.fna'
[2020-12-08T20:57:20Z INFO bird_tool_utils::clap_utils] Not using directory entry '/home/GLBRCORG/trina.mcmahon/Cyanos/data/dereplicated_MAGs/3300020539-bin.6.fna' as a genome FASTA file, as it does not end with the extension '.fna'
[2020-12-08T20:57:20Z INFO bird_tool_utils::clap_utils] Not using directory entry '/home/GLBRCORG/trina.mcmahon/Cyanos/data/dereplicated_MAGs/3300020509-bin.3.fna' as a genome FASTA file, as it does not end with the extension '.fna'
[2020-12-08T20:57:20Z INFO bird_tool_utils::clap_utils] Not using directory entry '/home/GLBRCORG/trina.mcmahon/Cyanos/data/dereplicated_MAGs/16cyanoMAGs-fastaIDedited.tar.gz' as a genome FASTA file, as it does not end with the extension '.fna'
[2020-12-08T20:57:20Z INFO bird_tool_utils::clap_utils] Not using directory entry '/home/GLBRCORG/trina.mcmahon/Cyanos/data/dereplicated_MAGs/2582580577.fna' as a genome FASTA file, as it does not end with the extension '.fna'
thread 'main' panicked at 'Failed to parse genome paths: "Found 0 genomes from the genome-fasta-directory, cannot continue."', src/bin/coverm.rs:782:18
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace

change tmp path

Some files are written to directory /tmp, where we only have a very small volume. Processes crashes easily when multiple tasks are running simutaneously. It would be the best if the path can be set explicitly via a parameter.

Extract unmapped reads

Hi,

first thanks for your tool, it's really usefull. I used coverm to determine MAGs abundance with my samples with options --min-read-aligned-percent 0.75 --min-read-percent-identity 0.95 and now would like to extract unmapped reads to explore what's are behind these reads. Is it possible ?

bests,
jsgounot

coverm filter

Hi ,
I am using "coverm filter" to remove alignments with insufficient identity. The size for the input bam file is 143374586594 bytes, but the size for the output file is only 4129795 bytes. And I have used "coverm contig --methods trimmed_mean" to calculate the mean coverage for each contig based on the small-sized output file. I am not sure if there is something wrong with it. Could you give some suggestions? Thanks

Best regards

-m metabat fails with --no-zeros

$ cargo run -- contig -m metabat -b tests/data/issue_19_bams/*bam --no-zeros
    Finished dev [unoptimized + debuginfo] target(s) in 0.07s
     Running `target/debug/coverm contig -m metabat -b tests/data/issue_19_bams/BE_BS_R1-BE_BS_R1.reduced.bam tests/data/issue_19_bams/BE_BS_R1-BE_RX_R1.reduced.bam tests/data/issue_19_bams/BE_BS_R1-BM_ER_R7.reduced.bam tests/data/issue_19_bams/BE_BS_R1-BM_RX_R7.reduced.bam --no-zeros`
[2020-02-07T01:35:46Z INFO  coverm] CoverM version 0.3.2
[2020-02-07T01:35:46Z INFO  coverm] Using min-covered-fraction 0%
[2020-02-07T01:35:46Z INFO  coverm::contig] In sample 'BE_BS_R1-BE_BS_R1.reduced', found 365 reads mapped out of 819 total (44.57%)
[2020-02-07T01:35:46Z INFO  coverm::contig] In sample 'BE_BS_R1-BE_RX_R1.reduced', found 480 reads mapped out of 748 total (64.17%)
[2020-02-07T01:35:46Z INFO  coverm::contig] In sample 'BE_BS_R1-BM_ER_R7.reduced', found 41 reads mapped out of 219 total (18.72%)
[2020-02-07T01:35:46Z INFO  coverm::contig] In sample 'BE_BS_R1-BM_RX_R7.reduced', found 130 reads mapped out of 543 total (23.94%)
contigName	contigLen	totalAvgDepth	BE_BS_R1-BE_BS_R1.reduced.bam	BE_BS_R1-BE_BS_R1.reduced.bam-var	BE_BS_R1-BE_RX_R1.reduced.bam	BE_BS_R1-BE_RX_R1.reduced.bam-var	BE_BS_R1-BM_ER_R7.reduced.bam	BE_BS_R1-BM_ER_R7.reduced.bam-var	BE_BS_R1-BM_RX_R7.reduced.bam	BE_BS_R1-BM_RX_R7.reduced.bam-var
thread 'main' panicked at 'called `Option::unwrap()` on a `None` value', /rustc/5e1a799842ba6ed4a57e91f7ab9435947482f7d8/src/libcore/macros/mod.rs:15:40
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace.

Error message: thread 'main' panicked at 'Unable to parse version for dashing (error 1)', /root/.cargo/registry/src/github.com-1ecc6299db9ec823/bird_tool_utils-0.2.0/src/external_command_checker.rs:98:14

Hi,

I installed coverM through the bioconda and I am getting the following message:

coverm genome -c LIBARIES -d Bins/ -x fasta --dereplicate --dereplication-output-cluster-definition Dereplicated_representatives_tab --bam-file-cache-directory tmp/ --discard-unmapped -t 20

[2020-12-03T10:49:16Z INFO  coverm] CoverM version 0.5.0
[2020-12-03T10:49:16Z INFO  coverm] Using min-covered-fraction 10%
[2020-12-03T10:49:16Z INFO  bird_tool_utils::external_command_checker] Found minimap2 version 2.17-r941
[2020-12-03T10:49:16Z INFO  bird_tool_utils::external_command_checker] Found samtools version 1.9
[2020-12-03T10:49:16Z INFO  coverm] Found 119 genomes specified before dereplication
[2020-12-03T10:49:16Z INFO  bird_tool_utils::external_command_checker] Found fastANI version 1.32
[2020-12-03T10:49:16Z WARN  galah::cluster_argument_parsing] Since CheckM input is missing, genomes are not being ordered by quality. Instead the order of their input is being used
thread 'main' panicked at 'Unable to parse version for dashing (error 1)', /root/.cargo/registry/src/github.com-1ecc6299db9ec823/bird_tool_utils-0.2.0/src/external_command_checker.rs:98:14
stack backtrace:
   0: backtrace::backtrace::libunwind::trace
             at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.46/src/backtrace/libunwind.rs:86
   1: backtrace::backtrace::trace_unsynchronized
             at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.46/src/backtrace/mod.rs:66
   2: std::sys_common::backtrace::_print_fmt
             at src/libstd/sys_common/backtrace.rs:78
   3: <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt
             at src/libstd/sys_common/backtrace.rs:59
   4: core::fmt::write
             at src/libcore/fmt/mod.rs:1076
   5: std::io::Write::write_fmt
             at src/libstd/io/mod.rs:1537
   6: std::sys_common::backtrace::_print
             at src/libstd/sys_common/backtrace.rs:62
   7: std::sys_common::backtrace::print
             at src/libstd/sys_common/backtrace.rs:49
   8: std::panicking::default_hook::{{closure}}
             at src/libstd/panicking.rs:198
   9: std::panicking::default_hook
             at src/libstd/panicking.rs:217
  10: std::panicking::rust_panic_with_hook
             at src/libstd/panicking.rs:526
  11: rust_begin_unwind
             at src/libstd/panicking.rs:437
  12: core::panicking::panic_fmt
             at src/libcore/panicking.rs:85
  13: core::option::expect_failed
             at src/libcore/option.rs:1261
  14: bird_tool_utils::external_command_checker::default_version_check
  15: galah::cluster_argument_parsing::generate_galah_clusterer
  16: coverm::main
  17: std::rt::lang_start::{{closure}}
  18: std::rt::lang_start_internal::{{closure}}
             at src/libstd/rt.rs:52
  19: std::panicking::try::do_call
             at src/libstd/panicking.rs:348
  20: std::panicking::try
             at src/libstd/panicking.rs:325
  21: std::panic::catch_unwind
             at src/libstd/panic.rs:394
  22: std::rt::lang_start_internal
             at src/libstd/rt.rs:51
  23: main
  24: __libc_start_main
  25: <unknown>
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

The problem does not appear when I do not perform dereplication. Could you help me solving this issue?

options for linkage information or mapped reads?

Since coverM is meant to replace bamm, are there any comparable options/workflows for getting contig-linkage information or to extract all reads that map to specific contigs?

Request: Foreign function interface

Thanks for your work on this tool. I'm sure it will see a lot of usage in future pipelines and/or tools.

However, as far as I can see, the interface and documentation exists only for the command-line interface. There is no foreign function interface. This means third party packages cannot use CoverM internally, but have to seperately install CoverM and call it from the shell.

It would be much easier to use CoverM if some high-level functions were exposed and documented. Even if the edges are rough.

Does CoverM only map one read from paired-read files?

Hi,

Thanks for a great, easy-to-use and efficient tool. I ran CoverM like below, on a set of assemblies:

coverm contig -1 {input.r1} -2 {input.r2} --reference {input.fa} --output-file {output} -t {threads}

When I look in the log file, I find the following:

Sun Jan 31 22:15:02 CET 2021
[2021-01-31T21:15:02Z INFO  coverm] CoverM version 0.6.0
[2021-01-31T21:15:02Z INFO  coverm] Writing output to file: /mnt/md1200/epfl_sber/massimo/EUCI_MG/selection_inference/sbusi/clusters/coverage/ERR476713_covstats.txt
[2021-01-31T21:15:02Z INFO  coverm] Using min-covered-fraction 0%
[2021-01-31T21:15:02Z INFO  bird_tool_utils::external_command_checker] Found minimap2 version 2.17-r941
[2021-01-31T21:15:02Z INFO  bird_tool_utils::external_command_checker] Found samtools version 1.11
[2021-01-31T21:15:02Z INFO  coverm] Not pre-generating minimap2 index
[2021-01-31T21:56:24Z INFO  coverm::contig] In sample 'ERR476713_prokka.fna/ERR476713_pass_1.fastq.gz', found 50124350 reads mapped out of 66933174 total (74.89%)
Sun Jan 31 22:56:29 CET 2021

Two questions regarding the last line:

does CoverM only map one of the reads from the pair in this scenario?
Am I better off using the --coupled option? Sorry wasn't clear as to when the coupled option is to be used.

Thank you,
Susheel

Recruitment plot for CoverM

Dear Ben,
I am wondering whether we can also add a recruitment plot option to coverm genome module. It is always important to check the coverage of each contigs and reads mapping distributions. See an example here (this was produced by an in house python and R script based on bam files for each MAG extracted from a competitive reads mapping, enveomics package): jianshu93/Competitive_mapping#1
Binning can be a mess in a lot of cases and may be we can offer support for whether binning looks good by adding the recruitment plot option. Are you also going to develop a paper for CoverM. I see a lot of influence in it for help with checking and calculating coverage information for MAGs. I do not see very good plot libraries for rust but probably you have seen some.

What do you think?

Thanks,

Jianshu

Mapping errors on a cluster

Hi,
I am running coverm v0.6.1, samtools/1.11, minimap2/2.17 with 16 CPUs and 96GB RAM and receive the below error with shotgun WGS data.

[2021-03-16T05:18:25Z INFO coverm] CoverM version 0.6.1
[2021-03-16T05:18:25Z INFO coverm] Using min-covered-fraction 10%
[2021-03-16T05:18:25Z INFO bird_tool_utils::external_command_checker] Found minimap2 version 2.17-r941
[2021-03-16T05:18:25Z INFO bird_tool_utils::external_command_checker] Found samtools version 1.11
[2021-03-16T05:18:25Z INFO coverm] Profiling 1316 genomes
[2021-03-16T05:18:25Z INFO coverm] Generating concatenated reference FASTA file of 1316 genomes ..
[2021-03-16T05:18:33Z INFO coverm] Not pre-generating minimap2 index
[2021-03-16T05:18:33Z ERROR coverm::bam_generator] Failed to correctly find or parse BAM file at "/tmp/pbs.464383.pbstraining/coverm_fifo.6HZvphWYmiRQ/foo.pipe": unable
to open SAM/BAM/CRAM file at /tmp/pbs.464383.pbstraining/coverm_fifo.6HZvphWYmiRQ/foo.pipe
[2021-03-16T05:19:47Z ERROR coverm::bam_generator] Error when running mapping process. Exitstatus was ExitStatus(ExitStatus(256)). Command run was: ["bash -c "set -e -o
pipefail; minimap2 --split-prefix /tmp/pbs.464383.pbstraining/.tmpUWtTHg -a -x sr -t 16 '/tmp/pbs.464383.pbstraining/.tmp90WfNW' '/project/SIH/Fastq_clea
ned_target/C10_M2_0B_R1.clean.target.fastq.gz' '/project/SIH/Fastq_cleaned_target/C10_M2_0B_R2.clean.target.fastq.gz' 2>/tmp/pbs.464383.pbstraining/.tmpkT3b
A2 | remove_minimap2_duplicated_headers 2>/tmp/pbs.464383.pbstraining/.tmppj36A1| samtools sort -T '/tmp/pbs.464383.pbstraining/coverm_fifo.6HZvphWYmiRQ/coverm-make-sa
mtools-sortMWaOYX' -l0 -@ 15 2>/tmp/pbs.464383.pbstraining/.tmpvoDUpc > "/tmp/pbs.464383.pbstraining/coverm_fifo.6HZvphWYmiRQ/foo.pipe"""]
[2021-03-16T05:19:47Z ERROR coverm::bam_generator] The STDERR for the MINIMAP2_SR part was: [M::mm_idx_gen::67.0381.69] collected minimizers
[M::mm_idx_gen::71.5002.47] sorted minimizers

[2021-03-16T05:19:47Z ERROR coverm::bam_generator] The STDERR for the samtools sort part was: samtools sort: failed to read header from "-"

[2021-03-16T05:19:47Z ERROR coverm::bam_generator] The STDERR for the remove_minimap2_duplicated_headers part was: bash: remove_minimap2_duplicated_headers: command not
found

[2021-03-16T05:19:47Z ERROR coverm::bam_generator] Cannot continue since mapping failed.

Warn user if it inputs an unsorted BAM file

CoverM gives wrong results if the BAM files are not sorted and it doesn't warn the user that the input should be sorted.

jgi_summarize_bam_contig_depths raises the following error message if the input is not sorted:

ERROR: the bam file 'test_sequences.bam' is not sorted!
Please execute 'samtools sort' on unsorted input bam files and try again!

CoverM and jgi_summarize_bam_contig_depths give different outputs

Hi Ben

I just replaced jgi_summarize_bam_contig_depths with CoverM in my binning pipeline (which uses both MetaBat2 and MaxBin2) and I've noticed that my final bins was different from the ones I got before.

I prepared reduced BAMs, with just three contigs, and measured coverage with both jgi_summarize_bam_contig_depths and CoverM:

contigName	method	contigLen	totalAvgDepth	BE_BS_R1-BE_BS_R1.reduced.bam	BE_BS_R1-BE_BS_R1.reduced.bam-var	BE_BS_R1-BE_RX_R1.reduced.bam	BE_BS_R1-BE_RX_R1.reduced.bam-var	BE_BS_R1-BM_ER_R7.reduced.bam	BE_BS_R1-BM_ER_R7.reduced.bam-var	BE_BS_R1-BM_RX_R7.reduced.bam	BE_BS_R1-BM_RX_R7.reduced.bam-var
BE_BS_R1_k147_2106787	CoverM	3152	5.5702863	5.578614	5.697416	14.260826	23.711023	0.92271817	2.084662	1.5189873	2.341692
BE_BS_R1_k147_2106787	jgi_summarize_bam_contig_depths	3152	22.2811	5.57861	5.69755	14.2608	23.7111	0.922718	2.08466	1.51899	2.34171
BE_BS_R1_k147_3889603	CoverM	2203	3.069654	10.337555	50.398186	0.71310276	1.1237903	0.49245006	0.889441	0.73550904	0.8008681
BE_BS_R1_k147_3889603	jgi_summarize_bam_contig_depths	2203	12.2786	10.3376	50.3978	0.713103	1.1238	0.49245	0.889457	0.735509	0.800879
BE_BS_R1_k147_2120400	CoverM	3251	4.63673	4.990003	5.4105444	8.594324	14.234088	0.65011287	1.901733	4.31248	36.406517
BE_BS_R1_k147_2120400	jgi_summarize_bam_contig_depths	3251	18.5469	4.99	5.41058	8.59432	14.2341	0.650113	1.90183	4.31248	36.405

reduced_bam.zip

As you can see, there are two differences:

The totalAvgDepth differs a lot between the two tools. CoverM got the values right and I don't know how jgi_summarize_bam_contig_depths computed its results.
CoverM outputs more decimal cases then jgi_summarize_bam_contig_depths. I think that's the reason the binning performed with CoverM gave me different results.

I'm wondering if CoverM should use the same number of decimal cases when using -m metabat output, just to keep consistency with jgi_summarize_bam_contig_depths.

Thank you for the great tool!

ERROR coverm::bam_generator

Hi,

I am currently using coverm and I have the following error. I am using a conda environment.
Could you please help me to solve this problem?

Thanks a lot!

coverm genome --threads 24
--genome-fasta-directory dereplicated_genomes
--genome-fasta-extension fa
--coupled
${SampleName}_1.fastq.gz
${SampleName}_2.fastq.gz \

abundances-${SampleName}.txt

[2020-09-10T08:54:37Z INFO coverm] Using min-covered-fraction 10%
[2020-09-10T08:54:37Z INFO coverm::external_command_checker] Found minimap2 version 2.17-r941
[2020-09-10T08:54:37Z INFO coverm::external_command_checker] Found samtools version 1.9
[2020-09-10T08:54:37Z INFO coverm] Generating concatenated reference FASTA file of 1296 genomes ..
[2020-09-10T08:54:47Z INFO coverm] Not pre-generating minimap2 index
[2020-09-10T09:04:53Z ERROR coverm::bam_generator] Failed to correctly find or parse BAM file at "/tmp/coverm_fifo.rgGGpR3gjykp/foo.pipe": unable to open SAM/BAM/CRAM file at /tmp/coverm_fifo.rgGGpR3gjykp/foo.pipe
[2020-09-10T09:04:53Z ERROR coverm::bam_generator] Error when running mapping process: ["bash -c "set -e -o pipefail; minimap2 --split-prefix /tmp/.tmpnhiasQ -a -x sr -t 24 '/tmp/.tmpzO3nxq' 'ERR3357550_1.fastq.gz' 'ERR3357550_2.fastq.gz' 2>/tmp/.tmpnEy3yu | remove_minimap2_duplicated_headers| samtools sort -T '/tmp/coverm_fifo.rgGGpR3gjykp/coverm-make-samtools-sortcHted5' -l0 -@ 23 2>/tmp/.tmpofcBac > "/tmp/coverm_fifo.rgGGpR3gjykp/foo.pipe"""]
[2020-09-10T09:04:53Z ERROR coverm::bam_generator] The STDERR for the MINIMAP2_SR part was: [M::mm_idx_gen::82.3661.79] collected minimizers
[M::mm_idx_gen::89.2853.09] sorted minimizers
[M::main::89.3613.09] loaded/built the index for 281394 target sequence(s)
[M::mm_mapopt_update::89.3613.09] mid_occ = 1000
[M::mm_idx_stat] kmer size: 21; skip: 11; is_hpc: 0; #seq: 281394
[M::mm_idx_stat::101.6012.83] distinct minimizers: 498419246 (91.88% are singletons); average occurrences: 1.123; average spacing: 6.010
[M::worker_pipeline::107.8343.36] mapped 505514 sequences
[M::worker_pipeline::109.6773.59] mapped 505606 sequences
[M::worker_pipeline::111.5703.83] mapped 505222 sequences
[M::worker_pipeline::113.7124.17] mapped 504902 sequences
[M::worker_pipeline::115.0874.32] mapped 505222 sequences
[M::worker_pipeline::116.9504.59] mapped 505260 sequences
[M::worker_pipeline::118.5724.76] mapped 505188 sequences
[M::worker_pipeline::120.2204.96] mapped 505270 sequences
[M::worker_pipeline::121.9255.18] mapped 505106 sequences
[M::worker_pipeline::123.4345.32] mapped 505280 sequences
[M::worker_pipeline::125.0825.48] mapped 505712 sequences
[M::worker_pipeline::127.1865.72] mapped 505718 sequences
[M::worker_pipeline::128.9165.90] mapped 505724 sequences
[M::worker_pipeline::130.5326.07] mapped 505458 sequences
[M::worker_pipeline::132.5686.29] mapped 506024 sequences
[M::worker_pipeline::134.2546.46] mapped 505836 sequences
[M::worker_pipeline::135.7966.61] mapped 506122 sequences
[M::worker_pipeline::137.2566.70] mapped 505978 sequences
[M::worker_pipeline::138.8436.85] mapped 506408 sequences
[M::worker_pipeline::140.3776.95] mapped 506108 sequences
[M::worker_pipeline::142.1627.11] mapped 507162 sequences
[M::worker_pipeline::144.1277.28] mapped 506566 sequences
[M::worker_pipeline::145.6157.37] mapped 506030 sequences
[M::worker_pipeline::147.6437.54] mapped 506836 sequences
[M::worker_pipeline::149.4447.68] mapped 506404 sequences
[M::worker_pipeline::151.1777.83] mapped 506370 sequences
[M::worker_pipeline::152.5087.93] mapped 507040 sequences
[M::worker_pipeline::153.8928.03] mapped 507056 sequences
[M::worker_pipeline::155.3668.10] mapped 507014 sequences
[M::worker_pipeline::157.3218.25] mapped 506124 sequences
[M::worker_pipeline::159.1788.39] mapped 505852 sequences
[M::worker_pipeline::161.0528.54] mapped 505382 sequences
[M::worker_pipeline::162.3818.62] mapped 505752 sequences
[M::worker_pipeline::163.9968.72] mapped 505552 sequences
[M::worker_pipeline::165.8738.86] mapped 505936 sequences
[M::worker_pipeline::167.1378.94] mapped 505660 sequences
[M::worker_pipeline::168.7839.05] mapped 505660 sequences
[M::worker_pipeline::170.0889.12] mapped 505862 sequences
[M::worker_pipeline::171.8309.23] mapped 505972 sequences
[M::worker_pipeline::173.4159.32] mapped 505870 sequences
[M::worker_pipeline::175.2679.44] mapped 506196 sequences
[M::worker_pipeline::176.6669.51] mapped 505790 sequences
[M::worker_pipeline::178.4879.60] mapped 506158 sequences
[M::worker_pipeline::180.2109.68] mapped 506236 sequences
[M::worker_pipeline::181.9959.79] mapped 506478 sequences
[M::worker_pipeline::183.5509.86] mapped 506164 sequences
[M::worker_pipeline::185.1119.91] mapped 506524 sequences
[M::worker_pipeline::187.19710.02] mapped 506418 sequences
[M::worker_pipeline::189.31610.13] mapped 507062 sequences
[M::worker_pipeline::192.16610.21] mapped 506520 sequences
[M::worker_pipeline::197.95810.42] mapped 508062 sequences
[M::worker_pipeline::200.08710.47] mapped 506748 sequences
[M::worker_pipeline::202.45710.57] mapped 506850 sequences
[M::worker_pipeline::204.45110.63] mapped 506448 sequences
[M::worker_pipeline::206.68410.72] mapped 507308 sequences
[M::worker_pipeline::208.96010.79] mapped 506604 sequences
[M::worker_pipeline::211.12910.87] mapped 506274 sequences
[M::worker_pipeline::213.58810.97] mapped 505386 sequences
[M::worker_pipeline::216.39611.07] mapped 504562 sequences
[M::worker_pipeline::218.54011.12] mapped 505062 sequences
[M::worker_pipeline::222.20811.15] mapped 505106 sequences
[M::worker_pipeline::224.930*11.21] mapped 505056 sequen
[2020-09-10T09:04:53Z ERROR coverm::bam_generator] The STDERR for the samtools sort part was:
[2020-09-10T09:04:53Z ERROR coverm::bam_generator] Cannot continue since mapping failed.

Failed to install CoverM

Hello!

I haven't been successful at building or installing CoverM. I have tried using Rust on Windows targeting the Gnu ABI, and on Ubuntu installation in the Windows Subsystem for Linux. When building with 'cargo run' both times I have had issues compiling rust-htslib, with the error:

error: failed to run custom build command for `rust-htslib v0.19.1`
process didn't exit successfully: `/home/timothy/git/CoverM/target/debug/build/rust-htslib-fa9ca6cca9341e9a/build-script-build` (exit code: 101)
--- stderr
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 2, kind: NotFound, message: "No such file or directory" }', libcore/result.rs:945:5
note: Run with `RUST_BACKTRACE=1` for a backtrace.

(The backtrace is not helpful). I thought it might be due to htslib not supporting windows very well, but when the Linux Subsystem also didn't work, I thought I'd let you know in case there was something else going on.

I also tried installing with cargo install coverm to see if that would work, but it doesn't seem to be available. I get this error:

$ cargo install coverm
    Updating registry `https://github.com/rust-lang/crates.io-index`
error: could not find `coverm` in registry `https://github.com/rust-lang/crates.io-index`

Cheers,
Tim

wwood / coverm Goto Github PK

coverm's Introduction

CoverM

Installation

Install through the bioconda package

Pre-compiled binary

Compiling from source

Development version

Dependencies

Shell completion

Usage

Calculation methods

License

coverm's People

Contributors

Stargazers

Watchers

Forkers

coverm's Issues

Recommend Projects

Recommend Topics

Recommend Org