shenwei356 / kmcp Goto Github PK

View Code? Open in Web Editor NEW

171.0 6.0 13.0 84.03 MB

Accurate metagenomic profiling && Fast large-scale sequence/genome searching

Home Page: https://bioinf.shenwei.me/kmcp

License: MIT License

Go 96.08% Shell 0.07% R 1.83% Roff 2.03%

kmer sketch searching metagenomics taxonomic-classification taxonomic-profiling scaled-minhash syncmers cobs sketching

kmcp's Introduction

KMCP: accurate metagenomic profiling of both prokaryotic and viral populations by pseudo-mapping

Citation

KMCP: accurate metagenomic profiling of both prokaryotic and viral populations by pseudo-mapping.
Wei Shen, Hongyan Xiang, Tianquan Huang, Hui Tang, Mingli Peng, Dachuan Cai, Peng Hu, Hong Ren.
Bioinformatics, btac845, https://doi.org/10.1093/bioinformatics/btac845

Documents
What can we do?
Features
Installation
Commands
Quickstart
KMCP vs COBS
Support
License
Acknowledgments

Documents

https://bioinf.shenwei.me/kmcp

What can we do?

1. Accurate metagenomic profiling

KMCP utilizes genome coverage information by splitting the reference genomes into chunks and stores k-mers in a modified and optimized COBS index for fast alignment-free sequence searching. KMCP combines k-mer similarity and genome coverage information to reduce the false positive rate of k-mer-based taxonomic classification and profiling methods.

The read mapping process in KMCP is referred to as pseudo-mapping, which is similar to but different from the lightweight algorithm in Sailfish (Patro et al., 2014), pseudoalignment in Kallisto (Bray et al., 2016), quasi-mapping in RapMap (Srivastava et al., 2016), and lightweight mapping in Salmon (Patro et al., 2017). All of these methods seek to elide the computation of base-to-base alignment using distinct strategies (Srivastava et al., 2016). In KMCP, each reference genome is pre-split into chunks of equal size, and the k-mers of a query, as a whole, are compared to each genome chunk to find all possible ones sharing a predefined proportion of k-mers with the query. Like quasi-mapping in RapMap, KMCP tracks the target and position for each query. However, the read position in KMCP is approximate and in a predefined resolution (the number of genome chunks).

Benchmarking results based on simulated and real data demonstrate that KMCP, despite a longer running time than some other methods, not only allows the accurate taxonomic profiling of prokaryotic and viral populations but also provides more confident pathogen detection in clinical samples of low depth.

Genome collections with custom taxonomy, e.g., GTDB uses its own taxonomy and MGV uses ICTV taxonomy, are also supported by generating NCBI-style taxdump files with taxonkit create-taxdump. You can even merge the GTDB taxonomy (for prokaryotic genomes from GTDB) and NCBI taxonomy (for genomes from NCBI).

2. Fast sequence search against large scales of genomic datasets

KMCP can be used for fast sequence search against large scales of genomic datasets as BIGSI and COBS do. We reimplemented and modified the Compact Bit-Sliced Signature index (COBS) algorithm, bringing a smaller index size and much faster searching speed (2x for genome search and 10x for short reads) faster than COBS (check the tutorial and benchmark). Also check the algorithm and data structure differences between KMCP and COBS.

3. Fast genome similarity estimation

KMCP can also be used for fast similarity estimation of assemblies/genomes against known reference genomes.

Genome sketching is a method of utilizing small and approximate summaries of genomic data for fast searching and comparison. Mash and Sourmash provide fast genome distance estimation using MinHash (Mash) or FracMinHash (Sourmash). KMCP supports multiple k-mer sketches (Minimizer, FracMinHash (previously named Scaled MinHash), and Closed Syncmers) for genome similarity estimation. And KMCP is 5x-7x faster than Mash/Sourmash (check the tutorial and benchmark).

Features

Easy to install
- Statically linked executable binaries for multiple platforms (Linux/Windows/macOS, AMD64/ARM64).
- No dependencies, no configurations.
- conda install -c bioconda kmcp
Easy to use
- Supporting shell autocompletion.
- Detailed usage, database, tutorials, and FAQs.
Building database is easy and fast
- ~25 min for 47894 genomes from GTDB-r202 on a sever with 40 CPU threads and solid disk drive.
Fast searching speed (for sequence/genome search)
- The index structure is modified from COBS, while KMCP is 2x-10x faster in sequence searching.
- Automatically scales to exploit all available CPU cores.
- Searching time is linearly related to the number of reference genomes (chunks).
Scalable searching. Searching results against multiple databases can be fast merged. This brings many benefits:
- There's no need to re-built the database with newly added reference genomes.
- The searching step can be parallelized with a computer cluster in which each computation node searches against a small database.
- Computers with limited main memory can also utilize an extensive collection of reference genomes by building and searching against small databases..
Accurate taxonomic profiling
- Some k-mer based taxonomic profilers suffer from high false positive rates, while KMCP adopts multiple strategies to improve specificity and keeps high sensitivity at the same time.
- In addition to archaea and bacteria, KMCP performed well on viruses/phages.
- KMCP also provides confident infectious pathogen detection.
- Preset six modes for multiple scenarios.
- Supports CAMI and MetaPhlAn profiling format.
Flexible support of taxonomy database
- Taxonomy data, in the format of NCBI taxdump files, are only needed in the profiling step. Therefore, it is easy to utilize an updated version of taxonomy data.
- GTDB, ICTV and custom taxonomy database are supported by creating taxdump files with taxonkit create-taxdump.
- You can even merge the GTDB taxonomy (for prokaryotic genomes from GTDB) and NCBI taxonomy (for genomes from NCBI).
- Profiling without taxonomy data is also supported by setting --level strain in kmcp profile.

Installation

Download executable binaries, or install using conda:

conda install -c bioconda kmcp

SIMD extensions including AVX512, AVX2, SSE2 are sequentially detected and used in two packages for better searching performance.

pand, for accelerating searching on databases constructed with multiple hash functions.
pospop, for batch counting matched k-mers in bloom filters.

ARM architecture is supported, but kmcp search would be slower.

Commands

Subcommand	Function
compute	Generate k-mers (sketch) from FASTA/Q sequences
index	Construct a database from k-mer files
search	Search sequences against a database
merge	Merge search results from multiple databases
profile	Generate the taxonomic profile from search results
utils split-genomes	Split genomes into chunks
utils unik-info	Print information of .unik files
utils index-info	Print information of index files
utils index-density	Plot the element density of bloom filters for an index file
utils ref-info	Print information of reference chunks in a database
utils cov2simi	Convert k-mer coverage to sequence similarity
utils query-fpr	Compute the false positive rate of a query
utils filter	Filter search results and find species/assembly-specific queries
utils merge-regions	Merge species/assembly-specific regions

Quickstart

# compute k-mers
kmcp compute -k 21 --split-number 10 --split-overlap 150 \
    --in-dir genomes/ --out-dir genomes-k21-n10

# index k-mers
kmcp index --false-positive-rate 0.1 --num-hash 1 \
    --in-dir genomes-k21-n10/ --out-dir genomes.kmcp

# delete temporary files
# rm -rf genomes-k21-n10/

# search    
kmcp search --db-dir genomes.kmcp/ test.fa.gz --out-file [email protected]

# merge search results against multiple databases
kmcp merge -o search.kmcp.tsv.gz search.kmcp@*.kmcp.tsv.gz

# profile and binning
kmcp profile search.kmcp.tsv.gz \
    --taxid-map        taxid.map \
    --taxdump          taxdump/ \
    --out-file         search.tsv.gz.k.profile \
    --metaphlan-report search.tsv.gz.m.profile \
    --cami-report      search.tsv.gz.c.profile \
    --binning-result   search.tsv.gz.binning.gz

KMCP vs COBS

We reimplemented and modified the Compact Bit-Sliced Signature index (COBS) algorithm, bringing a smaller index size and much faster searching speed (2x for genome search and 10x for short reads) faster than COBS.

The differences between KMCP and COBS

Category	Item	COBS	KMCP	Comment
Algorithm	K-mer hashing	xxhash	ntHash1	xxHash is a general-purpose hashing function while ntHash is a recursive hash function for DNA/RNA
	Bloom filter hashing	xxhash	Using k-mer hash values	Avoid hash computation
	Multiple-hash functions	xxhash with different seeds	Generating multiple values from a single one	Avoid hash computation
	Single-hash function	Same as multiple-hash functions	Separated workflow	Reducing loops
	AND step	Serial bitwise AND	Vectorised bitwise AND	Bitwise AND for >1 hash functions
	PLUS step	Serial bit-unpacking	Vectorised positional popcount with pospop	Counting from bit-packed data
Index structure	Size of blocks	/	Using extra thresholds to split the last block with the most k-mers	Uneven genome size distribution would make bloom filters of the last block extremely huge
	Index files	Concatenated	Independent
	Index loading	mmap, loading complete index into RAM	mmap, loading complete index into RAM, seek	Index loading modes
Input/output	Input files	FASTA/Q, McCortex, text	FASTA/Q
	Output	Target and matched k-mers	Target, matched k-mers, query FPR, etc.

Support

Please open an issue to report bugs, propose new functions, or ask for help.

License

MIT License

Acknowledgments

Zhi-Luo Deng (Helmholtz Centre for Infection Research, Germany) gave a lot of valuable advice on metagenomic profiling and benchmarking.
Robert Clausecker (Zuse Institute Berlin, Germany) wrote the high-performance vectorized positional popcount package (pospop) during my development of KMCP, which greatly accelerated the bit-matrix searching.

kmcp's People

Contributors

Stargazers

Watchers

Forkers

dawnmy alienzj qazwsx1995 jin0008 jsonprogram schaudge vszhang1976 liaochenlanruo anzhi525 tongzhou2017 galsang-git luernerliu

kmcp's Issues

Profiling output table interpretation

Dear Shenwei,

Thank you very much for your very nice tool, we are trying to understand how to interpret the output table in KMCP format.

If the output table contains more than one ref per species based on which parameter should we choose the best hit?
According to your manual the percentage column refers to Relative abundance of the reference however, we are not sure how this value is calculated. Could you give us more details about this metric?

thank you very much,

best,

Loïc

CowTransfer link doesn't work

Hi,

It seems that all CowTransfer links are not working. Links all lead to empty pages.

0.8.3: Fails on i386: 1 << 32 - 1 (untyped int constant 4294967295) overflows int

# github.com/shenwei356/kmcp/kmcp/cmd
kmcp/cmd/compute.go:287:14: 1 << 32 - 1 (untyped int constant 4294967295) overflows int
kmcp/cmd/compute.go:294:19: 1 << 32 - 1 (untyped int constant 4294967295) overflows int
kmcp/cmd/compute.go:300:17: 1 << 32 - 1 (untyped int constant 4294967295) overflows int
*** Error code 2

Should 1<<32-1 be uint64(1)<<32-1?

FreeBSD 13.1

Optimizing KMCP with HumGut

Hello Shenwei,
Thank you for making this new metagenomic tool! I'm interested in benchmarking its performance, and for that I want to perform classification on a large (10,000+) number of samples. There are a number of minor things I've come across but I have an end to end sample running now. My main objective for opening this issue is to discuss potential optimizations to what I'm doing to reduce the total time for running this.

Currently, end to end this single sample took 9 hours (08:58) with 32 cores and 23 GB of RAM (100GB was max)

Database used was the HumGut database, split over 10 indices
Working directory was in RAM (/dev/shm), started by copying the samples to that (done to avoid being impacted by NAS speeds)
Times
- Step 1: kmcp search, through parallel with 12 threads per job, took 2 hours per search
- Step 2: kmcp merge, took 15 mins with 64 threads
- Step 3: kmcp profile, took 4h52 mins with 64 threads
  - I could add the --no-amb-corr flag to reduce by 1h53 mins

From this, I could potentially reduce the time to ~6 hours by allocating more cores and increasing the number of searches running in parallel and adding the --no-amb-corr flag.
Are there more ways in which I am missing optimizations?

I've created a gist with the actual SLURM submission, can also upload the logs of the jobs if that's helpful.

QUESTION: Taxdump

Dear Dr. Shen,
I have a question about taxdump in KMCP documentation.
The documentation mentioned:

# taxid mapping files, multiple files supported.
taxid_map=gtdb.kmcp/taxid.map,refseq-viral.kmcp/taxid.map,refseq-fungi.kmcp/taxid.map

# or concatenate them into a big taxid.map
#    cat gtdb.kmcp/taxid.map refseq-viral.kmcp/taxid.map refseq-fungi.kmcp/taxid.map > taxid.map
# taxid_map=taxid.map

# taxdump directory
taxdump=taxdump

sfile=$file.kmcp.tsv.gz

kmcp profile \
    --taxid-map      $taxid_map \
    --taxdump         $taxdump/ \
    --level             species \
    --min-query-cov        0.55 \
    --min-chunks-reads       50 \
    --min-chunks-fraction   0.8 \
    --max-chunks-depth-stdev  2 \
    --min-uniq-reads         20 \
    --min-hic-ureads          5 \
    --min-hic-ureads-qcov  0.75 \
    --min-hic-ureads-prop   0.1 \
    $sfile                      \
    --out-prefix       $sfile.kmcp.profile \
    --metaphlan-report $sfile.metaphlan.profile \
    --cami-report      $sfile.cami.profile \
    --sample-id        "0" \
    --binning-result   $sfile.binning.gz

Is the taxdump here downloaded from the NCBI taxonomy database? I don't think so, but each database is a separate directory when using taxonkit to generate a taxdump, so does the taxdump here also need to be specified like taxid map ?

Thanks!

Merge error (number of fields < query index field)

When merging the results of a number of kmcp search runs, I get the following output:

> kmcp merge SRR6468718.kmcp_humgut@*.tsv.gz --out-file SRR6468718.kmcp_humgut.tsv.gz
11:28:13.212 [INFO] checking input files ...
11:28:13.213 [INFO]   9 input files given
11:28:13.213 [INFO] merging ...
11:44:06.311 [ERRO] number of fields (13) < query index field (15)

It does produce the out-file, attached the header:

❯ zcat SRR6468718.kmcp_humgut.tsv.gz| head
#query	qLen	qKmers	FPR	hits	target	chunkIdx	chunks	tLen	kSize	mKmers	qCov	tCov	jacc	queryIdx
SRR6468718.1	101	81	4.5272e-15	1	GCF_000228045.1_SMUT1-NEX_25-98_genomic	9	10	2182376	21	81	1.0000	0.0004	0.0004	0
SRR6468718.2	101	81	4.5272e-15	385	GUT_GENOME154070	5	10	2006006	21	81	1.0000	0.0004	0.0004	1
SRR6468718.2	101	81	4.5272e-15	385	GUT_GENOME053624	2	10	1545775	21	81	1.0000	0.0005	0.0005	1
SRR6468718.2	101	81	4.5272e-15	385	GUT_GENOME124299	2	10	1797471	21	81	1.0000	0.0005	0.0005	1
SRR6468718.2	101	81	4.5272e-15	385	GUT_GENOME216177	0	10	1912264	21	81	1.0000	0.0004	0.0004	1
SRR6468718.2	101	81	4.5272e-15	385	GUT_GENOME053394	2	10	1925801	21	81	1.0000	0.0004	0.0004	1
SRR6468718.2	101	81	4.5272e-15	385	GUT_GENOME124618	3	10	1946290	21	81	1.0000	0.0004	0.0004	1
SRR6468718.2	101	81	4.5272e-15	385	GUT_GENOME273974	0	10	1955198	21	81	1.0000	0.0004	0.0004	1
SRR6468718.2	101	81	4.5272e-15	385	GUT_GENOME127971	7	10	1964574	21	81	1.0000	0.0004	0.0004	1

TODO: support viral genomes collections, e.g, MGV, GPD

Genomes in MGV and GPD were assembled from shotgun metagenomic data (MAG). Though they are clustered into species, there are no official TaxIds available to show their relationship.

A new command similar to gtdb_to_taxdump is needed. Let TaxonKit do it!

KMCP's MetaPhlAn output doesn't follow the MetaPhlAn file format

The MetaPhlAn output generated by KMCP is not the same as the one generated by MetaPhlAn. In the KMCP output, the taxid column only contains the taxid of the lowest taxonomic rank (e.g. 1224), while the one generated by MetaPhlAn contains the full lineage, separated by | (e.g. 2|1224).

~~This makes the KMCP output incompatible with TAXPASTA.~~ Actually, the TAXPASTA error is due to rank not summing up to 100% (due to lineages genomes skipping some ranks).

Report statistics of matched, unmatched reads

This was previously proposed: #20 .

After performing some data analysis where novel viruses/bacteria exist in some samples, I decided to implement this. It would be helpful to hint at whether the reference genomes cover all microorganisms in the sample.

Here's the plan:

kmcp search: appending the numbers to search results, starting with #.
- input reads
- matched reads
kmcp merge: recount (can not directly plus) the numbers and outputting
kmcp profile: reading the numbers and output in log

Compatibility:

Counts in search results of older versions will be empty.

Create an index without create the database

Hello Dr Shen,

Can I create an index without create the database ? Because the folder takes memory.

If my interest is to indexing sequences and run some query, is it possible to use the tool without the db creation step ?

Best,

ETA missing when building KMCP index

When building the KMCP index for Humgut (by following the instructions here, the ETA is stuck at 0s, even after several blocks have been completed.

kmcp index -j 32 -I humgut-k21-n10 -O humgut.kmcp -n 1 -f 0.3
13:31:35.908 [INFO] kmcp v0.9.3
13:31:35.909 [INFO]   https://github.com/shenwei356/kmcp
13:31:35.909 [INFO]
13:31:35.909 [INFO] loading .unik file infos from file: humgut-k21-n10/_info.txt
13:31:36.518 [INFO]   306910 cached file infos loaded
13:31:36.585 [INFO]
13:31:36.585 [INFO] -------------------- [main parameters] --------------------
13:31:36.585 [INFO]   number of hashes: 1
13:31:36.585 [INFO]   false positive rate: 0.300000
13:31:36.585 [INFO]   k-mer size(s): 21
13:31:36.585 [INFO]   split seqequence size: 0, overlap: 20
13:31:36.585 [INFO]   block-sizeX-kmers-t: 10.00 M
13:31:36.585 [INFO]   block-sizeX        : 256
13:31:36.585 [INFO]   block-size8-kmers-t: 20.00 M
13:31:36.585 [INFO]   block-size1-kmers-t: 200.00 M
13:31:36.585 [INFO] -------------------- [main parameters] --------------------
13:31:36.585 [INFO]
13:31:36.586 [INFO] building index ...
13:31:36.753 [INFO]
13:31:36.753 [INFO]   block size: 9592
13:31:36.753 [INFO]   number of index files: 32 (may be more)
13:31:36.753 [INFO]
13:31:36.753 [block #001] 1199 / 1199  100 %
13:31:36.753 [block #002] 1199 / 1199  100 %
13:31:36.754 [block #003] 1199 / 1199  100 %
13:32:30.922 [block #004] 1199 / 1199  100 %
13:32:34.941 [block #005] 1199 / 1199  100 %
13:33:33.902 [block #006] 1199 / 1199  100 %
13:33:40.757 [block #007] 1199 / 1199  100 %
13:34:45.743 [block #008] 1199 / 1199  100 %
13:34:54.006 [block #009] 1199 / 1199  100 %
13:35:59.125 [block #010] 1199 / 1199  100 %
13:36:08.695 [block #011] 1060 / 1199 [==========================>---]  88 %
13:37:15.240 [block #012]  847 / 1199 [====================>---------]  71 %
[saved index files]     10 / 32 [==========>-----------------------] ETA: 0s

compiling/installing from source

Hello Wei,

Any guidance to install/compiling from source? I have an aarch64 (or ARM) CPU cluster and need to compile from source. None of the binaries works.

Thanks,

Jianshu

coverage is greater than 100

Test kmcp with mock community datasets. The result is like this.

A minor issue on console messages

Thanks for developing this tool.

When I inputted an empty file, the console outputs:
...
11:27:03.868 [WARN] no invalid sequences in file: input.fq.gz
...

The text should be "no valid sequences" instead of "no invalid sequences"?

Masking prophages in bacterial genomes before building database as Phanta does

TODO: save the search result into a serializing binary file for fast downstream parsing

The current tab-delimited search result format is redundant and inefficient for parsing in kmcp profile. So we can use a compact binary format to save the temporary result.

kmcp search: a flag -b/--binary-outpu would be added to choose the output format optionally.
A new command kmcp view should be added to convert the binary to plain text format.
kmcp merge needs to be compatible with both plain and binary formats.
kmcp profile needs to be compatible with both plain and binary formats.

#query	qLen	qKmers	FPR	hits	target	chunkIdx	chunks	tLen	kSize	mKmers	qCov	tCov	jacc	queryIdx
read_1	150	130	7.4626e-15	1	GCF_000007805.1	2	10	6397126	21	130	1.0000	0.0002	0.0002	0
read_2	150	130	7.4626e-15	1	GCF_000007805.1	8	10	6397126	21	130	1.0000	0.0002	0.0002	1
read_3	150	130	7.4626e-15	1	GCF_000003835.1	8	10	12115052	21	130	1.0000	0.0001	0.0001	2
read_4	150	130	7.4626e-15	1	GCF_000003835.1	3	10	12115052	21	130	1.0000	0.0001	0.0001	3

Metaphlan format profile report file seems to lack NCBI taxonomy IDs.

Thank you for a great tool!

I'd like to use Metaphlan format profile report file obtained from "kmcp profile" as an input of Pavian, which draws Sankey plots.
( https://github.com/fbreitwieser/pavian )
( https://fbreitwieser.shinyapps.io/pavian/ )

But, when I tried, the Metaphlan format output file of "kmcp profile" is not recognized by Pavian shiny app, which claims that it can recognize Kraken, Centrifuge and MetaPhlAn report files.

When I compared my output file to the original MetaPhlAn3 output format (https://github.com/biobakery/biobakery/wiki/metaphlan3#output-files), it seems that kmcp output file does not have "NCBI_tax_id" column, which seems to be required by Pavian.

Are there any ways to solve this problem?

Thanks in advance.

long read metagenomic profiling

Dear Wei Shen,
I really like your tool and your tutorials. I just have a question regarding long read metagenomic profiling. Is there a specific parameter combination you would recommend to use to taxonomic profiling? It seems like I'm missing some organisms from the Zymo Mock Community even when using profiling mode m=0.
Thanks
Jens

Automatically disable ambiguous-matches correction for complex communities with thousands of species

It's optional for now, but it would be better to be switched on automatically to reduce profiling time.

$ kmcp profile  -h
Generate taxonomic profile from search results

Methods:
  3. We also use the two-stage taxonomy assignment algorithm in MegaPath
     to reduce the false positive of ambiguous matches.
     You can also disable this step by the flag --no-amb-corr.
     If the stage 1/4 produces thousands of candidates, you can also use
     the flag --no-amb-corr to reduce analysis time, which has very little
     effect on the results.

Flags:
      --no-amb-corr                       ► Do not correct ambiguous reads. Use this flag to reduce
                                          analysis time if the stage 1/4 produces thousands of candidates.

Read-level output format

Hi,

this looks pretty interesting - would it be possible to implement an additional output format
that provides read-level assignments instead of the summary tax. profiles? I.e. maybe something
similar to the default Kraken/Kraken2 output format described at

https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown#standard-kraken-output-format

, with either taxids (for NCBI) or maybe complete lineages for other taxonomies (GTDB etc)?

Building database with MAGs

Can I expect the tool to work if I use incomplete draft genomes or MAGs as inputs? What if I use collections of CDS sequences rather than assemblies?

QUESTION: KMCP merge

Dear Dr Shen,

Thank you for developing so great tool !
I have a question, could you please help me to figure out it ?
You mentioned in your paper that KMCP can automatically merge profiles generated by searching reads on multi small database, does it work at different types of databases ? For example, can it merge bacteria, archaea, fungi, virus and plasmid profile ?

Thanks!

Error when installing KMCP

Hi Dr Chen,
You helped me and advised me to try KMCP (biostars forum).
Sorry to disturb you again
When I try to install it via conda, it throw an error:

What happened?
The initial connection between Cloudflare's network and the origin web server timed out. As a result, the web page can not be displayed.
If you're the owner of this website:
Contact your hosting provider letting them know your web server is not completing requests. An Error 522 means that the request was able to connect to your web server, but that the request didn't finish. The most likely cause is that something on your server is hogging resources.
https://support.cloudflare.com/hc/en-us/articles/200171906-Error-522

Do you think we could fix it?

Thank you Dr

kmcp search crashed

I used the latest kmcp (downloaded from GitHub) and GTDB database (downloaded from WeTransfer) to align FASTQ files.
The command line was kmcp search --load-whole-db --threads 32 --db-dir /home2/xzhan9/data/reference/kmcp/gtdb.kmcp -1 data/SRR12397805_1.fastq.gz -2 data/SRR12397805_2.fastq.gz --out-file kmcp/SRR12397805.out --log kmcp/SRR12397805.log
kmcp crashed on two machines (both have >128G memory).

The input files, SRR12397805_1.fastq.gz and SRR12397805_2.fastq.gz, were downloaded from NCBI SRA.

The error messages were: panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x7b1256].

This crash bug happened randomly, as sometimes kmcp search can work perfectly fine.

More relevant outputs:

$ seqkit stats data/SRR12397805_1.fastq.gz data/SRR12397805_2.fastq.gz
file                         format  type   num_seqs      sum_len  min_len  avg_len  max_len
data/SRR12397805_1.fastq.gz  FASTQ   DNA   1,676,891  247,187,869       15    147.4      151
data/SRR12397805_2.fastq.gz  FASTQ   DNA   1,676,891  247,022,404       15    147.3      151

$ kmcp version
kmcp v0.8.2

$ kmcp search --load-whole-db --threads 32 --db-dir /home2/xzhan9/data/reference/kmcp/gtdb.kmcp -1 data/SRR12397805_1.fastq.gz -2 data/SRR12397805_2.fastq.gz --out-file kmcp/SRR12397805.out --log kmcp/SRR12397805.log
metaphlan-report kmcp/SRR12397805.out --cami-report kmcp/SRR12397805.cami.out --sample-id SRR12397805 --binning-result kmcp/SRR12397805.bin22:50:44.811 [INFO] kmcp v0.8.2
22:50:44.813 [INFO]   https://github.com/shenwei356/kmcp
22:50:44.813 [INFO]
22:50:44.813 [INFO] checking input files ...
22:50:44.865 [INFO] loading database into main memory ...

22:50:46.105 [INFO]   loaded index file: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp/R001/_block001.uniki
22:50:54.871 [INFO]   loaded index file: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp/R001/_block002.uniki
22:50:56.256 [INFO]   loaded index file: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp/R001/_block004.uniki
22:50:56.544 [INFO]   loaded index file: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp/R001/_block003.uniki
22:50:59.652 [INFO]   loaded index file: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp/R001/_block005.uniki
22:51:00.957 [INFO]   loaded index file: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp/R001/_block027.uniki
22:51:01.091 [INFO]   loaded index file: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp/R001/_block007.uniki
22:51:01.169 [INFO]   loaded index file: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp/R001/_block006.uniki
22:51:01.292 [INFO]   loaded index file: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp/R001/_block008.uniki
22:51:03.181 [INFO]   loaded index file: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp/R001/_block009.uniki
22:51:03.652 [INFO]   loaded index file: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp/R001/_block011.uniki
22:51:04.320 [INFO]   loaded index file: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp/R001/_block010.uniki
22:51:04.769 [INFO]   loaded index file: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp/R001/_block013.uniki
22:51:05.238 [INFO]   loaded index file: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp/R001/_block012.uniki
22:51:06.211 [INFO]   loaded index file: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp/R001/_block014.uniki
22:51:06.398 [INFO]   loaded index file: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp/R001/_block015.uniki
22:51:07.185 [INFO]   loaded index file: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp/R001/_block016.uniki
22:51:07.262 [INFO]   loaded index file: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp/R001/_block018.uniki
22:51:07.546 [INFO]   loaded index file: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp/R001/_block019.uniki
22:51:07.582 [INFO]   loaded index file: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp/R001/_block017.uniki
22:51:08.221 [INFO]   loaded index file: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp/R001/_block020.uniki
22:51:08.590 [INFO]   loaded index file: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp/R001/_block026.uniki
22:51:08.699 [INFO]   loaded index file: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp/R001/_block021.uniki
22:51:09.063 [INFO]   loaded index file: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp/R001/_block030.uniki
22:51:09.755 [INFO]   loaded index file: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp/R001/_block031.uniki
22:51:10.131 [INFO]   loaded index file: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp/R001/_block022.uniki
22:51:12.791 [INFO]   loaded index file: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp/R001/_block024.uniki
22:51:13.106 [INFO]   loaded index file: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp/R001/_block025.uniki
22:51:14.194 [INFO]   loaded index file: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp/R001/_block023.uniki
22:51:15.207 [INFO]   loaded index file: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp/R001/_block028.uniki
22:51:15.384 [INFO]   loaded index file: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp/R001/_block032.uniki
22:51:16.456 [INFO]   loaded index file: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp/R001/_block029.uniki
22:51:16.457 [INFO] database loaded: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp
22:51:16.476 [INFO]
22:51:16.476 [INFO] -------------------- [main parameters] --------------------
22:51:16.476 [INFO]   minimum    query length: 30
22:51:16.476 [INFO]   minimum  matched k-mers: 10
22:51:16.476 [INFO]   minimum  query coverage: 0.550000
22:51:16.476 [INFO]   minimum target coverage: 0.000000
22:51:16.497 [INFO]   minimum target coverage: 0.000000
22:51:16.497 [INFO] -------------------- [main parameters] --------------------
22:51:16.497 [INFO]
22:51:16.497 [INFO] searching ...
22:51:16.513 [INFO] reading from paired-end files: data/SRR12397805_1.fastq.gz, data/SRR12397805_2.fastq.gz
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x7b1256]

goroutine 112426 [running]:
github.com/shenwei356/kmcp/kmcp/cmd.NewUnikIndexDB.func3.1(0xce8bd57400)
	/home/shenwei/shenwei/scripts/go/src/github.com/shenwei356/kmcp/kmcp/cmd/util-db-search.go:783 +0x276
created by github.com/shenwei356/kmcp/kmcp/cmd.NewUnikIndexDB.func3
	/home/shenwei/shenwei/scripts/go/src/github.com/shenwei356/kmcp/kmcp/cmd/util-db-search.go:1020 +0x2d7

Kmcp profile empty

Hello,

Thanks for developing kmcp and for providing detailed documentation. I am running into empty output file after running kmcp profile command.

The commands I am using are:

kmcp compute -k 21 -n 10 -l 150 -O tmp-k21-n10-l150 -I gtdb-genomes

kmcp index -f 0.3 -n 1 -j 32 -I tmp-k21-n10-l150/ -O gtdb.kmcp

kmcp search -d gtdb.kmcp -o [email protected] \
           -1 ERX5474932_ERR5766176_1.fastq.gz -2 ERX5474932_ERR5766176_2.fastq.gz

kmcp profile -X taxdump_custom/ -T seqid2taxid.map -m 3 \
           [email protected] -o ERX5474932_ERR5766176.k.profile

And the output of this command is:

16:01:52.145 [INFO] using a lot of threads does not always accelerate processing, 4-threads is fast enough
16:01:52.145 [INFO] kmcp v0.9.3
16:01:52.145 [INFO]   https://github.com/shenwei356/kmcp
16:01:52.145 [INFO] 
16:01:52.145 [INFO] checking input files ...
16:01:52.145 [INFO]   1 input file(s) given
16:01:52.145 [INFO] loading TaxId mapping file ...
16:01:52.145 [INFO]   2 pairs of TaxId mapping values from 1 file(s) loaded
16:01:52.145 [INFO] loading Taxonomy from: taxdump_custom/
16:01:52.145 [INFO]   1 nodes in 1 ranks loaded
16:01:52.145 [INFO]   0 merged nodes loaded
16:01:52.145 [INFO]   0 deleted nodes loaded
16:01:52.145 [INFO]   1 names loaded
16:01:52.145 [INFO] 
16:01:52.145 [INFO] -------------------- [main parameters] --------------------
16:01:52.145 [INFO] match filtration: 
16:01:52.145 [INFO]   maximum false positive rate: 0.010000
16:01:52.145 [INFO]   minimum query coverage: 0.550000
16:01:52.145 [INFO]   keep matches with the top N scores: N=0
16:01:52.145 [INFO]   only keep the full matches: false
16:01:52.145 [INFO]   only keep main matches: false, maximum score gap: 0.400000
16:01:52.145 [INFO] 
16:01:52.145 [INFO] deciding the existence of a reference:
16:01:52.145 [INFO]   preset profiling mode: 3
16:01:52.145 [INFO]   minimum number of reads per reference chunk: 50
16:01:52.145 [INFO]   minimum number of uniquely matched reads: 20
16:01:52.145 [INFO]   minimum proportion of matched reference chunks: 0.800000
16:01:52.145 [INFO]   maximum standard deviation of relative depths of all chunks: 2.000000
16:01:52.145 [INFO] 
16:01:52.145 [INFO]   minimum number of high-confidence uniquely matched reads: 5
16:01:52.145 [INFO]   minimum query coverage of high-confidence uniquely matched reads: 0.750000
16:01:52.145 [INFO]   minimum proportion of high-confidence uniquely matched reads: 0.100000
16:01:52.145 [INFO] 
16:01:52.145 [INFO] taxonomy data:
16:01:52.145 [INFO]   taxdump directory: taxdump_custom/
16:01:52.145 [INFO]   mapping reference IDs to TaxIds: [seqid2taxid.map]
16:01:52.145 [INFO] 
16:01:52.145 [INFO] reporting:
16:01:52.145 [INFO]   default format   : ERX5474932_ERR5766176.k.profile
16:01:52.145 [INFO] -------------------- [main parameters] --------------------
16:01:52.145 [INFO] 
16:01:52.145 [INFO] stage 1/4: counting matches and unique matches for filtering out low-confidence references
16:01:52.145 [INFO]   parsing file: [email protected]
16:01:52.147 [INFO]   number of references in search result: 1
16:01:52.147 [INFO]   number of estimated references: 1
16:01:52.147 [INFO]   elapsed time: 2.129055ms
16:01:52.147 [INFO] 
16:01:52.147 [INFO] stage 2/4: counting ambiguous matches for correcting matches
16:01:52.147 [INFO]   parsing file: [email protected]
16:01:52.149 [INFO]   elapsed time: 1.777231ms
16:01:52.149 [INFO] 
16:01:52.149 [INFO] stage 3/4: recounting matches and unique matches
16:01:52.149 [INFO]   parsing file: [email protected]
16:01:52.152 [INFO]   number of estimated references: 0
16:01:52.152 [INFO]   elapsed time: 2.885897ms
16:01:52.152 [INFO] 
16:01:52.152 [INFO] stage 4/4: estimating abundance using EM algorithm
16:01:52.152 [INFO]   initialization step
16:01:52.152 [INFO]     parsing file: [email protected]
16:01:52.157 [INFO]     number of estimated references: 0
16:01:52.157 [INFO]     elapsed time: 4.52852ms
16:01:52.157 [INFO]   number of estimated references: 0
16:01:52.157 [INFO]   elapsed time: 4.575221ms
16:01:52.157 [INFO] 
16:01:52.157 [INFO] 0.0000% (0/632060) reads matched
16:01:52.157 [INFO] 
16:01:52.157 [INFO] 0.0000% (0/81) matched reads belong to the 0 references in the profile
16:01:52.158 [INFO] 
16:01:52.158 [INFO] elapsed time: 13.189091ms
16:01:52.158 [INFO]

I am also attaching the input and output files in order to be easier to replicate the issue.

Could you please help me with what I might do wrong here?

ERX5474932_ERR5766176.k.profile.zip
seqid2taxid.map.zip
gtdb-genomes.zip
gtdb.kmcp.zip
[email protected]
taxdump_custom.zip

kmcp search is very slow for metatranscriptome data

The search results showed a huge number of reads from 16 rRNA genes have thousands of matches, writing results slowed down the search. So these reads should be filtered out before the search using tools like https://github.com/hzi-bifo/RiboDetector.

Dealing with novel/non-sequenced species

Hi shenwei356,

Thank you for this very promising tool! Based on your instructions, I was able to build an up-to-date database and have been testing KMCP on some previously characterised metagenomes. I found the results very accurate, even on the level of species! However, there is one aspect of the tool that I am uncertain about. It happens sometimes that a metagenome contains novel species or species that have not been sequenced before and are therefore not in the database.
Is it possible to have KMCP account for that possibility?

For example, if there is a novel Lactobacillus species in a sample metagenome along with a known Lactobacillus species like Lactobacillus amylovorus, is it possible for KMCP to assign a certain percentage to Lactobacillus amylovorus but leave the rest assigned only to the genus level? I guess it wouldn't work with relative abundances as currently calculated, but with read percentages, but I think this possibility would be helpful in screening for potential new species.

Kind regards,
Marko

suitable for CDS and/or contig taxonomic assignment?

Hey there, @shenwei356 :)

Thanks again for not only additional wonderful software, but excellent documentation as usual!

I'm looking for a better way to taxonomically classify predicted coding sequences and contigs from metagenomic assemblies (i currently use CAT with NCBI's nr).

I really want a combination of GTDB for bacteria/archaea, and then also be able to combine euks from NCBI, so your infrastructure enabling that sort of thing is really appealing to me 🙏

I see in issue #27 you note that KMCP is not suitable for long reads. Is your thinking similar for assembled contigs too?

And would you expect to have the same thoughts about taxonomically classifying predicted coding sequences (which might average around 800-1000 bases)?

If only one of those would be possible, I can imagine it might be reasonable to use it to infer the other. E.g., if contigs are do-able, then assigning all CDSs whatever their source contig tax was. And if CDSs are do-able, employing some consensus approach to assign to the contig the tax of its CDSs.

Sorry if i'm missing if you've covered this elsewhere already, and thanks for any of your thoughts!

How to profile results only identify 1 species per reference

Dear Shenwei,

Thank you very much for providing a good tool, I am trying to adjust the output table as desired according to KMCP format.

Currently I use search with all the databases you provide and the output table contains more than one ref per species, so I want the results to only return one ref per species based on the number of reads or based on the score, is that possible?
I want the output table to contain results according to each desired rank, for example, is it possible to output only Genus ranks or only Species ranks?

Thank you so much,

Best,

Phuc

[website] broken links

Hi @shenwei356,
not sure where to report so I opened an issue here: in https://bioinf.shenwei.me/kmcp/tutorial/profiling/ the database links in the main body (like https://bioinf.shenwei.me/database/#prebuilt-databases) are not working, while the left bar works fine :)

TODO: improve search results parsing by using a simplified version of strconv.ParseFloat

kmcp search is stucked

Hi. Thank you for your nice tool.

I am currently searching metagenome against viral genome database that you provide in the link : https://bioinf.shenwei.me/kmcp/database/

I installed this tool using conda.
But I don't know why the search step takes too long.
Host OS is centos 7.10 ( 3.10.0-1160.el7.x86_64 )

Thank you in advance.

reporting proportion of unmatched reads

Many thanks for this open source tool! Vey well documented

When running the profiling: For example:
kmcp profile [email protected] --taxid-map taxid.map --taxdump taxdump/ --out-prefix search.tsv.gz.k.profile --metaphlan-report search.tsv.gz.m.profile --cami-report search.tsv.gz.c.profile --binning-result search.tsv.gz.binning.gz

It is possible to report the proportion of unmatched read similar to what kraken2 does ?

Best regards,

Peter

How to specify multiple kmer values

In the docs for compute it says

  -k, --kmer ints                 ► K-mer size(s). (default [21])

I tried -k 15 21 but that didn't work. How should multiple values be delimited?

TODO: allow changing parts of the parameters for preset profiling modes

KMCP database building tutorial

Dear Wei Shen,
Thanks for this great tool! I just have a question regarding the viral database building tutorial here: https://bioinf.shenwei.me/kmcp/database/#refseq-viral-or-fungi
It's recommended to split viral genomes into 5 chunks, but the flag in the tutorial is set to --split-number 10. Is this maybe a typo, or is this set elsewhere?

Thanks,
Lea

[Suggestion] Use score calibration when identifying proviruses and plasmids

Just a suggestion for the database build step. Since the sample size is pretty big, it's worth using the --enable-score-calibration in geNomad. This parameter substantially improves the classification accuracy and should improve the reliability of the detection (see Supplementary Figure 5D here).

Detecting closest reference in custom DB

I've got some dengue NGS data and I'm wondering if I can use kmcp to find the best matching reference. I've built a database from ~4800 reference sequnces using the following:

kmcp compute -I ~/.dengue-ngs/ref/ -k 31 -D 10 -B partial -O temp
kmcp index -j 32 -I temp/ -O db -n 1 -f 0.3

Doing an assembly and then using kmcp search contig.fa -d kmcpdb/ works quite nicely and identifies the same reference sequence I get when I use blast.

#query	qLen	qKmers	FPR	hits	target	chunkIdx	chunks	tLen	kSize	mKmers	qCov	tCov	jacc	queryIdx
k141_133	10523	1975	0.0000e+00	231	MH823208.1	0	1	10723	31	1503	0.7610	0.7444	0.6034	0

However sometimes it is difficult to do a genome assembly and in this case I would like to search directly using the reads. Is this possible? I've tried with kmcp search -d kmcpdb reads.fq.gz --query-whole-file -o result.tsv.gz, however this does not seem to return any hits. Any guidance would be appreciated.

Availability of old gtdb databases?

Hi @shenwei356,

I am performing some benchmarks and I was wondering if the gtdb r207 database was still available for download?

Best,
Florian

Add a tutorial of detecting specific pathogen in sequencing data

Sample data:

SARS-CoV-2 reference genomes: NC_045512.2. Multiple strains are supported.
Sample data containing SARS-CoV-2: http://opengene.org/fastv/testdata.fq.gz

Creating a KMCP database:

# split reference genomes into 10 chunks with 150-bp overlaps
kmcp compute -k 21 -n 10 -l 150 -I refs/ -O refs-n10-l150

# index with a small FPR for small genomes
kmcp index -f 0.001 -I refs-n10-l150/ -O refs.kmcp

Searching reads against the KMCP database:

kmcp search -d refs.kmcp/ testdata.fq.gz -o testdata.fq.gz.kmcp.tsv.gz

23:19:42.530 [INFO] processed queries: 676694, speed: 32.606 million queries per minute
23:19:42.530 [INFO] 8.0837% (54702/676694) queries matched

Profiling:

# --level strain is used when no taxonomy is given.
# some preset profiling modes are available.
kmcp profile --level strain testdata.fq.gz.kmcp.tsv.gz \
    | tee profile.tsv

csvtk cut -t -f ref,percentage,coverage,score,chunksFrac,reads profile.tsv \
    | csvtk pretty -t
ref           percentage   coverage     score    chunksFrac   reads
-----------   ----------   ----------   ------   ----------   -----
NC_045512.2   100.000000   275.461793   100.00   1.00         54702

coverage is the vertical coverage or depth, score is a similarity score, and chunksFrac is the horizontal coverage of the genome.