Giter VIP home page Giter VIP logo

kmcp's Introduction

Kmer-based Metagenomic Classification and Profiling

KMCP: accurate metagenomic profiling of both prokaryotic and viral populations by pseudo-mapping

Citation

KMCP: accurate metagenomic profiling of both prokaryotic and viral populations by pseudo-mapping.
Wei Shen, Hongyan Xiang, Tianquan Huang, Hui Tang, Mingli Peng, Dachuan Cai, Peng Hu, Hong Ren.
Bioinformatics, btac845, https://doi.org/10.1093/bioinformatics/btac845

Table of contents

Documents

https://bioinf.shenwei.me/kmcp

What can we do?

1. Accurate metagenomic profiling

KMCP utilizes genome coverage information by splitting the reference genomes into chunks and stores k-mers in a modified and optimized COBS index for fast alignment-free sequence searching. KMCP combines k-mer similarity and genome coverage information to reduce the false positive rate of k-mer-based taxonomic classification and profiling methods.

The read mapping process in KMCP is referred to as pseudo-mapping, which is similar to but different from the lightweight algorithm in Sailfish (Patro et al., 2014), pseudoalignment in Kallisto (Bray et al., 2016), quasi-mapping in RapMap (Srivastava et al., 2016), and lightweight mapping in Salmon (Patro et al., 2017). All of these methods seek to elide the computation of base-to-base alignment using distinct strategies (Srivastava et al., 2016). In KMCP, each reference genome is pre-split into chunks of equal size, and the k-mers of a query, as a whole, are compared to each genome chunk to find all possible ones sharing a predefined proportion of k-mers with the query. Like quasi-mapping in RapMap, KMCP tracks the target and position for each query. However, the read position in KMCP is approximate and in a predefined resolution (the number of genome chunks).

Benchmarking results based on simulated and real data demonstrate that KMCP, despite a longer running time than some other methods, not only allows the accurate taxonomic profiling of prokaryotic and viral populations but also provides more confident pathogen detection in clinical samples of low depth.

Genome collections with custom taxonomy, e.g., GTDB uses its own taxonomy and MGV uses ICTV taxonomy, are also supported by generating NCBI-style taxdump files with taxonkit create-taxdump. You can even merge the GTDB taxonomy (for prokaryotic genomes from GTDB) and NCBI taxonomy (for genomes from NCBI).

2. Fast sequence search against large scales of genomic datasets

KMCP can be used for fast sequence search against large scales of genomic datasets as BIGSI and COBS do. We reimplemented and modified the Compact Bit-Sliced Signature index (COBS) algorithm, bringing a smaller index size and much faster searching speed (2x for genome search and 10x for short reads) faster than COBS (check the tutorial and benchmark). Also check the algorithm and data structure differences between KMCP and COBS.

3. Fast genome similarity estimation

KMCP can also be used for fast similarity estimation of assemblies/genomes against known reference genomes.

Genome sketching is a method of utilizing small and approximate summaries of genomic data for fast searching and comparison. Mash and Sourmash provide fast genome distance estimation using MinHash (Mash) or FracMinHash (Sourmash). KMCP supports multiple k-mer sketches (Minimizer, FracMinHash (previously named Scaled MinHash), and Closed Syncmers) for genome similarity estimation. And KMCP is 5x-7x faster than Mash/Sourmash (check the tutorial and benchmark).

Features

  • Easy to install
  • Easy to use
  • Building database is easy and fast
  • Fast searching speed (for sequence/genome search)
    • The index structure is modified from COBS, while KMCP is 2x-10x faster in sequence searching.
    • Automatically scales to exploit all available CPU cores.
    • Searching time is linearly related to the number of reference genomes (chunks).
  • Scalable searching. Searching results against multiple databases can be fast merged. This brings many benefits:
    • There's no need to re-built the database with newly added reference genomes.
    • The searching step can be parallelized with a computer cluster in which each computation node searches against a small database.
    • Computers with limited main memory can also utilize an extensive collection of reference genomes by building and searching against small databases..
  • Accurate taxonomic profiling
  • Flexible support of taxonomy database


Installation

Latest Version Github Releases Cross-platform Anaconda Cloud

Download executable binaries, or install using conda:

conda install -c bioconda kmcp

SIMD extensions including AVX512, AVX2, SSE2 are sequentially detected and used in two packages for better searching performance.

  • pand, for accelerating searching on databases constructed with multiple hash functions.
  • pospop, for batch counting matched k-mers in bloom filters.

ARM architecture is supported, but kmcp search would be slower.

Commands

Subcommand Function
compute Generate k-mers (sketch) from FASTA/Q sequences
index Construct a database from k-mer files
search Search sequences against a database
merge Merge search results from multiple databases
profile Generate the taxonomic profile from search results
utils split-genomes Split genomes into chunks
utils unik-info Print information of .unik files
utils index-info Print information of index files
utils index-density Plot the element density of bloom filters for an index file
utils ref-info Print information of reference chunks in a database
utils cov2simi Convert k-mer coverage to sequence similarity
utils query-fpr Compute the false positive rate of a query
utils filter Filter search results and find species/assembly-specific queries
utils merge-regions Merge species/assembly-specific regions

Quickstart

# compute k-mers
kmcp compute -k 21 --split-number 10 --split-overlap 150 \
    --in-dir genomes/ --out-dir genomes-k21-n10

# index k-mers
kmcp index --false-positive-rate 0.1 --num-hash 1 \
    --in-dir genomes-k21-n10/ --out-dir genomes.kmcp

# delete temporary files
# rm -rf genomes-k21-n10/

# search    
kmcp search --db-dir genomes.kmcp/ test.fa.gz --out-file [email protected]

# merge search results against multiple databases
kmcp merge -o search.kmcp.tsv.gz search.kmcp@*.kmcp.tsv.gz

# profile and binning
kmcp profile search.kmcp.tsv.gz \
    --taxid-map        taxid.map \
    --taxdump          taxdump/ \
    --out-file         search.tsv.gz.k.profile \
    --metaphlan-report search.tsv.gz.m.profile \
    --cami-report      search.tsv.gz.c.profile \
    --binning-result   search.tsv.gz.binning.gz

Next:

KMCP vs COBS

We reimplemented and modified the Compact Bit-Sliced Signature index (COBS) algorithm, bringing a smaller index size and much faster searching speed (2x for genome search and 10x for short reads) faster than COBS.

The differences between KMCP and COBS

Category Item COBS KMCP Comment
Algorithm K-mer hashing xxhash ntHash1 xxHash is a general-purpose hashing function while ntHash is a recursive hash function for DNA/RNA
Bloom filter hashing xxhash Using k-mer hash values Avoid hash computation
Multiple-hash functions xxhash with different seeds Generating multiple values from a single one Avoid hash computation
Single-hash function Same as multiple-hash functions Separated workflow Reducing loops
AND step Serial bitwise AND Vectorised bitwise AND Bitwise AND for >1 hash functions
PLUS step Serial bit-unpacking Vectorised positional popcount with pospop Counting from bit-packed data
Index structure Size of blocks / Using extra thresholds to split the last block with the most k-mers Uneven genome size distribution would make bloom filters of the last block extremely huge
Index files Concatenated Independent
Index loading mmap, loading complete index into RAM mmap, loading complete index into RAM, seek Index loading modes
Input/output Input files FASTA/Q, McCortex, text FASTA/Q
Output Target and matched k-mers Target, matched k-mers, query FPR, etc.

Support

Please open an issue to report bugs, propose new functions, or ask for help.

License

MIT License

Acknowledgments

  • Zhi-Luo Deng (Helmholtz Centre for Infection Research, Germany) gave a lot of valuable advice on metagenomic profiling and benchmarking.
  • Robert Clausecker (Zuse Institute Berlin, Germany) wrote the high-performance vectorized positional popcount package (pospop) during my development of KMCP, which greatly accelerated the bit-matrix searching.

kmcp's People

Contributors

shenwei356 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

kmcp's Issues

Profiling output table interpretation

Dear Shenwei,

Thank you very much for your very nice tool, we are trying to understand how to interpret the output table in KMCP format.

  • If the output table contains more than one ref per species based on which parameter should we choose the best hit?

  • According to your manual the percentage column refers to Relative abundance of the reference however, we are not sure how this value is calculated. Could you give us more details about this metric?

thank you very much,

best,

Loรฏc

0.8.3: Fails on i386: 1 << 32 - 1 (untyped int constant 4294967295) overflows int

# github.com/shenwei356/kmcp/kmcp/cmd
kmcp/cmd/compute.go:287:14: 1 << 32 - 1 (untyped int constant 4294967295) overflows int
kmcp/cmd/compute.go:294:19: 1 << 32 - 1 (untyped int constant 4294967295) overflows int
kmcp/cmd/compute.go:300:17: 1 << 32 - 1 (untyped int constant 4294967295) overflows int
*** Error code 2

Should 1<<32-1 be uint64(1)<<32-1?

FreeBSD 13.1

Optimizing KMCP with HumGut

Hello Shenwei,
Thank you for making this new metagenomic tool! I'm interested in benchmarking its performance, and for that I want to perform classification on a large (10,000+) number of samples. There are a number of minor things I've come across but I have an end to end sample running now. My main objective for opening this issue is to discuss potential optimizations to what I'm doing to reduce the total time for running this.

Currently, end to end this single sample took 9 hours (08:58) with 32 cores and 23 GB of RAM (100GB was max)

  • Database used was the HumGut database, split over 10 indices
  • Working directory was in RAM (/dev/shm), started by copying the samples to that (done to avoid being impacted by NAS speeds)
  • Times
    • Step 1: kmcp search, through parallel with 12 threads per job, took 2 hours per search
    • Step 2: kmcp merge, took 15 mins with 64 threads
    • Step 3: kmcp profile, took 4h52 mins with 64 threads
      • I could add the --no-amb-corr flag to reduce by 1h53 mins

From this, I could potentially reduce the time to ~6 hours by allocating more cores and increasing the number of searches running in parallel and adding the --no-amb-corr flag.
Are there more ways in which I am missing optimizations?

I've created a gist with the actual SLURM submission, can also upload the logs of the jobs if that's helpful.

QUESTION: Taxdump

Dear Dr. Shen,
I have a question about taxdump in KMCP documentation.
The documentation mentioned:

# taxid mapping files, multiple files supported.
taxid_map=gtdb.kmcp/taxid.map,refseq-viral.kmcp/taxid.map,refseq-fungi.kmcp/taxid.map

# or concatenate them into a big taxid.map
#    cat gtdb.kmcp/taxid.map refseq-viral.kmcp/taxid.map refseq-fungi.kmcp/taxid.map > taxid.map
# taxid_map=taxid.map

# taxdump directory
taxdump=taxdump

sfile=$file.kmcp.tsv.gz

kmcp profile \
    --taxid-map      $taxid_map \
    --taxdump         $taxdump/ \
    --level             species \
    --min-query-cov        0.55 \
    --min-chunks-reads       50 \
    --min-chunks-fraction   0.8 \
    --max-chunks-depth-stdev  2 \
    --min-uniq-reads         20 \
    --min-hic-ureads          5 \
    --min-hic-ureads-qcov  0.75 \
    --min-hic-ureads-prop   0.1 \
    $sfile                      \
    --out-prefix       $sfile.kmcp.profile \
    --metaphlan-report $sfile.metaphlan.profile \
    --cami-report      $sfile.cami.profile \
    --sample-id        "0" \
    --binning-result   $sfile.binning.gz

Is the taxdump here downloaded from the NCBI taxonomy database? I don't think so, but each database is a separate directory when using taxonkit to generate a taxdump, so does the taxdump here also need to be specified like taxid map ?

Thanks!

Merge error (number of fields < query index field)

When merging the results of a number of kmcp search runs, I get the following output:

> kmcp merge SRR6468718.kmcp_humgut@*.tsv.gz --out-file SRR6468718.kmcp_humgut.tsv.gz
11:28:13.212 [INFO] checking input files ...
11:28:13.213 [INFO]   9 input files given
11:28:13.213 [INFO] merging ...
11:44:06.311 [ERRO] number of fields (13) < query index field (15)

It does produce the out-file, attached the header:

โฏ zcat SRR6468718.kmcp_humgut.tsv.gz| head
#query	qLen	qKmers	FPR	hits	target	chunkIdx	chunks	tLen	kSize	mKmers	qCov	tCov	jacc	queryIdx
SRR6468718.1	101	81	4.5272e-15	1	GCF_000228045.1_SMUT1-NEX_25-98_genomic	9	10	2182376	21	81	1.0000	0.0004	0.0004	0
SRR6468718.2	101	81	4.5272e-15	385	GUT_GENOME154070	5	10	2006006	21	81	1.0000	0.0004	0.0004	1
SRR6468718.2	101	81	4.5272e-15	385	GUT_GENOME053624	2	10	1545775	21	81	1.0000	0.0005	0.0005	1
SRR6468718.2	101	81	4.5272e-15	385	GUT_GENOME124299	2	10	1797471	21	81	1.0000	0.0005	0.0005	1
SRR6468718.2	101	81	4.5272e-15	385	GUT_GENOME216177	0	10	1912264	21	81	1.0000	0.0004	0.0004	1
SRR6468718.2	101	81	4.5272e-15	385	GUT_GENOME053394	2	10	1925801	21	81	1.0000	0.0004	0.0004	1
SRR6468718.2	101	81	4.5272e-15	385	GUT_GENOME124618	3	10	1946290	21	81	1.0000	0.0004	0.0004	1
SRR6468718.2	101	81	4.5272e-15	385	GUT_GENOME273974	0	10	1955198	21	81	1.0000	0.0004	0.0004	1
SRR6468718.2	101	81	4.5272e-15	385	GUT_GENOME127971	7	10	1964574	21	81	1.0000	0.0004	0.0004	1

KMCP's MetaPhlAn output doesn't follow the MetaPhlAn file format

The MetaPhlAn output generated by KMCP is not the same as the one generated by MetaPhlAn. In the KMCP output, the taxid column only contains the taxid of the lowest taxonomic rank (e.g. 1224), while the one generated by MetaPhlAn contains the full lineage, separated by | (e.g. 2|1224).

This makes the KMCP output incompatible with TAXPASTA. Actually, the TAXPASTA error is due to rank not summing up to 100% (due to lineages genomes skipping some ranks).

Report statistics of matched, unmatched reads

This was previously proposed: #20 .

After performing some data analysis where novel viruses/bacteria exist in some samples, I decided to implement this. It would be helpful to hint at whether the reference genomes cover all microorganisms in the sample.

Here's the plan:

  1. kmcp search: appending the numbers to search results, starting with #.
    • input reads
    • matched reads
  2. kmcp merge: recount (can not directly plus) the numbers and outputting
  3. kmcp profile: reading the numbers and output in log

Compatibility:

  1. Counts in search results of older versions will be empty.

Create an index without create the database

Hello Dr Shen,

Can I create an index without create the database ? Because the folder takes memory.

If my interest is to indexing sequences and run some query, is it possible to use the tool without the db creation step ?

Best,

ETA missing when building KMCP index

When building the KMCP index for Humgut (by following the instructions here, the ETA is stuck at 0s, even after several blocks have been completed.

kmcp index -j 32 -I humgut-k21-n10 -O humgut.kmcp -n 1 -f 0.3
13:31:35.908 [INFO] kmcp v0.9.3
13:31:35.909 [INFO]   https://github.com/shenwei356/kmcp
13:31:35.909 [INFO]
13:31:35.909 [INFO] loading .unik file infos from file: humgut-k21-n10/_info.txt
13:31:36.518 [INFO]   306910 cached file infos loaded
13:31:36.585 [INFO]
13:31:36.585 [INFO] -------------------- [main parameters] --------------------
13:31:36.585 [INFO]   number of hashes: 1
13:31:36.585 [INFO]   false positive rate: 0.300000
13:31:36.585 [INFO]   k-mer size(s): 21
13:31:36.585 [INFO]   split seqequence size: 0, overlap: 20
13:31:36.585 [INFO]   block-sizeX-kmers-t: 10.00 M
13:31:36.585 [INFO]   block-sizeX        : 256
13:31:36.585 [INFO]   block-size8-kmers-t: 20.00 M
13:31:36.585 [INFO]   block-size1-kmers-t: 200.00 M
13:31:36.585 [INFO] -------------------- [main parameters] --------------------
13:31:36.585 [INFO]
13:31:36.586 [INFO] building index ...
13:31:36.753 [INFO]
13:31:36.753 [INFO]   block size: 9592
13:31:36.753 [INFO]   number of index files: 32 (may be more)
13:31:36.753 [INFO]
13:31:36.753 [block #001] 1199 / 1199  100 %
13:31:36.753 [block #002] 1199 / 1199  100 %
13:31:36.754 [block #003] 1199 / 1199  100 %
13:32:30.922 [block #004] 1199 / 1199  100 %
13:32:34.941 [block #005] 1199 / 1199  100 %
13:33:33.902 [block #006] 1199 / 1199  100 %
13:33:40.757 [block #007] 1199 / 1199  100 %
13:34:45.743 [block #008] 1199 / 1199  100 %
13:34:54.006 [block #009] 1199 / 1199  100 %
13:35:59.125 [block #010] 1199 / 1199  100 %
13:36:08.695 [block #011] 1060 / 1199 [==========================>---]  88 %
13:37:15.240 [block #012]  847 / 1199 [====================>---------]  71 %
[saved index files]     10 / 32 [==========>-----------------------] ETA: 0s

compiling/installing from source

Hello Wei,

Any guidance to install/compiling from source? I have an aarch64 (or ARM) CPU cluster and need to compile from source. None of the binaries works.

Thanks,

Jianshu

A minor issue on console messages

Thanks for developing this tool.

When I inputted an empty file, the console outputs:
...
11:27:03.868 [WARN] no invalid sequences in file: input.fq.gz
...

The text should be "no valid sequences" instead of "no invalid sequences"?

TODO: save the search result into a serializing binary file for fast downstream parsing

The current tab-delimited search result format is redundant and inefficient for parsing in kmcp profile. So we can use a compact binary format to save the temporary result.

  1. kmcp search: a flag -b/--binary-outpu would be added to choose the output format optionally.
  2. A new command kmcp view should be added to convert the binary to plain text format.
  3. kmcp merge needs to be compatible with both plain and binary formats.
  4. kmcp profile needs to be compatible with both plain and binary formats.
#query qLen qKmers FPR hits target chunkIdx chunks tLen kSize mKmers qCov tCov jacc queryIdx
read_1 150 130 7.4626e-15 1 GCF_000007805.1 2 10 6397126 21 130 1.0000 0.0002 0.0002 0
read_2 150 130 7.4626e-15 1 GCF_000007805.1 8 10 6397126 21 130 1.0000 0.0002 0.0002 1
read_3 150 130 7.4626e-15 1 GCF_000003835.1 8 10 12115052 21 130 1.0000 0.0001 0.0001 2
read_4 150 130 7.4626e-15 1 GCF_000003835.1 3 10 12115052 21 130 1.0000 0.0001 0.0001 3

Metaphlan format profile report file seems to lack NCBI taxonomy IDs.

Thank you for a great tool!

I'd like to use Metaphlan format profile report file obtained from "kmcp profile" as an input of Pavian, which draws Sankey plots.
( https://github.com/fbreitwieser/pavian )
( https://fbreitwieser.shinyapps.io/pavian/ )

But, when I tried, the Metaphlan format output file of "kmcp profile" is not recognized by Pavian shiny app, which claims that it can recognize Kraken, Centrifuge and MetaPhlAn report files.

When I compared my output file to the original MetaPhlAn3 output format (https://github.com/biobakery/biobakery/wiki/metaphlan3#output-files), it seems that kmcp output file does not have "NCBI_tax_id" column, which seems to be required by Pavian.

Are there any ways to solve this problem?

Thanks in advance.

long read metagenomic profiling

Dear Wei Shen,
I really like your tool and your tutorials. I just have a question regarding long read metagenomic profiling. Is there a specific parameter combination you would recommend to use to taxonomic profiling? It seems like I'm missing some organisms from the Zymo Mock Community even when using profiling mode m=0.
Thanks
Jens

Automatically disable ambiguous-matches correction for complex communities with thousands of species

It's optional for now, but it would be better to be switched on automatically to reduce profiling time.

$ kmcp profile  -h
Generate taxonomic profile from search results

Methods:
  3. We also use the two-stage taxonomy assignment algorithm in MegaPath
     to reduce the false positive of ambiguous matches.
     You can also disable this step by the flag --no-amb-corr.
     If the stage 1/4 produces thousands of candidates, you can also use
     the flag --no-amb-corr to reduce analysis time, which has very little
     effect on the results.

Flags:
      --no-amb-corr                       โ–บ Do not correct ambiguous reads. Use this flag to reduce
                                          analysis time if the stage 1/4 produces thousands of candidates.

Building database with MAGs

Can I expect the tool to work if I use incomplete draft genomes or MAGs as inputs? What if I use collections of CDS sequences rather than assemblies?

QUESTION: KMCP merge

Dear Dr Shen,

Thank you for developing so great tool !
I have a question, could you please help me to figure out it ?
You mentioned in your paper that KMCP can automatically merge profiles generated by searching reads on multi small database, does it work at different types of databases ? For example, can it merge bacteria, archaea, fungi, virus and plasmid profile ?

Thanks!

Error when installing KMCP

Hi Dr Chen,
You helped me and advised me to try KMCP (biostars forum).
Sorry to disturb you again
When I try to install it via conda, it throw an error:

What happened?
The initial connection between Cloudflare's network and the origin web server timed out. As a result, the web page can not be displayed.
If you're the owner of this website:
Contact your hosting provider letting them know your web server is not completing requests. An Error 522 means that the request was able to connect to your web server, but that the request didn't finish. The most likely cause is that something on your server is hogging resources.
https://support.cloudflare.com/hc/en-us/articles/200171906-Error-522

Do you think we could fix it?

Thank you Dr

kmcp search crashed

I used the latest kmcp (downloaded from GitHub) and GTDB database (downloaded from WeTransfer) to align FASTQ files.
The command line was kmcp search --load-whole-db --threads 32 --db-dir /home2/xzhan9/data/reference/kmcp/gtdb.kmcp -1 data/SRR12397805_1.fastq.gz -2 data/SRR12397805_2.fastq.gz --out-file kmcp/SRR12397805.out --log kmcp/SRR12397805.log
kmcp crashed on two machines (both have >128G memory).

The input files, SRR12397805_1.fastq.gz and SRR12397805_2.fastq.gz, were downloaded from NCBI SRA.

The error messages were: panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x7b1256].

This crash bug happened randomly, as sometimes kmcp search can work perfectly fine.

More relevant outputs:

$ seqkit stats data/SRR12397805_1.fastq.gz data/SRR12397805_2.fastq.gz
file                         format  type   num_seqs      sum_len  min_len  avg_len  max_len
data/SRR12397805_1.fastq.gz  FASTQ   DNA   1,676,891  247,187,869       15    147.4      151
data/SRR12397805_2.fastq.gz  FASTQ   DNA   1,676,891  247,022,404       15    147.3      151

$ kmcp version
kmcp v0.8.2

$ kmcp search --load-whole-db --threads 32 --db-dir /home2/xzhan9/data/reference/kmcp/gtdb.kmcp -1 data/SRR12397805_1.fastq.gz -2 data/SRR12397805_2.fastq.gz --out-file kmcp/SRR12397805.out --log kmcp/SRR12397805.log
metaphlan-report kmcp/SRR12397805.out --cami-report kmcp/SRR12397805.cami.out --sample-id SRR12397805 --binning-result kmcp/SRR12397805.bin22:50:44.811 [INFO] kmcp v0.8.2
22:50:44.813 [INFO]   https://github.com/shenwei356/kmcp
22:50:44.813 [INFO]
22:50:44.813 [INFO] checking input files ...
22:50:44.865 [INFO] loading database into main memory ...

22:50:46.105 [INFO]   loaded index file: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp/R001/_block001.uniki
22:50:54.871 [INFO]   loaded index file: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp/R001/_block002.uniki
22:50:56.256 [INFO]   loaded index file: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp/R001/_block004.uniki
22:50:56.544 [INFO]   loaded index file: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp/R001/_block003.uniki
22:50:59.652 [INFO]   loaded index file: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp/R001/_block005.uniki
22:51:00.957 [INFO]   loaded index file: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp/R001/_block027.uniki
22:51:01.091 [INFO]   loaded index file: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp/R001/_block007.uniki
22:51:01.169 [INFO]   loaded index file: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp/R001/_block006.uniki
22:51:01.292 [INFO]   loaded index file: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp/R001/_block008.uniki
22:51:03.181 [INFO]   loaded index file: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp/R001/_block009.uniki
22:51:03.652 [INFO]   loaded index file: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp/R001/_block011.uniki
22:51:04.320 [INFO]   loaded index file: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp/R001/_block010.uniki
22:51:04.769 [INFO]   loaded index file: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp/R001/_block013.uniki
22:51:05.238 [INFO]   loaded index file: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp/R001/_block012.uniki
22:51:06.211 [INFO]   loaded index file: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp/R001/_block014.uniki
22:51:06.398 [INFO]   loaded index file: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp/R001/_block015.uniki
22:51:07.185 [INFO]   loaded index file: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp/R001/_block016.uniki
22:51:07.262 [INFO]   loaded index file: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp/R001/_block018.uniki
22:51:07.546 [INFO]   loaded index file: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp/R001/_block019.uniki
22:51:07.582 [INFO]   loaded index file: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp/R001/_block017.uniki
22:51:08.221 [INFO]   loaded index file: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp/R001/_block020.uniki
22:51:08.590 [INFO]   loaded index file: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp/R001/_block026.uniki
22:51:08.699 [INFO]   loaded index file: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp/R001/_block021.uniki
22:51:09.063 [INFO]   loaded index file: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp/R001/_block030.uniki
22:51:09.755 [INFO]   loaded index file: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp/R001/_block031.uniki
22:51:10.131 [INFO]   loaded index file: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp/R001/_block022.uniki
22:51:12.791 [INFO]   loaded index file: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp/R001/_block024.uniki
22:51:13.106 [INFO]   loaded index file: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp/R001/_block025.uniki
22:51:14.194 [INFO]   loaded index file: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp/R001/_block023.uniki
22:51:15.207 [INFO]   loaded index file: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp/R001/_block028.uniki
22:51:15.384 [INFO]   loaded index file: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp/R001/_block032.uniki
22:51:16.456 [INFO]   loaded index file: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp/R001/_block029.uniki
22:51:16.457 [INFO] database loaded: /home2/xzhan9/data/reference/kmcp/gtdb.kmcp
22:51:16.476 [INFO]
22:51:16.476 [INFO] -------------------- [main parameters] --------------------
22:51:16.476 [INFO]   minimum    query length: 30
22:51:16.476 [INFO]   minimum  matched k-mers: 10
22:51:16.476 [INFO]   minimum  query coverage: 0.550000
22:51:16.476 [INFO]   minimum target coverage: 0.000000
22:51:16.497 [INFO]   minimum target coverage: 0.000000
22:51:16.497 [INFO] -------------------- [main parameters] --------------------
22:51:16.497 [INFO]
22:51:16.497 [INFO] searching ...
22:51:16.513 [INFO] reading from paired-end files: data/SRR12397805_1.fastq.gz, data/SRR12397805_2.fastq.gz
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x7b1256]

goroutine 112426 [running]:
github.com/shenwei356/kmcp/kmcp/cmd.NewUnikIndexDB.func3.1(0xce8bd57400)
	/home/shenwei/shenwei/scripts/go/src/github.com/shenwei356/kmcp/kmcp/cmd/util-db-search.go:783 +0x276
created by github.com/shenwei356/kmcp/kmcp/cmd.NewUnikIndexDB.func3
	/home/shenwei/shenwei/scripts/go/src/github.com/shenwei356/kmcp/kmcp/cmd/util-db-search.go:1020 +0x2d7

Kmcp profile empty

Hello,

Thanks for developing kmcp and for providing detailed documentation. I am running into empty output file after running kmcp profile command.

The commands I am using are:

kmcp compute -k 21 -n 10 -l 150 -O tmp-k21-n10-l150 -I gtdb-genomes

kmcp index -f 0.3 -n 1 -j 32 -I tmp-k21-n10-l150/ -O gtdb.kmcp

kmcp search -d gtdb.kmcp -o [email protected] \
           -1 ERX5474932_ERR5766176_1.fastq.gz -2 ERX5474932_ERR5766176_2.fastq.gz
kmcp profile -X taxdump_custom/ -T seqid2taxid.map -m 3 \
           [email protected] -o ERX5474932_ERR5766176.k.profile

And the output of this command is:

16:01:52.145 [INFO] using a lot of threads does not always accelerate processing, 4-threads is fast enough
16:01:52.145 [INFO] kmcp v0.9.3
16:01:52.145 [INFO]   https://github.com/shenwei356/kmcp
16:01:52.145 [INFO] 
16:01:52.145 [INFO] checking input files ...
16:01:52.145 [INFO]   1 input file(s) given
16:01:52.145 [INFO] loading TaxId mapping file ...
16:01:52.145 [INFO]   2 pairs of TaxId mapping values from 1 file(s) loaded
16:01:52.145 [INFO] loading Taxonomy from: taxdump_custom/
16:01:52.145 [INFO]   1 nodes in 1 ranks loaded
16:01:52.145 [INFO]   0 merged nodes loaded
16:01:52.145 [INFO]   0 deleted nodes loaded
16:01:52.145 [INFO]   1 names loaded
16:01:52.145 [INFO] 
16:01:52.145 [INFO] -------------------- [main parameters] --------------------
16:01:52.145 [INFO] match filtration: 
16:01:52.145 [INFO]   maximum false positive rate: 0.010000
16:01:52.145 [INFO]   minimum query coverage: 0.550000
16:01:52.145 [INFO]   keep matches with the top N scores: N=0
16:01:52.145 [INFO]   only keep the full matches: false
16:01:52.145 [INFO]   only keep main matches: false, maximum score gap: 0.400000
16:01:52.145 [INFO] 
16:01:52.145 [INFO] deciding the existence of a reference:
16:01:52.145 [INFO]   preset profiling mode: 3
16:01:52.145 [INFO]   minimum number of reads per reference chunk: 50
16:01:52.145 [INFO]   minimum number of uniquely matched reads: 20
16:01:52.145 [INFO]   minimum proportion of matched reference chunks: 0.800000
16:01:52.145 [INFO]   maximum standard deviation of relative depths of all chunks: 2.000000
16:01:52.145 [INFO] 
16:01:52.145 [INFO]   minimum number of high-confidence uniquely matched reads: 5
16:01:52.145 [INFO]   minimum query coverage of high-confidence uniquely matched reads: 0.750000
16:01:52.145 [INFO]   minimum proportion of high-confidence uniquely matched reads: 0.100000
16:01:52.145 [INFO] 
16:01:52.145 [INFO] taxonomy data:
16:01:52.145 [INFO]   taxdump directory: taxdump_custom/
16:01:52.145 [INFO]   mapping reference IDs to TaxIds: [seqid2taxid.map]
16:01:52.145 [INFO] 
16:01:52.145 [INFO] reporting:
16:01:52.145 [INFO]   default format   : ERX5474932_ERR5766176.k.profile
16:01:52.145 [INFO] -------------------- [main parameters] --------------------
16:01:52.145 [INFO] 
16:01:52.145 [INFO] stage 1/4: counting matches and unique matches for filtering out low-confidence references
16:01:52.145 [INFO]   parsing file: [email protected]
16:01:52.147 [INFO]   number of references in search result: 1
16:01:52.147 [INFO]   number of estimated references: 1
16:01:52.147 [INFO]   elapsed time: 2.129055ms
16:01:52.147 [INFO] 
16:01:52.147 [INFO] stage 2/4: counting ambiguous matches for correcting matches
16:01:52.147 [INFO]   parsing file: [email protected]
16:01:52.149 [INFO]   elapsed time: 1.777231ms
16:01:52.149 [INFO] 
16:01:52.149 [INFO] stage 3/4: recounting matches and unique matches
16:01:52.149 [INFO]   parsing file: [email protected]
16:01:52.152 [INFO]   number of estimated references: 0
16:01:52.152 [INFO]   elapsed time: 2.885897ms
16:01:52.152 [INFO] 
16:01:52.152 [INFO] stage 4/4: estimating abundance using EM algorithm
16:01:52.152 [INFO]   initialization step
16:01:52.152 [INFO]     parsing file: [email protected]
16:01:52.157 [INFO]     number of estimated references: 0
16:01:52.157 [INFO]     elapsed time: 4.52852ms
16:01:52.157 [INFO]   number of estimated references: 0
16:01:52.157 [INFO]   elapsed time: 4.575221ms
16:01:52.157 [INFO] 
16:01:52.157 [INFO] 0.0000% (0/632060) reads matched
16:01:52.157 [INFO] 
16:01:52.157 [INFO] 0.0000% (0/81) matched reads belong to the 0 references in the profile
16:01:52.158 [INFO] 
16:01:52.158 [INFO] elapsed time: 13.189091ms
16:01:52.158 [INFO] 

I am also attaching the input and output files in order to be easier to replicate the issue.

Could you please help me with what I might do wrong here?

ERX5474932_ERR5766176.k.profile.zip
seqid2taxid.map.zip
gtdb-genomes.zip
gtdb.kmcp.zip
[email protected]
taxdump_custom.zip

Dealing with novel/non-sequenced species

Hi shenwei356,

Thank you for this very promising tool! Based on your instructions, I was able to build an up-to-date database and have been testing KMCP on some previously characterised metagenomes. I found the results very accurate, even on the level of species! However, there is one aspect of the tool that I am uncertain about. It happens sometimes that a metagenome contains novel species or species that have not been sequenced before and are therefore not in the database.
Is it possible to have KMCP account for that possibility?

For example, if there is a novel Lactobacillus species in a sample metagenome along with a known Lactobacillus species like Lactobacillus amylovorus, is it possible for KMCP to assign a certain percentage to Lactobacillus amylovorus but leave the rest assigned only to the genus level? I guess it wouldn't work with relative abundances as currently calculated, but with read percentages, but I think this possibility would be helpful in screening for potential new species.

Kind regards,
Marko

suitable for CDS and/or contig taxonomic assignment?

Hey there, @shenwei356 :)

Thanks again for not only additional wonderful software, but excellent documentation as usual!

I'm looking for a better way to taxonomically classify predicted coding sequences and contigs from metagenomic assemblies (i currently use CAT with NCBI's nr).

I really want a combination of GTDB for bacteria/archaea, and then also be able to combine euks from NCBI, so your infrastructure enabling that sort of thing is really appealing to me ๐Ÿ™

I see in issue #27 you note that KMCP is not suitable for long reads. Is your thinking similar for assembled contigs too?

And would you expect to have the same thoughts about taxonomically classifying predicted coding sequences (which might average around 800-1000 bases)?

If only one of those would be possible, I can imagine it might be reasonable to use it to infer the other. E.g., if contigs are do-able, then assigning all CDSs whatever their source contig tax was. And if CDSs are do-able, employing some consensus approach to assign to the contig the tax of its CDSs.

Sorry if i'm missing if you've covered this elsewhere already, and thanks for any of your thoughts!

How to profile results only identify 1 species per reference

Dear Shenwei,

Thank you very much for providing a good tool, I am trying to adjust the output table as desired according to KMCP format.

  • Currently I use search with all the databases you provide and the output table contains more than one ref per species, so I want the results to only return one ref per species based on the number of reads or based on the score, is that possible?
  • I want the output table to contain results according to each desired rank, for example, is it possible to output only Genus ranks or only Species ranks?

Thank you so much,

Best,

Phuc

kmcp search is stucked

Hi. Thank you for your nice tool.

I am currently searching metagenome against viral genome database that you provide in the link : https://bioinf.shenwei.me/kmcp/database/

I installed this tool using conda.
But I don't know why the search step takes too long.
Host OS is centos 7.10 ( 3.10.0-1160.el7.x86_64 )
image

Thank you in advance.

reporting proportion of unmatched reads

Many thanks for this open source tool! Vey well documented

When running the profiling: For example:
kmcp profile [email protected] --taxid-map taxid.map --taxdump taxdump/ --out-prefix search.tsv.gz.k.profile --metaphlan-report search.tsv.gz.m.profile --cami-report search.tsv.gz.c.profile --binning-result search.tsv.gz.binning.gz

It is possible to report the proportion of unmatched read similar to what kraken2 does ?

Best regards,

Peter

How to specify multiple kmer values

In the docs for compute it says

  -k, --kmer ints                 โ–บ K-mer size(s). (default [21])

I tried -k 15 21 but that didn't work. How should multiple values be delimited?

Detecting closest reference in custom DB

I've got some dengue NGS data and I'm wondering if I can use kmcp to find the best matching reference. I've built a database from ~4800 reference sequnces using the following:

kmcp compute -I ~/.dengue-ngs/ref/ -k 31 -D 10 -B partial -O temp
kmcp index -j 32 -I temp/ -O db -n 1 -f 0.3

Doing an assembly and then using kmcp search contig.fa -d kmcpdb/ works quite nicely and identifies the same reference sequence I get when I use blast.

#query	qLen	qKmers	FPR	hits	target	chunkIdx	chunks	tLen	kSize	mKmers	qCov	tCov	jacc	queryIdx
k141_133	10523	1975	0.0000e+00	231	MH823208.1	0	1	10723	31	1503	0.7610	0.7444	0.6034	0

However sometimes it is difficult to do a genome assembly and in this case I would like to search directly using the reads. Is this possible? I've tried with kmcp search -d kmcpdb reads.fq.gz --query-whole-file -o result.tsv.gz, however this does not seem to return any hits. Any guidance would be appreciated.

Add a tutorial of detecting specific pathogen in sequencing data

Sample data:

Creating a KMCP database:

# split reference genomes into 10 chunks with 150-bp overlaps
kmcp compute -k 21 -n 10 -l 150 -I refs/ -O refs-n10-l150

# index with a small FPR for small genomes
kmcp index -f 0.001 -I refs-n10-l150/ -O refs.kmcp

Searching reads against the KMCP database:

kmcp search -d refs.kmcp/ testdata.fq.gz -o testdata.fq.gz.kmcp.tsv.gz

23:19:42.530 [INFO] processed queries: 676694, speed: 32.606 million queries per minute
23:19:42.530 [INFO] 8.0837% (54702/676694) queries matched

Profiling:

# --level strain is used when no taxonomy is given.
# some preset profiling modes are available.
kmcp profile --level strain testdata.fq.gz.kmcp.tsv.gz \
    | tee profile.tsv

csvtk cut -t -f ref,percentage,coverage,score,chunksFrac,reads profile.tsv \
    | csvtk pretty -t
ref           percentage   coverage     score    chunksFrac   reads
-----------   ----------   ----------   ------   ----------   -----
NC_045512.2   100.000000   275.461793   100.00   1.00         54702

coverage is the vertical coverage or depth, score is a similarity score, and chunksFrac is the horizontal coverage of the genome.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.