Comments (9)
Sorry, one of my dependencies, pand, does not support ARM right now.
from kmcp.
Use this first:
But note that the arm64 version has a slower searching speed for databases created with more than one hash function.
kmcp index -h
-n, --num-hash int ► Number of hash functions in bloom filters. (default 1)
from kmcp.
Method 3: Compile from source
-
wget https://go.dev/dl/go1.17.11.linux-amd64.tar.gz tar -zxf go1.17.11.linux-amd64.tar.gz -C $HOME/ # or # echo "export PATH=$PATH:$HOME/go/bin" >> ~/.bashrc # source ~/.bashrc export PATH=$PATH:$HOME/go/bin
-
Compile KMCP
# ------------- the latest stable version ------------- go get -v -u github.com/shenwei356/kmcp/kmcp # The executable binary file is located in: # ~/go/bin/kmcp # You can also move it to anywhere in the $PATH mkdir -p $HOME/bin cp ~/go/bin/kmcp $HOME/bin/ # --------------- the devlopment version -------------- git clone https://github.com/shenwei356/kmcp cd kmcp/kmcp/ go build # The executable binary file is located in: # ./kmcp # You can also move it to anywhere in the $PATH mkdir -p $HOME/bin cp ./kmcp $HOME/bin/
from kmcp.
Many Thanks! It works this is very helpful.
Jianshu
from kmcp.
Hello Wei,
I am following the same step to build database in the usage page but use k=16 for gtdb v207, and then f=0.1 but I got a huge database file (30+G) compare to your r202 which is very small (1.5G), I did not expect that there will be such huge difference:
kmcp compute -I all -O gtdb-r207-k16-n10 -k 16 -n 10 -l 100 -B plasmid --log gtdb-r207-k16-n10.log -j 24 --force
time kmcp index -j 24 -I gtdb-r207-k16-n10 -O gtdb.r207.minhash.kmcp -n 1 -f 0.1 --log gtdb.r207.minhash.kmcp.log
Is that because smaller smaller false positive rate or k?
Thanks,
Jianshu
from kmcp.
I guess you were following the database building steps for metagenomic profiling. For genome similarity estimation, you need to compute the sketches.
For example, the gtdb.minhash.kmcp.tar.gz
of 1.5G was created with FracMinHash/Scaled MinHash (scale = 1000). So the database size was very small. Besides, the reference genomes should not be split (-n, --split-number
).
from kmcp.
Hello Wei,
Thanks for the suggestion and I have solved it. An interesting finding: FracMinHash or COBS is very sensitive to different genome size. I am attaching a genome that I know the answer for the best hits in GTDB r207 ranked by Average nucleotide indentity (ANI, calculated by fastANI: https://github.com/ParBLiSS/FastANI) after comparing this genome with all the genomes in GTDB (very expensive, it takes days with more than 100 24 threads compute nodes). When ANI is above 95% it is consistent but not below 95% (fastANI ANI is accurate down to 75% ANI). This important because in a lot of cases, your query genomes may not have a best larger than 95% ANI in the database but say 80% or so, you still need to find this best hit.
Any idea how to bench mark FracMinhash or Syncmer based distance with ANI (how well they are correlated)? By the way, ANI is the standard method to compare to genomes, even MASH was benchmarked against ANI.
Thanks,
USFT4C.26.fasta.gz
Jianshu
from kmcp.
Search result of USFT4C.26.fasta.gz in GTDB r207 with KMCP.
$ kmcp search -d ~/ws/data/gtdb/gtdb207/gtdb-r207.minhash.kmcp/ USFT4C.26.fasta.gz -g -s jacc -t 0.3 \
| csvtk pretty -t -C $
#query qLen qKmers FPR hits target chunkIdx chunks tLen kSize mKmers qCov tCov jacc queryIdx
----------- ------- ------ ---------- ---- --------------- -------- ------ ------- ----- ------ ------ ------ ------ --------
USFT4C.26_1 4018002 8057 0.0000e+00 1 GCA_016791115.1 0 1 3969740 31 5972 0.7412 0.7540 0.5969 0
$ kmcp search -d ~/ws/data/gtdb/gtdb207/gtdb-r207.syncmer.kmcp/ USFT4C.26.fasta.gz -g -s jacc -t 0.3 \
| csvtk pretty -t -C $
#query qLen qKmers FPR hits target chunkIdx chunks tLen kSize mKmers qCov tCov jacc queryIdx
----------- ------- ------ ---------- ---- --------------- -------- ------ ------- ----- ------ ------ ------ ------ --------
USFT4C.26_1 4018002 11737 0.0000e+00 1 GCA_016791115.1 0 1 3969740 31 8628 0.7351 0.7437 0.5865 0
Isn't GCA_016791115.1
(98.8979%) the best hit with fastANI?
$ grep GCA_016791115.1 ~/ws/data/gtdb/gtdb207/gtdb-r207.minhash.kmcp/name.map
GCA_016791115.1 JAEUNL010000064.1 Rhodocyclaceae bacterium isolate new MAG-172 k141_1331473, whole genome shotgun sequence
$ fastANI -q USFT4C.26.fasta.gz -r GCA_016791115.1.fna.gz -o USFT4C.26.fasta.gz.fastani.txt
$ cat USFT4C.26.fasta.gz.fastani.txt
USFT4C.26.fasta.gz GCA_016791115.1.fna.gz 98.8979 1057 1270
Any idea how to benchmark FracMinhash or Syncmer-based distance with ANI (how well they are correlated)?
Maybe sourmash and syncmer paper have some clues?
from kmcp.
Yes. I was talking about the second best hit,third et. al until hit around 80% ANI. Those hits and their rank should be the same with fastANI best hits.
If you check the one around 80% ANI,it is not the same at all with MASH. Mash correlates very good with ANI with top 10% recall nearly 100%.
Jianshu
from kmcp.
Related Issues (20)
- kmcp search crashed HOT 5
- TODO: support viral genomes collections, e.g, MGV, GPD HOT 3
- Building database with MAGs HOT 2
- Automatically disable ambiguous-matches correction for complex communities with thousands of species
- QUESTION: KMCP merge HOT 4
- QUESTION: Taxdump HOT 2
- Error when installing KMCP HOT 9
- Create an index without create the database HOT 4
- 0.8.3: Fails on i386: 1 << 32 - 1 (untyped int constant 4294967295) overflows int HOT 4
- reporting proportion of unmatched reads HOT 4
- Profiling output table interpretation HOT 3
- kmcp search is very slow for metatranscriptome data HOT 6
- CowTransfer link doesn't work HOT 3
- KMCP database building tutorial HOT 3
- Dealing with novel/non-sequenced species HOT 2
- long read metagenomic profiling HOT 2
- TODO: improve search results parsing by using a simplified version of strconv.ParseFloat HOT 1
- A minor issue on console messages HOT 3
- [website] broken links HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from kmcp.