marbl / mash Goto Github PK

Fast genome and metagenome distance estimation using MinHash

License: Other

Shell 0.01% C++ 94.35% C 3.91% Cap'n Proto 0.33% Makefile 0.89% M4 0.52%

mash's Introduction

Mash is normally distributed as a dependency-free binary for Linux or OSX (see https://github.com/marbl/Mash/releases). This source distribution is intended for other operating systems or for development. Mash requires c++14 to build, which is available in and GCC >= 5 and XCode >= 6.

See http://mash.readthedocs.org for more information.

mash's People

Contributors

Stargazers

Watchers

Forkers

mrg7 boydgreenfield armish xzflin honglongwu ml-lab rmp blankenberg fw1121 dev-spk imoteph rtvt123 gtb-togerther ctb edawson abremges thackl shizhuoxing ondovb cbsb swamidass biocyberman godotgildor this12 xtmgah kloetzl kdm9 decaos skerker wangzhennan14 robertoalvarezm dnbaker bioinfoacademy sebhtml vcepeda standage tankmermaid lancetxiao pythseq bradsol bovee suzuki-shm zekunyin sharmasiddhartha231 fbreitwieser pombredanne yemilawal ohmeta zihengqin tseemann advaitb richardgoater emmarogge grangersutton umbertoloria ylkcy zhaoxia413 tirganteanga jacotton nagiyuki0912 xiaomingxu1995 stogqy wook2014 amandals10 yananzh pierrepeterlongo lqsae bio-algorithm gaiwingwong schaudge mahme501 yazhinia rnaimehaom smehringer bv-brc-dependencies sjaenick sarabjeetsingh007 tmaklin xuelei-dai camelcasecam ningshuang-yao nistelrealy1988 linohofstetter otakuzerg umbertodellamonica sophieseuleson

mash's Issues

Recommended options for protein database?

Maybe this is totally off, but I wanted to test this with protein sequences, i.e. I sketched uniprot_sprot.fasta (with -i) [roughly 200MB] and then a subset of uniprot_fasta [roughly 6MB] with the same options. I then called dist on them, but Mash never finishes...
I know it's not optimized for local comparisons and that the sequences are not equal length... in any case, what kind of -s and -k would you recommend?

Tag a 1.2 release

Can you tag a 1.2 release please?

The capnp bug you fixed was quite important for packaging.

mash dist sending stderr to stdout?

When I run this command

mash dist -p 32 ref.fa query1.fa query2.fa query3.fa

the output is all to stdout and it is the sketching debug output for each contig in the query files, intermixed with the tab-sep output distance table.

Does this seem correct?

problem compiling on Ubuntu

had to add -lgslcblas to the link line (line 35 in makefile)

Need option between mash info & mash info -H

mash info spits out lines for every sequence in the mash (including bonus blank lines between each).

mash info -H just prints the header.

What I'd like is the concise behaviour of mash info -H but with the extra info of how many sketches there are in the mash, like the default mash info tells you eg. Sketches (694)

Static build plus binary as mentioned in INSTALL.txt

Hi,

Two issues in one.

a) INSTALL.txt mentions that distribution is normally as a binary for Linux but there's no link to the binary and it's not mentioned on http://mash.readthedocs.org/en/latest/ - is this because the software is pre-publication or is there a link I'm missing?

b) I'm compiling using GCC4.8 on a system on which the standard GCC is 4.6 so trying to do so statically (Makefile contains "CPPFLAGS += -static-libstdc++ -L/software/gcc-4.8.1/lib64/"). Not sure whether the problem is that libstdc++.a at our end is dynamically built or whether something is amiss above (the path to gcc-4.8.1 is correct)? We're working round this using a bash script to set LD_LIBRARY_PATH and run mash.

Thanks,

Martin

RefSeq database mash question

Hello,

I discovered Mash through the khmer website and author. I had a question about the the RefSeq database .msh file was built. Was each genome run with mash sketch and then compiled together to form the database? I'd like the extend the database and want to make sure I follow best practices for this.

Thanks,
ara

Please tag a release

I was hoping you could tag a release in github, even if it's a "pre-release" with a 0.1 version for example.

I will be writing a Brew package for Mash as we are starting to use it now, and releases make it a lot easier.

Subtle distance calculation bug

In some cases, Mash reports an incorrect numerator (and downstream distance and p values) due to a subtle bug here: https://github.com/marbl/Mash/blob/master/src/mash/CommandDistance.cpp#L417-L435

Specifically, the line:

while ( denom < sketchSize && i < hashesSortedRef.size() && j < hashesSortedQry.size() )

... causes the distance function to halt when the denominator reaches sketchSize, but this can happen prematurely before hashesSortedRef is fully iterated through. I believe the simple patch is to fix denominator to hashesSortedRef.size() and remove it from the while condition, otherwise the last few elements of the reference/query can be skipped in the comparison.

Here are 2 .msh test case files that give 9993/10000 as a result. The number of minmers in the complete union is 9995/10000:

~~PR coming shortly.~~

Compare to refseq70 output - what are the columns?

Hello,

When I compare my file to the refseq70.msh file and the do the sort | head I see several columns.
refseq-NC-562-PRJNA178640-.-.-pColK_K235-Escherichia_coli.fna A1.10.1000.fast0.23011 2.52707e-27 4/1000

The first column is what it is hitting to the in the refseq70
The second column is my file
The third column is: ? (I am guessing some percent)
The fourth is the p-value
The fifth, is that the number of hashes that match?

Thanks!
ara

Question about "only the lexicographically smaller of the forward and reverse complement of a k-mer"

Hi,

I have a question about the following (quote from http://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0997-x):

For nucleotide sequence, Mash uses canonical k-mers by default to allow strand-neutral comparisons. In this case, only the lexicographically smaller of the forward and reverse complement representations of a k-mer is hashed.

Why isn't the smallest of the two hash values obtained from the forward and reverse complement representations of a k-mer selected instead ?

The selection on a lexicographic order basis amounts to a filter / sampling that I am guessing to be there to avoid having the same kmer included twice in a sketch (which might not be the most likely of events), but it would feel more natural keep the selection of on the two (forward and reverse complement) on a hash values and may allow a more natural integration of strandedness into the use of sketches.

[note: I had discussions with @ctb, @betatim, @luizirber while being unable to reproduce some of the results from sourmash - @betatim pointed me out to the lexicographic sorting-based selection I had missed]

Sketching paired-end read data

I note no mention of paired-end data in the docs; sketching with multiple files creates a multi-file sketch, but if I want a single sketch per paired-end sample, how is this done?

I assume I should be able to use sub-shells and/or zcat and pipe to stdin; however when i did this on our cluster I got core dumps (admittedly I seem to be having some problems with zcat on our cluster....)

At the very least it may be a good idea to update your docs to reflect how to sketch paired-end read data

Cheers
Mick

Option to produce PHYLIP distance matrix format

The mash dist -t option produces a TSV distance matrix.

It is not too much effort to produce a valid PHYLIP distance matrix format, preferably in ltriangle form: http://www.mothur.org/wiki/Phylip-formatted_distance_matrix

The main issue is that PHYLIP labels may be limited to 10 chars which is bit of a disaster for mash applications. Maybe integers 0000000 - 999999 could be used and a .map file created for using with nw_rename later?

Here is an example of more specs: http://www.uwyo.edu/dbmcd/molmark/practica/phylipinfo.doc

Guidelines for PHYLIP input files for programs Neighborand Fitch (tree-building from distance matrices)

1) File must be text-only (ASCII)!!  It must be in same directory with the program and must be called infile.  
2) 1st line has number of OTUs (taxa, pops.)
3) Next line has first OTU name (padded to at least 10 characters with spaces, if necessary)
4) Easiest is upper diagonal matrix (note that it doesn't have to be aligned).  Separators are spaces.

Example

4
LS1        0.083  0.25 0.458
LS2        0.167  0.392
LS3        0.392
LS4        

Notes: last line still pads out 10 spaces, but has no "distance" (implied zero from LS4 to itself).

Maximum copy number/high-pass filtering?

I've been using Mash with great success with plant read data, but I'm running into one weirdness. Some of my cultivars cluster in a way they shouldn't and I believe this is due to chloroplast differences.

Mash lets you specify a minimum copy number to get rid of sequencing error (-r) but it doesn't let you specify a maximum copy number. With plastids I'd remove everything with more than 50 or 70 copies and I can do that first using for example bbmap's bbnorm, it would just be easier to do it directly in Mash.

I'm unsure whether plastid copy numbers influence the -c cutoff so I'd rather filter first. Are there any plans to implement a maximum copy filter?

configure incorect CapNproto check

using Mash-1.0.2 (https://github.com/marbl/Mash/archive/v1.0.2.tar.gz)

if capnproto is not installed on standard (/usr/local) location BUT appears on path, it is not correctly detected

following patch restore behaviour.

--- configure.ac.ori    2015-12-11 12:21:30.182591195 +0100
+++ configure.ac    2015-12-11 12:30:33.938425796 +0100
@@ -2,9 +2,14 @@

 AC_ARG_WITH(capnp, [  --with-capnp=<path/to/capnp>     Cap'n Proto install dir (default: /usr/local/)])

-if test "$with_capnp" == ""
+if test x"$with_capnp" == "x"
 then
-   with_capnp=/usr/local/
+    # no Cap'n Proto install dir specified check PATH
+    AC_CHECK_PROG(capnpCheck, capnp, "yes", "no")
+else
+    # use provided path
+    with_capnp=$with_capnp
+    AC_CHECK_PROG(capnpCheck, capnp, "yes", "no", $with_capnp/bin)
 fi

 AC_ARG_WITH(gsl, [  --with-gsl=<path/to/gsl>     GNU Scientific Library install dir (default: /usr/local/)])
@@ -26,8 +31,6 @@
    AC_MSG_ERROR([Zlib not found.])
 fi

-AC_CHECK_PROG(capnpCheck, capnp, "yes", "no", $with_capnp/bin)
-
 if test "$capnpCheck" != "yes"
 then
    AC_MSG_ERROR([Cap'n Proto compiler (capnp) not found.])

Not clear what input -g requires for genome size

I assume it wants bases, i.e. 3Gb genome would be "-g 3,000,000,000" but it's not clear....

Github flow

Now that more than one person is going to contribute to the project and more than one change is going to be done would be great to have a clear github-flow.

The standard in many projects is to keep two branches: Master and Development. The master is equivalent to the releases meanwhile the development goes ahead with the last stable features in the project. Each time that a functionality has to be added a new branch is created from development and they merge when the feature is working.

I have no permissions so @ondovb should create it.

Does it seem right for you?

Use basenames for input file identifiers in output

Mash output (and sketches) currently represent the input file names as given, with paths:

/path/to/input.fasta

It would probably be better to strip them to basenames:

input.fasta

We would just have to watch out for potential collisions and figure out what to do about those. Some options for resolving:

error out
rename
leave paths
- strip off longest common path

Cannot download refseq.msh

The link in the docs to the pre-sketched RefSeq archive https://github.com/marbl/Mash/raw/master/data/refseq.msh returns only the LFS info:

version https://git-lfs.github.com/spec/v1
oid sha256:3292c28efedf8a2144494b99bdb784d94b1d737d6f942f29a17daf73b7d61d3e
size 97678488

I expected to click on the link and be prompted to download a 97MB file. Perhaps something to do with an exceeded quota? I went the long way round and installed the git lfs extension to fetch from within a cloned Mash repo, but this also fails with:

$ git lfs fetch
Fetching master
Git LFS: (0 of 4 files) 0 B / 107.44 MB                                                  
This repository is over its data quota. Purchase more data packs to restore access.
Docs: https://help.github.com/articles/purchasing-additional-storage-and-bandwidth-for-an-organization/

Can you make a 1.03 release?

There's been lots of bugfixes etc since 1.02 - any chance of an official 1.03 release?

Would help us package maintainers.

JSON schema

A first pass of the JSON schema is in the Mash repo:
https://github.com/marbl/Mash/blob/master/src/mash/schema.json

For now, I put k-mers as a separate array parallel to hashes rather than an array of tuples, since the latter seemed unwieldy, especially if they are optional.

Add a --version switch

Will make pipeline audit reports much simpler.

dist fails for billions of pairwise outputs

This is due to all pairwise outputs for each query file vs. reference file being stored by the processing thread. Memory will have to be managed with smarter task division.

Paired-data more distant than different samples

Following on from #32 where I (accidentally) sketched both reads of a pair, I get these curious results:

Mash sketch/dist sample against itself:

%> mash dist SRR1262629.mash.msh SRR1262629.mash.msh
SRR1262629_1.t.fastq.gz SRR1262629_1.t.fastq.gz 0 0 1000/1000
SRR1262629_2.t.fastq.gz SRR1262629_1.t.fastq.gz 0.0234513 0 440/1000
SRR1262629_1.t.fastq.gz SRR1262629_2.t.fastq.gz 0.0234513 0 440/1000
SRR1262629_2.t.fastq.gz SRR1262629_2.t.fastq.gz 0 0 1000/1000

Mash sketch/dist two samples against each other:

%> mash dist SRR1262629.mash.msh SRR1262625.mash.msh
SRR1262629_1.t.fastq.gz SRR1262625_1.t.fastq.gz 0.0166117 0 545/1000
SRR1262629_2.t.fastq.gz SRR1262625_1.t.fastq.gz 0.0189925 0 505/1000
SRR1262629_1.t.fastq.gz SRR1262625_2.t.fastq.gz 0.0174175 0 531/1000
SRR1262629_2.t.fastq.gz SRR1262625_2.t.fastq.gz 0.0212923 0 470/1000

So when looking at read 1 and read 2 from the same sample (SRR1262629) they only share 440/1000 hashes.

When comparing different samples (SRR1262629 vs SRR1262625) they actually share more hashes! (470 - 545)

Is this expected??

Cheers
Mick

Parallelism for mash dist

The '-p' option for "mash dist" does not appear to be working. Only one core is used even if I set -p to an integer greater than 1. OS: Redhat 6 Linux server; mash version 1.0.2 (downloaded executable for 64-bit linux).

Example:
mash dist -d 0.05 -p 8 database.msh query_seqs.msh > distances.tab

Crash on running example command

Howdy! Mash looks like a lot of fun and I'd like to try it out, but after getting it built, I see a crash when running the example command:

./mash dist genome1.fna genome2.fna

Here's the crash message:

hxr@leda:~/work/Mash$ ./mash dist genome1.fna genome2.fna
Sketching genome1.fna (provide sketch file made with "mash sketch" to skip)...*** stack smashing detected ***: ./mash terminated
Aborted (core dumped)

And opening the core dump up in gdb:

[New LWP 13107]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `./mash dist genome1.fna genome2.fna'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007f1065a46267 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:55
55      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.

can post core dump/etc if needed :)

Other stuff:

$ uname -a
Linux leda 3.19.0-30-generic #34-Ubuntu SMP Fri Oct 2 22:08:41 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
$ gcc --version
gcc (Ubuntu 4.9.2-10ubuntu13) 4.9.2

Thread support for "mash sketch" ?

I was wondering if you will be adding --threads to mash sketch to simplify the manual step of sketching in chunks then pasting together?

terminate called after throwing an instance of 'kj::ExceptionImpl'

Built a sketch of a 90GB fasta file using k=21. Running 'mash info' on the resulting '.msh' file returns an error:

ls -l imgdb.fa
-rw-r--r-- 1 copeland copeland 90G Fri Oct 23 11:45:23 2015 imgdb.fa

mash sketch -i -k 21 imgdb.fa

mash info imgdb.fa.msh
terminate called after throwing an instance of 'kj::ExceptionImpl'
what(): capnp/layout.c++:1966: failed: expected ref->kind() == WirePointer::LIST; Message contains non-list pointer where list pointer was expected.
stack: 0x41e071 0x40e3b1 0x425d02 0x4181a9 0x4046bd 0x7f63c6546c8d 0x404445
Aborted

Works fine with k=16, however.

finding similar sequences

I am trying to find similar sequences of a given query sequence. I am assuming all sequences are of same length and I have to find the similar ones from a set of million sequences. How could I use mash to achieve this?

Getting hashes out of .msh

Hello,

I was following the issues on: #27 .
I was wondering if there is a way to get the hashes from the .msh files. I am bouncing around between Mash and sourmash.

Thanks,
ara

Check for empty FASTA's

Check for empty files to avoid throwing the MASH error

Option for "mash sketch 1 2 3" to make 3 sketches

There are 3 mash sketch use-cases but only 2 are currently possible.

mash sketch a b c will create a single sketch of all 3 (a.msh by default)
mash sketch -i a b c will create a sketch for each sequence in the 3 named after the sequence IDs
The one I can't do is the proposed -e (each) for mask sketch -e a b c to produce 3 sketches, a.msh, b.msh, c.msh

I would love to have option 3, where a can be a multi-sequence file.

Option for arbitrary alphabets

Reminder to expose an option for protein (or even arbitrary?) alphabets. This option should also shut off canonical mers and adjust the math based on alphabet size.

Question about ambiguous k-mers with ambiguous nucleotides

Are k-mers with ambiguous nucleotides (e.g. N) included in the sketch or are they thrown out?

I would imagine the best strategy is to have Mash filter these kmers out. I suppose it could be handled by input processing: breaking fasta sequences into multiple sequences at every ambiguous nucleotide. This does not seem idea.

Thanks.

Read the docs still points to old 1.0 release!

Instead of pointing to a specific release just link to the releases page: https://github.com/marbl/Mash/releases

I learnt this the hard way recently!

Error in generating large sketches

I'm trying to generate a sketch with -s 100000 on about 7000 RefSeq genomes (using 21-mers). The program finishes fine (with an output of about 2.1 GB) but when I run mash info refseq.msh I get the following error. I was able to successfully build a sketch with 50000 hashes.

libc++abi.dylib: terminating with uncaught exception of type kj::ExceptionImpl: kj/io.c++:305: failed: miniposix::read(fd, pos, max - pos): Invalid argument; fd = 3
stack: 0x10a648e73 0x10a641425 0x10a624229 0x10a61d777 0x10a61ccae 0x10a612eca 0x10a60ab4c 0x10a617b0a 0x10a61bcc8 0x10a604794 0x3
Abort trap: 6

Add downsampling to docs

integer overflow?

Hi guys,

Ran mash on a collection of reference 16s rRNA sequences (3.6Gb) and got this warning while sketching:

WARNING: For the k-mer size used (21), the random match probability (0.000726951) is above 
the specified warning threshold (0.01) for the sequence "..rdp/current_Bacteria_unaligned.fa" of 
size 18446744072614073515. Distances to this sequence may be underestimated as a result. 
To meet the threshold of 0.01, a k-mer size of at least 20 is required.

Seems like the message should not be displayed and/or an interger overflow happened when calculating the size of the reference.

My suspicion is that maybe there an assumption that the reference fasta will always contain a single sequence?

Failure to build with GCC 7

The released Mash version 1.1.1 fails to build with GCC 7:

$ ./configure --with-capnp=/usr
checking for g++... g++
checking whether the C++ compiler works... yes
checking for C++ compiler default output file name... a.out
checking for suffix of executables...
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C++ compiler... yes
checking whether g++ accepts -g... yes
checking how to run the C++ preprocessor... g++ -E
checking for grep that handles long lines and -e... /bin/grep
checking for egrep... /bin/grep -E
checking for ANSI C header files... yes
checking for sys/types.h... yes
checking for sys/stat.h... yes
checking for stdlib.h... yes
checking for string.h... yes
checking for memory.h... yes
checking for strings.h... yes
checking for inttypes.h... yes
checking for stdint.h... yes
checking for unistd.h... yes
checking zlib.h usability... yes
checking zlib.h presence... yes
checking for zlib.h... yes
checking for capnp... yes
checking capnp/common.h usability... yes
checking capnp/common.h presence... yes
checking for capnp/common.h... yes
checking gsl/gsl_cdf.h usability... yes
checking gsl/gsl_cdf.h presence... yes
checking for gsl/gsl_cdf.h... yes
configure: creating ./config.status
config.status: creating Makefile
$ CXX=g++-7 make -j3
cd src/mash/capnp;export PATH=/usr/bin/:/usr/local/opt/ccache/libexec:/usr/local/sbin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games:/home/vagrant/bin:/home/vagrant/uni/gt/bin:/home/vagrant/uni/research/bin:/home/vagrant/uni/lehre/repo/bin;capnp compile -I /usr/include -oc++ MinHash.capnp
cd src/mash/capnp;export PATH=/usr/bin/:/usr/local/opt/ccache/libexec:/usr/local/sbin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games:/home/vagrant/bin:/home/vagrant/uni/gt/bin:/home/vagrant/uni/research/bin:/home/vagrant/uni/lehre/repo/bin;capnp compile -I /usr/include -oc++ MinHash.capnp
cc -include src/mash/memcpyLink.h -c -o src/mash/memcpyWrap.o src/mash/memcpyWrap.c
g++-7 -c -O3 -std=c++11 -Isrc -I/usr/include -I /usr/local//include -include src/mash/memcpyLink.h -Wl,--wrap=memcpy  -o src/mash/capnp/MinHash.capnp.o src/mash/capnp/MinHash.capnp.c++
In file included from /usr/include/capnp/generated-header-support.h:31:0,
                 from src/mash/capnp/MinHash.capnp.h:7,
                 from src/mash/capnp/MinHash.capnp.c++:4:
/usr/include/capnp/layout.h:129:65: error: could not convert template argument ‘b’ from ‘bool’ to ‘capnp::Kind’
 template <typename T, bool b> struct ElementSizeForType<List<T, b>> {
                                                                 ^
/usr/include/capnp/layout.h:129:66: error: template argument 1 is invalid
 template <typename T, bool b> struct ElementSizeForType<List<T, b>> {
                                                                  ^~
g++-7 -c -O3 -std=c++11 -Isrc -I/usr/include -I /usr/local//include -include src/mash/memcpyLink.h -Wl,--wrap=memcpy  -o src/mash/Command.o src/mash/Command.cpp
g++-7 -c -O3 -std=c++11 -Isrc -I/usr/include -I /usr/local//include -include src/mash/memcpyLink.h -Wl,--wrap=memcpy  -o src/mash/CommandBounds.o src/mash/CommandBounds.cpp
Makefile:50: recipe for target 'src/mash/capnp/MinHash.capnp.o' failed
make: *** [src/mash/capnp/MinHash.capnp.o] Error 1
make: *** Waiting for unfinished jobs....

This is on Debian stretch, x86_64.

Compare to refseq70 - Column with matched hashes

Hello,

I was wondering if it's possible to get which hashes match to a given refseq hash. I know the output gives you number of hashes hit. I was hoping to know which hashes those were.

Thanks!
ara

Please tag a release

I was hoping you could tag a release in github, even if it's a "pre-release" with a 0.1 version for example.

I will be writing a Brew package for Mash as we are starting to use it now, and releases make it a lot easier.

mash dist -i throws std::bad_alloc on pac bio data

Hi,

I ran Mash v1.0.1 on a sample PacBio data against the refseq.msh database:

mash dist -i refseq.msh BEI_staggered_2kb.fastq  > BEI_staggered_2kb_dashI_1.1.mash

and got the following error after 48s:

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Aborted (core dumped)

(also saw this in v1.0)

Building mash with CMake

As we discussed at #49 one of the first things to do was using CMake to build Mash. This would let us build the project on other platforms (as Windows) "for free" meanwhile it helps to integrate things like the python API building process.

You can finde the implementation in [the cmake branch in my fork] (https://github.com/jomsdev/Mash/tree/cmake). I will explain all the changes in my pull request but we need a development branch first #51 .

Briefly, the changes have been:

CMake integration
.gitignore updated (removing all the Makefile stuff, we do not build the project in the folder anymore)
INSTALL.txt modified explaining the new building process
src and include folders created

refseq.msh not downloadable from location Mash/data/refseq.msh

Seems to be just the 133 byte description. Is also not contained in the .zip file when I download the mash master.zip

Mash workflow from sequence to hash question

Hello,

I was reading through the Mash paper (Fig. 1) and the documents to try and understand the whole workflow.

1) kmers are calculated for each sequence in a file .
So for k=31 I get a bunch of kmers for each sequence. Are all the kmers for a single sequence hased or is there some pooling/cluster of all like kmers before being hashed?

2) End up with a file of hased kmers.

**3)**The hashed kmers for two sets can be compared.
For comparing two hased kmers in different sets is the Poisson model of mutations used to find similar kmers? Or is this used in 1 to find similar kmers within a file?

Thanks for helping with my understanding here.

ara

Problems compiling with capnp

I have a local installation of capnp which I specifiy in the configure step:

./configure --prefix=/exports/cmvm/datastore/eb/groups/watson_grp/software/Mash --with-capnp=/exports/cmvm/datastore/eb/groups/watson_grp/software/capnproto/

But when I "make" I get an error:

/exports/cmvm/datastore/eb/groups/watson_grp/software/capnproto//bin/capnp compile -oc++ MinHash.capnp
c++: no such plugin (executable should be 'capnpc-c++')
c++: plugin failed: exit code 1

The executeable it refers to does actually exist in capnproto/bin/capnpc-c++

I'm using Cap'n Proto version 0.5.3

Cheers
Mick

CAPI and Python Bindings

First of all congratulations for Mash.

After reading the paper about mash I decided to try it, and I am really impressed with the good results I have obtained.

Right now sketching and comparing samples with mash is fast and easy. However, it can only be used as command line tool. Although cli are an excellent solution, for many cases there are others that could take advantage of mash as a library.

I would like to know what do you thing about having some external API (in C) that exposes the main functionalities and primitives of the library. Having this C API would be nice because then I could write bindings for other languages. The first language that comes to my mind in bioinformatics is python so would be really nice having a python library for it.

I already forked the project and started to expose mash through a C API. This is going to take me some time, so meanwhile, I used the subprocess module in python to expose the cmd getting the results as python dictionaries. The branches are python and mashpy respectively.

Do you think it is a good idea? Would you be interested in having this on Mash?

Exporting mash hashes for interoperability

Hi all,

first some background:

it seems like sourmash is going to be a thing; I'm building it into a metagenomics data exploration tool, and it's already integrated into https://github.com/dib-lab/khmer/ in some interesting and useful ways. Before it becomes too much of a thing, I'm interested in harmonizing with what you've done with mash, both out of gratitude and because it'd be kind of stupid to have multiple different MinHash implementations out there - interoperability would be really handy!

So, on the topic of interop, I poked around under the hood of mash, and am happy to report that I can swizzle sourmash over to use your exact hash function and seed; I will do so forthwith.

It seems like it would be relatively simple for me to write a parser for your .msh files, but that would depend on capnproto, I think. It seems like it would be better to be part of mash. So, what do you think about a 'dump' command for sketches? This would be an explicit "data transfer" format that we could use to transition sketches between MinHash software implementations. I'd guess that something quite minimal (uniquely identified hash function + seed, k size, identifier, and hashes, all in a CSV file) would work. In our 'signature' files we also include an md5sum of the hashes.

If this is not antithetical to the very principles on which mash was founded, then great! Let me know! And I'm happy to whip up a prototype and submit a pull request - I was thinking of adding a new command, 'mash dump'. Alternate ideas very welcome.

cc @luizirber

Creating sketch database can fail with multiple processors

I am creating a sketch database from ~1500 genomes. These are all in individual FASTA files. When using multiple processors I often encounter the error:
ERROR: Did not find fasta records in "GCF_000755225.1_genomic.fna"

I have examined this file and it is a valid FASTA file with a single sequence. Note, that using different numbers of processors results in a different input file causing the error. Moreover, the sketch database builds properly if I use a single processor.

I am running Mash as follows:
mash sketch -p 23 -o ../gtdb *.fna

Change default murmurhash seed and/or expose as option?

While I appreciate the Douglas Adams tribute, I wanted to raise the possibility of (1) changing the default murmurhash seed value to 0 (default in many bindings) and/or (2) exposing it as a command line flag.

The former would be useful for compatibility (both cross lang/lib, and with existing k-mer hashing code that uses murmurhash3). The latter may also be useful in facilitating easier simulation of many pseudorandom minmer subsets from the same file, and for interop with code that doesn't use a seed of 42.

Obviously (1) is a highly breaking change and therefor good to discuss and pin down prior to a preprint coming out. Thoughts on the marbl side?

marbl / mash Goto Github PK

mash's Introduction

mash's People

Contributors

Stargazers

Watchers

Forkers

mash's Issues

Recommend Projects

Recommend Topics

Recommend Org