bbuchfink / diamond Goto Github PK

Accelerated BLAST compatible local sequence aligner.

License: GNU General Public License v3.0

C 14.91% C++ 84.01% CMake 0.56% Dockerfile 0.01% Rust 0.04% Scheme 0.02% Shell 0.15% Python 0.30%

diamond's Introduction

Introduction

DIAMOND is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data. The key features are:

Pairwise alignment of proteins and translated DNA at 100x-10,000x speed of BLAST.
Protein clustering of up to tens of billions of proteins
Frameshift alignments for long read analysis.
Low resource requirements and suitable for running on standard desktops or laptops.
Various output formats, including BLAST pairwise, tabular and XML, as well as taxonomic classification.

Documentation

The online documentation is located at the GitHub Wiki.

Support

Diamond is actively supported and developed software. Please use the issue tracker for malfunctions and the GitHub discussions for questions, comments, feature requests, etc.

About

Since 2019, DIAMOND is developed by Benjamin Buchfink at the Drost lab, Max Planck Institute for Biology Tübingen. From 2018-2019, its development was supported by the German Federal Ministry for Economic Affairs and Energy through an EXIST grant. From 2016-2018, it was developed by Benjamin Buchfink as an independent researcher. From 2013-2015, the initial version was developed by Benjamin Buchfink at the Huson lab, University of Tübingen, Germany.

[📧Email] [Twitter] [Google Scholar] [Drost lab] [MPI-BIO]

When using the tool in published research, please cite:

Buchfink B, Reuter K, Drost HG, "Sensitive protein alignments at tree-of-life scale using DIAMOND", Nature Methods 18, 366–368 (2021). doi:10.1038/s41592-021-01101-x

For sequence clustering:

Buchfink B, Ashkenazy H, Reuter K, Kennedy JA, Drost HG, "Sensitive clustering of protein sequences at tree-of-life scale using DIAMOND DeepClust", bioRxiv 2023.01.24.525373; doi: https://doi.org/10.1101/2023.01.24.525373

Original publication to cite DIAMOND until v0.9.25:

Buchfink B, Xie C, Huson DH, "Fast and sensitive protein alignment using DIAMOND", Nature Methods 12, 59-60 (2015). doi:10.1038/nmeth.3176

diamond's People

Contributors

Stargazers

Watchers

Forkers

yesimon ericjcgalvez leebergstrand vemeresomluktar jpfeil xmtang maplesond patricksnape kblin tomck ppflrs dnieuw candi79 gaberoo bioinfotools stevendbrown cwt1 lucventurini shu65 lafond-lapalmej wasade bcthomas gsc0107 mocat2 benjaminalbrecht84 skerker litswu answer19831020 kyounghyoun gitter-badger longbow0 rmhubley nabilquraishi jehops liupfskygre zihua wangzhennan14 drthakare chizhou-siti klarareichard mz-cy-han1998 spficklin dayedepps wangdang511 abdo3a bravokid47 ellenmblack pvanheus bio-ontology-research-group smyang2018 terrycojones fvangef rajaldebnath alexpersa7 brittanymareeott elcega pssun anandksrao vhelizarraga gavinwinner tw7649116 zhengxiaoxuan11542 qq1042032751 ghc4 nnnnnnnny6 tomasbruna smoe oeway prepultrue asad liuzi919 yx-xu bioshare zhk8111 shicheng-guo jxshi wxyz mr-c cnaid ettenhup minghao2016 feihongloveworld drostlab simoneperazzoli xuelei-dai deprekate serratus-bio wook2014 snehagg24 missymargs18 zjyzjjzmt yeahsmin olympus-terminal steg17 nvt-1009 jefdaj claralina srinivas32 jessmewald scy-bio

diamond's Issues

Add a --version flag

I eventually discovered that "diamond -v" will print a version, but it wasn't obvious, as "-v" is "--verbose".

Can you add a "--version" flag?

% diamond --version
diamond 0.7.9

This will help pipelines to keep track of what version they used etc.

diamond v0.8.14.76 - Segmentation fault: 11

I'm running the command
diamond blastx -d nr.dmnd -q query.fna -a matches.daa -k 1
but this showing error: Segmentation fault: 11

my dataset
db.dmnd
db.fasta

my log
diamond v0.8.14.76 | by Benjamin Buchfink [email protected]
Check http://github.com/bbuchfink/diamond for updates.

CPU threads: 16

Scoring parameters: (Matrix=blosum62 Lambda=0.267 K=0.041 Penalties=11/1)

Target sequences to report alignments for: 1

Temporary directory: diamond_blast
Opening the database... [4.1e-05s]
Opening the input file... [0.004157s]
Opening the output file... [4.9e-05s]
Loading query sequences... [1.11753s]
Running complexity filter... [1.84137s]
Building query histograms... [0.215376s]
Allocating buffers... [0.00045s]
Loading reference sequences... [0.440936s]
Building reference histograms... [0.76931s]
Allocating buffers... [0.000469s]
Initializing temporary storage... [0.006248s]
Processing query chunk 0, reference chunk 0, shape 0, index chunk 0.
Building reference index... [0.49554s]
Building query index... [0.155904s]
Building seed filter... [0.030046s]
Searching alignments... Segmentation fault: 11

A few questions about the comparison between Diamond and BLASTX

Hello,

Is it possible that a match is found only by Diamond but not by BLASTX when same parameters are used?

And, are e-values and bitscores from Diamond same as those from BLASTX when same database is used?

Thanks.

output format, what to do when xml is needed

Hello,

While diamond blast is wonderfully fast some applications need blast results in xml format. I haven't been able to locate a tool or method to convert the diamond daa or tabular output to xml, could you include an option to convert the daa files to xml or point me in the direction of a script/tool which can do the job?

Thanks,
Diane

On Linux, diamond is only for x86_64?

Hi,

I just wanted to check, I presume that diamond is not meant for mips, arm or i686? They seem to not be building on these architectures:

arm http://hydra.gnu.org/build/701534/nixlog/1
mips http://hydra.gnu.org/build/700108/nixlog/1
i686 http://hydra.gnu.org/build/700494/nixlog/1

I understand if supporting these platforms is not a priority, I just wanted to double check before I disable the building of them in guix. x86_64 works great though!

Thanks,
ben

FreeBSD package

Hello,

I've created a FreeBSD package, so FreeBSD users can install diamond with a simple pkg install diamond. Packages are only available for 9amd64 and 10amd64 (I haven't had luck compiling on i386).

Maybe you could mention this in your README along with the other operating systems?

Regards.

Invalid character (-/45) in sequence

This happens on a freshly downloaded nr.gz file from ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/. The formatting command was diamond makedb --in nr.gz -d nrDec2015 --threads 8

It is entirely possible that nr contains non-standard sequences, but the problem is that diamond makedb errored out few hours after beginning the indexing and it is not clear how to "reformat" nr to get rid of this sequence. Would it be possible to:

have diamond makedb simply ignore malformed sequences instead of aborting the job
provide more information about the malformed sequence, so that the input can be corrected

PS. I'll try to use the -v option

Error: Invalid character (@/64) in sequence

Hi Diamond team,

I am getting an "invalid character in sequence" error while running diamond blastx. As input, I gave it a fasta file with 76730390 records. I am also using these options: --evalue 1.0 --threads 1 --max-target-seqs 20 --sensitive

Does the error message mean there is an invalid character in the fasta sequence, or is it asking for a fastq sequence (where the sequence ID should start with an @ symbol)? I did try and grep for an "@" symbol in the input data and couldn't find one. Does the "64" refer to the ASCII value of "@"? Or is it a position in the input sequence? (my sequences are 100 bp long). I am running diamond v0.7.8.57 if that helps.

Thanks for any advice!
~Lina

Feature request: blast "-outfmt 11 " output from diamond view

Could diamond view have the BLAST archive format as another option to view the results?
I was thinking about using the BLAST archive format (ASN.1) as an input to blast_formatter to easily convert the output.

Command terminated abnormally error

Hello,

I'm running Diamond on a Mac OS X 10.6 snow leopard. I was able to successfully align 1 sequence (https://gist.github.com/pcantalupo/5899d47212e73ec02f35) against the NR database but when I run 100 sequences (https://gist.github.com/pcantalupo/77c59a00a423783c34b3), diamond exits with an error of 1 and outputs "Command terminated abnormally". This happens approx 20 minutes after starting diamond. Here is my command line:

diamond blastx -d ~/local/usr/local/ncbi/blast/db/nr -q seqs100.fa -a diamond.seqs100 -t diamond_tmp

My system has 8 hyperthreaded cores and 32 GB ram. The NR.dmnd file size is 22.1 GB. I created the dmnd file with NR.faa from NR downloaded in late 2013.

Thank you for your help,

Paul

Align multiple files

Is there a way to specify multiple query files and multiple .daa output files in a fashion similar to malt?
So far I get "Error: Invalid parameter count for option <-q or -a>"

Diamond view reading gzipped files?

Would it be possible to modify diamond view so that it can read gzipped diamond archive files or so that it can read from stdin?

case of matrix parameter

Somewhere between v0.7.10 and v0.8.16, the accepted values for --matrix changed from uppercase to lowercase - I had to look through the source to figure out why I was now getting an "Invalid scoring matrix" error. It would be helpful to either update the documentation to reflect the case change (e.g. in the "Scoring matrices" table and the default parameter section) or to accept either case in the parameter parsing.

blastx error writing file

I find that with a query of more than ~30 reads I get the following error:

Error: 20File_write_exception: Error writing file out.daa (696/5)

With fewer than ~30 reads, no error occurs. Furthermore, I say ~30 because sometimes it's more and sometimes it's less (I tried adding one read at a time until I got the error) depending on what the reads actually are, even if they're all the same length...sometimes inclusion of the 25th read will induce the error, sometimes the 35th. I've ruled out that it is a specific read causing the problem, i.e. if 25 reads gives an error, using just the 25th read plus a dozen other random ones don't give the error.

No amount of available memory or number of threads has any effect, nor does using or not using --tmpdir.

Below is an example of the verbose output:

diamond blastx -d uniref50_DEMO.dmnd -q in.fa -a out -p 1 -t tmp -v
diamond v0.7.9.58

Threads = 1

Scoring matrix = blosum62
Lambda = 0.267
K = 0.041
Gap open penalty = 11
Gap extension penalty = 1
Seg masking = 1
SSSE3 enabled.
Opening the database... [0.1s]
Reference = uniref50_DEMO.dmnd
Sequences = 327
Letters = 124409
Block size = 2000000000
Opening the input file... [0.0s]
Opening the output file... [0.0s]
Loading query sequences... [0.0s]
Sequences = 300, letters = 14900
Running complexity filter... [0.0s]
Building query histograms... [0.0s]
Allocating buffers... [0.0s]
Loading reference sequences... [0.2s]
Allocating buffers... [0.0s]
Initializing temporary storage... [0.0s]
Processing query chunk 0, reference chunk 0, shape 0, index chunk 0.
Building reference index... [0.0s]
Building query index... [0.0s]
Searching alignments... [1.5s]
Processing query chunk 0, reference chunk 0, shape 0, index chunk 1.
Building reference index... [0.0s]
Building query index... [0.0s]
Searching alignments... [1.5s]
Processing query chunk 0, reference chunk 0, shape 0, index chunk 2.
Building reference index... [0.0s]
Building query index... [0.0s]
Searching alignments... [1.5s]
Processing query chunk 0, reference chunk 0, shape 0, index chunk 3.
Building reference index... [0.0s]
Building query index... [0.0s]
Searching alignments... [1.5s]
Processing query chunk 0, reference chunk 0, shape 1, index chunk 0.
Building reference index... [0.0s]
Building query index... [0.0s]
Searching alignments... [1.5s]
Processing query chunk 0, reference chunk 0, shape 1, index chunk 1.
Building reference index... [0.0s]
Building query index... [0.0s]
Searching alignments... [1.5s]
Processing query chunk 0, reference chunk 0, shape 1, index chunk 2.
Building reference index... [0.0s]
Building query index... [0.0s]
Searching alignments... [1.5s]
Processing query chunk 0, reference chunk 0, shape 1, index chunk 3.
Building reference index... [0.0s]
Building query index... [0.0s]
Searching alignments... [1.5s]
Processing query chunk 0, reference chunk 0, shape 2, index chunk 0.
Building reference index... [0.0s]
Building query index... [0.0s]
Searching alignments... [1.6s]
Processing query chunk 0, reference chunk 0, shape 2, index chunk 1.
Building reference index... [0.0s]
Building query index... [0.0s]
Searching alignments... [1.5s]
Processing query chunk 0, reference chunk 0, shape 2, index chunk 2.
Building reference index... [0.0s]
Building query index... [0.0s]
Searching alignments... [1.6s]
Processing query chunk 0, reference chunk 0, shape 2, index chunk 3.
Building reference index... [0.0s]
Building query index... [0.0s]
Searching alignments... [1.5s]
Processing query chunk 0, reference chunk 0, shape 3, index chunk 0.
Building reference index... [0.0s]
Building query index... [0.0s]
Searching alignments... [1.5s]
Processing query chunk 0, reference chunk 0, shape 3, index chunk 1.
Building reference index... [0.0s]
Building query index... [0.0s]
Searching alignments... [1.5s]
Processing query chunk 0, reference chunk 0, shape 3, index chunk 2.
Building reference index... [0.0s]
Building query index... [0.0s]
Searching alignments... [1.5s]
Processing query chunk 0, reference chunk 0, shape 3, index chunk 3.
Building reference index... [0.0s]
Building query index... [0.0s]
Searching alignments... [1.6s]
Closing temporary storage... [0.0s]
Deallocating buffers... [0.0s]
Computing alignments... [0.0s]
Error: 20File_write_exception: Error writing file out.daa (696/5)

The daa file is created, but it is empty.

database naming issue .dmnd

I was trying to make a simple wrapper either via the terminal and/or via galaxy.
Knowing that galaxy names its files as dataset_xxx.dat I was trying to perform a blast against such a name.dat

However diamond automatically adds the extension while the file does not have the dmnd extension which obviously results in a error opening file.

Error: function Input_stream::Input_stream(const string &, bool) line 75. Error opening file dataset_xxx.dat.dmnd

Is there a way to tell diamond what the file name is without diamond appending the dmnd extension? Otherwise I need to copy the database file every time a blast is performed...

Multiple sequences

$ diamond blastp --query multiple.fasta ...

Can multiple.fasta contain inputs like this:


>seq1
LEARDLYCERDERTLFRGLSFTVEAGEWVQVTGGNGAGKTTLLRLLTGLARPDGGEVYWQ
GEPLRRVRDSFHRSLLWIGHQPGIKTRLTARENLHFFHPGDGARLPEALAQAGLAGFEDV
PVAQLSAGQQRRVALARLWLTRAALWVLDEPFTAIDVNGVARLTRRMAAHTAQGGMVILT
THQPLPGAADTVRRLALTGGGAGL

>seq2
MWRVFCLELRVAFRHGADIAGPLWFFLMVITLFPLSVGPQPQLLARIAPGIIQVAALLAS
LLALERLFRDDLQDGSLEQLMLLPVPLPAVVLAKVLAHWAVTGLPLMMLSPLVALLLGMD
VYGWKIMALTLLLGTPALGFLAAPGVGLTAGLRRGGVLLGILVLPLSVPVLIFATAAMDA
ASMHLPVDGYLAVLGALLAGSATLSPFATAAALRISTQ

>seq3
MWKTLHQLAAPPRLYQICGRLVPWLAAAGIIVLATGWVRGFGFAPADYQQGESYRIMYLH
VPAAIWSMGIYAAMAVAAFTGLVWQMKMASLAVAAMAPVGAVYTFIALVTGAAWGKPMWG
TWWVWDARLTSELVLLFLYAGVIALWHAFDDRKMAGRAAGILVLVGVVNLPVIHYSVEWW
NTLHQGSTRMQLSIDPAMRSPLRWAIAGFLLLFMTLALMRMRNLILLMEKRRPWVSELIL
KRGHR

@bbuchfink

Can not build Diamond 0.7.3

Hi,

Just wanted to give your soft a try but I can not build it on my computer (I just followed your recipe installing boost first and then buil diamond). Could you help me to figure out what is wrong?

Thx.

-- Error Returned--
g++ -DPACKAGE_NAME="diamond" -DPACKAGE_TARNAME="diamond" -DPACKAGE_VERSION="0.7.3" -DPACKAGE_STRING="diamond\ 0.7.3" -DPACKAGE_BUGREPORT="[email protected]" -DPACKAGE_URL="" -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DHAVE_DLFCN_H=1 -DLT_OBJDIR=".libs/" -DHAVE_BOOST=1 -DHAVE_BOOST_SYSTEM_ERROR_CODE_HPP=1 -DHAVE_BOOST_SYSTEM_ERROR_CODE_HPP=1 -DHAVE_BOOST_THREAD_HPP=1 -DHAVE_BOOST_PROGRAM_OPTIONS_HPP=1 -DHAVE_BOOST_SYSTEM_ERROR_CODE_HPP=1 -DHAVE_BOOST_CHRONO_HPP=1 -DHAVE_BOOST_SYSTEM_ERROR_CODE_HPP=1 -DHAVE_BOOST_TIMER_TIMER_HPP=1 -DHAVE_BOOST_IOSTREAMS_DEVICE_FILE_DESCRIPTOR_HPP=1 -DPACKAGE="diamond" -DVERSION="0.7.3" -I. -DNDEBUG -Iboost/include -pthread -g -O2 -MT diamond-main.o -MD -MP -MF .deps/diamond-main.Tpo -c -o diamond-main.o test -f 'main.cpp' || echo './'main.cpp
run/../output/../output/daa_record.h:108: attention : friend declaration ‘Binary_buffer::Iterator& operator>>(Binary_buffer::Iterator&, DAA_match_record<_val>&)’ declares a non-template function
run/../output/../output/daa_record.h:108: attention : (si ce n'est pas ce que vous vouliez faire, soyez sûr que le patron de la fonction a déjà été déclaré et ajouter <> après le nom de la fonction ici) -Wno-non-template-friend désactive le présent avertissement
run/../data/../util/complexity_filter.h: In instantiation of ‘const Complexity_filter<Value_type<Letter_nucl> > Complexity_filter<Value_type<Letter_nucl> >::instance’:
run/../data/../util/complexity_filter.h:34: instantiated from ‘static const Complexity_filter<_val>& Complexity_filter<_val>::get() [with _val = Value_type<Letter_nucl>]’
run/master_thread.h:228: instantiated from ‘void master_thread(Database_file&, boost::timer::cpu_timer&, boost::timer::cpu_timer&) [with _val = Value_type<Letter_nucl>, _locr = long unsigned int]’
run/master_thread.h:286: instantiated from ‘void master_thread() [with _val = Value_type<Letter_nucl>]’
main.cpp:184: instantiated from here
run/../data/../util/complexity_filter.h:88: erreur: uninitialized const ‘Complexity_filter<Value_type<Letter_nucl> >::instance’
make: *** [diamond-main.o] Erreur 1

--- Error Returned ---

--- C/C++ version of compilers ---
scruveil@etna17$ gcc -v
Utilisation des specs internes.
Target: x86_64-redhat-linux
Configuré avec: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-libgcj-multifile --enable-languages=c,c++,objc,obj-c++,java,fortran,ada --enable-java-awt=gtk --disable-dssi --disable-plugin --with-java-home=/usr/lib/jvm/java-1.4.2-gcj-1.4.2.0/jre --with-cpu=generic --host=x86_64-redhat-linux
Modèle de thread: posix
version gcc 4.1.2 20080704 (Red Hat 4.1.2-52)
--- C/C++ version of compilers ---

--- System Info ---
$ lsb_release -a
LSB Version: :core-4.0-amd64:core-4.0-ia32:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-ia32:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-ia32:printing-4.0-noarch
Distributor ID: RedHatEnterpriseServer
Description: Red Hat Enterprise Linux Server release 5.8 (Tikanga)
Release: 5.8
Codename: Tikanga
--- Ssytem Info ---

Clarify license

Hello @bbuchfink,

I've been working to get bioinformatics tools available as anaconda packages, including Diamond over at biocore/conda-recipes#16

I wanted to check in with you. Are you comfortable with this?

If you are cool with this, can you help me understand your license?
Your COPYING file looks like BSD, while a few of your files use a Affero GPL.

Thanks for your time!
Colin

OS X Build Warning: in-class initializer for static data member of type 'const double' is a GNU extension [-Wgnu-static-float-init]

OSX 10.10.3

/Applications/CLion.app/Contents/bin/cmake/bin/cmake --build /Users/lee/Library/Caches/clion10/cmake/generated/7781eb69/7781eb69/Debug --target all -- -j 4
[ 88%] Built target blast_core
Scanning dependencies of target diamond
[ 92%] Building CXX object CMakeFiles/diamond.dir/src/main.cpp.o
In file included from /Users/lee/Dropbox/RandD/Repositories/diamond/src/main.cpp:26:
In file included from /Users/lee/Dropbox/RandD/Repositories/diamond/src/data/reference.h:28:
In file included from /Users/lee/Dropbox/RandD/Repositories/diamond/src/data/sorted_list.h:25:
In file included from /Users/lee/Dropbox/RandD/Repositories/diamond/src/data/seed_histogram.h:26:
In file included from /Users/lee/Dropbox/RandD/Repositories/diamond/src/data/../basic/shape_config.h:24:
In file included from /Users/lee/Dropbox/RandD/Repositories/diamond/src/basic/shape.h:28:
/Users/lee/Dropbox/RandD/Repositories/diamond/src/basic/score_matrix.h:167:22: warning: in-class initializer for static data member of type 'const double' is a GNU extension [-Wgnu-static-float-init]
        static const double     LN_2 = 0.69314718055994530941723212145818;

Unexpectedly high memory usage

Hello,

I am using the blastp mode of DIAMOND v0.8.3.65. The database was made with default parameters (namely, --block-size of 2). As such, I expected the maximum memory usage to be about 12GB as per the documentation. The --tmpdir was set to /tmp and all other parameters left at default values. The query file is very large and this appears to adds to the memory usage substantially with peak usage being over 80GB. Is it possible to limit memory usage for large query files or is this the expected behaviour?

Regards,
Donovan

Different number of reads in input fasta and diamond results

Hi bbuchfink,

I seem to be missing some reads from the diamond results. For example, my input fasta contains 44,518 sequences but Diamond only outputs 41352 reads in fast mode and 41426 reads in sensitive mode. I can recover some of the reads by by setting the --seg option to no. Are there other reasons that Diamond may exclude some to the reads from the output?

Thank you very much for your help!

Unaccounted Query Seqeunces

Hi Benjamin

I am trying to use DIAMOND (v0.7.12.61) on a FASTQ file containing approx. 20 million 250bp reads. The verbose output once DIAMOND starts shows NR as containing 89 mil. seqs and query seqs. (which I suppose are my input seqs.) as 117 mil. (approx. 5x times more than the actual sequences in my input file). Is this normal?

I have also been trying to replicate the performance testing stats that you report in your paper. I am currently running my dataset on a server with 24 cores (exclusive use) and 256G RAM. I So I set my block size to 10 and index chunks to 4, to max out. But, still it seems to be taking quite long (more than a day) for a single file.

Suggestions for Improving Diamond Output

Hi,

Today I've been comparing the data from blastall in regular versus tabular output. I note that the diamond output is practically identical to that in the blastall tabular output.

An issue I'm having is that the tabular output of blastall and diamond does not contain such a rich set of data as blastall regular output. In particular, the descriptions in the first column all refer to the query sequence and are therefore all identical. We all have access to the fasta header of our input sequence so this isn't really providing any helpful information. What would be helpful is to show the fasta header of the hit sequence in the first column of the tabular output instead.

This post I found highlights some issues with blast tabular output including the one I mention, in greater depth: http://blastedbio.blogspot.ca/2012/05/blast-tabular-missing-descriptions.html

It would be wonderful if this feature was provided as a command line option or even more wonderful if the width of the first column had the ability to be much wider so that useful information could be extracted from this field . :)

Here's hoping and many thanks for providing such a great bioinformatics tool for us to use in our lab!
Gemma

Small file takes longer?

Based on the paper and the manual I was expecting small files to have substantial overhead running against nr, but what I didn't expect was that they would take longer than larger file. For example I can run a 25k sequence set against nr in 45 mins, which is fantastic, but then a set of 29 sequences is taking at least 1.5 hours (still running...).
Is this the expected behavior? I know I can just use blast for a small set like this, but I am programming an annotation service, so having to switch to a different program for some inputs is not convenient.

Ambiguous Nucleotides

Hey,
I was a bit puzzled when DIAMOND kept crashing for me on a new sequence data set, because it worked fine all the time before. Digging into the reason it seems that the problem is the translation table when using the BLASTX setting.

Your translation lookup table in src/basic/translate.h) is only 5x5, so I assume that besides ATGC DIAMOND will work fine with Ns as well, but the other ambiguous positions (such as R,W,Y,S,etc.) will lead to crashes.

While it's possible to replace those characters by Ns it would be nicer if DIAMOND could take care of it right away. Either by replacing those itself or by even making use of the information and extending the translation table.

Cheers,
Bastian

cmake build from source method not working

Hi,

I'm attempting to get diamond integrated with guix, and was trying to get the cmake install method to work, but to no avail

diamond-0.7.9/build$ cmake ..
CMake Error: The source directory "/tmp/nix-build-diamond-0.7.9.drv-0/diamond-0.7.9" does not appear to contain CMakeLists.txt.
Specify --help for usage, or press the help button on the CMake GUI.

Is it possible some files were not added to the git repo maybe?

Thanks. ben

Problems compiling version 0.8.7 on MacOS X

Using MacOS X 10.11.5, cmake version 3.5.2 from homebrew, I get the following error after running make install (all prior steps run without errors):

Ians-MacBook-Pro:bin ianmarshall$ make install
Scanning dependencies of target diamond
[  2%] Building CXX object CMakeFiles/diamond.dir/src/run/main.cpp.o
In file included from /Users/ianmarshall/diamond-0.8.7/src/run/main.cpp:25:
In file included from /Users/ianmarshall/diamond-0.8.7/src/run/../run/master_thread.h:30:
In file included from /Users/ianmarshall/diamond-0.8.7/src/run/../output/join_blocks.h:27:
In file included from /Users/ianmarshall/diamond-0.8.7/src/run/../output/output_file.h:26:
In file included from /Users/ianmarshall/diamond-0.8.7/src/run/../output/output_buffer.h:25:
/Users/ianmarshall/diamond-0.8.7/src/run/../output/output_format.h:222:26: error: default initialization of an object of const type
      'const XML_format' without a user-provided default constructor
        static const XML_format xml;
                                ^
1 error generated.
make[2]: *** [CMakeFiles/diamond.dir/src/run/main.cpp.o] Error 1
make[1]: *** [CMakeFiles/diamond.dir/all] Error 2
make: *** [all] Error 2

cmake output is as follows:

-- The C compiler identification is AppleClang 7.3.0.7030031
-- The CXX compiler identification is AppleClang 7.3.0.7030031
-- Check for working C compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc
-- Check for working C compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++
-- Check for working CXX compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Performing Test COMPILER_SUPPORTS_MARCHNATIVE
-- Performing Test COMPILER_SUPPORTS_MARCHNATIVE - Success
-- Found ZLIB: /usr/lib/libz.dylib (found version "1.2.5") 
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - found
-- Found Threads: TRUE  
-- Configuring done
-- Generating done
-- Build files have been written to: /Users/ianmarshall/diamond-0.8.7/bin

Do you have any thoughts about how I could get this working? Thanks!

Question about max-target-seqs option

Question, does max-target-seqs=N mean that the overall best N matches are shown, or only the first N that pass the e-value threshold? Hopefully the former, which would be a real improvement over the similar option of blastx.

Error diamond view

I tried converting my resulting .daa file to .m8 with the following command:
diamond view -a sample_79.daa -o sample_79.m8
and got this error message:
Error: function Output_stream::Output_stream(const string&, bool, Output_stream::Flags) line 201. Error opening file sample_79.m8.

can you point me to what could be wrong? my original blast (which was a DNA query to an amino acid sequence database) was carried out with the following command:
/diamond blastx -d prot.db -q assembly_79.fasta -a sample_79 -p 10 -t /tmp

a bug of blastx mode in v0.7.12?

As shown in this example, search the nucleotide seq against its translation finds no hit:

$ cat > a.fna
>gb|U00096.2|:190-255 Escherichia coli str. K-12 substr. MG1655, complete genome
ATGAAACGCATTAGCACCACCATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGA
# This is the translated seq of a.fna
$ cat > a.faa
>gi|1786182|gb|AAC73112.1| thr operon leader peptide [Escherichia coli str. K-12 substr. MG1655]
MKRISTTITTTITITTGNGAG

$ ~/foo/diamond  makedb --in a.faa --db a

$ ~/foo/diamond blastx --db a --tmpdir /tmp/ --query a.fna --daa /tmp/a.daa

$ ~/foo/diamond view -a /tmp/a.daa -o test.fna.tab --outfmt sam

$ cat test.fna.tab 
@HD     VN:1.5  SO:query
@PG     PN:DIAMOND
@mm     BlastX
@CO     BlastX-like alignments
@CO     Reporting AS: bitScore, ZR: rawScore, ZE: expected, ZI: percent identity, ZL: reference length, ZF: frame, ZS: query start DNA coordinate

$ ~/foo/diamond -v
diamond v0.7.12.61
#Threads = 24
Scoring matrix = blosum62
Lambda = 0.267
K = 0.041
Gap open penalty = 11
Gap extension penalty = 1
Seg masking = 0
SSSE3 enabled.
Insufficient arguments. Use diamond -h for help.

pthread missing from CMakelists.txt

In CMakeLists.txt, L21):

- target_link_libraries(diamond blast_core ${Boost_LIBRARIES})
+ target_link_libraries(diamond blast_core ${Boost_LIBRARIES} -lpthread)

Linking to pthread seems to be missing. I'm not sure if this is exactly what you want to add because I imagine this may have ramifications on Windows. However, I also don't know if diamond works on Windows so just let me know if you want a quick patch for that fix?

DIAMOND v0.8.5 ftell error

Hi,
I've got the "Error executing ftell on stream filename.daa" message on Windows Server 2012 Standard. Diamond only has generated incomplete output file (~4Gb).

Command: diamond blastx -d nr -q filename.fa -a filenamedmnd -c 1 --sensitive

Fortunately Linux version completed without any issues.

Thank you: Blaize

"Insufficient arguments" error should return nonzero exit status

It would help pipelines stop on error.

Diamond threading behaves differently on similar systems...

We have an MPI cluster with nodes with 16 cores and the following processors:
model name : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
and another MPI cluster with nodes with 24 cores and the following processors:
model name : Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz

We run the same slurm job on both clusters but on the latter cluster, diamond runs 24 threads on 24 cores which performs as expected. On the former cluster, diamond runs 16 threads on 1 core.

Is there a command line option or environment variable that is causing the former cluster to multithread on a single core rather than spreading the threads across all the cores?

Error while compiling from source diamond-0.8.5

Hello,

I have been trying to install diamond v 0.8.5 on account without root permission.
This is what I have done:

$ wget http://github.com/bbuchfink/diamond/archive/v0.8.5.tar.gz
$ tar xzf v0.8.5.tar.gz
$ cd diamond-0.8.5
$ mkdir bin
$ cd bin
$ cmake .. -DCMAKE_INSTALL_PREFIX=/home/tokarev/programs/diamond
$ make install

This is what I get

[tokarev@hhmi bin]$ make install
[  3%] Building CXX object CMakeFiles/diamond.dir/src/run/main.cpp.o
In file included from /home/tokarev/tmp/install_files/diamond/src/run/../data/seed_histogram.h:26,
                 from /home/tokarev/tmp/install_files/diamond/src/run/../data/sorted_list.h:25,
                 from /home/tokarev/tmp/install_files/diamond/src/run/../data/reference.h:29,
                 from /home/tokarev/tmp/install_files/diamond/src/run/main.cpp:25:
/home/tokarev/tmp/install_files/diamond/src/run/../data/sequence_set.h: In constructor 'Sequence_set::Sequence_set(Input_stream&)':
/home/tokarev/tmp/install_files/diamond/src/run/../data/sequence_set.h:41: error: class 'Sequence_set' does not have any field named 'String_set'
make[2]: *** [CMakeFiles/diamond.dir/src/run/main.cpp.o] Error 1
make[1]: *** [CMakeFiles/diamond.dir/all] Error 2
make: *** [all] Error 2

I am trying to install humann2 using anaconda running on python 2.7, Humann2 requires diamond and I get this message while installing it.

Thank you!
Vasily

Input file is a DAA file (possibly related to issue - blastx error writing file)

Diamond runs without error with a small number of query sequences < ~30. See issue "blastx error writing file." However, the resulting daa files are not readable by view. This is the error:

Error: Input file is not a DAA file.

This is the verbose output:

.diamond view -a out.daa -o out.tsv -v
diamond v0.7.9.58

Threads = 16

Scoring matrix = blosum62
Lambda = 0.267
K = 0.041
Gap open penalty = 11
Gap extension penalty = 1
Seg masking = 0
SSSE3 enabled.
Error: Input file is not a DAA file.

I could provide an example of one of these files if you suggest a method to do so.

how to set --band

Do you have any advice how (or whether) to increase sensitivity by raising the --band parameter? Also what are the possible values of the parameter (noting that the default is not a number, but "auto")?

command line redirected input

I tried to use diamond on some fastq files, but I was not wishing to use a temporary fasta file, so I redirected the input:
diamond blastx -db nr -a out.daa -q <(zcat READS.fastq.gz | fastq_to_fasta)
Error: Invalid input file format

Is there no way for the user to avoid making a temporary fasta file?

Options --matrix blosum45 and lower give "invalid scoring parameters"

Works with >= blosum62, but not < blosum62:

% diamond blastp -q NC_017032.faa -d sprot.dmnd -v -a out -k 1 -c 1 --matrix blosum45
diamond v0.7.9.58
Error: Invalid scoring parameters

Segmentation fault: 11

I used diamond to align transcriptome contigs to 25 GB database of proteins. One query (with 120 000 contigs) aligned well and very fast, but next with 80 000 contigs got Segmentation fault: 11. I concatenated first and second queries and it worked great and very fast!
After I have downloaded set of transcriptomic contigs (30 000) and again got Segmentation fault: 11. What is the hell?:) I like very much this great software, help to fix it,please.

diamond blastx -p 18 -k 1 -e 1e-20 -d uniprot_trembl.fasta -q Anadara.fa -a matches_anadara_trembl -t /home/cDNA/TEST_assembly
Segmentation fault: 11

Enable "salltitles" option

https://github.com/bbuchfink/diamond/blob/master/src/main.cpp#L98

It looks like the code is there but not implemented yet, I tested to see if it was a hidden feature but no luck. Having this feature would be amazing.

Error version binary in windows

When you run the command "diamond.exe blastx -d nr -q reads.fna -a matches"... the program is aborted unexpectedly. I using the binary version available to windows. [ https://github.com/bbuchfink/diamond/releases/ ]... my is Windows 10...

below is the log of the use.

diamond v0.8.8.70 | by Benjamin Buchfink [email protected]
Check http://github.com/bbuchfink/diamond for updates.

CPU threads: 4

Scoring parameters: (Matrix=blosum62 Lambda=0.267 K=0.041 Penalties=11/1)

Target sequences to report alignments for: 25

Temporary directory:
Opening the database... [0.000133605s]
Opening the input file...

How to deliver the results to Blast2Go?

0.8.1 version has great promotion compared to 0.7* version with some new useful functions.
It feels like 0.8.1 is a little slower than 0.7*, is this true?
Diamond has no output for xml format, so the question that how to deliver the result to Blast2Go remains to be solved.

Default for query genetic code

Hello,

I'd like to know the default value for query genetic code in Diamond.

Further, can I change the default genetic code in command line?

Thanks.

compliation issue introduced in 0.8.10

Here is a build log showing the problem:
http://pkg.awarnach.mathstat.dal.ca/data/10amd64-default/2016-07-06_02h47m52s/logs/errors/diamond-0.8.10.log.

The problem occurs on both FreeBSD 9.3 and 10.3 amd64.

There were no such problems in 0.8.9.

XML output

<Hit>
  <Hit_num>1</Hit_num>
  <Hit_id></Hit_id>
  <Hit_def>199_1718 | model_type_id: 40292 | pass_evalue: 1e-150</Hit_def> 
  <Hit_accession></Hit_accession>
  <Hit_len>378</Hit_len>
  <Hit_hsps>

To set Hit_def you use --salltitles

But which option do we use to populate the <Hit_id></Hit_id> and <Hit_accession></Hit_accession> ?

@bbuchfink

error with compiling v8.0.7 on Mac OSX

I get the following error when I try to compile the newest version of diamond:
git/diamond/src/run/../output/output_format.h:222:26: error: default initialization of an object of const type 'const XML_format' without a
user-provided default constructor
static const XML_format xml;
^
1 error generated.

When I do a git checkout to 89ec7d5 it does compile

Error while compiling from source diamond-0.8.14

[ 85%] Building CXX object CMakeFiles/diamond.dir/src/output/output_format.cpp.o
/Users/vagner/Downloads/diamond-0.8.14/src/output/output_format.cpp:124:34: error:
use of overloaded operator '<<' is ambiguous (with operand types
'Text_buffer' and 'uint64_t' (aka 'unsigned long long'))
<< " <Statistics_db-num>" << ref_header.sequences <...
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~~~~~~~
/Users/vagner/Downloads/diamond-0.8.14/src/output/../basic/../util/text_buffer.h:131:15: note:
candidate function
Text_buffer& operator<<(char c)
^
/Users/vagner/Downloads/diamond-0.8.14/src/output/../basic/../util/text_buffer.h:138:15: note:
candidate function
Text_buffer& operator<<(uint32_t x)
^
/Users/vagner/Downloads/diamond-0.8.14/src/output/../basic/../util/text_buffer.h:146:15: note:
candidate function
Text_buffer& operator<<(int x)
^
/Users/vagner/Downloads/diamond-0.8.14/src/output/../basic/../util/text_buffer.h:154:15: note:
candidate function
Text_buffer& operator<<(size_t x)
^
/Users/vagner/Downloads/diamond-0.8.14/src/output/../basic/../util/text_buffer.h:161:15: note:
candidate function
Text_buffer& operator<<(double x)
^
1 error generated.
make[2]: *** [CMakeFiles/diamond.dir/src/output/output_format.cpp.o] Error 1
make[1]: *** [CMakeFiles/diamond.dir/all] Error 2
make: *** [all] Error 2

compilation problem

We had this problem on Ubuntu 12.04.5 LTS when trying to compile it on Travis CI. Does anyone have any clue on the cause of failure? We also tried the cmake approach, which also failed on both Ubuntu and OSX...

Thank you for the tagged release

Here is the homebrew packaging thread FYI: https://github.com/Homebrew/homebrew-science/issues/1936

bbuchfink / diamond Goto Github PK

diamond's Introduction

Introduction

Documentation

Support

About

diamond's People

Contributors

Stargazers

Watchers

Forkers

diamond's Issues

CPU threads: 16

Target sequences to report alignments for: 1

Threads = 1

Threads = 16

CPU threads: 4

Target sequences to report alignments for: 25

Recommend Projects

Recommend Topics

Recommend Org