martinsos / opal Goto Github PK

View Code? Open in Web Editor NEW

35.0 8.0 8.0 19.44 MB

SIMD C/C++ library for massive optimal sequence alignment (local/SW, infix, overlap, global)

License: MIT License

Makefile 0.54% C++ 97.74% Shell 1.72%

sequence-alignment bioinformatics smith-waterman c-plus-plus avx alignment database-search

opal's Introduction

Opal

This work has been supported in part by Croatian Science Fundation under the project UIP-11-2013-7353.

Opal (ex Swimd) is SIMD C/C++ library for massive optimal sequence alignment. Opal is implemented mainly by Rognes's "Faster Smith-Waterman database searches with inter-sequence SIMD parallelisation". Main difference is that Opal offers support for AVX2 and 4 alignment modes instead of just Smith-Waterman.

Requirements

SSE4.1 or higher. If AVX2 is available, Opal will consume two times more sequences and will therefore work two times faster. By compiling code with makefile and running ./test, you can see if you have SSE4.1 or AVX2 and also see if everything is working.

Alignment modes

Opal offers 4 different modes of alignment: 1 local and 3 global modes, explained below.

SW: Local alignment (Smith-Waterman). Useful for finding similar regions in not necessarily similar sequences, and also for finding shorter sequence in longer sequence (text searching).
NW: Global alignment (Needleman-Wunsch). Useful for detecting if two sequences of similar lengths are similar.
HW: Semi-global alignment. Insertions before query start and insertions after query end are not penalized. Useful for finding query in target.
OV: Semi-global alignment. Insertions before start of either query or target and insertions after end of either query or target are not penalized. Useful when sequences are partially overlaping or one of sequences is contained in another.

Usage

To use Opal you just have to include opal.h in your code and compile your code together with opal.cpp using appropriate compiler flag for SSE, for example -msse4.1 (or just use -march which will detect your arhitecture and will also use appropriate SSE flag).
Opal is written for C++11 standard, so you should make sure that you compile it according to that. For gcc, add flag -std=c++11.

...
#include "opal.h"
...

...
// Basic parameters
int alphabetLength = 4;
int gapOpen = 3;
int gapExt = 1;
int scoreMatrix[16] = {
    2, -1, -3, 0,
    -1, 4, -5, -1,
    -3, -5, 1, -10,
    0, -1, -10, 4
};

// Query
int queryLength = 10;
unsigned char query[10] = {0,1,3,2,1,0,3,0,1,1};

// Database
int dbLength = 4;
unsigned char dbSeq1[14] = {1,3,2,3,0,0,1,0,2,2,1,2,3,2};
unsigned char dbSeq2[12] = {2,1,1,3,2,0,0,2,2,0,2,1};
unsigned char dbSeq3[13] = {0,0,2,1,0,3,1,1,2,3,2,1,0};
unsigned char dbSeq4[9] = {2,3,3,3,1,1,2,2,0};
unsigned char* db[4] = { dbSeq1, dbSeq2, dbSeq3, dbSeq4 };
int dbSeqsLengths[4] = {14, 12, 13, 9};

// Results for each sequence in database
OpalSearchResult results[4];
for (int i = 0; i < 4; i++) {
    opalInitSearchResult(results[i]);
}

// Do calculation!
int resultCode = opalSearchDatabase(query, queryLength, db, dbLength, dbSeqsLengths,
                                    gapOpen, gapExt, scoreMatrix, alphabetLength, results,
                                    OPAL_MODE_SW, OPAL_OVERFLOW_BUCKETS);

// Print scores
printf("%d %d %d %d\n", results[0].score, results[1].score, results[2].score, results[3].score);
...

For more examples of usage take a look at test.cpp and opal_aligner.cpp. For detailed documentation check out opal.h.

Opal aligner

In order to compile and use simple aligner that uses Opal run makefile in src:

cd src
make

Type ./opal_aligner for help.

Examples of usage:

./opal_aligner ../test_data/query/O74807.fasta ../test_data/db/uniprot_sprot15.fasta
./opal_aligner -p ../test_data/query/O74807.fasta ../test_data/db/uniprot_sprot15.fasta
./opal_aligner -s ../test_data/query/P18080.fasta ../test_data/db/uniprot_sprot12071.fasta
./opal_aligner -s -a NW ../test_data/query/P18080.fasta ../test_data/db/uniprot_sprot12071.fasta

Test data

In test_data/ there are two directories, query/ and db/. In query/ are fasta files with one sequence each, while in db/ are fasta files with many sequences each. All fasta files were obtained from www.uniprot.org. File in db/ with name uniprot_sprotN.fasta was generated by taking first N sequences from uniprot_sprot.fasta, which can be found at www.uniprot.org/downloads -> UniProtKB/Swiss-Prot.

opal's People

Contributors

Stargazers

Watchers

Forkers

tolot27 xubo245 agangzz zzbd-org pombredanne pirtt ivujevic althonos

opal's Issues

calculate hits for all query sequences of a multiple fasta file

Currently, only the first query sequence is searched against all subject/database sequences.

Improve alignment speed

I should improve alignment speed.
One thing I can do for sure is make alignment algorithm banded.
I should also investigate if there are any other ways to make algorithm faster, and I should do some comparison with other aligners.

Add E-value

Information about E-value: http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=FAQ#expect .

Implementation of E-value in SW#: http://sourceforge.net/p/swsharp/code/ci/master/tree/swsharp/src/evalue.cu .

Add SHW mode

Could I add SHW mode? Ivan Sovic said it would be useful.

Make function input more flexible

Right now user has to provide dbSeqsLengths and results as arrays of consecutive ints/objects.
That means that if user organized his memory differently (for example keeps sequence length and result together with sequence in a wrapper object), he will have to allocate it again to fit ouor function and then copy it back which is a lot of pain.
Instead of taking arrays of objects, we should take arrays of pointers to objects -> that way user only has to set pointers which point to his objects, no matter how he organized them in memory.

Add flag to choose if end location is needed (to make search faster)

When I added functionality for finding end location of alignment (#13), search got slower for about 20%, maybe even more, although I did my best not to slow it down. One possible fix for this, if it can be fixed, would be to introduce a flag that will enable user to choose if he needs end location of alignment or not. I hope that search will be as quick as before when this flag is negative, although I am not sure this will actually help (because we will probably have to put if in core loop, and that may slow things down even more -> or it may help as it will have constant value, I am not sure how smart is compiler or whoever).

When gap open is smaller then gap extend, stuff breaks

Check this out and fix it, on match_bonus branch.
I tried with gap open 0 and extend 1, with gap open 2 and gap extend 3, and I got errors (seg fault or assertions failing), I should check this out and fix it.

Add flag for turning off overflow check

Sometimes user may be confident that score will be in some range. In that case, I could allow him to specify for opal that overflow should not be checked. That could give some speed boost!

Automatize comparison with other aligners and performance tests

Difference between swimd and edlib

Hi Martinos,

I'm interested in potentially making use of one of your libraries, and, in my searching, I've come across the repositories for both swimd and edlib. These seem like they have very similar (or at least overlapping) functionality. Could you elaborate a little bit on what the differences are and which library you'd recommend for different use cases?

Thanks!
Rob

Segmentation fault for certain gap open / gap extension configuration

Hi,

Below example that is causing segfault:

./opal_aligner -o 1 -e 1 -x 2 -a NW -p test_query.fasta test_db.fasta -f test.mat

test_query.fasta:

>test query
AABB

test_db.fasta:

>test target
AABBC

test.mat:

 A  B  C  D
 1 -1 -1 -1
-1  1 -1 -1
-1 -1  1 -1
-1 -1 -1  1

Looks to me like the underflow is occurring in matrix E, which is initialized incorrectly. It might have to do with issue #28, however extended to a wider set of scenarios than those described there.

Consider using swimd to solve 1:1 for large targets

Idea: I could use swimd for 1:1 solving, when target is large. If it is large enough, I could split it to pieces, solve them in parallel, and then combine them. I could use something similar to idea that is used for Hichberg. Possible problem is complexity of combining solutions together: if that part is too slow, the whole idea may not be worth it. Another thing to think about is how to exactly combine the solutions: I should examine Hichberg and start from there, there may also be some other articles that already used similar approach, I should find them and read them.

Alignment of data for SSE/AVX does not seem to matter

Documentation says that data which will be loaded/stored by SSE must be memory aligned to 16/32 bytes, or loading/storing will be much slower and possibly cause exceptions.

I do that like this:

int myArray[4] __attribute__((aligned(16)));

However, if I remove alignment, or align it to some different, wrong value, nothing changes in speed!
I am not sure if loads/stores actually become slower but it does not affect overall speed of execution of opal, or am I actually doing something else wrong -> maybe I am aligning wrong all the time? I should investigate this, to make sure I am doing this correctly.

scores are often incorrect

By comparison with swipe I observed that the scores computed by swimd are often incorrect. Since I did a comparison between swipe and ssearch sometimes ago, I trust the scores of swipe.

Compare for instance the first sequence of ./test_data/db/uniprot_sprot15.fasta with the fourth sequence of the same file with blosum50 and gap open of 13 and gap extend of 2. Swipe returns a raw score of 41, whereas swimd returns 43.

./swimd_aligner -m Blosum50 -o 13 -e 2 ../test_data/db/uniprot_sprot1.fasta ../test_data/db/uniprot_sprot1_4.fasta

./swipe -M BLOSUM50 -G 13 -E 2 -i ../swimd/test_data/db/uniprot_sprot1.fasta -d ../swimd/test_data/db/uniprot_sprot1_4.fasta -m 7 -v 15 -b 15 -e 10000 | grep score

BTW: I've compared the scoring matrices and they are identical.

no member named 'abs' in namespace 'std';

I got this error while trying to execute make

g++ -std=c++11 -march=native -Wall -O3 -c opal.cpp
opal.cpp:1169:45: error: no member named 'abs' in namespace 'std'; did you mean simply 'abs'?
if (M * std::min(Q, T) - gapPenalty(std::abs(Q - T), Go, Ge) >= k) {
^~~~~~~~
abs
/usr/include/stdlib.h:129:6: note: 'abs' declared here
int abs(int) __pure2;
^
1 error generated.
make: *** [opal.o] Error 1

deleted

Update README!

Example in readme is not updated

Process multiple cells in one iteration, add padding to database sequences, optimize for cache

This method is explained in Rognes's article, under "Processing four consecutive cells along the database sequences". They claim that by processing multiple cells they have better usage of SSE registers and caching.
Another thing is adding padding to database sequences so they always have the length that is mutliple of 4. That way I can check every 4 columns if some sequence ended instead of doing it every column, which could make things faster.

Limit memory usage of alignment

Right not alignment eats up quadratic amount of memory. For big sequences this will be a lot of memory, which makes swimd unusable for finding alignment of very large sequences.
I should do something similar to Hichbergs algorithm, in order to limit memory usage. It will make alignment slower for some factor, but that is a trade-off we can not avoid.
It would be nice if user could specify this memory limit.

Write better tests

Write better tests! Something more similar to what I did in EDLIB.

make command line parameters compatible to blast/swipe

Why introducing another set of command line parameters and not using those of Blast which are also used by swipe?

There is also a tiny bug regarding to the gap open penalty. The help says that -g should be used but only -o works.

report GCUPS for opal_aligner ?

I have used SWIPE in the past and am familiar with its output. Specifically, I'm interested in opal_aligner reporting the GCUPS for alignments, either individually or an average for the entire query. Is this possible for the opal_aligner?

Implement AVX2 version

When AVX2 is present on machine, we should use it -> it will give us double speed.

Add callback function that is called after sequence is calculated

This could be interesting: user can do some actions on this, maybe print some output or something.

Create example of usage

Aligner served as example before, but now it got pretty complicated.
I should create very simple example program, that will serve as useful point to get started learning about library

Improve speed of basic search

Before, in v1.0 branch, opal was 1.5 times faster than now! I have to speed up the parts that I can and isolate all functionalities that are not needed in basic search and are making it slower.

Implement calculation of alignment

Right now we get just score, not alignment, so we should implement it. Place to start: SWIPE can do alignment, but does not have it documented in article. Examine SWIPE code to see how they are doing alignment.

Implement finding location in target and query where best score was found

This is first step in implementing finding of alignment (#1). I want this to affect speed of calculation as little as possible, which is the biggest challenge.

Split swimd.cpp into more files

Swimd.cpp file is huge: 1300 lines. It would be nice if I could split it into multiple files. There are some problems with this: it may make usage of library more complicated.

Slowdown after new feature

As reported by @tolot27, after adding different overflow modes with commit 1766765, there is a slowdown of about 10% - 20%.
I did a quick investigation but could find no obvious reason for this. I should do some profiling and serious testing, as such slowdown is not acceptable.