martinsos / edlib Goto Github PK

View Code? Open in Web Editor NEW

489.0 22.0 161.0 4.74 MB

Lightweight, super fast C/C++ (& Python) library for sequence alignment using edit (Levenshtein) distance.

Home Page: http://martinsos.github.io/edlib

License: MIT License

C++ 67.64% C 6.77% CMake 3.10% Shell 6.86% Makefile 1.49% Python 4.27% Dockerfile 0.50% Meson 1.66% Cython 7.72%

sequence-alignment edit-distance levehnstein-distance library c-plus-plus alignment-path python bioinformatics

edlib's Introduction

Edlib ·

A lightweight and super fast C/C++ library for sequence alignment using edit distance.

Calculating edit distance of two strings is as simple as:

edlibAlign("hello", 5, "world!", 6, edlibDefaultAlignConfig()).editDistance;

Edlib is also available for Python PyPI version (Click here for Python README), with code residing at bindings/python.

Developers have created bindings to edlib in other languages as well:

Edlib.jl, a Julia package created and supported by Christopher Rowley (@cjdoris)
edlibR, an R package created and supported by Evan Biederstedt (@evanbiederstedt)

Features

Calculates edit distance (Levenshtein distance).
It can find optimal alignment path (instructions how to transform first sequence into the second sequence).
It can find just the start and/or end locations of alignment path - can be useful when speed is more important than having exact alignment path.
Supports multiple alignment methods: global(NW), prefix(SHW) and infix(HW), each of them useful for different scenarios.
You can extend character equality definition, enabling you to e.g. have wildcard characters, to have case insensitive alignment or to work with degenerate nucleotides.
It can easily handle small or very large sequences, even when finding alignment path, while consuming very little memory.
Super fast thanks to Myers's bit-vector algorithm.

Features
Building
Using Edlib in your project
Usage and examples
API documentation
Alignment methods
Aligner
Running tests
Time and space complexity
Test data
Development and contributing
Publication
Acknowledgements

Using Edlib in your project

You can use Edlib in you project by either directly copying header and source files from edlib/, or by linking Edlib library (see Building for instructions how to build Edlib libraries). In any case, only thing that you have to do in your source files is to include edlib.h.

To get you started quickly, let's take a look at a few ways to get simple Hello World project working.

Our Hello World project has just one source file, helloWorld.cpp file, and it looks like this:

#include <cstdio>
#include "edlib.h"

int main() {
    EdlibAlignResult result = edlibAlign("hello", 5, "world!", 6, edlibDefaultAlignConfig());
    if (result.status == EDLIB_STATUS_OK) {
        printf("edit_distance('hello', 'world!') = %d\n", result.editDistance);
    }
    edlibFreeAlignResult(result);
}

Running it should output edit_distance('hello', 'world!') = 5.

Approach #1: Directly copying edlib source and header files.

Here we directly copied edlib/ directory to our project, to get following project structure:

edlib/  -> copied from edlib/
  include/
    edlib.h
  src/
    edlib.cpp
helloWorld.cpp -> your program

Since helloWorld is a c++ program, we can compile it with just one line: c++ helloWorld.cpp edlib/src/edlib.cpp -o helloWorld -I edlib/include.

If hello world was a C program, we would compile it like this:

    c++ -c edlib/src/edlib.cpp -o edlib.o -I edlib/include
    cc -c helloWorld.c -o helloWorld.o -I edlib/include
    c++ helloWorld.o edlib.o -o helloWorld

Approach #2: Copying edlib header file and static library.

Instead of copying edlib source files, you could copy static library (check Building on how to create static library). We also need to copy edlib header files. We get following project structure:

edlib/  -> copied from edlib
  include/
    edlib.h
  edlib.a
helloWorld.cpp -> your program

Now you can compile it with c++ helloWorld.cpp -o helloWorld -I edlib/include -L edlib -ledlib.

Approach #3: Install edlib library on machine.

Alternatively, you could avoid copying any Edlib files and instead install libraries by running sudo make install (check Building for exact instructions depending on approach you used for building). Now, all you have to do to compile your project is c++ helloWorld.cpp -o helloWorld -ledlib. If you get error message like cannot open shared object file: No such file or directory, make sure that your linker includes path where edlib was installed.

Approach #4: Use edlib in your project via CMake.

Using git submodule

If you are using CMake for compilation, we suggest adding edlib as a git submodule with the command git submodule add https://github.com/martinsos/edlib vendor/edlib. Afterwards, modify your top level CMakeLists.txt file accordingly:

add_subdirectory(vendor/edlib EXCLUDE_FROM_ALL)
target_link_libraries(your_exe edlib) # or target_link_libraries(your_exe edlib)

The add_subdirectory command adds a folder to the build tree, meaning it will run CMakeLists.txt from the included folder as well. Flag EXCLUDE_FROM_ALL disables building (and instalment) of targets in the added folder which are not needed in your project. In the above example only the (static) library edlib will be build, while edlib-aligner, hello_world and the rest won't. In order to access the edlib API, add #include "edlib.h" in your source file (CMake will automatically update your include path).

For more example projects take a look at applications in apps/.

Using VCPKG

Edlib is available on VCPKG package manager. With VCPKG on your system, Edlib can be downloaded using the VCPKG install command vcpkg install edlib. Once the library has been downloaded, add the following instructions to your CMakeLists.txt file:

find_package(edlib CONFIG REQUIRED)
target_link_libraries(MyProject PRIVATE edlib::edlib)

then you should be able to include the library header in your project (#include "edlib.h)

Building

Meson

Primary way of building Edlib is via Meson build tool.

Requirements: make sure that you have meson installed on your system.

Execute

make

to build static library and binaries (apps and tests) and also run tests.
To build shared library and binaries, do make LIBRARY_TYPE=shared.

Library and binaries will be created in meson-build directory.
You can choose alternate build directory like this: make BUILD_DIR=some-other-dir.

Optionally, you can run

sudo make install

to install edlib library on your machine (on Linux, this will usually install it to usr/local/lib and usr/local/include).

Check Makefile if you want to run individual steps on your own (building, tests, ...).

NOTE: If you need more control, use meson command directly, Makefile is here only to help with common commands.

CMake

Edlib can alternatively be built with CMake.

Execute following command to build Edlib using CMAKE:

cd build && cmake -D CMAKE_BUILD_TYPE=Release .. && make

This will create binaries in bin/ directory and libraries (static and shared) in lib/ directory.

./bin/runTests

to run tests.

Optionally, you can run

sudo make install

to install edlib library on your machine.

Conda

Edlib can also be installed via Conda: : conda install edlib.

Usage and examples

Main function in edlib is edlibAlign. Given two sequences (and their lengths), it will find edit distance, alignment path or its end and start locations.

char* query = "ACCTCTG";
char* target = "ACTCTGAAA"
EdlibAlignResult result = edlibAlign(query, 7, target, 9, edlibDefaultAlignConfig());
if (result.status == EDLIB_STATUS_OK) {
    printf("%d", result.editDistance);
}
edlibFreeAlignResult(result);

NOTE: One character is expected to occupy one char/byte, meaning that characters spanning multiple chars/bytes are not supported. As long as your alphabet size is <= 256 you can manually map it to numbers/chars from 0 to 255 and solve this that way, but if its size is > 256 then you will not be able to use Edlib.

Configuring edlibAlign()

edlibAlign takes configuration object (it is a struct EdlibAlignConfig), which allows you to further customize how alignment will be done. You can choose alignment method, tell edlib what to calculate (just edit distance or also path and locations) and set upper limit for edit distance.

For example, if you want to use infix(HW) alignment method, want to find alignment path (and edit distance), are interested in result only if edit distance is not larger than 42 and do not want to extend character equality definition, you would call it like this:

edlibAlign(seq1, seq1Length, seq2, seq2Length,
           edlibNewAlignConfig(42, EDLIB_MODE_HW, EDLIB_TASK_PATH, NULL, 0));

Or, if you want to use suffix(SHW) alignment method, want to find only edit distance, do not have any limits on edit distance and want character '?' to match both itself and characters 'X' and 'Y', you would call it like this:

EdlibEqualityPair additionalEqualities[2] = {{'?', 'X'}, {'?', 'Y'}};
edlibAlign(seq1, seq1Length, seq2, seq2Length,
           edlibNewAlignConfig(-1, EDLIB_MODE_SHW, EDLIB_TASK_DISTANCE, additionalEqualities, 2));

We used edlibNewAlignConfig helper function to easily create config, however we could have also just created an instance of it and set its members accordingly.

Handling result of edlibAlign()

edlibAlign function returns a result object (EdlibAlignResult), which will contain results of alignment (corresponding to the task that you passed in config).

EdlibAlignResult result = edlibAlign(seq1, seq1Length, seq2, seq2Length,
                                     edlibNewAlignConfig(-1, EDLIB_MODE_HW, EDLIB_TASK_PATH, NULL, 0));
if (result.status == EDLIB_STATUS_OK) {
    printf("%d\n", result.editDistance);
    printf("%d\n", result.alignmentLength);
    printf("%d\n", result.endLocations[0]);
}
edlibFreeAlignResult(result);

It is important to remember to free the result object using edlibFreeAlignResult function, since Edlib allocates memory on heap for certain members. If you decide to do the cleaning manually and not use edlibFreeAlignResult, do not forget to manually free() required members.

Turning alignment to cigar

Cigar is a standard way to represent alignment path. Edlib has helper function that transforms alignment path into cigar.

char* cigar = edlibAlignmentToCigar(result.alignment, result.alignmentLength, EDLIB_CIGAR_STANDARD);
printf("%s", cigar);
free(cigar);

API documentation

For complete documentation of Edlib library API, visit http://martinsos.github.io/edlib (should be updated to the latest release).

To generate the latest API documentation yourself from the source, you need to have doxygen installed. Position yourself in the root directory and run doxygen, this will generate docs/ directory. Then open docs/html/index.html file with you favorite browser.

Alternatively, you can directly check edlib.h.

Alignment methods

Edlib supports 3 alignment methods:

global (NW) - This is the standard method, when we say "edit distance" this is the method that is assumed. It tells us the smallest number of operations needed to transform first sequence into second sequence. This method is appropriate when you want to find out how similar is first sequence to second sequence.
prefix (SHW) - Similar to global method, but with a small twist - gap at query end is not penalized. What that means is that deleting elements from the end of second sequence is "free"! For example, if we had AACT and AACTGGC, edit distance would be 0, because removing GGC from the end of second sequence is "free" and does not count into total edit distance. This method is appropriate when you want to find out how well first sequence fits at the beginning of second sequence.
infix (HW): Similar as prefix method, but with one more twist - gaps at query end and start are not penalized. What that means is that deleting elements from the start and end of second sequence is "free"! For example, if we had ACT and CGACTGAC, edit distance would be 0, because removing CG from the start and GAC from the end of second sequence is "free" and does not count into total edit distance. This method is appropriate when you want to find out how well first sequence fits at any part of second sequence. For example, if your second sequence was a long text and your first sequence was a sentence from that text, but slightly scrambled, you could use this method to discover how scrambled it is and where it fits in that text. In bioinformatics, this method is appropriate for aligning read to a sequence.

Aligner

Edlib comes with a standalone aligner cli app, which can be found at apps/aligner/.

Aligner reads sequences from fasta files, and it can display alignment path in graphical manner or as a cigar. It also measures calculation time, so it can be useful for testing speed and comparing Edlib with other tools.

Check Building to see how to build binaries (including edlib-aligner). Run ./build/bin/edlib-aligner with no params for help and detailed instructions.

Example of usage: ./build/bin/edlib-aligner -p apps/aligner/test_data/query.fasta apps/aligner/test_data/target.fasta

NOTE: Aligner currently does not work on Windows, because it uses getopt to parse command line arguments, which is not supported on Windows.

Running tests

Check Building to see how to build binaries (including binary runTests). To run tests, just run ./runTests. This will run random tests for each alignment method, and also some specific unit tests.

Time and space complexity

Edlib is based on Myers's bit-vector algorithm and extends from it. It calculates a dynamic programming matrix of dimensions Q x T, where Q is the length of the first sequence (query), and T is the length of the second sequence (target). It uses Ukkonen's banded algorithm to reduce the space of search, and there is also parallelization from Myers's algorithm, however time complexity is still quadratic. Edlib uses Hirschberg's algorithm to find alignment path, therefore space complexity is linear.

Time complexity: O(T * Q).

Space complexity: O(T + Q).

It is worth noting that Edlib works best for large, similar sequences, since such sequences get the highest speedup from banded approach and bit-vector parallelization.

Test data

In test_data/ directory there are different genome sequences, ranging from 10 kbp to 5 Mbp in length. They are ranging in length and similarity, so they can be useful for testing and measuring speed in different scenarios.

Development and contributing

Feel free to send pull requests and raise issues.

When developing, you may want to use -D CMAKE_BUILD_TYPE=Debug flag when calling cmake in order to get debugging flags passed to compiler. This should also happen if you just run cmake .. with no flags, but I think I have noticed it does not always works as expected (probably has something to do with cmake cache). To check which flags is compiler using, run make with VERBOSE=1: make VERBOSE=1.

Publication

Martin Šošić, Mile Šikić; Edlib: a C/C ++ library for fast, exact sequence alignment using edit distance. Bioinformatics 2017 btw753. doi: 10.1093/bioinformatics/btw753

Acknowledgements

Mile Šikić (@msikic) - Mentoring and guidance through whole project.

Ivan Sović (@isovic) - Help with testing and prioritizing features, valuable comments on the manuscript.

FAQ

What do terms NW, HW and SHW stand for?

NW stands for Needleman-Wunsch, HW for Hybrid Wunsch, and SHW for Semi Hybrid Wunsch. While NW is a common abbreviation, HW and SHW abbreviations were made up at the very start of this project to describe additional modes of alignment. Later we started using terms "global", "infix" and "prefix" more, as they describe the modes better, but terms NW, HW and SHW are still very present in the project.

edlib's People

Contributors

Stargazers

Watchers

Forkers

foxish gevg mldbai isovic gitter-badger kabuhr xubo245 agangzz stepanic ksahlin xiaohesisi fabricgenomics phillipf rob-p ivujevic little-aeinstein bryanlng shawnvdavis karanuppal2790 agsant01 doevelopper randrick srikanthmaturu jgoppert tdnet12434 shenjia1 yabinfan kurtbrose mahmoud pablorp80 elouizbadr zihua joulankadalem bonesoul davpal srbehera11 laurentfournier animecyc 4creators jiasir803 spinorx drmorr0 skyformat99 zgsxwsdxg mrea1 richardchenarcelik jbrown136 biocodings donikv mdr0id nh13 smazumder05 chenzhiw fw1121 solksjaer varalakshmika jayeshmahajan23 52hzzz atuladn bdice hose1021 paulopieczarka jbaiter richierunner pombredanne pantlavanya andurill anandksrao stephanesdr deryy ufoym nybblr sandynism y9c jasonamark tomsergeweber lijb09 ggkingdom idan94 mobinasri sduchesneau wangdi2014 pengjia6 sabiqali raaulc tianfuzeng clayne encosianima wollmers cjw85 pb-cdunn spaceim curioustauseef evhaus chrisford-ovo fraumauz stevenweaver cunningnobleman molugan dbs7120

edlib's Issues

Improve Ukkonen boundaries

In some modes boundaries are not optimal. Improve them!

Enable stating that sequence should be consumed in reverse

Enable user to state that query and/or target should be used in reverse. This is good because user does not have to waste time and resources on reversing the sequence. What I do not like about this is that I am not sure how useful is this if calculation is spending much more time than reversal, which I believe to be the case.

aligner: Consider adding characters to show the locations of matches/mismatches

Commonly spaces are used for mismatches and pipes | for matches. Also consider changing underscore _ to hyphen -, as the latter is more typical.

For example:

T: AGATATGCTGCCGC---GGACAGCGTTATCTCTAACTAACAGTCACTATC (0 - 46)
   |||| |||||||||   ||||||||||| |||||| |||||||| |||| 
Q: AGATGTGCTGCCGCCTTGGACAGCGTTACCTCTAA-TAACAGTCCCTATG (0 - 48)

Add option to chose number of positions that will be returned

This is important, because if this is number is small then algorithm is a little bit faster sometimes.

Create aligner

Create aligner based on Swimd Aligner

Please, add link of the paper to README.md

Could you please add the link of the paper to readme file? People don't know what you actually implemented and it is hard to google every time :)
http://www.gersteinlab.org/courses/452/09-spring/pdf/Myers.pdf

Orient blocks parallel with target (horizontally)

Currently blocks are oriented vertically, parallel with query. Since target is placed on top border of table and target is usually much bigger then query, we may benefit if blocks would be parallel with target! Example: if query is of length 64 (same as block size) and target is much bigger, then we can expect some narrow band. If blocks are parallel with query, we will have to calculate each cell of matrix, while if blocks are parallel with target we will calculate only small part of matrix!
I think this is important only when band is small and target is much bigger then query. Otherwise it will not bring much speedup.
How hard is it to implement? I have two ideas:

transform the problem. If I can do this, that would be great, not much work.
use horizontal blocks instead of vertical. This would be major refactoring of code, it sounds hard and complicated.

Do profiling

Do some profiling to find out which part of code are consuming most cpu time, and try to make those parts faster.

Make it easier to prepare sequences for usage

Currently sequences have to be transformed into numbers, that go from 0 to N-1 if alphabet has length N. This is a boring thing to implement and nobody should have to go through this trouble.

I should either implement some helper function to transform char sequence to unsigned char sequence of numbers from 0 to N-1, or I should put that transformation inside main function so nobody even knows about it.

What I could also do is create a simpler function, that will detect the length of alphabet and also do this transformation.

Finally, in case that my function accepts numbers as it was accepting them so far, I should check if they are in range 0 to N-1 and report an error if that is not the case!

Remove bool from interface

I introduced bool arguments by mistake to edlibCalcEditDistance, I should replace them with integers, so it can be easily used in C.

Remove padding from target in NW

Remove padding as it slows down calculation for small targets.

Reduce memory consumption for NW alignment

I can reduce memory consumption of NW alignment by storing only bands, not the whole matrix! Will this slow down the calculation somewhat?

Check that allocation of memory is successful

I should always check that allocation of memory was successful, by checking if maybe NULL was returned!

Check for memory leaks

Check if there are some memory leaks, especially in aligner. I am pretty sure aligner has some memory leaks, in case when it ends earlier because it is missing some housekeeping.

Improve cigar format

It is not a custom to start alignment path with insertions in cigar: I should replace them with soft clipping CIGAR operation (S) which can come at the start or end of read.
Also, I shoud add support for standard cigar format.

Return starting position of alignment (but no alignment path)

@isovic suggested this as useful feature, so I should consider adding it!
For NW and SHW starting position is 0 so that is easy. However, for HW we do not know starting position. Best way to get is to run SHW backwards from ending point of HW, which is what I do now anyway when finding alignment path.
I should expose some flag for this, and then modify code for finding alignment to stop when starting position is found.

Segfault because of unsanitazed function parameters

Just to mention, this happens in an older version (not sure if the bug is present in the most recent one).
This is the GDB's output from running my program:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ff2f5399700 (LWP 9563)]
0x0000000000410f8d in obtainAlignment (maxNumBlocks=3, queryLength=178, targetLength=-13, W=14, bestScore=-1,
position=-1, alignData=0x0, alignment=0x7ff2f5397f60, alignmentLength=0x7ff2f5397f5c)
at src/alignment/myers.cpp:765

At the beginning of the obtainAlignment function, there is an allocation without checking the values of input parameters.
Also, it would be good that after every malloc/new there is a check if the pointer is NULL (but please continue handling things in C-like manner, throwing exceptions is just awful :-) ).

Add more examples in README

Add more examples of code in README. Also add more examples under apps.

Better explain how fast is EDLIB for specific cases, especially when finding alignment path

Try using Myers to obtain most common protein for some protein

It is actually just doing database search with NW. Try the same thing with SSW and compare results. If they both return the same protein and Myers is faster, then that could be important

Replace #defines in edlib.h with enumerations

Enumerations are better.

Different comparison with SSW

Right now I am comparing Myers with SSW while using dynamic adjusting of k in Myers. Is that ok? Can k be specified in SSW? If so, then I should try comparing with fixed k. If not, I could still try comparing with fixed k for just Myers maybe.

Memory leak

There is a memory leak (near line 160) in myersCalcEditDistance: I do not free positionsSHW!

Investigate if SSE/AVX could bring speedup

Could SSE/AVX increase speed? Investigate it, and if it would increase speed enough, implement it.

Finding alignment for HW throws segmentation fault

Reported by @isovic.
Command line used: src/aligner -f -c -a HW test_data2/test_read.fasta test_data2/test_reference.fasta .
Test data is available at https://drive.google.com/folderview?id=0B3-AVv8sCms8cWxoZVBNN1NkOTg&usp=sharing

Validate function input

I should validate that function input satisfies some basic conditions, for example that array lengths are positive and so on.

Obtain alignment path for certain best position

@isovic suggested addition of following feature: User can choose for which of best positions to get alignment path (now alignment path is returned only for first position).
I am not sure what is the best way to do this? Should I enable this before searching, or after finding positions? How can I enable it before when I do not know how many positions will be returned? Should I just find alignment paths for all positions? I should talk with @isovic more to see what is exact use case for which he thinks this will be useful.

Add helper function for reading fasta files to library

Finding alignment

Mogao bih dodati nalazenje alignmenta, po slicnoj ideji kao kod SSW. Za NW bi trebalo ponovno sve izracunat pa to bas i nema smisla, ali za HW i SHW bi trebalo izracunati samo dio matrice tako da bi se to moglo upotrijebiti. Pogotovo za HW ustvari. Dakle ideja je da se od pozicije gdje je kraj alignmenta krene prema nazad racunati, i tada se ocito racuna samo dio matrice.

Document API

Looking at .h file is not user friendly, I should create some nice documentation for API. It is probably best to use Doxygen for it.

Compare with some other aligning method (maybe SSW?). Test on nucleotide data.

Unify all output parameters into one parameter

Currently there are many output parameters in main function of edlib, like startLocations, endLocations, alignment, and so on. In order to make usage of function easier, we should pass only one object which will then be filled with output. This deprecates #34.

Enable passing NULL for outputs that are not needed

Enable user to pass null pointer for outputs that he does not need.

Combine Myers with Landau Vishkin

Since Landau Vishkin is so fast when query and target are very similar, maybe it would be good idea to first start calculation with Landau Vishkin for a few small k, then if it does not produce result I can switch to Myers. Reason for this is that Myers is actually not very efficient for small k because it will calculate whole block anyway, although band is maybe very small.

Optimize adjusting of last block in NW

Right now, when adjusting band (last block) for NW, I say that value of first cell in block is larger or equal than Score - W + 1. However, I could do it even better, I could say it is larger or equal than max(Score' - 1, Score - W + 1)! Should I do this, will it speed things up? I think I should try it.

Usage of parameter k in aligner

Currently k is always set to -1. Allow aligner to use k if we only want the query with best score, or if we want queries with N best scores. Also allow user to manually set k.

Speed up calculation of Peq

Although calculation of Peq normally takes unimportant amount of time compared to DP calculation, in some cases like when using NW for very similar proteins, it takes about half of execution time! It would be interesting to speed it up in all or at least such cases.

Runtime error

edlib.cpp:1062: int obtainAlignment(const unsigned char_, const unsigned char_, int, const unsigned char_, const unsigned char_, int, int, int, unsigned char*, int): Assertion `score_ == bestScore' failed.

Add support in aligner for printing multiple positions

Separate code into aligner and library

Instead of src/, create aligner/ and library/, where each of them will have its src/.

Test speed of alignment

Compare speed of search with alignment and without alignment. Do it for all modes and for different lengths of query. Longer the query, slower should search with alignment be. If query is very big, I expect search with alignment to be much slower

Edlib could be faster if starting with bigger k for unsimilar sequences in HW mode

Tests showed that for HW, when score is big enough, it may be more beneficial to start with larger k! For example, when similarity (1 - score / read_length) is < 60% we can get better results by just running edlib with k = read_length then using k = -1.

So how can we use this to speed up edlib? If we could have some way of very quickly and roughly estimating the similarity of two sequences up front, we could make a decision: "they seem to be pretty unsimilar, so lets use k=read_length instead of k=-1".

Add progress callback

Sometimes calculation takes a lot of time and we may want to be informed about the progress/status.
To accomplish that I can add a progress callback that will be called every some columns to report about progress so far.

Compare with myers code from 1998

I already run some tests, it seems I have similar speed like myers from 1998. He is somewhat faster(15%) for small k, but for larger k I am faster.
I should test intensively my code agains myers(1998) and other implementations that come with it.

Name of aligner is too generic

The name of the executable aligner is quite generic and could possibly conflict with a similarly named executable from another package. Consider renaming the executable to edlib-aligner.

Compare with Ukkonen implementation from SNAP

Compare speed with Ukkonen implementation that is used in SNAP. If my code is faster, then I can say it could be used in SNAP to speed up calculation.

aligner: Add SAM output format

Consider adding SAM output format to the aligner program

Update comparison with Landau Vishkin

I have new results of comparison, much more meaningful, so I should update the old ones with new ones!
I should also update other comparison results (for SSW)

Sometimes alignments start/end with insertions

For data at https://drive.google.com/folderview?id=0B3-AVv8sCms8TjB3S0Zfb3RrLTA&usp=sharing and command src/aligner -a HW -f -c test_data5/read.fasta test_data5/sample_reference.fasta alignment starts with insertions, investigate this!

Improve block calculation

I removed ifs from calculation of block, which is core of whole algorithm. That gave some speed, about 30% - 40% speedup. What I should do further: investigate if operations in block calculation can be further simplified! That would bring more speedup

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.