passionlab / bella Goto Github PK

BELLA: a Computationally-Efficient and Highly-Accurate Long-Read to Long-Read Aligner and Overlapper

License: Other

Python 0.90% C++ 90.90% Shell 0.07% Makefile 0.08% C 7.67% CMake 0.01% Fortran 0.02% Cuda 0.37%

overlapper genome-alignment pacbio-aligner longread-aligner long-read read-aligners matrix-multiplication spgemm

bella's Introduction

BELLA - Berkeley Efficient Long-Read to Long-Read Aligner and Overlapper

BELLA is a computationally efficient and highly accurate long-read to long-read aligner and overlapper. BELLA uses a k-mer based approach to detect overlaps between noisy, long reads. We demonstrated the feasibility of the k-mer based approach through a mathematical model based on Markov chains. BELLA provides a novel algorithm for pruning k-mers that are unlikely to be useful in overlap detection and whose presence would only incur unnecessary computational costs. Our reliable k-mers detection algorithm explicitly maximizes the probability of retaining k-mers that belong to unique regions of the genome. BELLA achieves fast overlapping without sketching using sparse matrix-matrix multiplication (SpGEMM), implemented utilizing high-performance software and libraries developed for this sparse matrix subroutine. Any novel sparse matrix format and multiplication algorithm would be applicable to overlap detection and enable continued performance improvements. We coupled BELLA's overlap detection with our newly developed vectorized seed-and-extend banded-alignment algorithm. The choice of the optimal k-mer seed occurs through our binning mechanism, where k-mer positions within a read pair are used to estimate the length of the overlap and to "bin" k-mers to form a consensus.We developed and implemented a new method to separate true alignments from false positives depending on the alignment score. This method demonstrates that the probability of false positives decreases exponentially as the overlap length between sequences increases.

Content

Getting Started
- Dependencies
- Compile
- Run
- Error Rate
- Memory Usage
Output Format
Performance Evaluation
Demo
I get 0 outputs, what is likely going wrong?
Citation

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

Dependencies

COMPILER: the software requires gcc-6 or higher with OpenMP to be compiled.
CUDA to compile and use GPU-accelerated pairwise alignment. You do not need CUDA to use CPU-based pairwise alignment. Our stand-alone GPU-based pairwise alignment, named LOGAN, can be found here.
Python3 and simplesam are required to generare the ground truth data. You can install simplesam via pip:

pip install simplesam

Compile

Clone the repository and enter it:

git clone https://github.com/giuliaguidi/bella
cd bella

Build using makefile:

ln -s makefile-nersc Makefile
make bella (CPU-only) OR make bella-gpu (CPU/GPU)

Run

To run with default setting:

./bella -f <list-of-fastq> -o <output-name>

BELLA requires a text file containing the path to the input fastq file(s) as the argument for the -f option. Example: input-example.txt

To show the usage:

./bella -h

Optional flag description:

  -f, --fastq arg            List of Fastq(s) (required)
  -o, --output arg           Output Filename (required)
  -k, --kmer arg             K-mer Length (default: 17)
  -x, --xdrop arg            SeqAn X-Drop (default: 7)
  -e, --error arg            Error Rate (default: 0.15)
      --estimate             Estimate Error Rate from Data
      --skip-alignment       Overlap Only
  -m, --memory arg           Total RAM of the System in MB (default: 8000)
      --score-deviation arg  Deviation from the Mean Alignment Score [0,1]
                             (default: 0.1)
  -b, --bin-size arg         Bin Size for Binning Algorithm (default: 500)
      --paf                  Output in PAF format
  -g, --gpus arg             GPUs Available (default: 1)
      --split-count arg      K-mer Counting Split Count (default: 1)
      --hopc                 Use HOPC representation
  -w, --window arg           Window Size for Minimizer Selection (default: 0)
  -s, --syncmer              Enable Syncmer Selection
  -u, --upper-freq arg       K-mer Frequency Upper Bound (default: 8)
  -l, --lower-freq arg       K-mer Frequency Lower Bound (default: 2)
  -h, --help                 Usage

The error rate is used to compute the adaptive alignment threshold. If using PacBio CCS/HiFi please set --error 0.005.

Memory Usage

The parallelism during the overlap detection phase depends on the available number of threads and on the available RAM [Default: 8000MB].

Use -DOSX or -DLINUX at compile time to estimate available RAM from your machine. If your machine has more RAM than the default one, using -DOSX or -DLINUX would make the ovelap detection phase faster.

Output Format

BELLA outputs alignments in a format similar to BLASR's M4 format. Example output (tab-delimited):

[A ID] [B ID] [# shared k-mers] [alignment score] [overlap length] [n=B fwd, c=B rc] [A start] [A end] [A length] [B start] [B end] [B length]

The positions are zero-based and are based on the forward strand, whatever which strand the sequence is mapped. If -p option is used, BELLA outputs alignments in PAF format. Example output (tab-delimited):

[A ID] [A length] [A start] [A end] ["+" = B fwd, "-" = B rc] [B ID] [B length] [B start] [B end] [alignment score] [overlap length] [mapping quality]

Performance Evaluation

The repository contains also the code to get the recall/precision of BELLA and other long-read aligners (Minimap, Minimap2, DALIGNER, MHAP and BLASR).

Ground truth generation for real data set: SAMparser.py allows to transform the Minimap2 .sam output file in a simpler format usable as input to the evaluation code when using real data set.

minimap2 -ax map-pb  ref.fa pacbio-reads.fq > aln.sam   # for PacBio subreads
samtools view -h -Sq 10 -F 4 aln.sam > mapped_q10.sam	# remove reads with quality values smaller than 10
samtools view -h mapped_q10.sam | grep -v -e 'XA:Z:' -e 'SA:Z:' | samtools view -S -h > unique_mapped_q10.sam	# remove reads mapped to multiple locations
python3 SAMparser.py <bwamem/minimap2-output>

Ground truth generation for synthetic data set: mafconvert.py allows to transform the .MAF file from PBSIM (Pacbio read simulator) in a simpler format usable as input to the evaluation code when using synthetic data set.

python scripr/mafconvert.py axt <maf-file> > <ground-truth.txt>

To run the evaluation program:

cd benchmark
make result

./result -G <grouth-truth-file> [-B <bella-output>] [-m <minimap/minimap2-output>] [-D <daligner-output>] [-L <blasr-output>] [-H <mhap-output>] [-M <mecat-output>] [-i <mecat-idx2read-file>]

If the output of BELLA is in PAF format, you should run it using minimap2 -m flag.

To show the usage:

./result -h

NOTE: add -z flag if simulated data is used.

Demo

You can download an E. coli 30X dataset here to test BELLA. For this dataset, you can use the following single mapped ground truth to run the evaluation code: ecsample_singlemapped_q10.txt. A detailed description of the procedure we use to generate the ground truth for real data can be found in our preprint.

You can run the evaluation code located in /bench folder as:

./result -G ecsample_singlemapped_q10.txt -B <bella-output>

I get 0 outputs, what is likely going wrong?

Error rate estimation might have gone wrong. If the error estimated is greater than 1, the adaptive alignment threshold would be so high that no alignments would pass the threshold. Please check if your fastq file has proper quality values. If not, please define an error rate using command line options.

Citation

To cite our work or to know more about our methods, please refer to:

BELLA: Berkeley Efficient Long-Read to Long-Read Aligner and Overlapper. Giulia Guidi, Marquita Ellis, Daniel Rokhsar, Katherine Yelick, Aydın Buluç. bioRxiv 464420; doi: https://doi.org/10.1101/464420.

Authors

Contributors

Copyright Notice

Berkeley Efficient Long-Read to Long-Read Aligner and Overlapper (BELLA), Copyright (c) 2018, The Regents of the University of California, through Lawrence Berkeley National Laboratory (subject to receipt of any required approvals from the U.S. Dept. of Energy) Giulia Guidi and Marco Santambrogio. All rights reserved.

If you have questions about your rights to use or distribute this software, please contact Berkeley Lab's Innovation & Partnerships Office at [email protected].

NOTICE. This Software was developed under funding from the U.S. Department of Energy and the U.S. Government consequently retains certain rights. As such, the U.S. Government has been granted for itself and others acting on its behalf a paid-up, nonexclusive, irrevocable, worldwide license in the Software to reproduce, distribute copies to the public, prepare derivative works, and perform publicly and display publicly, and to permit other to do so.

Acknowledgments

Funding provided in part by DOE ASCR through the Exascale Computing Project, and computing provided by NERSC. Thanks to Rob Egan and Steven Hofmeyr for valuable discussions. Thanks to NECST Laboratory and Ed Younis for key collaborations.

bella's People

Contributors

Stargazers

Watchers

Forkers

qizhou1512 kodingkoning albertozeni marcorabozzi awz camaclean daschober classicvalues

bella's Issues

Too much memory requested

Hi,

I've been trying to use bella b150701 to assemble a small portion of the human genome. Whenever I try to use it though I run into the issue that bella apparently sometimes wants insane amounts of memory.

./bella -k 13 -i input.fastq -o test-bella -d 3

sometimes gives

@id1 : 133599645 MB
ACGT .... : 133599645 MB
: 133599645 MB

When I let this run, the memory fills up until I have to restart my computer.
Only sometimes i get somethig more feasible (with the exact same command, mind you) :

@id1 : 3 MB
ACGT .... : 3 MB
: 3 MB

I've tried this with different depth settings and with different k-mer lengths, but the problem seems to always occur.

Problem with running evaluation test

Hello,

I'm attempting to use BELLA but have encountered a problem in the alignment step.
The installation is one CentOS 3.10, using gcc 6.3.0, with virtualenv pip install of simplesam. When I try the test files using ./bella -i test.txt -o bella_output -d 30 the produced output file 'bella_output.out' is empty.

Looking over the print out I find this 'Average length of successful alignment -nan bp'. Everything else seems fine.


paeruginosa30x_0001_5reads.fastq: 0 MB
K-mer counting: BELLA
Output filename: bella_output.out
K-mer length: 17
X-drop: 7
Depth: 30X
Compute alignment: true
Seeding: two-kmer

Running with up to 4 threads
Reading FASTQ file paeruginosa30x_0001_5reads.fastq
Initial parsing, error estimation, and k-mer loading took: 0.0749692s

Cardinality estimate is 27251
Table size is: 186907 bits, 0.0222811 MB
Optimal number of hash functions is: 5
First pass of k-mer counting took: 0.0391129s
Second pass of k-mer counting took: 0.0153589s

Entries within reliable range: 231
Error rate estimate is -nan
Reliable lower bound: 2
Reliable upper bound: 30
Deviation from expected alignment score: 0.2
Constant of adaptive threshold: -nan

Running with up to 4 threads
Reading FASTQ file paeruginosa30x_0001_5reads.fastq
Fastq(s) parsing fastq took: 0.0317728s
Total number of reads: 5

Old number of nonzeros before merging: 474
New number of nonzeros after merging: 474
Old number of nonzeros before merging: 474
New number of nonzeros after merging: 474
Sparse matrix construction took: 0.00283812s

Available RAM is assumed to be: 8000 MB
FLOPS is 255
nnz(output): 10 | free memory: 8.38861e+09 | required memory: 288
Stages: 1 | max nnz per stage: 291271111

Columns [0 - 5] overlap time: 0.00734909s
Creating or appending to output file with 0 MB

Columns [0 - 5] alignment time: 0.0329758s | alignment rate: 937294 bases/s | average read length: 5556.6 | read pairs aligned this stage: 10
Average length of successful alignment -nan bps
Average length of failed alignment 5788.6 bps

Outputted 0 lines in 0.0115177s
Total running time: 0.346414s```

I cant type the letter b

I was doing some testing on my computer and I couldn't type the letter 'b' in the control centre

lower/upper bound

SELF_TODO: implement lower/upper bound as input parameter.

error when installing the sof

gcc -O3 -fopenmp -c -o bound.o kmercode/bound.cpp
In file included from /usr/include/c++/4.9/array:35:0,
from kmercode/bound.cpp:11:
/usr/include/c++/4.9/bits/c++0x_warning.h:32:2: error: #error This file requires compiler and library support for the ISO C++ 2011 standard. This support is currently experimental, and must be enabled with the -std=c++11 or -std=gnu++11 compiler options.
#error This file requires compiler and library support for the
^
make: *** [bound.o] Error 1

how can I solve the errors? thanks!

Freeing invalid pointer when using non-zero window size for minimizer selection

Hi Giulia,

I've run into a problem when I try to run Bella using minimizers. Specifically, when I run the command

"bella -f input.txt -o output --paf -k 31 -w 17 -e 0.005 -l 2 -u 100",

I get the following output log:

INFO: src/../include/kmercount.hpp(108) InputFile = metagenome-anonymous.fastq
INFO: src/../include/kmercount.hpp(109) InputSize = 67.693069 MB
INFO: src/main.cpp(200) OutputFile = output.out
INFO: src/main.cpp(203) kmerSize = 31
INFO: src/main.cpp(206) GPUs = DISABLED
INFO: src/main.cpp(209) UserDefinedMemory = 8000.000000 MB
INFO: src/main.cpp(212) OutputPAF = 1
INFO: src/main.cpp(215) BinSize = 500
INFO: src/main.cpp(218) DeltaChernoff = 0.100000
INFO: src/main.cpp(221) RunPairwiseAlignment = 1
INFO: src/main.cpp(231) HOPC = DISABLED
INFO: src/main.cpp(235) xDrop = 7
INFO: src/main.cpp(238) KmerSplitCount = 1
INFO: src/main.cpp(243) useMinimizer = ENABLED
INFO: src/main.cpp(245) minimizerWindow = 17
INFO: src/main.cpp(276) numThreads = 64
INFO: src/../include/kmercount.hpp(711) ReadingFASTQ = metagenome-anonymous.fastq
*** Error in `bella': free(): invalid pointer: 0x00002aaacc076480 ***
Aborted

When I run without the "-w 17" option, it runs perfectly.

Thanks,
Gabe

build error bella-gpu

Hi,
when building bella-gpu (make bella-gpu) ; I've got an error

nvcc -arch=sm_70 -O3 -maxrregcount=32 -std=c++14 -Xcompiler -fopenmp -w -Iinclude/common/GTgraph/sprng2.0-lite/include -IloganGPU -Iseqan -o bella hash_funcs.o Kmer.o Buffer.o fq_reader.o optlist.o src/main.cu -L/home/boelle/Documents/bella/libbloom/build -lbloom -lpthread -lbz2 -lz -D__NVCC__

seqan/seqan/score/score_matrix_dyn.h:110:489: error: template argument 1 is invalid
110 | enum class AminoAcidScoreMatrixID : std::underlying_type_t<decltype(Find<impl::score::MatrixTags, ScoreSpecBlosum30>::VALUE)>

Seems like some template library must be missing - Do you have an idea ?
building without GPU is ok (make bella ); building LOGAN runs fine

platform : Ubuntu 20.04
gcc : 9.4.0
nvcc : 11.6 (quadro P1000 - architecture=sm_62)

Thanks a lot.

build error bella-gpu, issue with large input

I'm getting the same error as issue #32 with a copy of the repository cloned today, in the same system background as described in issue #32 - Ubuntu 20.04, nvcc 11.6, gcc 9.4. The comment in that issue that LOGAN may have issues with "large-ish" input is a concern, because I'd like to use bella-gpu on a file containing about 600 gigabases of long-read sequences.
Two questions: (1) Is there a fix for the compile failure other than to install different versions of nvcc and gcc?
(2) If I'm successful in compiling, will bella-gpu process 600 gigabases of input?
Thanks!

What does MECAT's indices mean

Hi,

I was working with the evaluation module of the Bella. I was going to test the output from MECAT. So, there I need to provide two output files: one is Mecat's output file and the other one id Mecat's indices files. I am not sure what does that " Meact's indices" file mean? Where where to find it?

Segfault in matrix transpose

Issue I neglected earlier

Data set:

/project/projectdirs/mp309/bella-spgemm/ecoli_hifi_29x.fastq

Parameters:

-k 31 -l 20 -u 23

Fix:

The issue appears to be in include/common/transpose.h, ~ line 35.

for (IT i=0; i <= n; i++)
{
    cscColPtr[i+1] = atomicColPtr[i] + cscColPtr[i];
}

cscColPtr

is of length n+1 I believe (n rows, n+1 stores nnz).

However this writes at location n+1 (i.e. n+2nd element), causing out of bounds.

I believe the correct code would be

for (IT i=0; i < n; i++)

cscColPtr[i+1] already handles the +1 size correctly for the last index (n).

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.