nvidia-genomics-research / genomeworks Goto Github PK

View Code? Open in Web Editor NEW

275.0 18.0 76.0 21.16 MB

SDK for GPU accelerated genome assembly and analysis

Home Page: https://clara-parabricks.github.io/GenomeWorks/

License: Apache License 2.0

CMake 2.48% Shell 1.55% C++ 29.77% Cuda 59.48% Python 4.81% Cython 1.91%

genomics cuda gpu nvidia python-api alignment mapping poa partial-order-alignment

genomeworks's Introduction

GenomeWorks

Overview

GenomeWorks is a GPU-accelerated library for biological sequence analysis. This section provides a brief overview of the different components of GenomeWorks. For more detailed API documentation please refer to the documentation.

Modules
- cudamapper - CUDA-accelerated sequence to sequence mapping
- cudapoa - CUDA-accelerated partial order alignment
- cudaaligner - CUDA-accelerated pairwise sequence alignment
- cudaextender - CUDA-accelerated seed extension
Setup GenomeWorks
Python API
Development Support

Clone GenomeWorks

Latest released version

This will clone the repo to the master branch, which contains code for latest released version and hot-fixes.

git clone --recursive -b master https://github.com/clara-parabricks/GenomeWorks.git

Latest development version

This will clone the repo to the default branch, which is set to be the latest development branch. This branch is subject to change frequently as features and bug fixes are pushed.

git clone --recursive https://github.com/clara-parabricks/GenomeWorks.git

System Requirements

Minimum requirements -

Ubuntu 16.04 or Ubuntu 18.04
CUDA 10.0+ (official instructions for installing CUDA are available here)
GPU generation Pascal and later (compute capability >= 6.0)
gcc/g++ 5.4.0+ / 7.x.x
Python 3.6.7+
CMake (>= 3.10.2)
autoconf (required to output SAM/BAM files)
automake (required to output SAM/BAM files)

GenomeWorks Setup

Build and Install

To build and install GenomeWorks -

mkdir build
cd build
cmake .. -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=install -Dgw_cuda_gen_all_arch=OFF
make -j install

NOTE : The gw_cuda_gen_all_arch=OFF option pre-generates optimized code only for the GPU(s) on your system. For building a binary that pre-generates opimized code for all common GPU architectures, please remove the option or set it to ON.

NOTE : (OPTIONAL) To enable outputting overlaps in SAM/BAM format, pass the gw_build_htslib=ON option.

Package generation

Package generation puts the libraries, headers and binaries built by the make command above into a .deb/.rpm for portability and easy installation. The package generation itself doesn't guarantee any cross-platform compatibility.

It is recommended that a separate build and packaging be performed for each distribution and CUDA version that needs to be supported.

The type of package (deb vs rpm) is determined automatically based on the platform the code is being run on. To generate a package for the SDK -

make package

genomeworks Python API

The python API for the GenomeWorks SDK is available through the genomeworks python package. More details on how to use and develop genomeworks can be found in the README under pygenomeworks folder.

Development Support

Enable Unit Tests

To enable unit tests, add -Dgw_enable_tests=ON to the cmake command in the build step.

This builds GTest based unit tests for all applicable modules, and installs them under ${CMAKE_INSTALL_PREFIX}/tests. These tests are standalone binaries and can be executed directly. e.g.

cd $INSTALL_DIR
./tests/cudapoatests

Enable Benchmarks

To enable benchmarks, add -Dgw_enable_benchmarks=ON to the cmake command in the build step.

This builds Google Benchmark based microbenchmarks for applicable modules. The built benchmarks are installed under ${CMAKE_INSTALL_PREFIX}/benchmarks/<module> and can be run directly.

e.g.

#INSTALL_DIR/benchmarks/cudapoa/multibatch

A description of each of the benchmarks is present in a README under the module's benchmark folder.

Enable Doc Generation

To enable document generation for GenomeWorks, please install Doxygen on your system. OnceDoxygen has been installed, run the following to build documents.

make docs

Docs are also generated as part of the default all target when Doxygen is available on the system.

To disable documentation generation add -Dgw_generate_docs=OFF to the cmake command in the build step.

Code Formatting

GenomeWorks makes use of clang-format to format it's source and header files. To make use of auto-formatting, clang-format would have to be installed from the LLVM package (for latest builds, best to refer to http://releases.llvm.org/download.html).

Once clang-format has been installed, make sure the binary is in your path.

To add a folder to the auto-formatting list, use the macro gw_enable_auto_formatting(FOLDER). This will add all cpp source/header files to the formatting list.

To auto-format, run the following in your build directory.

make format

To check if files are correct formatted, run the following in your build directory.

make check-format

Running CI Tests Locally

Please note, your git repository will be mounted to the container, any untracked files will be removed from it. Before executing the CI locally, stash or add them to the index.

Requirements:

docker (https://docs.docker.com/install/linux/docker-ce/ubuntu/)
nvidia-docker (https://github.com/NVIDIA/nvidia-docker)
nvidia-container-runtime (https://github.com/NVIDIA/nvidia-container-runtime)

Run the following command to execute the CI build steps inside a container locally:

bash ci/local/build.sh -r <GenomeWorks repo path>

ci/local/build.sh script was adapted from rapidsai/cudf

The default docker image is clara-genomics-base:cuda10.0-ubuntu16.04-gcc5-py3.7. Other images from gpuci/clara-genomics-base repository can be used instead, by using -i argument

bash ci/local/build.sh -r <GenomeWorks repo path> -i gpuci/clara-genomics-base:cuda10.0-ubuntu18.04-gcc7-py3.6

genomeworks's People

Contributors

Stargazers

Watchers

genomeworks's Issues

improve alignment error handling in kernel

In CUDA aligner, some times valid inputs can lead to errors in processing e.g. when the hirschberg processing stack is full. We should have an error handling mechanism which reports when certain alignments could not be processed correctly so they can be reported back to the caller.

Specific case - cudaaligner/src/hirschberg_myers_gpu.cu has a printf(ERROR: Stack full) case.

[tests] test_wrappers.py test fails without minimap2 and racon in env

the test_wrappers.py script always fails if minimap2 and racon are not present in the environment, and our documentation currently doesn't require those to be installed. we need to update documentation to take care of that.

`Matcher` needs chunked implementation

As solving part of #94 a low-memory (chunked) implementation of Matcher is required.

Unable to clone

Hi,
The clone command fails with access right error.

$ git clone --recursive [email protected]:clara-genomics/ClaraGenomicsAnalysis.git
Cloning into 'ClaraGenomicsAnalysis'...
The authenticity of host 'github.com (140.82.114.3)' can't be established.
RSA key fingerprint is SHA256:nThbg6kXUpJWGl7E1IGOCspRomTxdCARLviKw6E5SY8.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'github.com,140.82.114.3' (RSA) to the list of known hosts.
[email protected]: Permission denied (publickey).
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

Any thought?

Kernel Error:: Node count exceeded maximum nodes per window

Observed this error on my stderr while running racon-gpu, not sure if related to #45? It's raised by the code here.

Use pip to install pyclaragenomics instead of setup.py

There are several benefits to using pip to install custom packages instead of running setup.py directly. A good summary can be found here - https://stackoverflow.com/questions/15724093/difference-between-python-setup-py-install-and-pip-install/15731459

Acceleration improvements for Hirschberg

Improve exception handling in python bindings

Currently error codes are simply converted to python RuntimeErrors, whereas it might be more appropriate to throw some of them as ValueErrors based on the type of error status

genome_simulator slow on large genomes

After genome_simulator prints out

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 120/120 [05:41<00:00,  1.76s/it]
Simulating reads:
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 120/120 [06:54<00:00,  3.46s/it]

If running with a large genome (e.g 100MB @ 30x) there is a very long period where a single CPU is at 100% utilisation. This can probably be sped up through multiprocessing and/or other means.

[cudamapper] Number of overlaps generated has dependency on `index_size`

A regression appears to have been introduced whereby whatever index_size variable is set to affects the number of overlaps comptued. This results in very small differences between the number of overlaps detected prior and after read-level chunking. Example:

   899904 res_new.out
   899973 res_old.out```

This is likely to be an off-by-one error at some point in the read-level chunking

Graph representation in python and serialization methods

The python bindings are very helpful to perform alignments and consensus calls, but there currently isn't a good way to work with the resulting graph structures in python. The structures are available in C++, but there are some nuances to them. It would be nice if there were some examples (with documentation) of working with the resulting graphs in C++ and some bindings (or a new interface) to work with them in python as well.

It would also be helpful if there were a method that can be called after performing an alignment or consensus call that would serialize the graph (in DOT or some similar format) so it can be easily inspected / visualized after creation.

cuda aligner API to take in max available memory and max ref/query sizes

each aligner batch to take in max memory and max ref/query sizes and determine how many how alignments can be performed in the batch.
provide api to check max alignments possible
actual max alignments may me larger based on inputs processed so far, and actual max can be determined by continually adding and checking return value of add alignment api call

Sample app for cuda aligner

Write a sample application for cuda aligner

python build fails when Doxygen not present

When setup.py first runs cmake it is noted that:

-- Doxygen not found. Doc gen disabled.

but the subsequent cmake --build ... docs install call fails.

Create Error class for python bindings

Wrap the C++ error enums into an error class that has a to __str__ function and can be thrown from within the binding classes

Evaluate nvBowTie import into CGA

Import nvBowTie into ClaraGenomicsAnalysis SDK from NVBio repo

Add build dependencies to Conda for GPUCI builds

Some recent GPUCI failures have revealed that (eg this one) have revealed that we are sensitive to what packages are installed on GPUCI VM/docker instances we are running.

Conda should be used as much as possible to allow our tests to run on a clean CI instance. This includes at a minimum:

Cmake
Flake

clang-format: fix member initializer formatting

clang-format formats the member initializer of a constructor as a single long line regardless of the length of this line.
For long lines clang-format should introduce line breaks in some sensible way.

[cudaaligner] FormattedAlignment should include `|` and `x` symbols

FormattedAlignment in cudaaligner returns aligned strings in the following format:

ACCGTCA
ACGC--A

This would be preferable:

ACCGTCA
||xx  |
ACGC--A

create conda builds for C++ libraries and python modules

Create conda recipes for C++ libs
Create conda recipes for pyclaragenomics
Host releases in conda channel

Shared objects not being detected by python when importing `claragenomics.bindings`

When installing pyclaragenomics with venv, the following error is happening when running samples/tests:

Traceback (most recent call last):
  File "./sample_cudapoa", line 18, in <module>
    from claragenomics.bindings import cudapoa
ImportError: liblogging.so: cannot open shared object file: No such file or directory

it seems that if ClaraGenomicsAnalysis/pyclaragenomics/cga_build/install/lib/ is added to LD_LIBRARY_PATH this problem is resolved.

This error does not seem to occur when running in a Conda environment, but does in a venv (as reported by @mimaric ).

[cudapoa] Expose compile-time constants as parameters

There are several constants defined in cudapoa_kernels.cuh which control various maximum sizes of graph properties. It would be useful to vary these at runtime.

[cudaaligner] Add character set check to API

The cudaaligner API only supports ATCG alphabet, but the API doesn't check for this in input strings right now, leading to undefined behavior.

Improve README for different various parts of the SDK

Currently READMEs are not setup in an easy to use manner. The following needs to be done -
Proper README for each section (main, benchmarks, samples, APIs, tests)
Link all READMEs from main one to provide connected information from single location

About singlebatch

I went to build/install/benchmarks/cudapoa/ and when I run singlebatch, I get the following output

$ ./singlebatch
2019-07-29 13:14:25
Running ./singlebatch
Run on (16 X 3600 MHz CPU s)
CPU Caches:
  L1 Data 32K (x8)
  L1 Instruction 64K (x8)
  L2 Unified 512K (x8)
  L3 Unified 8192K (x2)
Load Average: 0.33, 0.15, 0.16
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
------------------------------------------------------------------
Benchmark                        Time             CPU   Iterations
------------------------------------------------------------------
BM_SingleBatchTest/1          1594 ms         1593 ms            1
BM_SingleBatchTest/4          2429 ms         2427 ms            1
BM_SingleBatchTest/16         2687 ms         2685 ms            1
BM_SingleBatchTest/64         3231 ms         3228 ms            1
BM_SingleBatchTest/256        8233 ms         8225 ms            1
terminate called after throwing an instance of 'std::runtime_error'
  what():  GPU Error:: out of memory /home/mahmood/cactus/cl/ClaraGenomicsAnalysis/cudapoa/src/allocate_block.cpp 46
Aborted (core dumped)

I would like to test a specific batch size and not variable sizes. It seems that singlebatch is a binary file. Any idea for that?

Setting the cuda toolkit location with PyClaraGenomics

It would be useful to override the CUDA_TOOLKIT_ROOT_DIR when building the Python bindings. The patch below passes the environment variable CUDA_TOOLKIT_ROOT_DIR to CMake if it is set, let me know if you want a P.R for this.

--- a/pyclaragenomics/setup.py
+++ b/pyclaragenomics/setup.py
@@ -35,6 +35,7 @@ class CMakeWrapper():
         self.cmake_root_dir = os.path.abspath(cmake_root_dir)
         self.cmake_install_dir = os.path.join(self.build_path, "install")
         self.cmake_extra_args = cmake_extra_args
+        self.cuda_toolkit_root_dir = os.environ.get("CUDA_TOOLKIT_ROOT_DIR")
 
     def run_cmake_cmd(self):
         cmake_args = ['-DCMAKE_INSTALL_PREFIX=' + self.cmake_install_dir,
@@ -42,6 +43,9 @@ class CMakeWrapper():
                       '-DCMAKE_INSTALL_RPATH=' + os.path.join(self.cmake_install_dir, "lib")]
         cmake_args += [self.cmake_extra_args]
 
+        if self.cuda_toolkit_root_dir:
+            cmake_args += ["-DCUDA_TOOLKIT_ROOT_DIR=%s" % self.cuda_toolkit_root_dir]
+
         if not os.path.exists(self.build_path):
             os.makedirs(self.build_path)

Add python API for cuda aligner

SketchElementImpl::ReadidPositionDirection became new SketchElement

SketchElement/Minimizer objects are not used anymore. IndexGPU internally relies on SketchElementImpl::ReadidPositionDirection. Its output consists of the content of SketchElementImpl::ReadidPositionDirection split into three separate arrays.

Look into ways to:

Change the interface of Index so that SketchElementImpl::ReadidPositionDirection does not have to be split into three arrays
Refactor the code to reflect the current state of not using SketchElement objects

Move enums back to enum classes in C++ CGA

Because of cython limitations, C++ enum classes had to be converted to enums for compatibility. However, there seem to be some workarounds in cython land to make up for that limitation. Worth investigating those WARs to avoid violating good C++ coding guidelines

Enable style check for cudamapper

Would just require the CMakeLists.txt in cudamapper folder to have one more line as specified here https://gitlab-master.nvidia.com/genomics/GenomeWorks/blob/master/cudapoa/CMakeLists.txt#L67

Path Missing CMakeLists.txt

I tried to install Claragenomics in ubuntu 18.04 using command cmake .. -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=install,but I'm getting this error
The source directory /home/vaibhavcurl/ClaraGenomicsAnalysis-master/3rdparty/bioparser
does not contain a CMakeLists.txt file.
The source directory

/home/vaibhavcurl/ClaraGenomicsAnalysis-master/3rdparty/spdlog

does not contain a CMakeLists.txt file.

[cudaaligner] combine benchmarks into single application

Use google bench filters instead to choose which set of benchmarks to run

Hitting graph size larger than allowed GPU limits

Observed a graph overflowing the max edges count during a run. Reported in #44 by @SamStudio8

After PR #131 results of cudamapper are different

For a 10 megabases 30x coverage input cudamapper returns different results than before the merge of #131
Please investigate and unittest to cover that case.

Add CIGAR alignments to cudamapper using cudaaligner

Add optional cigar attribute to Overlap objects.
Add -a flag to cudamapper for the option of computing alignments
Once overlapping is complete, alignments can be completed in batches using cudaaligner.
If alignments are computed, they should be added to the PAF file (the relevant modification needs to be performed in the print_paf function).

Remove Ubuntu dependency

On Linux distributions which aren't Ubuntu or CentOS, building the source code fails with the 'unrecognized distro' fatal error. This error occurs in Packaging, which is not relevant to building the rest of the code to use on a given machine and should not block this.

[cudapoa] Lost lines in MSA output

In the example pyclaragenomics/samples/sample_cudapoa, the maximum sequences per poa is specified as 100, though the outputs are only 99 long. Changing the maximum sequences to 50, results in outputs of length 49. In appears that it is the final input sequence that is lost.

Do final steps of index generation in IndexGPU on GPU

As specified in PR #134 last part of building index in IndexGPU (done in details::index_gpu::build_index()) is still done on the CPU and takes about half of the total time execution time of IndexGPU generation.

Look for a way to move it to the GPU

document missing

I get GitHub 404 for this link
https://github.com/clara-genomics/ClaraGenomicsAnalysis/blob/master/docs/annotated.html
from this page
https://github.com/clara-genomics/ClaraGenomicsAnalysis/blob/master/docs/README-SDK.md

I guess the generated HTML is not checked in?

Increase cudaaligner singlealignment sweep to 1 Mb

Update Myers + Hirschberg benchmark sweep to end at 1 Mb

Index should accept SketchElement implementation as tempalte parameter

Currently Index works with pointers to SketchElement, meaning we have to use std::vector<std::unique_ptr<SketchElement>> which is bad for performance and makes it hard to use that data on the GPU.
Change the implementation so that Index (or it's constructor) accepts one implementation of SketchElement and then work with std::vector<SketchElementImpl>

[cudapoa] CudaPoaBatch.get_msa() incorrectly reports success on failure with large inputs

The results of at least the python binding can be unexpected when the maximum MSA width is surpassed (default 1024 from cudapoa_kernels.cuh). I’ve observed the status be reported as 0 but the results be slightly mangled.

For example I’ve input 70 sequences of ~660bases, the status is 0, and the lengths of the strings returned for the MSA are not equal (often the first being longer than the rest). Taking a one/a few bases away from the inputs gives MSA lines uniformly of length 1023.

Solve performance regression caused by chunked Index Generator

#100 allows indexing of an arbitarily-large set of sequences but introduces a performance regression. This is caused because several sorted lists of SketchElements now need to be merged together. The merging is not being performed in an optimal way and can be improved by multithreading to run in ~log(N) time. There is also the possibility of performing this on GPU.

revert CGA_CU_CHECK_ERR to abort on error

Revert the CGA_CU_CHECK_ERR functionality to abort on error for Release builds, and assert(false) and then abort for Debug builds.

Potentially add cudaDeviceSynchronize in debug builds to catch errors when they occur

pycga setup.py should run from any folder

Right now setup.py only runs from the pyclaragenomics folder. This should be runnable from any directory

Make SDK functions "current device"-neutral

If we assume our users use CUDA also outside of our library, we should also ensure that our methods are "current device"-neutral, i.e. that we reset the device (cudaSetDevice) at the end of each method to the value it had when it entered the method.

`IndexGenerator` and `Matcher` unable to allocate memory on GPU when number of reads too large

When running overlaps with a FASTA/FASTQ that is too large (e.g >500MB) the following error is encountered:

terminate called after throwing an instance of 'claragenomics::device_memory_allocation_exception'
  what():  Could not allocate device memory!

This happens because on-device memory requirements of IndexGenerator and Matcher scale with size of the input reads.

The solution is to implement a "chunked" version of IndexGenerator and Matcher.

[cudapoa] combine benchmarks into single application

Combine cudapoa benchmarks into a single application instead of two. google benchmarks provides a way to select which benchmarks to run based on filters

`IndexGenerator` needs chunked implementation

As solving part of #94 a low-memory (chunked) implementation of InexGenerator is required.

Still about single batch

For my GPU analyses, I tried running single-batch with nv profiler. It seems that there are two kernels only where the dominant one is generatePOAKernel. The other is generateConsensusKernel which is not important. So, this benchmark is not going to solve the problem and is only good for generating the graph. Am I right? I am not expert in this field and want to analyze some GPU things. I don't know if that graph generation is a big problem.

A single run of batch=256, takes
Time = 8094 ms
CPU = 8092 ms
Iterations = 1
So, where is GPU in the results?