Giter VIP home page Giter VIP logo

kseqpp's Introduction

kseq++

kseq++ is a C++11 re-implementation of kseq.h by Heng Li. The goal for re-implementation of kseq is providing better API and resource management while preserving its flexibility and performance. Like original kseq, this parser is based on generic stream buffer and works with different file types. However, instead of using C macros, it uses C++ templates.

It inherits all features from kseq (quoting from kseq homepage):

  • Parse both FASTA and FASTQ format, and even a mixture of FASTA and FASTQ records in one file.
  • Seamlessly adapt to gzipped compressed file when used with zlib.
  • Support multi-line FASTQ.
  • Work on a stream with an internal stream buffer.

while additionally provides:

  • simpler and more readable API
  • RAII-style memory management

The library also comes with a FASTA/Q writer. Like reading, it can write mixed multi-line FASTA and FASTQ records with gzip compression. The writer is multi-threaded and the actual write function call happens in another thread in order to hide the IO latency.

The RAII-style class KStream is the core class which handles input and output streams. Each FASTA or FASTQ record will be stored in a KSeq object.

This library provides another layer of abstraction which hides most details and provides very simple API on top of KStream: SeqStreamIn and SeqStreamOut classes for reading and writing a sequence file respectively with exactly the same interface. It is highly recommended to use these classes unless you intent to use low-level interface like changing buffer size or use custom stream type.

Looking for a quick start guide?

Jump to Examples or Installation.

KStream (kseq++.hpp)

KStream is a generic, template class with the following template parameters which are usually inferred by the compiler when constructed (so, there is no need to provide them manually):

  • TFile: type of the underlying stream/file (e.g. gzFile)
  • TFunc: type of the read/write function corresponding to TFile (e.g. int (*)(gzFile_s*, const void*, unsigned int) for an output stream with gzFile as underlying file type)
  • TSpec: stream opening mode (with values: mode::in or mode::out)

The template parameters are inferred by compiler in C++17 when instantiated by calling their constructors. make_kstream function family also construct KStreams which might be useful for inferring template parameters when using older standards; e.g. C++11 or C++14.

To construct an instance, it requires at least three arguments: 1) the file object/pointer/descriptor (can be of any type), 2) its corresponding read/write function, and 3) stream opening mode (see Examples).

Higher-level API (seqio.hpp)

This header file defines SeqStream class set: i.e. SeqStreamIn and SeqStreamOut. SeqStream classes are inherited from KStream with simpler constructors using sensible defaults. They do not define any new method or override inherited ones. So, they can be treated the same way as KStream.

In order to prevent imposing any unwanted external libraries (e.g. zlib) , the SeqStream class set are defined in a separated header file (seqio.hpp) from the core library.

Examples

Reading a sequence file

These examples read FASTQ/A records one by one from either compressed or uncompressed file.

Using SeqStreamIn:

#include <iostream>
#include <kseq++/seqio.hpp>

using namespace klibpp;

int main(int argc, char* argv[])
{
  KSeq record;
  SeqStreamIn iss("file.fq.gz");
  while (iss >> record) {
    std::cout << record.name << std::endl;
    if (!record.comment.empty()) std::cout << record.comment << std::endl;
    std::cout << record.seq << std::endl;
    if (!record.qual.empty()) std::cout << record.qual << std::endl;
  }
}
Low-level API

Using KStream

#include <iostream>
#include <zlib>
#include <kseq++/kseq++.hpp>

using namespace klibpp;

int main(int argc, char* argv[])
{
  KSeq record;
  gzFile fp = gzopen(filename, "r");
  auto ks = make_kstream(fp, gzread, mode::in);
  // auto ks = KStream(fp, gzread, mode::in);  // C++17
  // auto ks = KStreamIn(fp, gzread);  // C++17
  while (ks >> record) {
    std::cout << record.name << std::endl;
    if (!record.comment.empty()) std::cout << record.comment << std::endl;
    std::cout << record.seq << std::endl;
    if (!record.qual.empty()) std::cout << record.qual << std::endl;
  }
  gzclose(fp);
}

Or records can be fetched and stored in a std::vector< KSeq > in chunks.

Using SeqStreamIn:

#include <iostream>
#include <kseq++/seqio.hpp>

using namespace klibpp;

int main(int argc, char* argv[])
{
  SeqStreamIn iss("file.fq");
  auto records = iss.read();
  // auto records = iss.read(100);  // read a chunk of 100 records
}
Low-level API

Using KStream

#include <iostream>
#include <zlib>
#include <kseq++/kseq++.hpp>

using namespace klibpp;

int main(int argc, char* argv[])
{
  gzFile fp = gzopen(filename, "r");
  auto ks = make_ikstream(fp, gzread);
  auto records = ks.read();  // fetch all the records
  // auto records = ks.read(100);  // read a chunk of 100 records
  gzclose(fp);
}

Writing a sequence file

These examples write FASTA/Q records to an uncompressed file.

Using SeqStreamIn:

#include <iostream>
#include <kseq++/seqio.hpp>

using namespace klibpp;

int main(int argc, char* argv[])
{
  SeqStreamOut oss("file.dat");
  for (KSeq const& r : records) oss << r;
}
Low-level API

Using KStream

#include <iostream>
#include <zlib>
#include <kseq++/kseq++.hpp>

using namespace klibpp;

int main(int argc, char* argv[])
{
  int fd = open(filename, O_WRONLY);
  auto ks = make_kstream(fd, write, mode::out);
  // auto ks = KStreamOut(fd, write);  // C++ 17
  // ...
  for (KSeq const& r : records) ks << r;
  ks << kend;
  close(fd);
}

Another example for writing a series of FASTQ records to a gzipped file in FASTA format:

#include <iostream>
#include <kseq++/seqio.hpp>

using namespace klibpp;

int main(int argc, char* argv[])
{
  /* let `record` be a list of FASTQ records */
  SeqStreamOut oss("file.fa.gz", /* compression */ true, format::fasta);
  for (KSeq const& r : records) oss << r;
}

NOTE

The buffer will be flushed to the file when the KStream object goes out of the scope. Otherwise, ks << kend is required to be called before closing the file to make sure that there is no data loss.

There is no need to write kend to the stream if using SeqStreamOut.


Wrapping seq/qual lines

While writing a record to a file, sequence and quality scores can be wrapped at a certain length. The default wrapping length for FASTA format is 60 bps and can be customised by KStream::set_wraplen method. For FASTQ format -- i.e. when the format is explicitly set to format::fastq -- output sequence and quality string are not wrapped by default.

Wrapping can be disabled or enable by KStream::set_nowrapping and KStream::set_wrapping methods respectively. The latter reset the wrapping length to the default value (60 bps).

Formatting

The default behaviour is to write a record in FASTQ format if it has quality information. Otherwise, i.e. when the quality string is empty, the record will be written in FASTA format. So, the output might be a mixture of FASTQ and FASTA records. However, the output format can be forced by using format::fasta and format::fastq modifiers. For example:

out << format::fasta << fastq_record;
out << another_record;  // all other calls after this will also be in FASTA format.

will write a FASTQ record in FASTA format. These modifiers affect all writes after them until another modifier is used. The format::mix modifier reverts the behaviour to default.


NOTE

Writing a FASTA record in FASTQ format throws an exception unless the record is empty (a record with empty sequence and quality string).


Installation

kseq++ is a header-only library and can be simply included in a project. Use the package provided in the Releases section and copy include/kseq++ to your project tree.

The kseq++.hpp is the core header file and seqio.hpp is optional and only needs to be included when using higher-level API (see above). The latter requires zlib as dependency which should be linked.

There are also other ways to install the library:

From source

Installing from source requires CMake>= 3.10:

git clone https://github.com/cartoonist/kseqpp
cd kseqpp
mkdir build && cd build
cmake .. # -DCMAKE_INSTALL_PREFIX=/path/to/custom/install/prefix (optional)
make install

From conda

It is also distributed on bioconda:

conda install -c bioconda kseqpp

Development

CMake integration

After installing the library, you can import the library to your project using find_package. It imports kseq++::kseq++ target which can be passed to target_include_directories and target_link_libraries calls. This is a sample CMake file for building myprogram which uses the library:

cmake_minimum_required(VERSION 3.10)
project(myprogram VERSION 0.0.1 LANGUAGES CXX)

find_package(kseq++ REQUIRED)

set(SOURCES "src/main.cpp")
add_executable(myprogram ${SOURCES})
target_include_directories(myprogram
  PRIVATE kseq++::kseq++)
target_link_libraries(myprogram
  PRIVATE kseq++::kseq++)

CMake options:

  • for building tests: -DBUILD_TESTING=on
  • for building benchmark: -DBUILD_BENCHMARKING=on

Benchmark

NOTE: The results below are based on older versions of kseq++ and kseq.h.

  • TODO Update benchmark

NOTE: It is fair to say that kseq++ comes with a very negligible overhead and is almost as fast as kseq.h (in 'read' mode) with an idiomatic C++ API and more convinient resource management. The original kseq.h does not support writing FASTA/Q files.

Datasets

For this benchmark, I re-used sequence files from SeqKit benchmark: seqkit-benchmark-data.tar.gz

file format type num_seqs sum_len min_len avg_len max_len
dataset_A.fa FASTA DNA 67,748 2,807,643,808 56 41,442.5 5,976,145
dataset_B.fa FASTA DNA 194 3,099,750,718 970 15,978,096.5 248,956,422
dataset_C.fq FASTQ DNA 9,186,045 918,604,500 100 100 100

Platform

  • CPU: Intel® Xeon® CPU E3-1241 v3 @ 3.50GHz, 4 cores, 8 threads
  • RAM: DDR3 1600 MHz, 16352 MB
  • HDD: Seagate Desktop HDD 500GB, 16MB Cache, SATA-3
  • OS: Debian GNU/Linux 9.4 (stretch), Linux 4.9.91-1-amd64-smp
  • Compiler: GCC 6.3.0, compiled with optimisation level 3 (-O3)

Result (for kseq++ v0.1.4)

Reading all records

file kseq++ kseq SeqAn kseq++/read* SeqAn/readRecords**
dataset_A.fa 2.35 s 2.5 s 2.92 s 3.52 s 4.94 s
dataset_B.fa 2.66 s 2.8 s 3.34 s 3.74 s 9.82 s
dataset_C.fq 2.56 s 2.46 s 2.66 s 4.56 s 11.8 s

* storing all records in std::vector.

** storing all records in seqan2::StringSet< seqan2::CharString >.

Writing all records

file kseq++/plain kseq++/gzipped SeqAn/plain
dataset_A.fa 2.3 s 866 s 2.29 s
dataset_B.fa 2.19 s 849 s 2.33 s
dataset_C.fq 1.94 s 365 s 2.24 s

kseqpp's People

Contributors

cartoonist avatar rob-p avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

kseqpp's Issues

Unable to install the library using provided CMakeLists.txt

Library installation instructions would be greatly appreciated!

Upon trying to install the library using cmake... The following results:

-- The CXX compiler identification is GNU 8.4.0
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/local/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found ZLIB: /usr/lib64/libz.so (found version "1.2.7")
-- Found BZip2: /usr/lib64/libbz2.so (found version "1.0.6")
-- Looking for BZ2_bzCompressInit
-- Looking for BZ2_bzCompressInit - found
-- Looking for C++ include pthread.h
-- Looking for C++ include pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
CMake Error at CMakeLists.txt:17 (add_library):
  add_library INTERFACE library requires no source arguments.


CMake Error at CMakeLists.txt:19 (target_sources):
  Cannot specify sources for target "kseq++" which is not built by this
  project.


CMake Error at CMakeLists.txt:21 (target_include_directories):
  Cannot specify include directories for target "kseq++" which is not built
  by this project.


CMake Error at CMakeLists.txt:28 (target_link_libraries):
  Cannot specify link libraries for target "kseq++" which is not built by
  this project.


-- Configuring incomplete, errors occurred!

feature request, make threads optional

this is a great minimal library, thanks!
it would be nice have std::thread as an ifdef option.
these are my changes, not tested yet.

line 29

#ifdef KSEQPP_THREADS
  #include <thread>
#endif

line 100

#ifdef KSEQPP_THREADS
        std::thread worker;                             /**< @brief worker thread */
#endif

line 430

          inline void
        worker_join( )
        {
#ifdef KSEQPP_THREADS
          this->async_write( true );
          if ( this->worker.joinable() ) this->worker.join();
#else
          this->async_write( false );
#endif
        }

          inline void
        worker_start( )
        {
#ifdef KSEQPP_THREADS
          this->worker = std::thread( &KStream::writer, this );
#else
          this->writer();
#endif
        }

Broken output if seq/qual is empty

After doing some adapter trimming on a read I wound up with an empty sequence (and corresponding empty qualities). When I write that out instead of four lines (two of them blank) I get only two lines (name line and one blank line). And the name line starts with a greater-than sign (>) instead of an at-sign (@). Setting the wraplen to larger values (I tried 1000 and 10000) had no effect.

Code sample follows:

// g++ -Wall -Wextra -O3 -std=c++14 -I /path/to/kseqpp/src emptySeq.cpp -o emptySeq -lz -pthread

#include <iostream>
#include <seqio.h>

int main()
{
    klibpp::SeqStreamOut out("/dev/stdout");

    klibpp::KSeq s;
    s.name = "should start with an at-sign";
    s.seq = "";
    s.qual = "";
    out << s;

    return 0;
}

I also tried opening /dev/tty, /dev/stderr and a file on disk. Same output.

If I set the seq/qual to even a single character it behaves fine.

Using g++ 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609 on Ubuntu 16.04.6 LTS.

Maybe using libdeflate for decompression is faster?

When decompressing a file that has been compressed with gzip, kseqpp relies on zlib for decompression. However, there is a library called libdeflate which offers faster decompression speed compared to zlib. Perhaps using libdeflate could result in improved performance?

ZLIB cannot be linked...

When compiling the code containing kseq++ it keeps complaining about the ZLIB:

undefined reference to `gzopen'
undefined reference to `gzread'

Although the ZLIB is being interfaced in the CMAKE configurations of kseq++, the following lines were necessary to be added to the project's CMakeLists.txt to force linking ZLIB:

find_package(ZLIB)
target_link_libraries(_target_ ZLIB::ZLIB)

Add usage and installation guide

Suggested in #2, there should be a section in the README file explaining how to install the library and alternatively how to bundle it in the source tree without installation (i.e. as a git submodule).

A problem with a newline when writing FASTQ

When writing a FASTQ file using this library, I find a newline is added into sequence line and quality line by default. My output result is like this:

@name
AGAAAGCTTCATGTTTCAATTGGCCAAAGAATAAGGTAGAGTACGGTAGAAAATAGAGTG
AGCGAACAAA
+
222222222222222222222222222222222222222222222222222222222222
2222222222

Is there any way to avoid the newline?

No default constructor for defined classes

Hi, I want to make a map of {int -> SeqStreamOut}, for output reads to different files according to the int key. but the compiler says that "no matching constructor for initialization of SeqStreamOut", any could help me? Thank you

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.