Compact Bit-Sliced Signature Index (COBS)

COBS (COmpact Bit-sliced Signature index) is a cross-over between an inverted index and Bloom filters. Our target application is to index k-mers of DNA samples or q-grams from text documents and process approximate pattern matching queries on the corpus with a user-chosen coverage threshold. Query results may contain a number of false positives which decreases exponentially with the query length and the false positive rate of the index determined at construction time. COBS' compact but simple data structure outperforms other indexes in construction time and query performance with Mantis by Pandey et al. in second place. However, unlike Mantis and other previous work, COBS does not need the complete index in RAM and is thus designed to scale to larger document sets.

COBS has two interfaces: ( )

a command line tool in C++ called cobs (see below)
a Python interface to the C++ library (see https://bingmann.github.io/cobs-python-docs/)

More information about COBS is presented in our current research paper: Timo Bingmann, Phelim Bradley, Florian Gauger, and Zamin Iqbal. "COBS: a Compact Bit-Sliced Signature Index". In: 26th International Symposium on String Processing and Information Retrieval (SPIRE). pages 285-303. Spinger. October 2019. preprint arXiv:1905.09624.

If you use COBS in an academic context or publication, please cite our paper

@InProceedings{bingmann2019cobs,
  author =       {Timo Bingmann and Phelim Bradley and Florian Gauger and Zamin Iqbal},
  title =        {{COBS}: a Compact Bit-Sliced Signature Index},
  booktitle =    {26th International Conference on String Processing and Information Retrieval (SPIRE)},
  year =         2019,
  series =       {LNCS},
  pages =        {285--303},
  month =        oct,
  organization = {Springer},
  note =         {preprint arXiv:1905.09624},
}

Installation and First Steps

Installation

COBS requires CMake, a C++17 compiler or the Boost.Filesystem library.

Linux

To download and install COBS run:

git clone --recursive https://github.com/bingmann/cobs.git
mkdir cobs/build
cd cobs/build
cmake -DCMAKE_BUILD_TYPE=Release ..
make -j4

and optionally run make test to check the build.

OS X compilation

Using `gcc`

Install gcc-11 or more recent: brew install gcc@11
Compile COBS: cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=gcc-11 -DCMAKE_CXX_COMPILER=g++-11 ..

Using `clang`:

Install boost-1.76: brew install [email protected]
Compile COBS with boost: cmake ..

Troubleshooting

Several issues might arise from your specific configuration.

Problems with openMP on Mac OS X

If installing OpenMP does not work, add -DNOOPENMP=1 argument to the cmake command.

Problems with python bindings

Skip python bindings compilation by adding -DSKIP_PYTHON=1 argument to the cmake command.

Problems with finding boost

Define BOOST_ROOT env variable and then compile:

export BOOST_ROOT="/usr/local/opt/[email protected]"  # use your boost root path - this would be the path if installing boost using brew on Mac OS X
cmake  ..

Building an Index

COBS can read FASTA files (*.fa, *.fasta, *.fna, *.ffn, *.faa, *.frn, *.fa.gz, *.fasta.gz, *.fna.gz, *.ffn.gz, *.faa.gz, *.frn.gz), FASTQ files (*.fq, *.fastq, *.fq.gz., *.fastq.gz), "Multi-FASTA" and "Multi-FASTQ" files (*.mfasta, *.mfastq), McCortex files (*.ctx), or text files (*.txt). See below on details how they are parsed.

You can either recursively scan a directory for all files matching any of these files, or pass a *.list file which lists all paths COBS should index.

To check the document list to be indexed, run for example

src/cobs doc-list tests/data/fasta/

To construct a compact COBS index from these seven example documents run

src/cobs compact-construct tests/data/fasta/ example.cobs_compact

Or construct a compact COBS index from a list of documents by running

src/cobs compact-construct tests/data/fasta_files.list example.cobs_compact

The paths in the file list can be absolute or relative to the file list's path. Note that *.txt files are read as verbatim text files. You can force COBS to read a .txt file as a file list using --file-type list.

Check --help for many options.

Query an Index

COBS has a simple command line query tool:

src/cobs query -i example.cobs_compact AGTCAACGCTAAGGCATTTCCCCCCTGCCTCCTGCCTGCTGCCAAGCCCT

or a fasta file of queries with

src/cobs query -i example.cobs_compact -f query.fa

Multiple indices can be queried at once by adding more -i parameters.

Python Interface

COBS also has a Python frontend interface which can be used to construct and query an index. See https://bingmann.github.io/cobs-python-docs/ for a tutorial.

Experimental Results

In our paper we compare COBS against seven other k-mer indexing software packages. These are the main results, scaling by number of documents in the index, and in the second diagram shown per document.

More Details

File Types and How They Are Parsed

COBS can read FASTA files (*.fa, *.fasta, *.fa.gz, *.fasta.gz), FASTQ files (*.fq, *.fastq, *.fq.gz., *.fastq.gz), "Multi-FASTA" and "Multi-FASTQ" files (*.mfasta, *.mfastq), McCortex files (*.ctx), or text files (*.txt). Each file type is parsed slightly differently into q-grams or k-mers.

FASTA files are parsed as one document each. If a FASTA file contains multiple sequences or reads then they are combined into one document. Multiple sequences (separated by comments) are NOT concatenated trivially, instead the k-mers are extracted separately from each sequence. This means there are no erroneous k-mers from the beginning or end of crossing sequences. All newlines within a sequence are removed.

The k-mers from DNA sequences are automatically canonicalized (the lexicographically smaller is indexed). By adding the flag --no-canonicalize this process can be skipped. With canonicalization only ACGT letters are indexed, every other letter is mapped to binary zeros and index with the other data. A warning per FASTA/FASTQ file containing a non-ACGT letter is printed, but processing continues. With the flag --no-canonicalize any letters or text can be indexed.

FASTQ files are also parsed as one document each. The quality information is dropped and effectively everything is parsed identical to FASTA files.

Multi-FASTA or Multi-FASTQ files are parsed as many documents. Each sequence in the FASTA or FASTQ file is considered a separate document in the COBS index. Their names are append with _### where ### is the index of the subdocument.

McCortex files (*.ctx) contain a list of k-mers and these k-mers are indexes individually. The graph information is ignored. Only k=31 is currently supported.

Text files (*.txt) are parsed as verbatim binary documents. All q-grams are extracted, including newlines and other whitespace.

False negatives - some sort of interaction with gzip / bgzip

Strange bug, and might be more of a user error

I've created "broken.fa.gz" by downloading 10 public genomes and compressing the file with bgzip.
I've created "works.fa.gz" by uncompressing this file and recompressing with gzip.

If I build a COBS index with the broken file then query the index with each of the 10 sequences, it finds the first 6, 95% of the 7th and ~60% of the remaining 3.

All works as expected with the second file and it finds all the kmers for each of the reference sequences.

broken.fa.gz
works.fa.gz

minimal script:

#!/usr/bin/env python

import cobs_index as cobs
from Bio import SeqIO
import gzip

p = cobs.CompactIndexParameters()
p.term_size = 31               # k-mer size
p.clobber = True               # overwrite output and temporary files
p.false_positive_rate = 0.4    # higher false positive rate -> smaller index

broken = "broken.fa.gz"
cobs.compact_construct(broken, "broken_index.cobs_compact", index_params=p)
s = cobs.Search("broken_index.cobs_compact")

with gzip.open(broken, "rt") as handle:
    for record in SeqIO.parse(handle, "fasta"):
        l = len(record) - 30
        print(record.id, l)
        r = s.search(str(record.seq), num_results=2)
        for result in r:
            print(result.score, result.doc_name, float(result.score)/l)

works = "works.fa.gz"
cobs.compact_construct(works, "works_index.cobs_compact", index_params=p)
s = cobs.Search("works_index.cobs_compact")

with gzip.open(works, "rt") as handle:
    for record in SeqIO.parse(handle, "fasta"):
        l = len(record) - 30
        print(record.id, l)
        r = s.search(str(record.seq), num_results=2)
        for result in r:
            print(result.score, result.doc_name, float(result.score)/l)

iqbal-lab-org / cobs Goto Github PK

cobs's Introduction

Compact Bit-Sliced Signature Index (COBS)

Installation and First Steps

Installation

Linux

OS X compilation

Using gcc

Using clang:

Troubleshooting

Problems with openMP on Mac OS X

Problems with python bindings

Problems with finding boost

Building an Index

Query an Index

Python Interface

Experimental Results

More Details

File Types and How They Are Parsed

cobs's People

Contributors

Stargazers

Watchers

Forkers

cobs's Issues

Recommend Projects

Recommend Topics

Recommend Org

Using `gcc`

Using `clang`: