Giter VIP home page Giter VIP logo

cobs's Introduction

Compact Bit-Sliced Signature Index (COBS)

CI

COBS (COmpact Bit-sliced Signature index) is a cross-over between an inverted index and Bloom filters. Our target application is to index k-mers of DNA samples or q-grams from text documents and process approximate pattern matching queries on the corpus with a user-chosen coverage threshold. Query results may contain a number of false positives which decreases exponentially with the query length and the false positive rate of the index determined at construction time. COBS' compact but simple data structure outperforms other indexes in construction time and query performance with Mantis by Pandey et al. in second place. However, unlike Mantis and other previous work, COBS does not need the complete index in RAM and is thus designed to scale to larger document sets.

cobs-architecture

COBS has two interfaces: ( Coverage Status )

More information about COBS is presented in our current research paper: Timo Bingmann, Phelim Bradley, Florian Gauger, and Zamin Iqbal. "COBS: a Compact Bit-Sliced Signature Index". In: 26th International Symposium on String Processing and Information Retrieval (SPIRE). pages 285-303. Spinger. October 2019. preprint arXiv:1905.09624.

If you use COBS in an academic context or publication, please cite our paper

@InProceedings{bingmann2019cobs,
  author =       {Timo Bingmann and Phelim Bradley and Florian Gauger and Zamin Iqbal},
  title =        {{COBS}: a Compact Bit-Sliced Signature Index},
  booktitle =    {26th International Conference on String Processing and Information Retrieval (SPIRE)},
  year =         2019,
  series =       {LNCS},
  pages =        {285--303},
  month =        oct,
  organization = {Springer},
  note =         {preprint arXiv:1905.09624},
}

Installation and First Steps

Installation

COBS requires CMake, a C++17 compiler or the Boost.Filesystem library.

Linux

To download and install COBS run:

git clone --recursive https://github.com/bingmann/cobs.git
mkdir cobs/build
cd cobs/build
cmake -DCMAKE_BUILD_TYPE=Release ..
make -j4

and optionally run make test to check the build.

OS X compilation

Using gcc

  1. Install gcc-11 or more recent: brew install gcc@11
  2. Compile COBS: cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=gcc-11 -DCMAKE_CXX_COMPILER=g++-11 ..

Using clang:

  1. Install boost-1.76: brew install [email protected]
  2. Compile COBS with boost: cmake ..

Troubleshooting

Several issues might arise from your specific configuration.

Problems with openMP on Mac OS X

If installing OpenMP does not work, add -DNOOPENMP=1 argument to the cmake command.

Problems with python bindings

Skip python bindings compilation by adding -DSKIP_PYTHON=1 argument to the cmake command.

Problems with finding boost

Define BOOST_ROOT env variable and then compile:

export BOOST_ROOT="/usr/local/opt/[email protected]"  # use your boost root path - this would be the path if installing boost using brew on Mac OS X
cmake  ..

Building an Index

COBS can read FASTA files (*.fa, *.fasta, *.fna, *.ffn, *.faa, *.frn, *.fa.gz, *.fasta.gz, *.fna.gz, *.ffn.gz, *.faa.gz, *.frn.gz), FASTQ files (*.fq, *.fastq, *.fq.gz., *.fastq.gz), "Multi-FASTA" and "Multi-FASTQ" files (*.mfasta, *.mfastq), McCortex files (*.ctx), or text files (*.txt). See below on details how they are parsed.

You can either recursively scan a directory for all files matching any of these files, or pass a *.list file which lists all paths COBS should index.

To check the document list to be indexed, run for example

src/cobs doc-list tests/data/fasta/

To construct a compact COBS index from these seven example documents run

src/cobs compact-construct tests/data/fasta/ example.cobs_compact

Or construct a compact COBS index from a list of documents by running

src/cobs compact-construct tests/data/fasta_files.list example.cobs_compact

The paths in the file list can be absolute or relative to the file list's path. Note that *.txt files are read as verbatim text files. You can force COBS to read a .txt file as a file list using --file-type list.

Check --help for many options.

Query an Index

COBS has a simple command line query tool:

src/cobs query -i example.cobs_compact AGTCAACGCTAAGGCATTTCCCCCCTGCCTCCTGCCTGCTGCCAAGCCCT

or a fasta file of queries with

src/cobs query -i example.cobs_compact -f query.fa

Multiple indices can be queried at once by adding more -i parameters.

Python Interface

COBS also has a Python frontend interface which can be used to construct and query an index. See https://bingmann.github.io/cobs-python-docs/ for a tutorial.

Experimental Results

In our paper we compare COBS against seven other k-mer indexing software packages. These are the main results, scaling by number of documents in the index, and in the second diagram shown per document.

cobs-experiments-scaling cobs-experiments-scaling-per-documents

More Details

File Types and How They Are Parsed

COBS can read FASTA files (*.fa, *.fasta, *.fa.gz, *.fasta.gz), FASTQ files (*.fq, *.fastq, *.fq.gz., *.fastq.gz), "Multi-FASTA" and "Multi-FASTQ" files (*.mfasta, *.mfastq), McCortex files (*.ctx), or text files (*.txt). Each file type is parsed slightly differently into q-grams or k-mers.

FASTA files are parsed as one document each. If a FASTA file contains multiple sequences or reads then they are combined into one document. Multiple sequences (separated by comments) are NOT concatenated trivially, instead the k-mers are extracted separately from each sequence. This means there are no erroneous k-mers from the beginning or end of crossing sequences. All newlines within a sequence are removed.

The k-mers from DNA sequences are automatically canonicalized (the lexicographically smaller is indexed). By adding the flag --no-canonicalize this process can be skipped. With canonicalization only ACGT letters are indexed, every other letter is mapped to binary zeros and index with the other data. A warning per FASTA/FASTQ file containing a non-ACGT letter is printed, but processing continues. With the flag --no-canonicalize any letters or text can be indexed.

FASTQ files are also parsed as one document each. The quality information is dropped and effectively everything is parsed identical to FASTA files.

Multi-FASTA or Multi-FASTQ files are parsed as many documents. Each sequence in the FASTA or FASTQ file is considered a separate document in the COBS index. Their names are append with _### where ### is the index of the subdocument.

McCortex files (*.ctx) contain a list of k-mers and these k-mers are indexes individually. The graph information is ignored. Only k=31 is currently supported.

Text files (*.txt) are parsed as verbatim binary documents. All q-grams are extracted, including newlines and other whitespace.

cobs's People

Contributors

bingmann avatar devgg avatar giang-nghg avatar iqbal-lab avatar jnalanko avatar leoisl avatar simongog avatar zhicheng-liu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

wook2014 aromberg

cobs's Issues

COBS for 600K genomes

Hi,
I am in charge to find presence of specific genes in 600.000 Salmonella's genomes.
I used COBS on few genomes for training
But I don't really understand the output...
I copied a subsequence (55 bp) from one of my genomes, and run COBS to see if it get it.
In the output I got 24 (see bellow).
And when I choose bigger sub sequence, sometimes it doesn't find it at all.

Another issue: how I can see if my query fully matchs or partially?

I ran these command:
cobs compact-construct index.cobs_compact
cobs query -i index.cobs_compact

--- end of document list (5 entries) ---
documents: 5
minimum 31-mers: 2811023
maximum 31-mers: 2874904
average 31-mers: 2834688
total 31-mers: 14173442
DIE: Output file exists, will not overwrite without --clobber @ /opt/conda/conda-bld/cobs_1646087618998/work/cobs/construction/compact_index.cpp:213
terminate called without an active exception

SRR18349609 24
SRR18349610 24
SRR18349611 24

TIMER info=search hashes=9.929e-06 io=0.000567883 total=0.000577812

Query length 55

I'd really appreciate your help

Thank you!

Result output truncates reference files at first dot

Hello,

We built an index over RefSeq genomes. The downloaded filenames are named like this:

/path/GCF_000019125.1_ASM1912v1_genomic.fna.gz
/path/GCF_000019165.1_ASM1916v1_genomic.fna.gz
...

When searching the index, the result looks as follows:

*query1 XXX
GCF_000019125 XXX
GCF_000019165 XXX
...

Luckily for us, the names are still unique and we should be able to compare the output with some effort to reconstruct the full reference name.

This format is lossy if the names weren't unique before the first dot and might even lead to severe false negatives if not noticed by the user.

Best,
Svenja

Format of output

Could the percentage/proportion of query kmers present in each sample be reported rather than the number of kmers present?

Iteratively adding documents

Apologies for cross posting this bingmann#24 here. I couldn't figure out where would be best.

I have limited storage capacity and would like to build and index a large number of SRA files. Is it possible to do this iteratively? By this I mean, I would download 10 SRA files, build the index, delete the 10 SRA, download another 10 SRA, add these to the existing compact COB index, delete them, etc...

Alternatively, is there a way of merging multiple COB index?

Thanks in advance

False negatives - some sort of interaction with gzip / bgzip

Strange bug, and might be more of a user error

I've created "broken.fa.gz" by downloading 10 public genomes and compressing the file with bgzip.
I've created "works.fa.gz" by uncompressing this file and recompressing with gzip.

If I build a COBS index with the broken file then query the index with each of the 10 sequences, it finds the first 6, 95% of the 7th and ~60% of the remaining 3.

All works as expected with the second file and it finds all the kmers for each of the reference sequences.

broken.fa.gz
works.fa.gz

minimal script:

#!/usr/bin/env python

import cobs_index as cobs
from Bio import SeqIO
import gzip

p = cobs.CompactIndexParameters()
p.term_size = 31               # k-mer size
p.clobber = True               # overwrite output and temporary files
p.false_positive_rate = 0.4    # higher false positive rate -> smaller index

broken = "broken.fa.gz"
cobs.compact_construct(broken, "broken_index.cobs_compact", index_params=p)
s = cobs.Search("broken_index.cobs_compact")

with gzip.open(broken, "rt") as handle:
    for record in SeqIO.parse(handle, "fasta"):
        l = len(record) - 30
        print(record.id, l)
        r = s.search(str(record.seq), num_results=2)
        for result in r:
            print(result.score, result.doc_name, float(result.score)/l)

works = "works.fa.gz"
cobs.compact_construct(works, "works_index.cobs_compact", index_params=p)
s = cobs.Search("works_index.cobs_compact")

with gzip.open(works, "rt") as handle:
    for record in SeqIO.parse(handle, "fasta"):
        l = len(record) - 30
        print(record.id, l)
        r = s.search(str(record.seq), num_results=2)
        for result in r:
            print(result.score, result.doc_name, float(result.score)/l)

Error compiling on Apple Silicon/ARM64

Hello,

while trying to install cobs on an Apple Silicon machine, I ran into the following error:

In file included from /Users/adrian/Desktop/cobs/cobs/query/classic_index/mmap_search_file.cpp:9:
In file included from /Users/adrian/Desktop/cobs/cobs/query/classic_index/mmap_search_file.hpp:12:
In file included from /Users/adrian/Desktop/cobs/cobs/query/classic_index/search_file.hpp:13:
In file included from /Users/adrian/Desktop/cobs/cobs/query/index_file.hpp:15:
/Library/Developer/CommandLineTools/usr/lib/clang/15.0.0/include/immintrin.h:14:2: error: "This header is only meant to be used on x86 and x64 architecture"
#error "This header is only meant to be used on x86 and x64 architecture"

Would really appreciate your help with this! It looks like this could be fixed by using something like SIMDe, however I haven't been able to figure it out just yet.

Query reads from fastq.gz files

Hi,
it seems queries are only supported when stored in fasta format.
Do you have any plan to enable query from reads in fastq.gz files?

Thanks,
-Giulio

Improve Mac OS X delivery - make COBS compilable with clang

Right now COBS on Mac has to be compiled by users. We can't provide a conda recipe because COBS require gcc. We can't use containers as singularity is not supported on Mac. The real solution for this is for COBS to compile with clang, then it is straightforward to provide a COBS Mac bioconda recipe

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.