Giter VIP home page Giter VIP logo

sliding-window's Introduction

Sliding window

Build Status

Quick run

sliding_window build bin_paths.txt --threads 8 --output index.out --size 80m

sliding_window search --index index.out --query example_data/64/reads/all.fastq --pattern 50 --output matches.out --overlap 10

A fast and space-efficient pre-filter for querying very large collections of nucleotide sequences

The aim of this repository is to develop an IBF based prefilter for metagenomics read mapping. The IBF is created from the (k,k)-minimiser content of the reference database. The filter excludes parts of the reference datbase for each query read. Only reference sequences where an approximate local match for the query sequence was found are retained. A local match is defined as:

  • length >= w
  • errors <= e

where w is the window length and e is the allowed number of errors. Each read is divided into multiple possibly overlapping windows. The (k, k)-minimiser content of each window is then queried in the IBF.

Download and Installation

Prerequisites (click to expand)
  • CMake >= 3.8
  • GCC 9, 10 or 11 (most recent minor version)
  • git

Refer to the Seqan3 Setup Tutorial for more in depth information.

Download current master branch (click to expand)
git clone --recurse-submodules https://github.com/eaasna/sliding-window
Building (click to expand)
cd sliding-window
mkdir -p build
cd build
cmake ..
make

The binary can be found in bin.

You may want to add the executable to your PATH:

export PATH=$(pwd)/bin:$PATH
raptor --version

Example Data and Usage

A toy data set can be found here.

wget https://ftp.imp.fu-berlin.de/pub/seiler/raptor/example_data.tar.gz
tar xfz example_data.tar.gz

After extraction, the example_data will look like:

$ tree -L 2 example_data
example_data
├── 1024
│   ├── bins
│   └── reads
└── 64
    ├── bins
    └── reads

The bins folder contains a FASTA file for each bin and the reads directory contains a FASTQ file for each bin containing reads from the respective bin (with 2 errors). Additionally, mini.fastq (5 reads of all bins), all.fastq (concatenation of all FASTQ files) and all10.fastq (all.fastq repeated 10 times) are provided in the reads folder.

In the following, we will use the 64 data set. To build an index over all bins, we first prepare a file that contains one file path per line (a line corresponds to a bin) and use this file as input:

seq -f "example_data/64/bins/bin_%02g.fasta" 0 1 63 > bin_paths.txt
sliding_window build bin_paths.txt --threads 8 --output index.out --size 80m

You may be prompted to enable or disable automatic update notifications. For questions, please consult the SeqAn documentation.

Afterwards, we can search for all reads from bin 1:

sliding_window search --index index.out --query example_data/64/reads/mini.fastq --errors 2 --pattern 50 --output matches.out --overlap 10

Each line of the output consists of the read ID (in the toy example these are numbers) and the corresponding bins in which they were found:

0       0,
1       0,
2       0,
3       0,
4       0,
16384   1,
...
1015812 62,
1032192 63,
1032193 63,
1032194 63,
1032195 63,
1032196 63,

For a list of options, see the help pages:

sliding_window --help
sliding_window build --help
sliding_window search --help

Authorship and Copyright

The sliding window filter is based on Raptor. Raptor is being developed by Enrico Seiler, but also incorporates much work from other members of SeqAn.

Citation

In your academic works (also comparisons and pipelines) please cite:

  • Seiler, E. et al. (2020). Raptor: A fast and space-efficient pre-filter for querying very large collections of nucleotide sequences. bioRxiv 2020.10.08.330985. doi: https://doi.org/10.1101/2020.10.08.330985

Supplementary

The subdirectory util contains applications and scripts related to the paper.

License

Raptor is open source software. However, certain conditions apply when you (re-)distribute and/or modify Raptor, please see the license.

sliding-window's People

Contributors

eseiler avatar eaasna avatar mitradarja avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.