Giter VIP home page Giter VIP logo

seqqs's Introduction

seqqs

Seqqs (SEQuence Quality Statistics, pronounced "seeks") is a C library for quickly gathering quality statistics from sequence files. It's mostly adapted from qrqc, except it is designed to be run in quality processing pipelines. It can also be compiled as a dynamic library and called from other programs.

Seqqs is meant to check nucleotide composition, k-mer abundance, length distribution, and base quality at many points on in a quality control pipeline. Why might you want to do this? Quality control programs can misbehave โ€” don't trust your tools or data (the "golden rule of bioinformatics"). In several cases, I've seen pathologically bad data quality lead to a program severely misbehaving to. This may lead to confounding during downstream analysis if uncaught, as one sequencing sample of initially poor quality may be overtrimmed, or many reads removed (I've seen this in practice, and the statistical consequences). It's much easier to put Seqqs in your pipeline, and quickly check the results to ensure both your data and tools are working as they should be.

Requirements and Installation

Seqqs can be compiled using GCC or Clang; compilation during development used the latter. Seqqs relies on Heng Li's kseq.h and khash.h, which is bundled with the source.

Seqqs requires Zlib, which can be obtained at http://www.zlib.net/.

To install, just run make in the seqqs directory.

Usage

Documentation is internal; just compile and run ./seqqs. Here are some usage examples.

Without any options, seqqs works like so:

cat in.fq | seqqs -
# or:
seqqs in.fq

Note that - tells seqqs to read from standard input. Without any options, this will create qual.txt, nucl.txt, and len.txt.

seqqs is designed to be placed in pipelines and act as a quality gathering step without disrupting the flow (similar to Unix tee). To enable this, use -e (for emit):

cat in.fq | seqqs -e -

For complex quality pipelines, seqqs can also take a prefix argument to prevent overwriting output files. If we wanted to create a complex workflow that gathers quality on raw input, gathers quality statistics, then trims using Heng Li's seqtk trimfq command, and then gathers output statistics, we could use:

cat in.fq | seqqs -e -p raw-$(date +%F) - | seqtk trimfq - | \
  seqqs -e -p trimmed-$(date +%F) > trimmed.fq

seqqs can also gather positional k-mers, which can help in discovering enrichment due to positional contaminants like untrimmed barcodes and adapters. As a quick aside: you should check for these! Many sequencing data set are plagued by positional contaminants, especially as barcoding grows in popularity. The k-mer option is -k <n> where n is the k-mer size:

cat in.fq | seqqs -k 6

seqqs can also work with interleaved paired-end files. The results are no different, but two output files (one for each set of reads in a pair) are created. These have the names like the default, except they have _1.txt and _2.txt suffixes. Also, seqqs will warn if pairing looks incorrect. If -s (strict) is set, seqqs will error out if interleaved pairs do not have the same name (ignoring /1 and /2 and excluding the comment).

Using Output

All tables are tab-delimited with headers, and can be easily analyzed by a program of your choice. qrqc will soon have functions to gather this output and make plots from it.

Todo

  • BAM support

seqqs's People

Contributors

vsbuffalo avatar hans-vg avatar

Watchers

Dr. K. D. Murray avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.