Giter VIP home page Giter VIP logo

auto_barcode's Introduction

Barcode Splitter, Trimmer, and Stat Generator

Contents of README:

Usage Summary

The following can be accessed by running ./barcode_split_trim.pl --help:

USAGE
  barcode_split_trim.pl [options] -b BARCODE IN.FASTQ

DESCRIPTION
  Extracts fastq reads for specified barcode(s) from one or multiple FASTQ files.
  Use wildcards ('*') to match multiple input FASTQ files.

OPTIONS
  -h, --help                 Print this help message
  -v, --version              Print version number
  --id                       Sample or Experiment ID
  -b, --barcode   BARCODE    Specify barcode or file w/ list of barcodes to extract
  -l, --list                 Indicate BARCODE is a list of barcodes in a file
  --indexed                  Samples designated by index sequences
                              Alternate read FQ files and index FQ files
  -m, --mismatches           Minimum number of mismatches allowed in barcode sequence [0]
  -n, --notrim               Split without trimming barcodes
  -st, --stats               Output summary stats only (w/o creating fastq files)
  -o, --outdir    DIR        Output file is saved in the specified directory
                              (or same directory as IN.FASTQ, if --outdir is not used)

NAMING OPTIONS
  --autoprefix               Append FASTQ file name onto output
  --autosuffix               Append barcode onto output
  -p, --prefix    PREFIX     Add custom prefix to output
  -su, --suffix   SUFFIX     Add custom suffix to output

OUTPUT
  An output file in fastq format is written for each barcode to the directory
  containing IN.FASTQ, unless an output directory is specified.
  The default name of the output file is SAMPLE_ID.fq. The output names can be
  customized using the Naming Options.

  Log files and a summary plot that aid in identification of problem libraries.

EXAMPLES
  barcode_split_trim.pl -i Charlotte -b GACTG kitten_DNA.fq
  barcode_split_trim.pl --id BigExperiment --barcode barcode.file --list *_DNA.fastq
  barcode_split_trim.pl --help

Detailed Usage Example

A pair of sample FASTQ and barcode files is provided in the sample_files folder. They have been used to generate the output FASTQ, log, and summary plot files (sample_files/output). In this example, I show what these files look like and how to generate them.

Barcode splitting/trimming

The sample FASTQ file (sample_files/sequences.fq) contains 100,000 sequence reads from a pool of 14 samples. The corresponding barcodes (sample_files/barcode.list) are:

TACGC   marmot1
ATCGT   marmot2
ATTCC   marmot3
CCAGC   marmot4
GATAC   aardvark1
GGATG   aardvark2
TCGAT   tarsier1
AGCGC   tarsier2
CCAAT   tarsier3
CGCTG   tarsier4
CTAGT   puffin1
TAGAG   puffin2
TAGTC   puffin3
GGTCA   puffin4

To split sample_files/sequences.fq with the barcodes in sample_files/barcode.list, we can run:

./barcode_split_trim.pl \
  --id demo \
  --barcode sample_files/barcode.list \
  --list \
  --outdir sample_files/output \
  sample_files/sequences.fq

This results in a FASTQ file for each barcode (barcodes are trimmed), a single FASTQ file containing all unmatched barcodes (barcodes are left in tact), and two log files:

# FASTQ file for each barcode
aardvark1.fq
aardvark2.fq
marmot1.fq
marmot2.fq
marmot3.fq
marmot4.fq
puffin1.fq
puffin2.fq
puffin3.fq
puffin4.fq
tarsier1.fq
tarsier2.fq
tarsier3.fq
tarsier4.fq

# FASTQ file containing all unmatched barcodes
unmatched.fq_sequences.bar_barcode.list.fq

# log files
log_barcode_counts.fq_sequences.bar_barcode.list
log_barcodes_observed.fq_sequences.bar_barcode.list

# plot summary
demo.barcodes.png

Log files

The first log file is the barcode splitting summary (sample_files/output/log_barcode_counts.fq_sequences.bar_barcode.list) and returns some basic stats with an emphasis on expected barcodes:

Barcode splitting summary for:
  sample_files/sequences.fq
---------------------------
matched   24,459 97.8%
unmatched    541  2.2%
---------------------------
barcodes    14
min      1,260   5.0%
max      2,200   8.8%
mean     1,747   7.0%
median   1,735.5 6.9%
---------------------------
id        barcode count percent
aardvark1 GATAC   1,595 6.4%
aardvark2 GGATG   1,839 7.4%
marmot1   TACGC   1,758 7.0%
marmot2   ATCGT   1,504 6.0%
marmot3   ATTCC   1,614 6.5%
marmot4   CCAGC   1,468 5.9%
puffin1   CTAGT   1,476 5.9%
puffin2   TAGAG   1,260 5.0%
puffin3   TAGTC   2,102 8.4%
puffin4   GGTCA   1,964 7.9%
tarsier1  TCGAT   2,010 8.0%
tarsier2  AGCGC   2,200 8.8%
tarsier3  CCAAT   1,713 6.9%
tarsier4  CGCTG   1,956 7.8%

The second log file returns counts and percentages for all observed barcodes (both expected and unexpected). Below are the first 20 (of 215) observed barcodes from this log (sample_files/output/log_barcodes_observed.fq_sequences.bar_barcode.list):

barcode count percent id
AGCGC   2,200 8.8%     tarsier2
TAGTC   2,102 8.4%      puffin3
TCGAT   2,010 8.0%     tarsier1
GGTCA   1,964 7.9%      puffin4
CGCTG   1,956 7.8%     tarsier4
GGATG   1,839 7.4%    aardvark2
TACGC   1,758 7.0%      marmot1
CCAAT   1,713 6.9%     tarsier3
ATTCC   1,614 6.5%      marmot3
GATAC   1,595 6.4%    aardvark1
ATCGT   1,504 6.0%      marmot2
CTAGT   1,476 5.9%      puffin1
CCAGC   1,468 5.9%      marmot4
TAGAG   1,260 5.0%      puffin2
GGGCA      32 0.1%
GGTCC      27 0.1%
GGATT      15 0.1%
NAGAG      11 0.0%
NAGTC      11 0.0%
TCTAT      11 0.0%

Plotting barcode splitting summary

The logs are useful, but if there are numerous barcodes and/or experiments being analyzed at once, it can be difficult to easily detect irregularities or problematic barcodes. To solve this issue, we can make a barcode frequency plot using R.

For this plot (saved to sample_files/output/demo.barcodes.png), barcodes are split into two groups, those that match an expected barcode and those that are unmatched. Boxplots are then generated using the observed barcode frequencies (which are jitter-plotted individually on top of the boxplot).

Version 2.1.2

auto_barcode's People

Contributors

mfcovington avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

auto_barcode's Issues

help output needs minor update

If one runs the perl script without proper input a nice help message is returned. There is a small error, however.

The Examples do not tell you to include the --id tag, but that appears to be mandatrory. The README is correct.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.