msettles / dbcamplicons Goto Github PK

View Code? Open in Web Editor NEW

9.0 9.0 7.0 106.27 MB

Analysis of Double Barcoded Illumina Amplicon Data

License: GNU Lesser General Public License v3.0

Python 73.26% R 20.40% C 4.34% Shell 1.99%

dbcamplicons's People

Contributors

Stargazers

Watchers

Forkers

allie128 samhunter pmhenry boloo2018 biryabarema brugerer

dbcamplicons's Issues

reproduce analysis from 2018

Our collaborators used the function 'dbcamplicons preprocess' in a 2018 analysis to demultiplex dual-barcoded sequences. We attempted to download dbcAmplicons but can no longer successfully install it (having lots of python errors).

Is this software no longer being maintained? If so, what would you suggest for alternative demultiplexing workflows? Thank you.

Easier RC support for barcodes

Rather than reverse complementing barcode1 automatically, what do you think about a couple of switches (-rc1, -rc2 perhaps) which would allow for flexibly reverse complementing barcode 1 or 2 depending on the library (or spreadsheet provided by a client). Currently it is a little bit confusing to check that barcode1 indeed matches the barcodes sequences in the fastq files, then have to reverse complement that barcode sequence in order to get dbcAmplicons to work.

More detailed statistics for trouble shooting

We recently ran into a tricky problem where some primer names were accidentally left off of the SampleSheet. The preprocesslog indicated a big dropoff in the number of reads successfully assigned to sample vs identified to barcode and primer which helped tip us off to this problem, however the Identified_Barcodes listed all of the reads associated with each sample+barcode, even if the barcode wasn't listed under that sample (a significant difference in total reads listed in the preprocesslog and the Identified_Barcodes table also helped with troubleshooting). It would be very cool to have one additional table generated by preprocess which listed:

Sample	PrimersExpected	PrimersIdentified	ReadsByBarcode	ReadsByBarcodeAndPrimer
sample1	96	96	1000000	800000
sample2	48	96	1000000	50000

This would make troubleshooting issues with samplesheet formatting much easier.

Support splitting of projects at the abundance table phase

Preprocess single barcoded reads

Hello,

I notice that the update history for version 0.9.0 states that single barcoded reads can be processed, but I can't figure out how to do that.

Is there an option I need to set or a way to set up the input files to do this?

I have paired-end reads with a third read containing the barcodes.
For the Barcode file I put just two columns (BarcodeID, BarcodeSeq).
I then ran preprocess with --R1 Seq_R1.fastq --R2 Seq_R2.fastq --BC1 Barcodes.fastq

I get the following Traceback, which I interpret as program missing the other barcode read:

  File "build/bdist.linux-x86_64/egg/dbcAmplicons/preprocess_app.py", line 161, in start
    self.run = FourReadIlluminaRun(fastq_file1, fastq_file2, fastq_file3, fastq_file4)
  File "build/bdist.linux-x86_64/egg/dbcAmplicons/illuminaRun.py", line 48, in __init__
    self.fbc2.append(misc.infer_read_file_name(fread, "3"))
  File "build/bdist.linux-x86_64/egg/dbcAmplicons/misc.py", line 104, in infer_read_file_name
    raise Exception("Error inferring read " + seakread + " from read 1, found " + str(len(read)) + " suitable matches.")
Exception: Error inferring read 3 from read 1, found 0 suitable matches.

Thanks for the help!

allow search of pairs in join

join is the only sub function that requires you to specify both -1 -2, allow for search of -2 if only specifying -1

classify app should include rdp classifier

using external path to RDP classifier but creates problems if user is not familiar with the path to RDP. When including the classifer.jar file within the dbcAmplicons app, determine if the classifier can be packed and made within it.

During updates, splitreads no longer works, fix

Support output of Biom format

Support output of Biom formatted files to facilitate usage of downstream software.
http://biom-format.org/
This can be accomplished using their python application

join not inferring read 2 from read 1

biom file error when primers are absent

Cleaning up.
A fatal error was encountered.
Traceback (most recent call last):
File "build/bdist.linux-x86_64/egg/dbcAmplicons/abundance_app.py", line 244, in start
sampleList_md = [{'primers': ";".join(primers[v])} for v in sampleList]
TypeError: sequence item 0: expected string, NoneType found

split reads in abundance by project designation,

allows for one to go through the pipeline on a single path then split at the very end

convert2ReadTo4Read.py barcode error

Using the following call:
dbcAmplicons/scripts/python/convert2ReadTo4Read.py -1 data/003237_H-2D_S30_R1_filtered.fastq.gz -2 data/003237_H-2D_S30_R2_filtered.fastq.gz --debug

I get the error:
ERROR:[TwoSequenceReadSet] Unknown error occured generating four read set
Cleaning up.
A fatal error was encountered.
Traceback (most recent call last):
File "dbcAmplicons/scripts/python/convert2ReadTo4Read.py", line 48, in start
self.run_out.addRead(read.getFourReads(bc1_length=barcode1, bc2_length=barcode2))
File "build/bdist.linux-x86_64/egg/dbcAmplicons/sequenceReads.py", line 355, in getFourReads
raise Exception("string in the barcode is not %s characters" % str(bc1_length + bc2_length))
Exception: string in the barcode is not 16 characters

When I try to add in the -p or -q flags the error remains the same.

validation fails without primer sheet

even if there are no primer sequences in the samples and no primers identified in the sample sheet, validation requires a primer sheet. a dummy sheet will pass validation.

not found file error needs to include actual filename

Create script for splitting out samples

generate new python script in scripts/python that will take data processed by dbcAmplicons and split by sample which is suitable for upload to the SRA

preprocess output percentages

Change the output to stdout that summarizes the total percentage of reads identified in a run and their respective allocations to every project including "unidentified" reads

Project specific handling of primer trimming and output format

Currently "dbcAmplicons preprocess" processes all projects within a single set of fastq files in the same way. However, some clients prefer to receive files with primers intact, others wish to have primers removed, and some would prefer 4-read format (R1, R2, I1, I2). It would be great if a more sophisticated configuration option could be provided to generate the correct output per-project.

With new Flash version, output no longer works

Fix parsing of flash output

Change C code to allow for new duplicate monitoring

Add clip support to classification sub-function

For two reads add support for clipping of reads due to quality
Test to see if clipping reads improves quality of classification

validate needs primer, fix for no primer

phoebe11@c11-42:~/MattsPipeline/CarolynsQiime/CQiime_metadata$ dbcAmplicons validate -B Carolyn_M_dbcBarcodeTable.txt -S Carolyn_M_SampleSheet2.txt --debug
/share/apps/python-2.7.4/lib/python2.7/site-packages/pkg_resources.py:1031: UserWarning: /home/phoebe11/.python-eggs is writable by group/others and vulnerable to attack when used with get_resource_filename. Consider a more secure location (set with .set_extraction_path or the PYTHON_EGG_CACHE environment variable).
warnings.warn(msg, UserWarning)
A newer version (0.8.5) of dbcAmplicons is available at https://github.com/msettles/dbcAmplicons
barcode table length: 587
Cleaning up.
Traceback (most recent call last):
File "build/bdist.linux-x86_64/egg/dbcAmplicons/validate_app.py", line 114, in start
prTable = primerTable(primerFile)
File "build/bdist.linux-x86_64/egg/dbcAmplicons/primers.py", line 39, in init
prfile = open(primerfile, 'r')
TypeError: coercing to Unicode: need string or buffer, NoneType found

Add check for p5 and p7 to primers

error when you give it the wrong order of files

more graceful, and helpful error message needed
Wrong order of files.

dbcAmplicons preprocess -B barcodeTable2.txt -P primerTable.txt -S Judelson-sample.txt -1 Illumina_RawData/Undetermined_S0_L001_R1_001.fastq.gz -2 Illumina_RawData/Undetermined_S0_L001_I1_001.fastq.gz -3 Illumina_RawData/Undetermined_S0_L001_R2_001.fastq.gz -4 Illumina_RawData/Undetermined_S0_L001_I2_001.fastq.gz

Error message:

/home/msettles/Python_venv/local/lib/python2.7/site-packages/pkg_resources.py:991: UserWarning: /home/jli/.python-eggs is writable by group/others and vulnerable to attack when used with get_resource_filename. Consider a more secure location (set with .set_extraction_path or the PYTHON_EGG_CACHE environment variable).
warnings.warn(msg, UserWarning)
A newer version (0.8.1-20160418) of dbcAmplicons is available at https://github.com/msettles/dbcAmplicons/tree/develop
barcode table length: 71
primer table length P5 Primer Sequences:8, P7 Primer Sequences:9
sample table length: 1, and 1 projects.
Cleaning up.
Traceback (most recent call last):
File "build/bdist.linux-x86_64/egg/dbcAmplicons/preprocess_app.py", line 98, in start
bcsuccesscount += read.assignBarcode(bcTable, barcodeMaxDiff) # barcode
File "build/bdist.linux-x86_64/egg/dbcAmplicons/sequenceReads.py", line 130, in assignBarcode
bc2, bc2Mismatch = barcodeDist(bcTable.getP5(), self.bc_2, max_diff)
File "build/bdist.linux-x86_64/egg/dbcAmplicons/sequenceReads.py", line 31, in barcodeDist
bc_i, bc_mismatch = editdist.hamming_distance_list(b_l, b_2, max_diff+1)
SystemError: Bad Arguments

Format of read ID differs from manual when using "--keepPrimers"

The default output of dbcAmplicons preprocess is documented in the DBC_ampliconsUserManual as being formatted like:

{Sequence ID (Illumina header) } : {SampleID} : {PrimerPairID} {Barcode1|#differences1|Barcode2|#differences2} {PrimerForward|#differences|bpTrimmed}

However, this format is only used when "--keepPrimers" is not passed to preprocess.

Example without "--keepPrimers" (format described in the manual):
@M01380:62:000000000-B547W:1:1102:20354:1000 1:N:0:AG_5856:ITS3_ITS4 NTATCGCT|1|NTCTCTAT|1 ITS3_CS1|1|20|

Example with "--keepPrimers" (differs from what is described in the manual):
@M01380:62:000000000-B547W:1:1102:12519:1279 1:N:0:AG_1105 ACGAATTC|0|CAGGACGT|0

This might be intended behavior. I couldn't find documentation for it in the manual however, and it resulted in some issues with a downstream pipeline. Additionally, because no information is reported regarding the target specific primer (number of differences, which primer was identified, bpTrimmed) it isn't clear whether dbcAmplicons is still looking for the primer within the read.

I think the preferred behavior would be to report the primer, num mismatches, as well as 0bp trimmed?

in abundance when folder doesn't exit, errors

try to replicate, then try to guard against.

sample sheet has # in header does not error

If expected header not identified error out

Add illegal character search for barcode and primer lookup sheets

Removing spaces will prevent downstream errors.

Merge all updates across all branches

abundance app crashes when provided a samplesheet with no metadata

expand tilda on files

Specifically RDP os.isFile check fails when using tilda to reference.

abundance step appears to run but output files are empty

Hello! I've gotten almost all the way through the pipeline with no problem until the abundance step. When I run:

dbcAmplicons abundance -S SampleSheet.txt -O test-results/16sV4 -F NWRD001_1.classified.fixrank --biom > abundance.16sV4.log
I get this output, but all of my output files are empty. I checked with the HPC support people at my institution and they said that the error messages about numPy were safe to ignore:

/global/home/users/rpduncan/src/dbcA_virtualenv/lib/python2.7/site-packages/scipy/special/init.py:640: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
from ._ufuncs import *lines, 44403.0 lines/second
/global/home/users/rpduncan/src/dbcA_virtualenv/lib/python2.7/site-packages/scipy/linalg/basic.py:17: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
from ._solve_toeplitz import levinsonlines/second
/global/home/users/rpduncan/src/dbcA_virtualenv/lib/python2.7/site-packages/scipy/linalg/init.py:207: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
from ._decomp_update import *
/global/home/users/rpduncan/src/dbcA_virtualenv/lib/python2.7/site-packages/scipy/special/_ellip_harm.py:7: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
from ._ellip_harm_2 import _ellipsoid, _ellipsoid_norm
/global/home/users/rpduncan/src/dbcA_virtualenv/lib/python2.7/site-packages/scipy/interpolate/_bsplines.py:10: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
from . import _bspl
/global/home/users/rpduncan/src/dbcA_virtualenv/lib/python2.7/site-packages/scipy/sparse/lil.py:19: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
from . import _csparsetools
/global/home/users/rpduncan/src/dbcA_virtualenv/lib/python2.7/site-packages/scipy/sparse/csgraph/init.py:165: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
from ._shortest_path import shortest_path, floyd_warshall, dijkstra,
/global/home/users/rpduncan/src/dbcA_virtualenv/lib/python2.7/site-packages/scipy/sparse/csgraph/_validation.py:5: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
from ._tools import csgraph_to_dense, csgraph_from_dense,
/global/home/users/rpduncan/src/dbcA_virtualenv/lib/python2.7/site-packages/scipy/sparse/csgraph/init.py:167: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
from ._traversal import breadth_first_order, depth_first_order,
/global/home/users/rpduncan/src/dbcA_virtualenv/lib/python2.7/site-packages/scipy/sparse/csgraph/init.py:169: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
from ._min_spanning_tree import minimum_spanning_tree
/global/home/users/rpduncan/src/dbcA_virtualenv/lib/python2.7/site-packages/scipy/sparse/csgraph/init.py:170: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
from ._reordering import reverse_cuthill_mckee, maximum_bipartite_matching,
/global/home/users/rpduncan/src/dbcA_virtualenv/lib/python2.7/site-packages/scipy/spatial/init.py:95: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
from .ckdtree import *
/global/home/users/rpduncan/src/dbcA_virtualenv/lib/python2.7/site-packages/scipy/spatial/init.py:96: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
from .qhull import *
/global/home/users/rpduncan/src/dbcA_virtualenv/lib/python2.7/site-packages/scipy/spatial/_spherical_voronoi.py:18: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
from . import _voronoi
/global/home/users/rpduncan/src/dbcA_virtualenv/lib/python2.7/site-packages/scipy/spatial/distance.py:122: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
from . import _hausdorff
/global/home/users/rpduncan/src/dbcA_virtualenv/lib/python2.7/site-packages/scipy/optimize/_trlib/init.py:1: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
from ._trlib import TRLIBQuadraticSubproblem
/global/home/users/rpduncan/src/dbcA_virtualenv/lib/python2.7/site-packages/scipy/optimize/_numdiff.py:10: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
from ._group_columns import group_dense, group_sparse
/global/home/users/rpduncan/src/dbcA_virtualenv/lib/python2.7/site-packages/scipy/stats/_continuous_distns.py:18: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
from . import _stats
/global/home/users/rpduncan/src/dbcA_virtualenv/lib/python2.7/site-packages/pandas/_libs/init.py:4: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
from .tslib import iNaT, NaT, Timestamp, Timedelta, OutOfBoundsDatetime
/global/home/users/rpduncan/src/dbcA_virtualenv/lib/python2.7/site-packages/pandas/init.py:26: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
from pandas._libs import (hashtable as _hashtable,
/global/home/users/rpduncan/src/dbcA_virtualenv/lib/python2.7/site-packages/pandas/core/dtypes/common.py:6: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
from pandas._libs import algos, lib
/global/home/users/rpduncan/src/dbcA_virtualenv/lib/python2.7/site-packages/pandas/core/util/hashing.py:7: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
from pandas._libs import hashing, tslib
/global/home/users/rpduncan/src/dbcA_virtualenv/lib/python2.7/site-packages/pandas/core/indexes/base.py:7: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
from pandas._libs import (lib, index as libindex, tslib as libts,
/global/home/users/rpduncan/src/dbcA_virtualenv/lib/python2.7/site-packages/pandas/tseries/offsets.py:21: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
import pandas._libs.tslibs.offsets as liboffsets
/global/home/users/rpduncan/src/dbcA_virtualenv/lib/python2.7/site-packages/pandas/core/ops.py:16: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
from pandas._libs import algos as libalgos, ops as libops
/global/home/users/rpduncan/src/dbcA_virtualenv/lib/python2.7/site-packages/pandas/core/indexes/interval.py:32: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
from pandas._libs.interval import (
/global/home/users/rpduncan/src/dbcA_virtualenv/lib/python2.7/site-packages/pandas/core/internals.py:14: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
from pandas._libs import internals as libinternals
/global/home/users/rpduncan/src/dbcA_virtualenv/lib/python2.7/site-packages/pandas/core/sparse/array.py:33: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
import pandas._libs.sparse as splib
/global/home/users/rpduncan/src/dbcA_virtualenv/lib/python2.7/site-packages/pandas/core/window.py:36: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
import pandas._libs.window as _window
/global/home/users/rpduncan/src/dbcA_virtualenv/lib/python2.7/site-packages/pandas/core/groupby/groupby.py:68: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
from pandas._libs import (lib, reduction,
/global/home/users/rpduncan/src/dbcA_virtualenv/lib/python2.7/site-packages/pandas/core/reshape/reshape.py:30: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
from pandas._libs import algos as _algos, reshape as _reshape
/global/home/users/rpduncan/src/dbcA_virtualenv/lib/python2.7/site-packages/pandas/io/parsers.py:45: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
import pandas._libs.parsers as parsers
/global/home/users/rpduncan/src/dbcA_virtualenv/lib/python2.7/site-packages/pandas/io/pytables.py:50: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
from pandas._libs import algos, lib, writers as libwriters
processed 100000 total lines, 38930.0 lines/second
processed 200000 total lines, 41795.0 lines/second
processed 300000 total lines, 42559.0 lines/second
processed 400000 total lines, 43132.0 lines/second
processed 500000 total lines, 43520.0 lines/second
processed 600000 total lines, 43753.0 lines/second
processed 700000 total lines, 43928.0 lines/second
processed 800000 total lines, 44070.0 lines/second
processed 900000 total lines, 44183.0 lines/second
processed 1000000 total lines, 44272.0 lines/second
processed 1100000 total lines, 44343.0 lines/second
processed 1200000 total lines, 44403.0 lines/second
processed 1300000 total lines, 44431.0 lines/second
processed 1400000 total lines, 44371.0 lines/second
processed 1500000 total lines, 44361.0 lines/second
Writing output
Writing json formatted biom file to: results/16sV4.biom
Writing abundance file to: results/16sV4.abundance.txt
Writing proportions file to: results/16sV4.proportions.txt
finished in 0.57 minutes
Cleaning up.