qiime2 / q2-demux Goto Github PK

License: BSD 3-Clause "New" or "Revised" License

Python 93.23% HTML 1.30% JavaScript 5.08% Makefile 0.18% TeX 0.21%

q2-demux's Introduction

qiime2 (the QIIME 2 framework)

Source code repository for the QIIME 2 framework.

QIIME 2™ is a powerful, extensible, and decentralized microbiome bioinformatics platform that is free, open source, and community developed. With a focus on data and analysis transparency, QIIME 2 enables researchers to start an analysis with raw DNA sequence data and finish with publication-quality figures and statistical results.

Visit https://qiime2.org to learn more about the QIIME 2 project.

Installation

Detailed instructions are available in the documentation.

Users

Head to the user docs for help getting started, core concepts, tutorials, and other resources.

Just have a question? Please ask it in our forum.

Developers

Please visit the contributing page for more information on contributions, documentation links, and more.

Citing QIIME 2

If you use QIIME 2 for any published research, please include the following citation:

Bolyen E, Rideout JR, Dillon MR, Bokulich NA, Abnet CC, Al-Ghalith GA, Alexander H, Alm EJ, Arumugam M, Asnicar F, Bai Y, Bisanz JE, Bittinger K, Brejnrod A, Brislawn CJ, Brown CT, Callahan BJ, Caraballo-Rodríguez AM, Chase J, Cope EK, Da Silva R, Diener C, Dorrestein PC, Douglas GM, Durall DM, Duvallet C, Edwardson CF, Ernst M, Estaki M, Fouquier J, Gauglitz JM, Gibbons SM, Gibson DL, Gonzalez A, Gorlick K, Guo J, Hillmann B, Holmes S, Holste H, Huttenhower C, Huttley GA, Janssen S, Jarmusch AK, Jiang L, Kaehler BD, Kang KB, Keefe CR, Keim P, Kelley ST, Knights D, Koester I, Kosciolek T, Kreps J, Langille MGI, Lee J, Ley R, Liu YX, Loftfield E, Lozupone C, Maher M, Marotz C, Martin BD, McDonald D, McIver LJ, Melnik AV, Metcalf JL, Morgan SC, Morton JT, Naimey AT, Navas-Molina JA, Nothias LF, Orchanian SB, Pearson T, Peoples SL, Petras D, Preuss ML, Pruesse E, Rasmussen LB, Rivers A, Robeson MS, Rosenthal P, Segata N, Shaffer M, Shiffer A, Sinha R, Song SJ, Spear JR, Swafford AD, Thompson LR, Torres PJ, Trinh P, Tripathi A, Turnbaugh PJ, Ul-Hasan S, van der Hooft JJJ, Vargas F, Vázquez-Baeza Y, Vogtmann E, von Hippel M, Walters W, Wan Y, Wang M, Warren J, Weber KC, Williamson CHD, Willis AD, Xu ZZ, Zaneveld JR, Zhang Y, Zhu Q, Knight R, and Caporaso JG. 2019. Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2. Nature Biotechnology 37:852–857. https://doi.org/10.1038/s41587-019-0209-9

q2-demux's People

Contributors

Watchers

Forkers

gregcaporaso jakereps ebolyen jairideout nervous-laughter maxvonhippel kestrelgorlick mahermassoud nbokulich chriskeefe wasade oddant1 emford andrewsanchez keegan-evans lizgehret cherman2 valentynbez hagenjp

q2-demux's Issues

Hover-over glitches in boxplots

Bug Description
When the boxes are not visible for a particular position, the hovering over doesn't work unless you hover over the ends of the whiskers. It would be great to make the entire element "hoverable" so that users don't have to fiddle with the mouse to see the information in the table.

Screenshots
This gif shows what I mean:

summarize should fail if sequences are different length

We might ultimately want to support different length sequences (in which case each box would somehow need to indicate how many sequences were summarized for that position), but that's not immediately necessary and the results are misleading if you pass in sequences with different lengths. I think we should it should therefore fail (or include a danger message, but probably fail) if all input sequences are not the same length.

emtpy samples should be (optionally) filtered after `subsample-*` actions

@brett-van-tussler discovered that if a sample has 0 reads following qiime demux subsample-paired (or subsample-single), the sample will still be retained in the output. There should be an option to drop these samples from the resulting artfiact.

A workaround is as follows (once #151 is merged):

# subsample the Moving Pictures demux.qza such that all samples will likely have zero read count
qiime demux subsample-single --i-sequences demux.qza --p-fraction 0.0000000000000000000000001 --o-subsampled-sequences demux-empty.qza

# generate counts of reads on a per-sample basis (this is what we're waiting on #151 for)
qiime demux tabulate-read-counts --i-sequences demux-empty.qza --o-counts demux-empty-counts.qza

# tabulate those into a .qzv, just to confirm counts are zero
qiime metadata tabulate --m-input-file demux-empty-counts.qza --o-visualization demux-empty-counts.qzv

# filter all samples with a count of zero (this will fail if no samples have a count greater than 0)
qiime demux filter-samples --i-demux demux-empty.qza --m-metadata-file demux-empty-counts.qza --p-where "[Demultiplexed sequence count] > 0" --o-filtered-demux demux-empty-filtered.qza

# generate read counts on the filtered artifact
qiime demux tabulate-read-counts --i-sequences demux-empty-filtered.qza --o-counts demux-empty-filtered-counts.qza

# tabulate those into a .qzv to confirm that no samples are left
qiime metadata tabulate --m-input-file demux-empty-filtered-counts.qza --o-visualization demux-empty-filtered-counts.qzv

MAINT: migrate types/formats/transformers to q2-types for more general access

For example, see here.

Handle a single sample in summarize

This came up on the forum

Right now you get a strange error from sns.distplot.

add license to setup call in setup.py

[object Object] in `demux summarize` output

Bug Description
After mousing over the Interactive Quality Plot in demux summarize visualizations the "random sampling X out of Y sequences" turns into "random sampling of X out of [object Object] sequences"

Steps to reproduce the behavior

Go to https://view.qiime2.org/visualization/?type=html&src=https%3A%2F%2Fdocs.qiime2.org%2F2020.8%2Fdata%2Ftutorials%2Fmoving-pictures%2Fdemux.qzv
Click on the "Interactive Quality Plot" tab
Observe "10000 out of 263931 sequences"
Mouse over a bar in the plot
Observe "10000 out of [object Object] sequences"

Expected behavior
Show the correct numbers instead

Screenshots

Computation Environment

QIIME 2 2020.6, 2020.8, or latest dev as of writing

Allow users to download the backing data from the `summarize` interactive quality plot

This recently came up on the forum.

Add citations

Should use the new citation API in qiime2/qiime2#387

y-axis label truncated (summarize)

pyyaml dep is missing from ci build recipe

summarize unable to parse sample IDs with # in ID name

Bug Description
If a sample ID contains a # (which is completely valid), summarize will fail noisily when parsing the format's internal manifest file.

Steps to reproduce

Import a dataset with a # in at least one sample ID
Run summarize on those samples.

Expected behavior
The vis shouldn't fail, the # should be included as part of the ID.

Screenshots

(qiime2-2018.6) eduroam121-33:16s_july26_Run Molly-Mac$ qiime demux summarize --i-data /Users/Molly-Mac/My_FILES/16s_july26_Run/Demulti_SmPro_PlasmidTrial_sequences.qza --o-visualization Dmultivis_SmPr_PlasmidTrial.qzv --verbose
Traceback (most recent call last):
File “/Users/Molly-Mac/miniconda2/envs/qiime2-2018.6/lib/python3.5/site-packages/q2cli/commands.py”, line 274, in call
results = action(**arguments)
File “”, line 2, in summarize
File “/Users/Molly-Mac/miniconda2/envs/qiime2-2018.6/lib/python3.5/site-packages/qiime2/sdk/action.py”, line 232, in bound_callable
output_types, provenance)
File “/Users/Molly-Mac/miniconda2/envs/qiime2-2018.6/lib/python3.5/site-packages/qiime2/sdk/action.py”, line 429, in callable_executor
ret_val = self._callable(output_dir=temp_dir, **view_args)
File “/Users/Molly-Mac/miniconda2/envs/qiime2-2018.6/lib/python3.5/site-packages/q2_demux/_summarize/_visualizer.py”, line 121, in summarize
lambda x: os.path.join(str(data), x))
File “/Users/Molly-Mac/miniconda2/envs/qiime2-2018.6/lib/python3.5/site-packages/pandas/core/series.py”, line 3194, in apply
mapped = lib.map_infer(values, f, convert=convert_dtype)
File “pandas/_libs/src/inference.pyx”, line 1472, in pandas._libs.lib.map_infer
File “/Users/Molly-Mac/miniconda2/envs/qiime2-2018.6/lib/python3.5/site-packages/q2_demux/_summarize/_visualizer.py”, line 121, in
lambda x: os.path.join(str(data), x))
File “/Users/Molly-Mac/miniconda2/envs/qiime2-2018.6/lib/python3.5/posixpath.py”, line 89, in join
genericpath._check_arg_types(‘join’, a, *p)
File “/Users/Molly-Mac/miniconda2/envs/qiime2-2018.6/lib/python3.5/genericpath.py”, line 143, in _check_arg_types
(funcname, s.class.name)) from None
TypeError: join() argument must be str or bytes, not ‘float’

Plugin error from demux:

join() argument must be str or bytes, not ‘float’

In this case, the offending sample ID: SmProDNA.Va#2Control

Computation Environment

OS: macOS High Sierra
QIIME 2 Release: 2018.6

References

An action to align paired end reads with each other

Improvement Description
There's a bunch of times where the paired-end reads sequence IDs don't match or where some of the reads need to be dropped because there is not matching complementary read. We should have some way to fix that situation.

Comments
Not sure where this would necessarily go, but this plugin seemed reasonable.

add subsample method

Addition Description
Implement a new action for creating a random subsample of demultiplexed sequence data as a user-specified percentage of the input data.

Current Behavior
We use similar subsets of these data for our tutorials, but have generated them externally to QIIME 2 to date.

Proposed Behavior
User will provide SampleData[SequencesWithQuality] and a percentage of the sequences to random retain. A new SampleData[SequencesWithQuality] will be generated with roughly (for the sake of computational efficiency) that number of sequences.

poor demux summarize error message when summarizing empty demux

Bug Description
When qiime demux summarize is run on an empty 'demux' Artifact, it produces a sub-optimal error message:

Plugin error from demux: Cannot describe a DataFrame without columns

This error message is likely being passed up from Pandas.

Steps to reproduce the behavior

Download or create an empty SampleData[SequencesWithQuality] or other acceptable Artifact. (e.g. this one)
Pass it as an input to qiime demux summarize

Expected behavior
demux summarize should test for an empty Artifact and render a clearer error message. e.g. 'Input data contains no sequences.'

Screenshots

Computation Environment

OS: linuxmint 18.3
QIIME 2 Release 2019.4

References
demux summarize source

emp using all available file descriptors on mac os

https://forum.qiime2.org/t/qiime-2-demux-emp-using-all-available-file-descriptors/99

summarize: report read counts for all samples in the metadata, even if that read count is zero

Comments
Came up in NH workshop

support for paired end, variable length demultiplexing

Comments
This could be a new method in this plugin.

References
A user requested this support on the forum here.

summarize: min-length is global and does not always match the random sample

In this example, the global min-length is 40, whereas the sample's min-length 149.

emp_single fails when the `ec_details` list is empty

QIIME 2 Release 2019.4

emp_single fails when the ec_details list is empty. This looks to be the case when golay_error_correction=False because ec_details will not be appended. In emp_single if the details data frame that is returned by the function is empty the transformer that converts the dataframe to ErrorCorrectionDetailsFmt fails with ValueError: Metadata must contain at least one ID.

If the details dataframe on this line is empty:

q2-demux/q2_demux/_demux.py

Line 347 in 9d4d4f6

details = pd.DataFrame(ec_details, columns=columns).set_index('id')

The output type of emp_single is an empty dataframe. When the transformer is called to convert this to ErrorCorrectionDetailsFmt it fails

from qiime2 import Metadata
from q2_demux._format import ErrorCorrectionDetailsFmt
import pandas as pd

columns = ['id',
           'sample',
           'barcode-sequence-id',
           'barcode-uncorrected',
           'barcode-corrected',
           'barcode-errors']
details = pd.DataFrame([], columns=columns).set_index('id')

ff = ErrorCorrectionDetailsFmt()
Metadata(details).save(str(ff))

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-2-262a0b5c8838> in <module>
     12 
     13 ff = ErrorCorrectionDetailsFmt()
---> 14 Metadata(details).save(str(ff))

~/miniconda3/envs/qiime2-2019.4/lib/python3.6/site-packages/qiime2/metadata/metadata.py in __init__(self, dataframe)
    360                 "%r" % (self.__class__.__name__, type(dataframe)))
    361 
--> 362         super().__init__(dataframe.index)
    363 
    364         self._dataframe, self._columns = self._normalize_dataframe(dataframe)

~/miniconda3/envs/qiime2-2019.4/lib/python3.6/site-packages/qiime2/metadata/metadata.py in __init__(self, index)
     91         if index.empty:
     92             raise ValueError(
---> 93                 "%s must contain at least one ID." % self.__class__.__name__)
     94 
     95         id_header = index.name

ValueError: Metadata must contain at least one ID.

IMP: More concise metadata error handling for qiime demux emp-single

Improvement Description
More concise error handling within q2-demux for metadata file handling when running qiime demux emp-single to include the specific column where any metadata.tsv (or sequences.qza) errors occurred. This will make troubleshooting easier for both QIIME 2 users and developers.

Current Behavior
When an error originates from an issue within the metadata file provided (that is due to invalid characters present in the --m-barcodes-column), the following message is provided to the QIIME 2 user:

Plugin error from demux:

 Invalid characters in sequence: ['1', '3', 'F', 'P', 'i', 'l']. 
 Valid characters: ['C', 'M', 'G', 'W', '-', 'H', 'T', 'K', 'S', 'Y', 'B', 'A', 'N', 'D', 'V', 'R', '.']
 Note: Use `lowercase` if your sequence contains lowercase characters not in the sequence's alphabet.

This is an error message pulled from scikitbio, but there is no specific error message from q2-demux.

Proposed Behavior
The error above will also include a q2-demux specific message that includes the column name that caused the error, along with suggested resolution (make sure you've included the correct column in your metadata file, etc).

References
QIIME 2 Forum post that prompted this improvement

`summarize`: better handling of unequal sequence lengths

In QIIME 2 2017.5 we added a check to disallow unequal sequence lengths when using demux summarize, in order to avoid misleading positional boxplots. However, this is inconvenient because it is likely that raw Illumina data has sequences that are slightly different lengths (usually differing by 1bp in length). A user on the forum ran into this problem.

I'm creating this issue to discuss a fix so that 1) the boxplots aren't misleading and 2) the visualization can be used with Illumina data that may have slightly differing sequence lengths.

Proposal: remove the error, provide a warning box in the resulting viz if unequal seq lengths are detected in the subsample, and only draw boxplots for positions in the shortest sequence (i.e. positions that occur in all subsampled sequences). We could also have a threshold parameter that controls when to display this warning so that users aren't always seeing it when processing Illumina data.

This issue needs resolution for the next release (2017.6).

support golay barcode correction

demux: should golay-error-correction be False by default?

Proposed Behavior
I think yes, default should be False, but I see both sides of the argument:

True: the emp-* methods are designed for EMP, after all. EMP uses golay barcodes (correct?) and hence anyone truly following that protocol would have golay (by default).
False: many different primers and barcoding schemes (i.e., non-EMP) are compatible with the "EMP"-style method and I think many novice users will stumble into this, as demonstrated on the forum. Setting this to default will allow users to decide what they do.

Table header misplaced in demux summarize visualizations

Bug Description
The "Sample Name" and "Sequence Count" column headers in the Per-sample sequence counts table produced by qiime demux summarize are misaligned.

Steps to reproduce the behavior

Download demux.qza from the moving pictures tutorial. (or use any other demuxed sequences)
Open a terminal window and navigate to the folder it's in.
Activate a qiime2 environment and run the following:
qiime demux summarize --i-data demux.qza --o-visualization demux.qzv
Open the resulting demux.qzv with q2-view, and take note of the positioning of the column headers (below "Per-sample sequence counts").

Expected behavior
Column headers will be mis-aligned.

Screenshots

Computation Environment

OS: LinuxMint 18.3 Cinnamon
QIIME 2 2018.11
Firefox 64.0 and Chrome 67.0.3396.62

References

The table produced by feature-table tabulate-seqs might provide a useful reference.

invalid sample-id values run successfully through demux without errors, probably shouldn't.

Current Behavior
I recently noticed that an invalid sample-id name, in my case a value with a comma in it, will not break demux or yield a warning. So demux runs seemingly successfully but if you try to visualize the results you get an error like so:

Traceback (most recent call last):
File "/home/mestaki/miniconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/q2cli/commands.py", line 328, in call
results = action(**arguments)
File "</home/mestaki/miniconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/decorator.py:decorator-gen-436>", line 2, in summarize
File "/home/mestaki/miniconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/qiime2/sdk/action.py", line 245, in bound_callable
output_types, provenance)
File "/home/mestaki/miniconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/qiime2/sdk/action.py", line 452, in callable_executor
ret_val = self._callable(output_dir=temp_dir, **view_args)
File "/home/mestaki/miniconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/q2_demux/_summarize/_visualizer.py", line 119, in summarize
header=0, comment='#')
File "/home/mestaki/miniconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/pandas/io/parsers.py", line 685, in parser_f
return _read(filepath_or_buffer, kwds)
File "/home/mestaki/miniconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/pandas/io/parsers.py", line 463, in _read
data = parser.read(nrows)
File "/home/mestaki/miniconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/pandas/io/parsers.py", line 1154, in read
ret = self._engine.read(nrows)
File "/home/mestaki/miniconda3/envs/qiime2-2020.2/lib/python3.6/site-packages/pandas/io/parsers.py", line 2059, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 881, in pandas._libs.parsers.TextReader.read
File "pandas/_libs/parsers.pyx", line 896, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas/_libs/parsers.pyx", line 950, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 937, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas/_libs/parsers.pyx", line 2132, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 3 fields in line 74, saw 5

The comma would obviously not pass Keemie/validation, but demux doesn't recognize any issues. Only the visualization does.

Proposed Behavior
I know metadata validation would be the easiest solution but just wanted to bring it up in case it can be improved in some way, for example:

A) extract and validate sample-id column before running demux. <- has a tiny bit oversight
B) enhance the "tokenizing data." warning message to suggest invalid sample_id and to make sure the file is validated.

q2-2020.2

Golay error correction assumes barcodes are reverse complemented

Running emp_single with my data I was receiving errors that no samples were assigned sequences when using the golay_error_correction=True flag

After investigating I believe the issue lies with the complement of the barcode. Here I just wrote a for loop and returned the number of barcodes with 4 or more error corrections (uncorrectable) for both the normal and reverse complements. The normal (or how we maintain our barcodes) are being returned almost exclusively non-valid.

I am not sure if there is a way to account for complement currently, I may have missed it in the code

from q2_demux._ecc import GolayDecoder


def rev_comp(seq):
    seq = (seq
           .replace('A', 't')
           .replace('T', 'a')
           .replace('G', 'c')
           .replace('C', 'g')
           .upper()[::-1])
    return seq

decoder = GolayDecoder()

n_normal = 0
n_rev_comp = 0

for gol_code in golay_barcodes:
    rev_gol_code = rev_comp(gol_code)
    seq, n = decoder.decode(gol_code)
    if n == 4:
        n_normal += 1
    
    seq, n = decoder.decode(rev_gol_code)
    if n  == 4:
        n_rev_comp += 1  
        
print(n_normal) 
print(n_rev_comp)

1698
0

Where golay_barcodes is a list of valid barcodes.

Support Dual Indexing

Comments
Right now we don't have a way to manage this data. I'm not aware of any tools yet that manipulate this either, so we really don't have a good place to start.

Possible demux enhancement (RFC?)

Improvement Description
Python has multiprocessing.pool.Pool.map which lets you "spin up" processes, and map a list of variables one-by-one to a function.

Is this something that could be utilized by demux w.r.t. demultiplexing one barcode at a time, but in parallel?

Current Behavior
Currently it goes through every sequence in the main read set file and writes it to its respective sample file, but it seems like you could spin up a few processes that read through the sequences.fastq.gz file in parallel and strip only what they are interested in, for writing to their respective sample files (therefore eliminating any race conditions, as each barcode/sample will only be dealt with once and won't step on another per-sample file's foot).

Comments

This could allow buffering of the reading from the main file, preventing it from having to do so many writes to disk, and potentially greatly improving the runtime speeds.
Mainly just a question as a possible enhancement. Totally unsure if it would work the way I'm picturing it, or if there were any better way to "thread" with python and parallelize this process.
One example reason for this enhancement would be that a MiSeq run (~15mil seqs) takes about an hour to demultiplex (the longest step in the process from getting raw sequences through to taxonomy classification).

References
multiprocessing.pool.Pool.map

`summarize` errors when provided single sample

If the input data consists of only a single sample, the summarize visualizer errors:

$ qiime demux summarize --i-data bar.qza --o-visualization bleh
Traceback (most recent call last):
  File "/Users/jairideout/miniconda3/envs/qiime2/bin/qiime", line 9, in <module>
    load_entry_point('q2cli', 'console_scripts', 'qiime')()
  File "/Users/jairideout/miniconda3/envs/qiime2/lib/python3.5/site-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/Users/jairideout/miniconda3/envs/qiime2/lib/python3.5/site-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/Users/jairideout/miniconda3/envs/qiime2/lib/python3.5/site-packages/click/core.py", line 1060, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/jairideout/miniconda3/envs/qiime2/lib/python3.5/site-packages/click/core.py", line 1060, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/jairideout/miniconda3/envs/qiime2/lib/python3.5/site-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/jairideout/miniconda3/envs/qiime2/lib/python3.5/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/Users/jairideout/dev/qiime2/q2cli/q2cli/commands.py", line 210, in __call__
    results = action(**arguments)
  File "<decorator-gen-190>", line 2, in summarize
  File "/Users/jairideout/dev/qiime2/qiime2/qiime/sdk/action.py", line 170, in callable_wrapper
    output_types, provenance)
  File "/Users/jairideout/dev/qiime2/qiime2/qiime/sdk/action.py", line 301, in _callable_executor_
    ret_val = callable(output_dir=temp_dir, **view_args)
  File "/Users/jairideout/dev/qiime2/q2-demux/q2_demux/_demux.py", line 116, in summarize
    ax = sns.distplot(result, kde=False)
  File "/Users/jairideout/miniconda3/envs/qiime2/lib/python3.5/site-packages/seaborn/distributions.py", line 209, in distplot
    bins = min(_freedman_diaconis_bins(a), 50)
  File "/Users/jairideout/miniconda3/envs/qiime2/lib/python3.5/site-packages/seaborn/distributions.py", line 30, in _freedman_diaconis_bins
    h = 2 * iqr(a) / (len(a) ** (1 / 3))
TypeError: len() of unsized object

Originally reported on the forum here.

Summarize clips end of the sequence (sometimes?)

There seems to be an extra position that is getting clipped with the FMT 1 percent tutorial data.

Zoomed Out

Zoomed In

move RawSequences to q2-types, rename to MultiplexedSequences

Improvement Description
RawSequences is a pretty common/important type so it makes sense to move to q2-types. It would be great to rename it to something like MultiplexedSequences (or possibly a more specific name) since we already have the concept of demultiplexed sequences, and "raw sequences" isn't very informative of a name.

Proposed Behavior
When this move and rename happens, we'll need to handle it in a backward-compatible way so that existing artifacts with this type will continue to work. I think this is the first case of a type move/rename that needs backward-compatibility.

`summarize` incorrectly handles sample IDs with underscores

Came up on the forum: if there are sample IDs containing one or more underscores in the input .qza (SampleData[PairedEndSequencesWithQuality | SequencesWithQuality]), summarize will treat the sample ID as all characters up to the first underscore, and will truncate the rest of the sample ID. This results in a single truncated sample ID being reported in the plot. If there are multiple sample IDs with underscores that are truncated to the same sample ID, the behavior is undefined as to which sample data is used for the truncated sample ID (it's whichever file is found last when walking the per-sample fastq files).

This line of code is parsing the filename using .split('_', 1)[0], which truncates IDs with underscores. I think the code can be updated to use the sample IDs in the manifest -- I believe there's an API on the manifest object to obtain sample IDs and their respective filepaths, which avoids the ad-hoc parsing happening in summarize.

Note that this bug affects both single- and paired-end data.

demux is very slow

In the short-term, we'll put the optimizations in place that are described here. In the longer term, we need to improve this in scikit-bio (scikit-bio/scikit-bio#907).

Thanks for the input @jairideout.

Support paired-end demux with two barcode files

References

This came up on the forum.
QIIME 1 handled this with extract_barcodes.py.

Improve warning in demux command

This discussion was started here.

If the user doesn't have golay barcodes, the following error will be reported

Plugin error from demux:

  No sequences were mapped to samples. Check that your barcodes are in the correct orientation (see the rev_comp_barcodes and/or rev_comp_mapping_barcodes options).

The solution is to specify --p-no-golay-error-correction, but it isn't immediately obvious that this should be the case given the error. It maybe worthwhile to also note this in the error message.

update summarize to take `SampleData[JoinedSequencesWithQuality]` as input

Plot and 7-number summaries off-by-one

Bug Description
In the "Interactive Quality Plot" tab of the summarize viz, the index numbering scheme varies between the 7-number csv summary (0-start), and the D3 plot (1-start).

Steps to reproduce the behavior

Go to https://view.qiime2.org/?src=https%3A%2F%2Fdocs.qiime2.org%2F2019.10%2Fdata%2Ftutorials%2Fmoving-pictures%2Fdemux.qzv

Expected behavior
The viz should utilize consistent indexing.

References

Original forum post

Will qiime2 support demultiplexing barcodes with mutations?

Improvement Description
Mutation Types:

Indel
subsitution

Distance calculation:

Hamming
Levenshtein

Better validation of barcodes column

Current Behavior
Right now the emp methods in this plugin allow null values in the barcodes column specified.

Questions
We should either ignore the samples without a barcode, error, or something else entirely?

References
This recently came up on the forum.

new action to generate `ImmutableMetadata` of sequence per sample counts for`SampleData[*SequencesWithQuality]` artifacts

This should generate a single ImmutableMetadata artifact containing the sequence per sample counts from one or more inputs.

In the future, the summarize action could become a pipeline that includes this output as well.

New method: filter SampleData[*SequencesWithQuality]

Addition Description
We need a method to filter SampleData[*SequencesWithQuality] artifacts. Several users have brought this up — e.g., multiple studies are included in a single sequencing run. They want to demultiplex once and then split by study for downstream analysis.

Current Behavior
There is no such method.

Proposed Behavior
New method to filter based on metadata.

References
forum xref

New/Improved visualization for Golay statistics

Improvement Description
User "rrohwer" on the QIIME 2 Forum reported difficulty visualizing the Golay statistics output. A reasonable improvement would be a better way to interpret these statistics.

Current Behavior
The issue is the output is per-read. q2-metadata's tabulate was the original target for this, but it may not be suited given the number of rows in the output.

add seven-number summary of read length distribution to `summarize`

This is useful after read joining, as the reads will differ in length, and this information is essential to have for setting the trim length parameter for deblur denoise-*.

If you need convincing of why we should add this, check out how awful the Viewing summary of joined data with read quality section of the read joining community tutorial has to be without it.

expand tests of transformers

Current Behavior
The paired-end read transformers are not tested, and the tests of the transformers in general are a bit light.

Bespoke viz for error-correction-details?

Bug Description
metadata tabulate struggles to visualize demux error-correction-details unless data sets are small. Even when your data set is of limited size (e.g. Moving Pictures Tutorial data), the result is an unwieldy 300+ page qzv.

We need a new visualization that presents this data in a more usable, and less resource-intensive way.

Steps to reproduce the behavior

demux a large data set using q2-demux
use metadata tabulate on the resulting error-correction-details.qza

Expected behavior
The new viz should present error correction details in a meaningful way, without requiring an HPC cluster's resources and/or long lead-time to create.

Computation Environment

QIIME 2 2019.7, 2019.10

Questions

What data do users need most from this report? Best to provide summary statistics or selected results in this viz, with a link to download a csv?
Does the resources required to build this viz constrain what we are able to provide here?
Is there any way to double-dip with #105 ?

References

Forum x-ref

Add color legend for boxplot

Improvement Description
Adding a legend or a line to the plot would make this immediately visible without the need to compare the "warning text" that's shown when hovering over.

Current Behavior
Currently boxplots are colored in blue and red. Boxplots for any given position are colored red when the subsampling included sequences that were shorter than that position. This information is only visible when hovering over the boxes.

support single-end demux from EMPPairedEndSequences

Comments
This would be convenient, and similar to how dada2 denoise-single can take SampleData[PairedEndSequencesWithQuality] and process only the forward reads.

Update imports for TestPluginBase class to use qiime.plugin.testing

TestPluginBase has been moved to qiime.plugin.testing in qiime2/qiime2/pull/152.

I will open a PR to change this soon.

qiime2 / q2-demux Goto Github PK

q2-demux's Introduction

qiime2 (the QIIME 2 framework)

Installation

Users

Developers

Citing QIIME 2

q2-demux's People

Contributors

Watchers

Forkers

q2-demux's Issues

Zoomed Out

Zoomed In

Recommend Projects

Recommend Topics

Recommend Org