bokulich-lab / q2-assembly Goto Github PK
View Code? Open in Web Editor NEWQIIME 2 plugin for (meta)genome assembly.
License: BSD 3-Clause "New" or "Revised" License
QIIME 2 plugin for (meta)genome assembly.
License: BSD 3-Clause "New" or "Revised" License
I didn't notice this during my review of #32, but I believe this text should now be updated:
$ qiime assembly evaluate-contigs --help
Usage: qiime assembly evaluate-contigs [OPTIONS]
...
--p-threads INTEGER Maximum number of parallel jobs. Default: 1.
Range(1, None) Currently disabled - only 1 CPU is supported.
[default: 1]
...
I think that should indicate that it's disabled on platforms other than Linux, right?
Hello,
I tried to run qiime assembly generate-reads --output-dir test_data
and was expecting data sampled genomes to be placed in a new dir test_data
. Instead, I got
Plugin error from assembly:
'NoneType' object is not iterable
It seems that ncbi, n-genomes-ncbi, and sample-names are needed for this to run.
Thanks!
I am testing with samples that have underscores in their identifiers, and am running into a failure due to an assumption that only the text before the _
is the sample identifier. This looks to be traceable to this line.
Note that QIIME 2 does allow for underscores in identifiers (see the documentation here).
Is there a different way to get the sample ids from the filenames in this case? Based on a quick look at the files in the data artifact, it looks like you could change line 113 to:
return os.path.basename(fp.replace('_contigs.fa', ''))
I recommend adding a test of this function that includes a sample id with an underscore in it.
We need an action supporting contig/MAG indexing using bowtie2.
Acceptance criteria:
SampleData[Contigs]
or SampleData[MAG]
as inputSampleData[Bowtie2Index]
or SampleData[MultiBowtie2Index]
We need an action supporting metagenome assembly using the metaSPAdes assembler.
Acceptance criteria:
SampleData[*SequencesWithQuality*]
as inputSampleData[Contigs]
artifactOnly .fa
contig files are collected but .fasta
files should also be collected:
See bokulich-lab/q2-moshpit#76 and the closing PR for a possible solution.
In doing some experiments with q2-assembly, I noticed that the default number of CPUs is set a few different ways. For example:
qiime assembly assemble-megahit
: --p-num-cpu-threads
Number of CPU threads. Default: # of logical processors.
qiime assembly assemble-spades
: --p-threads
Number of threads. Default: 16.
In the core distribution QIIME 2 plugins, we tend to set these types of values with a default of 1
, forcing the user to intentionally request more resources. This is because users will often just go with the default setting. If it's set to 1
, they'll notice that it's too slow and increase the value. If it's set high by default though, it becomes easy for users to not notice and overload a system. For example, if they request a single CPU on their cluster, and then 16 subprocesses spin up, that can overload the cluster node and get them in trouble with the sys admin (and potentially give QIIME 2 a bad reputation with the sys admin if it happens regularly).
I recommend always setting the defaults for these parameters to 1
, and letting the user override them.
The following command successfully runs and generates a QZV.
qiime assembly evaluate-contigs \
--i-contigs megahit-contigs.qza \
--p-min-contig 1000 \
--p-threads 56 \
--o-visualization megahit-contigs.qzv \
--verbose
However, clicking on any of the "Downloads" buttons does not invoke a "download" of the respective PDFs & TSVs. Howewver, I can confirm that these files do exist within the extracted QZV under the quast_data/
and quast_data/basic_stats/
folder, using the command:
qiime tools extract \
--input-path megahit-contigs.qzv \
--output-path megahit-contigs-extract
Tested within the Chrome and Safari browsers. Code ran within the qiime2-shotgun-2023.9
environment.
When I open the attached visualization, I am unable to download any attached images.
Steps to reproduce:
qiime assembly evaluate-contigs
with the default parameters (without providing reads as input) and the contigs.qza artifact provided in the attached filesExpected behaviour:
Plots get downloaded.
Actual behaviour:
An error message is displayed:
Attached files: https://polybox.ethz.ch/index.php/s/0MwYFnTJS5M1k8p/download
When running the evaluate-contigs
action in an environment with the most recent version of QUAST installed (5.2.0), it is impossible to generate the QC visualization due to the following error:
ValueError: invalid literal for int() with base 10: 'START_A'
.
This seems to be caused by ablab/quast#230 (and fixed by ablab/quast#244). Unfortunately, the previous conda-installable version of QUAST (5.0.2) is not compatible with our environment (it needs Python<3.7) so until a new, fixed version is released, the only solution would be to pip install
QUAST directly.
When running the command within the conda environment qiime2-shotgun-2023.9
:
qiime assembly assemble-spades \
--i-seqs fondue-output/single_reads.qza \
--p-threads 8 \
--p-phred-offset 33 \
--p-memory 60 \
--p-meta \
--o-contigs contigs.qza \
--verbose
The following error was returned:
Plugin error from assembly:
SPAdes v3.15.2 in "meta" mode supports only paired-end reads.
See above for debug info.
Update the help text to reflect this.
As a plugin user,
I want to be able to co-assemble genomes across all the samples rather than per-sample
so that I can use all the available sequence information.
Tasks:
Multiprocessing was temporarily disabled here:
q2-assembly/q2_assembly/quast/quast.py
Lines 57 to 63 in a562844
We need an action supporting mapping reads to indexed contigs/MAGs using bowtie2.
Acceptance criteria:
SampleData[*SequencesWithQuality]
and SampleData[MultiBowtie2Index | Bowtie2Index]
SampleData[AlignmentMaps | AlignmentMap]
Notes:
Describe the bug
When I try to run the generate-reads
action, I get an error stating that the pysam
module is missing.
To Reproduce
Steps to reproduce the behavior:
mamba env create -n q2-shotgun --file https://data.qiime2.org/distro/shotgun/qiime2-shotgun-2023.9-py38-linux-conda.yml
conda activate q2-shotgun
qiime dev refresh-cache
qiime assembly generate-reads \
--i-genomes genomes.qza \
--p-sample-names sample1 sample2 sample3 sample4 \
--p-n-reads 2000000 \
--p-abundance uniform \
--p-n-genomes 5 \
--p-cpus 10 \
--output-dir reads \
--verbose
Template genome sequences were provided - "n-genomes-ncbi" and "ncbi" parameters will be ignored.
Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.
Command: iss generate --compress --genomes /scratch/mziemski/tmp/qiime2/mziemski/data/0619c364-72ba-492a-9d8f-7043e825bf15/data/dna-sequences.fasta --n_genomes 5 --abundance uniform --n_reads 2000000 --mode kde --model HiSeq --cpus 10 --output /scratch/mziemski/tmp/tmpowd79ql2/sample1_00_L001
Traceback (most recent call last):
File "/home/mziemski/miniconda3/envs/q2-shotgun-tutorial/bin/iss", line 6, in <module>
from iss.app import main
File "/home/mziemski/miniconda3/envs/q2-shotgun-tutorial/lib/python3.8/site-packages/iss/app.py", line 4, in <module>
from iss import bam
File "/home/mziemski/miniconda3/envs/q2-shotgun-tutorial/lib/python3.8/site-packages/iss/bam.py", line 14, in <module>
import pysam
ModuleNotFoundError: No module named 'pysam'
Expected behavior
The action runs without errors.
Please complete the following information:
Additional context
This happened on both, Linux and macOS. I used the 2023.9 distro - for some reason the pysam
package is not included there. Before, when I used to install q2-assembly "directly" from our conda channel, it seemed to work all fine. @lizgehret, @ebolyen would you have an idea why that could be happening? ๐ค
We need a QIIME 2 visualisation displaying assembly quality control results.
Acceptance criteria:
SampleData[Contigs]
as inputOn two separate machines (both HPC clusters), I've had evaluate-contigs
fail due to available disk space. I can see from the log that it downloads a lot of files. @misialq, have you run into this? Any ideas on how to address this? I haven't actually got this command to complete yet, after testing on two different studies (and two different systems, as I mentioned). Let me know if you'd like the error log - I can send that by email.
It seems that QUAST generates many more visualizations than the ones that are currently displayed in the visualization produced by evaluate-contigs
(most, if not all, of them are generated based on alignments to reference sequences, either provided by the user - not yet supported, see #35 - or fetched by QUAST automatically). Some of the interesting ones include:
Update the following information:
We need an action supporting metagenome assembly using the MEGAHIT assembler.
Acceptance criteria:
SampleData[*SequencesWithQuality*]
as inputSampleData[Contigs]
artifactAs a user,
I want more QUAST params to be exposed in the evaluate_contigs
actions,
so that I can have more control over how the tool is being run.
Notes
There are options like --memory-efficient
but also a couple of other ones which have to do with aligning contigs to references.
When running:
qiime assembly assemble-spades \
> --i-seqs fondue-output/single_reads.qza \
> --p-threads 8 \
> --o-contigs contigs.qza \
within the conda env qiime2-shotgun-2023.9
, the following error appears within the --verbose
output:
Usage: spades.py [options] -o <output_dir>
spades.py: error: argument --phred-offset: invalid qvoffset value: 'auto-detect'
...
I kept running into memory issues with a test data set I am using. After reading the the Spades manual, for release 3.15.2 which qiime2-shotgun-2023.9
uses:
SPAdes uses 512 Mb per thread for buffers, which results in higher memory consumption. If you set memory limit manually, SPAdes will use smaller buffers and thus less RAM.
I think my issue was not realizing the increased memory usage incurred when using multiple threads. I am in the process of validating this now.
If a user specifies 32 cores, they'll be using up ~ 16 GB of RAM for buffers. This is analogous to feature-classifier
, in which more memory is used with increasing thread count. Conversely, the user may specify too little memory to get anything to run. For example, setting the maximum memory usage to 100 GB and using 16 threads, means much smaller buffers / RAM per thread.
Perhaps update the help text like so:
--p-threads
: Number of threads. By default SPAdes uses 512 Mb per thread for buffers, which results in higher memory consumption. This can be further affected by the --p-memory
option.
--p-memory
: RAM limit for SPAdes in Gb (terminates if exceeded). If a smaller memory limit is set, SPAdes will use smaller buffers and thus less memory per --p-threads
.
Is it easier for everyone to post these types of suggestions as an issue like this, or should I simply wait and compile a set of these suggestions and then and submit them as PR? I've not dived into the code yet, so I figured I'd recommend these simple fixes as I work through testing the tools. I'd imagine that these are easy enough to wrap into any other existing PRs.
As a user,
I want the evaluate_contigs
action to accept reference sequences and/or at least custom BLAST dbs
so that the 16S Silva reference does not need to be downloaded by QUAST on every execution.
Notes:
Maybe in the beginning we could just allow passing the pre-created Silva QZA that we can easily convert to a blast database to avoid those constant re-downloads.
Inside of index_mags
here mags
(of type MultiMAGSequencesDirFmt
) does not have the expected manifest attached to it. E.g. adding mags.validate()
in this function causes tests to fail.
It also doesn't seem like there's a way to generate a manifest in the way that we do e.g. for CasavaOneEightSingleLanePerSampleDirFmt
in q2-types, from which some of the parent classes are borrowed as parent classes here.
If no contigs are formed for any samples during assembly, and a SampleData[Contigs]
with some .fa
files of size zero is therefore passed as input to index-contigs
, index-contigs
fails with a fairly uninformative error message:
An error was encountered while running Bowtie2, (return code 1), please inspect stdout and stderr to learn more.
The --verbose
output was more useful, but still only "warned" about an empty fasta file:
Input files DNA, FASTA:
/scratch/gcaporaso/temp/qiime2/gcaporaso/data/b1c35261-68ee-4d73-864e-80ca50a04069/data/NEC-EF_contigs.fa
Warning: Empty fasta file: '/scratch/gcaporaso/temp/qiime2/gcaporaso/data/b1c35261-68ee-4d73-864e-80ca50a04069/data/NEC-EF_contigs.fa'
Warning: All fasta inputs were empty
Total time for call to driver() for forward index: 00:00:00
Error: Encountered internal Bowtie 2 exception (#1)
Command: /home/gcaporaso/mambaforge/envs/q2dev-20235-shotgun/bin/bowtie2-build-s --wrapper basic-0 --bmaxdivn 4 --dcv 1024 --offrate 5 --ftabchars 10 --threads 40 /scratch/gcaporaso/temp/qiime2/gcaporaso/data/b1c35261-68ee-4d73-864e-80ca50a04069/data/NEC-EF_contigs.fa /scratch/gcaporaso/temp/q2-Bowtie2IndexDirFmt-35dkmvkk/NEC-EF/index
Traceback (most recent call last):
File "/home/gcaporaso/4-git-repos/qiime2/q2-assembly/q2_assembly/bowtie2/indexing.py", line 50, in _index_seqs
run_command(cmd, verbose=True)
File "/home/gcaporaso/4-git-repos/qiime2/q2-assembly/q2_assembly/_utils.py", line 28, in run_command
subprocess.run(cmd, check=True)
File "/home/gcaporaso/mambaforge/envs/q2dev-20235-shotgun/lib/python3.8/subprocess.py", line 516, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['bowtie2-build', '--bmaxdivn', '4', '--dcv', '1024', '--offrate', '5', '--ftabchars', '10', '--threads', '40', '/scratch/gcaporaso/temp/qiime2/gcaporaso/data/b1c35261-68ee-4d73-864e-80ca50a04069/data/NEC-EF_contigs.fa', '/scratch/gcaporaso/temp/q2-Bowtie2IndexDirFmt-35dkmvkk/NEC-EF/index']' returned non-zero exit status 1.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/gcaporaso/mambaforge/envs/q2dev-20235-shotgun/lib/python3.8/site-packages/q2cli/commands.py", line 468, in __call__
results = action(**arguments)
File "<decorator-gen-736>", line 2, in index_contigs
File "/home/gcaporaso/mambaforge/envs/q2dev-20235-shotgun/lib/python3.8/site-packages/qiime2/sdk/action.py", line 274, in bound_callable
outputs = self._callable_executor_(
File "/home/gcaporaso/mambaforge/envs/q2dev-20235-shotgun/lib/python3.8/site-packages/qiime2/sdk/action.py", line 509, in _callable_executor_
output_views = self._callable(**view_args)
File "/home/gcaporaso/4-git-repos/qiime2/q2-assembly/q2_assembly/bowtie2/indexing.py", line 85, in index_contigs
_index_seqs(contig_fps, str(result), common_args, "contigs")
File "/home/gcaporaso/4-git-repos/qiime2/q2-assembly/q2_assembly/bowtie2/indexing.py", line 52, in _index_seqs
raise Exception(
Exception: An error was encountered while running Bowtie2, (return code 1), please inspect stdout and stderr to learn more.
Plugin error from assembly:
An error was encountered while running Bowtie2, (return code 1), please inspect stdout and stderr to learn more.
See above for debug info.
Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.
I came across this because I had a couple of control samples which had very few (<10) demultiplexed sequences in my input to assemble-megahit
, and these unsurprisingly didn't form any contigs. When I ran index-contigs
I got the error.
I'm not sure what the best pathway forward is for this - at the very least we probably want a more informative error message, but we also might want a way to filter the SampleData[Contigs]
so the user doesn't have to generate contigs again (which can take a while). I got around it this time by filtering my input to assemble-megahit
to drop the two samples that were causing problems with qiime demux filter
.
EDIT: I just hit this again, on a different data set. (Aug 21 2023)
At present, it looks like you tend to define default values implicitly, and it would be better to do this explicitly (i.e., when you define the function the action is mapped to).
I suspect that you are doing this so that the defaults are actually set by the underlying code if not overridden by the user, which makes sense intuitively but is not ideal for a few reasons. First, it could result in your help text becoming outdated and misleading (e.g., if you specify the default is 1, and the underlying code is changed so it's 16, the help text your user is referring to will be wrong). Next, and probably most importantly, if not specified explicitly the parameter values won't be stored in data provenance. And finally, if you define explicitly on function definition, the default value will autopopulate in the help text for your action.
As an example of how I recommend doing this, take a look at this snippet of the help text from dada2 denoise-single
:
--p-n-threads INTEGER The number of threads to use for multithreaded
processing. If 0 is provided, all available cores
will be used. [default: 1]
That default value is specified here.
I recommend ultimately making this change across the whole code base.
The CI can be simplified by moving coverage testing to the package-building step, as was done for q2-moshpit: bokulich-lab/q2-moshpit#12.
This is very minor, but might be worth addressing before an alpha release.
These two actions in assembly use different names for the same input type. It might be nice to pick one and use that for all actions in the plugin for consistency. This happens all over the place in the core distro, but it would be a breaking change to address it, so it hasn't been worth the trouble.
qiime assembly assemble-megahit --i-seqs demux.qza ...
qiime assembly map-reads-to-contigs --i-reads demux.qza ...
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.