bokulich-lab / q2-assembly Goto Github PK

View Code? Open in Web Editor NEW

4.0 4.0 12.0 32.57 MB

QIIME 2 plugin for (meta)genome assembly.

License: BSD 3-Clause "New" or "Revised" License

Makefile 0.12% Python 94.43% TeX 2.72% HTML 2.73%

q2-assembly's People

Contributors

Stargazers

Watchers

Forkers

misialq ebolyen gregcaporaso keegan-evans colinvwood nbokulich oddant1 lizgehret christosmatzoros sann5 mikerobeson dorielagrabocka

q2-assembly's Issues

outdated help text on evaluate-contigs

I didn't notice this during my review of #32, but I believe this text should now be updated:

$ qiime assembly evaluate-contigs --help
Usage: qiime assembly evaluate-contigs [OPTIONS]
...
  --p-threads INTEGER     Maximum number of parallel jobs. Default: 1.
    Range(1, None)        Currently disabled - only 1 CPU is supported.
                                                                  [default: 1]
...

I think that should indicate that it's disabled on platforms other than Linux, right?

Generate Reads unexpected inputs needed

Hello,

I tried to run qiime assembly generate-reads --output-dir test_data and was expecting data sampled genomes to be placed in a new dir test_data. Instead, I got

Plugin error from assembly:

  'NoneType' object is not iterable

It seems that ncbi, n-genomes-ncbi, and sample-names are needed for this to run.

Thanks!

invalid sample name assumption causes failure in `evaluate-contigs`

I am testing with samples that have underscores in their identifiers, and am running into a failure due to an assumption that only the text before the _ is the sample identifier. This looks to be traceable to this line.

Note that QIIME 2 does allow for underscores in identifiers (see the documentation here).

Is there a different way to get the sample ids from the filenames in this case? Based on a quick look at the files in the data artifact, it looks like you could change line 113 to:

return os.path.basename(fp.replace('_contigs.fa', ''))

I recommend adding a test of this function that includes a sample id with an underscore in it.

Implement contig read indexing action

We need an action supporting contig/MAG indexing using bowtie2.

Acceptance criteria:

uses SampleData[Contigs] or SampleData[MAG] as input
uses bowtie2-build to generate the indices for contig/MAG files
outputs SampleData[Bowtie2Index] or SampleData[MultiBowtie2Index]

Implement assemble-metaspades action

We need an action supporting metagenome assembly using the metaSPAdes assembler.

Acceptance criteria:

uses metaSPAdes assembler (link)
uses SampleData[*SequencesWithQuality*] as input
can handle single- and paired-end reads
outputs SampleData[Contigs] artifact

BUG: index-contigs collects only .fa files

Only .fa contig files are collected but .fasta files should also be collected:

q2-assembly/q2_assembly/bowtie2/indexing.py

Line 84 in 5afdeb6

contig_fps = sorted(glob.glob(os.path.join(str(contigs), "*_contigs.fa")))

See bokulich-lab/q2-moshpit#76 and the closing PR for a possible solution.

default values for number of CPU threads

In doing some experiments with q2-assembly, I noticed that the default number of CPUs is set a few different ways. For example:

qiime assembly assemble-megahit: --p-num-cpu-threads Number of CPU threads. Default: # of logical processors.
qiime assembly assemble-spades: --p-threads Number of threads. Default: 16.

In the core distribution QIIME 2 plugins, we tend to set these types of values with a default of 1, forcing the user to intentionally request more resources. This is because users will often just go with the default setting. If it's set to 1, they'll notice that it's too slow and increase the value. If it's set high by default though, it becomes easy for users to not notice and overload a system. For example, if they request a single CPU on their cluster, and then 16 subprocesses spin up, that can overload the cluster node and get them in trouble with the sys admin (and potentially give QIIME 2 a bad reputation with the sys admin if it happens regularly).

I recommend always setting the defaults for these parameters to 1, and letting the user override them.

BUG: Download buttons for the `assembly evaluate-contigs` visualization not producing PDFs, etc..

The following command successfully runs and generates a QZV.

qiime assembly evaluate-contigs \
  --i-contigs megahit-contigs.qza \
  --p-min-contig 1000 \
  --p-threads 56 \
  --o-visualization megahit-contigs.qzv \
  --verbose

However, clicking on any of the "Downloads" buttons does not invoke a "download" of the respective PDFs & TSVs. Howewver, I can confirm that these files do exist within the extracted QZV under the quast_data/ and quast_data/basic_stats/ folder, using the command:

qiime tools extract \
--input-path megahit-contigs.qzv \
--output-path megahit-contigs-extract

Tested within the Chrome and Safari browsers. Code ran within the qiime2-shotgun-2023.9 environment.

Cannot download some files from the `evaluate-contigs` qzv

When I open the attached visualization, I am unable to download any attached images.

Steps to reproduce:

run qiime assembly evaluate-contigs with the default parameters (without providing reads as input) and the contigs.qza artifact provided in the attached files
open the visualization
on the "QC report" tab:
- try to download the "GC content plot" for one sample
- try to download the full report (green button)
- try to download any of the other summary plots (next to the green button)

Expected behaviour:
Plots get downloaded.

Actual behaviour:
An error message is displayed:

Attached files: https://polybox.ethz.ch/index.php/s/0MwYFnTJS5M1k8p/download

`evaluate-contigs` fails on the most recent QUAST version

When running the evaluate-contigs action in an environment with the most recent version of QUAST installed (5.2.0), it is impossible to generate the QC visualization due to the following error:
ValueError: invalid literal for int() with base 10: 'START_A'.

This seems to be caused by ablab/quast#230 (and fixed by ablab/quast#244). Unfortunately, the previous conda-installable version of QUAST (5.0.2) is not compatible with our environment (it needs Python<3.7) so until a new, fixed version is released, the only solution would be to pip install QUAST directly.

update help text for `assemble-spades --p-meta`

When running the command within the conda environment qiime2-shotgun-2023.9:

qiime assembly assemble-spades \
    --i-seqs fondue-output/single_reads.qza \
    --p-threads 8   \
    --p-phred-offset 33 \
    --p-memory 60 \
    --p-meta \
    --o-contigs contigs.qza  \
    --verbose

The following error was returned:

Plugin error from assembly:
SPAdes v3.15.2 in "meta" mode supports only paired-end reads.
See above for debug info.

Update the help text to reflect this.

Implement co-assembly option

As a plugin user,
I want to be able to co-assemble genomes across all the samples rather than per-sample
so that I can use all the available sequence information.

Tasks:

Implement co-assembly for MEGAHIT
Implement co-assembly for metaSPAdes

Enable multiprocessing in `evaluate-contigs`

Multiprocessing was temporarily disabled here:

q2-assembly/q2_assembly/quast/quast.py

Lines 57 to 63 in a562844

 elif arg_key == "threads" and (not arg_val or arg_val > 1): 

 # TODO: this needs to be fixed (to allow multiprocessing) 

 print( 

 "Multiprocessing is currently not supported. Resetting " 

 "number of threads to 1." 

 ) 

 return [_construct_param(arg_key), "1"]

as it was not working before but should be possible now - that branch can be removed.

Implement read mapping action

We need an action supporting mapping reads to indexed contigs/MAGs using bowtie2.

Acceptance criteria:

uses SampleData[*SequencesWithQuality] and SampleData[MultiBowtie2Index | Bowtie2Index]
uses bowtie2 to align reads
outputs SampleData[AlignmentMaps | AlignmentMap]

Notes:

perhaps this needs to be split into two actions: one for single- and one for paired-end reads...? (perhaps not)

BUG: action `generate-reads` is missing the `pysam` dependency

Describe the bug
When I try to run the generate-reads action, I get an error stating that the pysam module is missing.

To Reproduce
Steps to reproduce the behavior:

Create and activate the environment:

mamba env create -n q2-shotgun --file https://data.qiime2.org/distro/shotgun/qiime2-shotgun-2023.9-py38-linux-conda.yml
conda activate q2-shotgun
qiime dev refresh-cache

Execute the command

qiime assembly generate-reads \
    --i-genomes genomes.qza \
    --p-sample-names sample1 sample2 sample3 sample4 \
    --p-n-reads 2000000 \
    --p-abundance uniform \
    --p-n-genomes 5 \
    --p-cpus 10 \
    --output-dir reads \
    --verbose

where genomes.qza is any FeatureData[Sequence] artifact.

See error:

Template genome sequences were provided - "n-genomes-ncbi" and "ncbi" parameters will be ignored.
Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.

Command: iss generate --compress --genomes /scratch/mziemski/tmp/qiime2/mziemski/data/0619c364-72ba-492a-9d8f-7043e825bf15/data/dna-sequences.fasta --n_genomes 5 --abundance uniform --n_reads 2000000 --mode kde --model HiSeq --cpus 10 --output /scratch/mziemski/tmp/tmpowd79ql2/sample1_00_L001

Traceback (most recent call last):
  File "/home/mziemski/miniconda3/envs/q2-shotgun-tutorial/bin/iss", line 6, in <module>
    from iss.app import main
  File "/home/mziemski/miniconda3/envs/q2-shotgun-tutorial/lib/python3.8/site-packages/iss/app.py", line 4, in <module>
    from iss import bam
  File "/home/mziemski/miniconda3/envs/q2-shotgun-tutorial/lib/python3.8/site-packages/iss/bam.py", line 14, in <module>
    import pysam
ModuleNotFoundError: No module named 'pysam'

Expected behavior
The action runs without errors.

Please complete the following information:

OS: CentOS
QIIME 2 version: 2023.9

Additional context
This happened on both, Linux and macOS. I used the 2023.9 distro - for some reason the pysam package is not included there. Before, when I used to install q2-assembly "directly" from our conda channel, it seemed to work all fine. @lizgehret, @ebolyen would you have an idea why that could be happening? 🤔

Implement assembly QC visualisation

We need a QIIME 2 visualisation displaying assembly quality control results.

Acceptance criteria:

uses metaQUAST (link)
uses SampleData[Contigs] as input
wraps the output report and the Icarus browser as a qzv visualization

`evaluate-contigs` fails with `OSError: [Errno 28] No space left on device`

On two separate machines (both HPC clusters), I've had evaluate-contigs fail due to available disk space. I can see from the log that it downloads a lot of files. @misialq, have you run into this? Any ideas on how to address this? I haven't actually got this command to complete yet, after testing on two different studies (and two different systems, as I mentioned). Let me know if you'd like the error log - I can send that by email.

expose additional visualizations generated by QUAST in the `evaluate-contigs` visualization

It seems that QUAST generates many more visualizations than the ones that are currently displayed in the visualization produced by evaluate-contigs (most, if not all, of them are generated based on alignments to reference sequences, either provided by the user - not yet supported, see #35 - or fetched by QUAST automatically). Some of the interesting ones include:

Krona plots per sample
more Icarus reports (per reference sequence - show coverage per reference genome in the contig browser)
reports for not aligned sequences

Update README

Update the following information:

add new conda installation instructions
add dev section (hooks etc.)
add a new section describing functionality

Implement assemble-megahit action

We need an action supporting metagenome assembly using the MEGAHIT assembler.

Acceptance criteria:

uses MEGAHIT assembler (link)
uses SampleData[*SequencesWithQuality*] as input
can handle single- and paired-end reads
outputs SampleData[Contigs] artifact

Expose more QUAST parameters

As a user,
I want more QUAST params to be exposed in the evaluate_contigs actions,
so that I can have more control over how the tool is being run.

Notes
There are options like --memory-efficient but also a couple of other ones which have to do with aligning contigs to references.

BUG: `assembly assemble-spades` auto-detection of phred-offset error

When running:

qiime assembly assemble-spades \
>     --i-seqs fondue-output/single_reads.qza \
>     --p-threads 8 \
>     --o-contigs contigs.qza \

within the conda env qiime2-shotgun-2023.9, the following error appears within the --verbose output:

Usage: spades.py [options] -o <output_dir>
spades.py: error: argument --phred-offset: invalid qvoffset value: 'auto-detect'
...

Update help text for `assemble-spades --p-threads --p-memory`

I kept running into memory issues with a test data set I am using. After reading the the Spades manual, for release 3.15.2 which qiime2-shotgun-2023.9 uses:

SPAdes uses 512 Mb per thread for buffers, which results in higher memory consumption. If you set memory limit manually, SPAdes will use smaller buffers and thus less RAM.

I think my issue was not realizing the increased memory usage incurred when using multiple threads. I am in the process of validating this now.

If a user specifies 32 cores, they'll be using up ~ 16 GB of RAM for buffers. This is analogous to feature-classifier, in which more memory is used with increasing thread count. Conversely, the user may specify too little memory to get anything to run. For example, setting the maximum memory usage to 100 GB and using 16 threads, means much smaller buffers / RAM per thread.

Perhaps update the help text like so:

--p-threads: Number of threads. By default SPAdes uses 512 Mb per thread for buffers, which results in higher memory consumption. This can be further affected by the --p-memory option.
--p-memory: RAM limit for SPAdes in Gb (terminates if exceeded). If a smaller memory limit is set, SPAdes will use smaller buffers and thus less memory per --p-threads.

Is it easier for everyone to post these types of suggestions as an issue like this, or should I simply wait and compile a set of these suggestions and then and submit them as PR? I've not dived into the code yet, so I figured I'd recommend these simple fixes as I work through testing the tools. I'd imagine that these are easy enough to wrap into any other existing PRs.

`evaluate-contigs` should accept reference seqs

As a user,
I want the evaluate_contigs action to accept reference sequences and/or at least custom BLAST dbs
so that the 16S Silva reference does not need to be downloaded by QUAST on every execution.

Notes:
Maybe in the beginning we could just allow passing the pre-created Silva QZA that we can easily convert to a blast database to avoid those constant re-downloads.

BUG: `MultiMAGSequencesDirFmt` does not have a manifest in `index_mags`

Inside of index_mags here mags (of type MultiMAGSequencesDirFmt) does not have the expected manifest attached to it. E.g. adding mags.validate() in this function causes tests to fail.

It also doesn't seem like there's a way to generate a manifest in the way that we do e.g. for CasavaOneEightSingleLanePerSampleDirFmt in q2-types, from which some of the parent classes are borrowed as parent classes here.

BUG: `index-contigs` fails if any input files are empty

If no contigs are formed for any samples during assembly, and a SampleData[Contigs] with some .fa files of size zero is therefore passed as input to index-contigs, index-contigs fails with a fairly uninformative error message:

  An error was encountered while running Bowtie2, (return code 1), please inspect stdout and stderr to learn more.

The --verbose output was more useful, but still only "warned" about an empty fasta file:

Input files DNA, FASTA:
  /scratch/gcaporaso/temp/qiime2/gcaporaso/data/b1c35261-68ee-4d73-864e-80ca50a04069/data/NEC-EF_contigs.fa
Warning: Empty fasta file: '/scratch/gcaporaso/temp/qiime2/gcaporaso/data/b1c35261-68ee-4d73-864e-80ca50a04069/data/NEC-EF_contigs.fa'
Warning: All fasta inputs were empty
Total time for call to driver() for forward index: 00:00:00
Error: Encountered internal Bowtie 2 exception (#1)
Command: /home/gcaporaso/mambaforge/envs/q2dev-20235-shotgun/bin/bowtie2-build-s --wrapper basic-0 --bmaxdivn 4 --dcv 1024 --offrate 5 --ftabchars 10 --threads 40 /scratch/gcaporaso/temp/qiime2/gcaporaso/data/b1c35261-68ee-4d73-864e-80ca50a04069/data/NEC-EF_contigs.fa /scratch/gcaporaso/temp/q2-Bowtie2IndexDirFmt-35dkmvkk/NEC-EF/index
Traceback (most recent call last):
  File "/home/gcaporaso/4-git-repos/qiime2/q2-assembly/q2_assembly/bowtie2/indexing.py", line 50, in _index_seqs
    run_command(cmd, verbose=True)
  File "/home/gcaporaso/4-git-repos/qiime2/q2-assembly/q2_assembly/_utils.py", line 28, in run_command
    subprocess.run(cmd, check=True)
  File "/home/gcaporaso/mambaforge/envs/q2dev-20235-shotgun/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['bowtie2-build', '--bmaxdivn', '4', '--dcv', '1024', '--offrate', '5', '--ftabchars', '10', '--threads', '40', '/scratch/gcaporaso/temp/qiime2/gcaporaso/data/b1c35261-68ee-4d73-864e-80ca50a04069/data/NEC-EF_contigs.fa', '/scratch/gcaporaso/temp/q2-Bowtie2IndexDirFmt-35dkmvkk/NEC-EF/index']' returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/gcaporaso/mambaforge/envs/q2dev-20235-shotgun/lib/python3.8/site-packages/q2cli/commands.py", line 468, in __call__
    results = action(**arguments)
  File "<decorator-gen-736>", line 2, in index_contigs
  File "/home/gcaporaso/mambaforge/envs/q2dev-20235-shotgun/lib/python3.8/site-packages/qiime2/sdk/action.py", line 274, in bound_callable
    outputs = self._callable_executor_(
  File "/home/gcaporaso/mambaforge/envs/q2dev-20235-shotgun/lib/python3.8/site-packages/qiime2/sdk/action.py", line 509, in _callable_executor_
    output_views = self._callable(**view_args)
  File "/home/gcaporaso/4-git-repos/qiime2/q2-assembly/q2_assembly/bowtie2/indexing.py", line 85, in index_contigs
    _index_seqs(contig_fps, str(result), common_args, "contigs")
  File "/home/gcaporaso/4-git-repos/qiime2/q2-assembly/q2_assembly/bowtie2/indexing.py", line 52, in _index_seqs
    raise Exception(
Exception: An error was encountered while running Bowtie2, (return code 1), please inspect stdout and stderr to learn more.

Plugin error from assembly:

  An error was encountered while running Bowtie2, (return code 1), please inspect stdout and stderr to learn more.

See above for debug info.
Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.

I came across this because I had a couple of control samples which had very few (<10) demultiplexed sequences in my input to assemble-megahit, and these unsurprisingly didn't form any contigs. When I ran index-contigs I got the error.

I'm not sure what the best pathway forward is for this - at the very least we probably want a more informative error message, but we also might want a way to filter the SampleData[Contigs] so the user doesn't have to generate contigs again (which can take a while). I got around it this time by filtering my input to assemble-megahit to drop the two samples that were causing problems with qiime demux filter.

EDIT: I just hit this again, on a different data set. (Aug 21 2023)

specify default values in function definition

At present, it looks like you tend to define default values implicitly, and it would be better to do this explicitly (i.e., when you define the function the action is mapped to).

I suspect that you are doing this so that the defaults are actually set by the underlying code if not overridden by the user, which makes sense intuitively but is not ideal for a few reasons. First, it could result in your help text becoming outdated and misleading (e.g., if you specify the default is 1, and the underlying code is changed so it's 16, the help text your user is referring to will be wrong). Next, and probably most importantly, if not specified explicitly the parameter values won't be stored in data provenance. And finally, if you define explicitly on function definition, the default value will autopopulate in the help text for your action.

As an example of how I recommend doing this, take a look at this snippet of the help text from dada2 denoise-single:

  --p-n-threads INTEGER  The number of threads to use for multithreaded
                         processing. If 0 is provided, all available cores
                         will be used.                            [default: 1]

That default value is specified here.

I recommend ultimately making this change across the whole code base.

Clean up the CI

The CI can be simplified by moving coverage testing to the package-building step, as was done for q2-moshpit: bokulich-lab/q2-moshpit#12.

Update dependencies

inconsistent input naming

This is very minor, but might be worth addressing before an alpha release.

These two actions in assembly use different names for the same input type. It might be nice to pick one and use that for all actions in the plugin for consistency. This happens all over the place in the core distro, but it would be a breaking change to address it, so it hasn't been worth the trouble.

qiime assembly assemble-megahit --i-seqs demux.qza ...

qiime assembly map-reads-to-contigs --i-reads demux.qza ...

	elif arg_key == "threads" and (not arg_val or arg_val > 1):
	# TODO: this needs to be fixed (to allow multiprocessing)
	print(
	"Multiprocessing is currently not supported. Resetting "
	"number of threads to 1."
	)
	return [_construct_param(arg_key), "1"]

bokulich-lab / q2-assembly Goto Github PK

q2-assembly's People

Contributors

Stargazers

Watchers

Forkers

q2-assembly's Issues

Recommend Projects

Recommend Topics

Recommend Org