nf-core / pathogensurveillance Goto Github PK

View Code? Open in Web Editor NEW

10.0 174.0 2.0 56.98 MB

Surveillance of pathogens using population genomics and sequencing

Home Page: https://nf-co.re/pathogensurveillance

License: MIT License

HTML 1.13% Python 7.32% Nextflow 58.56% Groovy 20.35% Perl 7.16% R 3.43% TeX 1.97% CSS 0.08%

nextflow nf-core pipeline workflow

pathogensurveillance's Introduction

NOTE: THIS PROJECT IS UNDER DEVELOPMENT AND MAY NOT FUNCTION AS EXPECTED UNTIL THIS MESSAGE GOES AWAY

Introduction

nf-core/pathogensurveillance is a population genomic pipeline for pathogen diagnosis, variant detection, and biosurveillance. The pipeline accepts the paths to raw reads for one or more organisms and creates reports in the form of interactive HTML reports or PDF documents. Significant features include the ability to analyze unidentified eukaryotic and prokaryotic samples, creation of reports for multiple user-defined groupings of samples, automated discovery and downloading of reference assemblies from NCBI RefSeq, and rapid initial identification based on k-mer sketches followed by a more robust core genome phylogeny.

The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The Nextflow DSL2 implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies. Where possible, these processes have been submitted to and installed from nf-core/modules in order to make them available to all nf-core pipelines, and to everyone within the Nextflow community!

On release, automated continuous integration tests run the pipeline on a full-sized dataset on the AWS cloud infrastructure. This ensures that the pipeline runs on AWS, has sensible resource allocation defaults set to run on real-world data sets, and permits the persistent storage of results to benchmark between pipeline releases and other analysis sources.The results obtained from the full-sized test can be viewed on the nf-core website.

Pipeline summary

Quick Start

Install Nextflow (>=21.10.3)
Install any of Docker, Singularity (you can follow this tutorial), Podman, Shifter or Charliecloud for full pipeline reproducibility (you can use Conda both to install Nextflow itself and also to manage software within pipelines. Please only use it within pipelines as a last resort; see docs).
Download the pipeline and test it on a minimal dataset with a single command:
```
nextflow run nf-core/pathogensurveillance -profile test,YOURPROFILE --outdir <OUTDIR>
```
Note that some form of configuration will be needed so that Nextflow knows how to fetch the required software. This is usually done in the form of a config profile (YOURPROFILE in the example command above). You can chain multiple config profiles in a comma-separated string.
- The pipeline comes with config profiles called docker, singularity, podman, shifter, charliecloud and conda which instruct the pipeline to use the named tool for software management. For example, -profile test,docker.
- Please check nf-core/configs to see if a custom config file to run nf-core pipelines already exists for your Institute. If so, you can simply use -profile <institute> in your command. This will enable either docker or singularity and set the appropriate execution settings for your local compute environment.
- If you are using singularity, please use the nf-core download command to download images first, before running the pipeline. Setting the NXF_SINGULARITY_CACHEDIR or singularity.cacheDir Nextflow options enables you to store and re-use the images from a central location for future pipeline runs.
- If you are using conda, it is highly recommended to use the NXF_CONDA_CACHEDIR or conda.cacheDir settings to store the environments in a central location for future pipeline runs.

Start running your own analysis!

nextflow run nf-core/pathogensurveillance --input samplesheet.csv --outdir <OUTDIR> --genome GRCh37 -profile <docker/singularity/podman/shifter/charliecloud/conda/institute>

Documentation

The nf-core/pathogensurveillance pipeline comes with documentation about the pipeline usage, parameters and output.

Credits

nf-core/pathogensurveillance was originally written by Zachary S.L. Foster, Martha Sudermann, Nicholas C. Cauldron, Fernanda I. Bocardo, Hung Phan, Jeﬀ H. Chang, Niklaus J. Grünwald.

We thank the following people for their extensive assistance in the development of this pipeline:

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

For further information or help, don't hesitate to get in touch on the Slack #pathogensurveillance channel (you can join with this invite).

Citations

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

You can cite the nf-core publication as follows:

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

pathogensurveillance's People

Contributors

Stargazers

Watchers

Forkers

grunwaldlab biofriends

pathogensurveillance's Issues

CUSTOM_DUMPSOFTWAREVERSIONS stops -resume from working

Not a big deal since this happens at the end of the pipeline, but it is making multiqc run each time, so it would be nice to have to work with -resume. This seems to be due to .collectFile making a new file each time, even if the contents are the same.

    CUSTOM_DUMPSOFTWAREVERSIONS (                                               
        ch_versions.unique().toSortedList().flatten().collectFile(name: 'collated_versions.yml')
    )

Missing sample name error

Deleted sample name of the first sample and got the error message:

ERROR ~ Error executing process > 'NFCORE_PATHOGENSURVEILLANCE:PATHOGENSURVEILLANCE:INPUT_CHECK:SAMPLESHEET_CHECK (metadata_medium.csv)'

Pipeline would benefit from a more descriptive error message.

Add option to split samples within a group into multiple core gene phylogenies if diverse enough

This splitting could be done based on the families found in the sendsketch results.

It would require plotting multiple phylogenies in a single report, similar to how the vcf snp trees will need to be handled.
We could even combine the trees into a single tree, using taxonomic relationships for the root nodes and coloring those parts of the tree differently.

check that core gene filter is reporting removed samples

weird error when downloading genomes?

curl: (3) URL using bad/illegal format or missing URL
ERROR: curl command failed ( Sun Oct 8 04:11:28 PM PDT 2023 ) with: 3
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?query_key=1&WebEnv=MCID_6523371be8af48640e2d92e7&retstart=0&retmax=1580&db=taxonomy&rettype=uilist&retmode=text&tool=edirect&edirect=16.2&edirect_os=Linux&email=changlab%40fileserver%20has%20address%2010.206.67.17
fileserver
ERROR: ELink failure
ERROR: Missing -db argument
ERROR: Missing -db argument

Test data does not match the input check requirements

Description of the bug

Hi @zachary-foster ,

I know you mentioned you were working on fixing the sample check script and converting it to an R script.

I was testing the pipeline to generate the record_message output for the report. I encounter the following issue:

nextflow.config.Manifest@3a7f7201
[3c/caa1f5] NOTE: Process `NFCORE_PATHOGENSURVEILLANCE:PATHOGENSURVEILLANCE:INPUT_CHECK:SAMPLESHEET_CHECK (metadata_full.csv)` terminated with an error exit status (1) -- Execution is retried (1)
ERROR ~ Error executing process > 'NFCORE_PATHOGENSURVEILLANCE:PATHOGENSURVEILLANCE:INPUT_CHECK:SAMPLESHEET_CHECK (metadata_full.csv)'

Caused by:
  Process `NFCORE_PATHOGENSURVEILLANCE:PATHOGENSURVEILLANCE:INPUT_CHECK:SAMPLESHEET_CHECK (metadata_full.csv)` terminated with an error exit status (1)

Command executed:

  check_samplesheet.py \
      metadata_full.csv \
      samplesheet.valid.csv
  
  cat <<-END_VERSIONS > versions.yml
  "NFCORE_PATHOGENSURVEILLANCE:PATHOGENSURVEILLANCE:INPUT_CHECK:SAMPLESHEET_CHECK":
      python: $(python --version | sed 's/Python //g')
  END_VERSIONS

Command exit status:
  1

Command output:
  (empty)

Command error:
  [CRITICAL] The sample sheet **must** contain these column headers: fastq_1, sample, fastq_2.

Work dir:
  /nfs5/BPP/Grunwald_Lab/home/paradarc/pathogensurveillance/work/ca/00ab1aa2b9a3a2f8e72580320b41e7

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

 -- Check '.nextflow.log' file for details

Command used and terminal output

nextflow run nf-core/pathogensurveillance --input https://raw.githubusercontent.com/grunwaldlab/pathogensurveillance/master/test/data/metadata_small.csv --outdir test_out --download_bakta_db true -profile mamba -resume

Relevant files

No response

System information

CQLS

Circos error within bakta

As part or testing the pipeline with a larger dataset and more realistic files. It seems like one of the samples produced an assembly with lots of contigs. Based on the issue here: oschwengers/bakta#174 this could cause circos to clog. I am testing if adding the --skip-plot flag actually helps.

Oct-11 20:03:40.398 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'NFCORE_PATHOGENSURVEILLANCE:PATHOGENSURVEILLANCE:GENOME_ASSEMBLY:BAKTA_BAKTA (H1_10_30-2_S99)'

Caused by:
  Process `NFCORE_PATHOGENSURVEILLANCE:PATHOGENSURVEILLANCE:GENOME_ASSEMBLY:BAKTA_BAKTA (H1_10_30-2_S99)` terminated with an error exit status (1)

Command executed:

  bakta \
      H1_10_30-2_S99_filtered.fasta \
      --force \
      --threads 4 \
      --prefix H1_10_30-2_S99 \
       \
       \
      --db db-light
  
  cat <<-END_VERSIONS > versions.yml
  "NFCORE_PATHOGENSURVEILLANCE:PATHOGENSURVEILLANCE:GENOME_ASSEMBLY:BAKTA_BAKTA":
      bakta: $(echo $(bakta --version) 2>&1 | cut -f '2' -d ' ')
  END_VERSIONS

Command exit status:
  1

Command output:
        discarded spurious: 15
        detected IPSs: 0
        found PSCCs: 0
        lookup annotations...
        filter and combine annotations...
        filtered sORFs: 0
  detect gaps...
        found: 115
  detect oriCs/oriVs...
        found: 1
  detect oriTs...
        found: 0
  apply feature overlap filters...
  select features and create locus tags...
  selected: 14391
  improve annotations...
        revised gene symbols: 0
  
  genome statistics:
        Genome size: 13,369,953 bp
        Contigs/replicons: 20891
        GC: 64.2 %
        N50: 1,121
        N ratio: 0.1 %
        coding density: 50.6 %
  
  annotation summary:
        tRNAs: 106
        tmRNAs: 1
        rRNAs: 3
        ncRNAs: 27
        ncRNA regions: 22
        CRISPR arrays: 0
        CDSs: 14116
                hypotheticals: 9749
                pseudogenes: 0
                signal peptides: 0
        sORFs: 0
        gaps: 115
        oriCs/oriVs: 1
        oriTs: 0
  
  export annotation results to: /nfs1/BPP/Grunwald_Lab/home/paradarc/pathogensurveillance/work/c0/b4330e92b8805ce2508a106c25313e
        human readable TSV...
        GFF3...
        INSDC GenBank & EMBL...
        genome sequences...
        feature nucleotide sequences...
        translated CDS sequences...
        circular genome plot...

Command error:
        found: 1
  detect oriTs...
        found: 0
  apply feature overlap filters...
  select features and create locus tags...
  selected: 14391
  improve annotations...
        revised gene symbols: 0
  
  genome statistics:
        Genome size: 13,369,953 bp
        Contigs/replicons: 20891
        GC: 64.2 %
        N50: 1,121
        N ratio: 0.1 %
        coding density: 50.6 %
  
  annotation summary:
        tRNAs: 106
        tmRNAs: 1
        rRNAs: 3
        ncRNAs: 27
        ncRNA regions: 22
        CRISPR arrays: 0
        CDSs: 14116
                hypotheticals: 9749
                pseudogenes: 0
                signal peptides: 0
        sORFs: 0
        gaps: 115
        oriCs/oriVs: 1
        oriTs: 0
  
  export annotation results to: /nfs1/BPP/Grunwald_Lab/home/paradarc/pathogensurveillance/work/c0/b4330e92b8805ce2508a106c25313e
        human readable TSV...
        GFF3...
        INSDC GenBank & EMBL...
        genome sequences...
        feature nucleotide sequences...
        translated CDS sequences...
        circular genome plot...
  Traceback (most recent call last):
    File "/nfs1/BPP/Grunwald_Lab/home/paradarc/pathogensurveillance/work/conda/env-a426d411c02c15e446ce4ad992ac0c99/bin/bakta", line 10, in <module>
      sys.exit(main())
               ^^^^^^
    File "/nfs1/BPP/Grunwald_Lab/home/paradarc/pathogensurveillance/work/conda/env-a426d411c02c15e446ce4ad992ac0c99/lib/python3.11/site-packages/bakta/main.py", line 562, in main
      plot.write_plot(features, contigs, cfg.output_path)
    File "/nfs1/BPP/Grunwald_Lab/home/paradarc/pathogensurveillance/work/conda/env-a426d411c02c15e446ce4ad992ac0c99/lib/python3.11/site-packages/bakta/plot.py", line 259, in write_plot
      raise Exception(f'circos error! error code: {proc.returncode}')
  Exception: circos error! error code: 255

Work dir:
  /nfs1/BPP/Grunwald_Lab/home/paradarc/pathogensurveillance/work/c0/b4330e92b8805ce2508a106c25313e

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

Ensure that reference IDs are assigned to a single unique reference path

However, it is fine if multiple IDs are assigned to the same path

Ensure column order in input spreadsheet does not matter

Only the names of the columns should matter

Missing fastq2 error

Deleted fastq2 (third column) of the first sample and got the error message:

ERROR ~ Error executing process > 'NFCORE_PATHOGENSURVEILLANCE:PATHOGENSURVEILLANCE:ASSIGN_REFERENCES:KHMER_TRIMLOWABUND (22-324_T1)'

This is interesting because the pipeline still ran for quit a bit of time with the first 5 processes completed and a few others. I deleted the fastq2 of the sample 22-299 but the one that resulted in the error was 22-324.

I ran this a second time to see if it was reproducible and got farther in the pipeline than I normally get even with no modification to the sample sheet input. This time I got the error message:

ERROR ~ Error executing process > 'NFCORE_PATHOGENSURVEILLANCE:PATHOGENSURVEILLANCE:GENOME_ASSEMBLY:SPADES (22-330_T1)'

When I ran it a third time I got the same error message, but with the sample 22-324_T1.

Remove code that adds "_T#" to sample IDs

Description of the bug

This is meant to force the sample IDs to be unique, but in our case we want to allow for duplicated rows with different group/reference combinations. Also, we dont want the cache to think that two smaples are different just because their sample IDs differ but their reads/reference do not.

Command used and terminal output

No response

Relevant files

No response

System information

No response

Add more test datasets

contamination (host and non-intentional)
low read coverage
diverse organisms

Add read subsampling step to each major subworkflow based on reference genome size

Description of feature

Some of the steps probably don't need to use all the reads, particularly the reference selection step, it would save time to subsample them, ideally based on predicted genome size/complexity. After FIND_ASSEMBLIES, we should be able to get some stats on genome size from assemblies of related organisms. If no close match is found ( no genus-level match) we use all the reads. If we find some reference sequences, we can estimate the genome size of the sample from the mean assembly size of the closely related reference assemblies. The degree of subsmapling should be controlled by a command line parameter.

This would speed up ASSIGN_REFERENCES:KHMER_TRIMLOWABUND

Make sure variant calling for eukaryotes takes into account heterozyzgous variants

Break up long reads into 150mers to enable long read data for variant calling

A collaborator has tested a lot of methods for variant calling with long read data and determined that breaking up reads into 150mers and using graphtyper works the best of the methods tested. This would be a easy way to accept long reads into the variant calling portion of the pipeline.

Test that low quality/abundnace reads dont break pipeline

Possible program to simulate bad reads:

https://www.niehs.nih.gov/research/resources/software/biostatistics/art/index.cfm

Error picking assemblies

Description of the bug

ERROR ~ Error executing process > 'NFCORE_PLANTPATHSURVEIL:PLANTPATHSURVEIL:DOWNLOAD_REFERENCES:PICK_ASSEMBLIES (13A4_3_S77_T1)'

Caused by:
  Process `NFCORE_PLANTPATHSURVEIL:PLANTPATHSURVEIL:DOWNLOAD_REFERENCES:PICK_ASSEMBLIES (13A4_3_S77_T1)` terminated with an error exit status (137)

Command executed:

  pick_assemblies.R families.txt genera.txt species.txt merged_assembly_stats.tsv 5 13A4_3_S77_T1.tsv

  tail -n +2 13A4_3_S77_T1.tsv | cut -f2 > 13A4_3_S77_T1_ids.txt

  cat <<-END_VERSIONS > versions.yml
  "NFCORE_PLANTPATHSURVEIL:PLANTPATHSURVEIL:DOWNLOAD_REFERENCES:PICK_ASSEMBLIES":
      r-base: $(echo $(R --version 2>&1) | sed 's/^.*R version //; s/ .*$//')
  END_VERSIONS

Command exit status:
  137

Command output:
  (empty)

Command error:
  .command.sh: line 2: 31360 Killed                  pick_assemblies.R families.txt genera.txt species.txt merged_assembly_stats.tsv 5 13A4_3_S77_T1.tsv

Work dir:
  /nfs7/BPP/Chang_Lab/paradarc/nf_brady_N120/logs/work/48/ebb72ad3db2c128eb8379c34bd1a2e

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

 -- Check '.nextflow.log' file for details
-[nf-core/plantpathsurveil] Pipeline completed with errors-
WARN: Killing running tasks (83)

Command used and terminal output

/local/cluster/bin/nextflow run /nfs7/BPP/Chang_Lab/paradarc/nf_brady_N120/scripts/nf-core-plantpathsurveil/main.nf -w /nfs7/BPP/Chang_Lab/paradarc/nf_brady_N120/logs/work --input /nfs7/BPP/Chang_Lab/paradarc/nf_brady_N120/data/metadata/metadata.csv --outdir /nfs7/BPP/Chang_Lab/paradarc/nf_brady_N120/results -profile conda -resume

Relevant files

nextflow.log
execution_trace_2023-08-07_07-31-19.txt

System information

CQLS

Glitch at the download genome step

Description of the bug

This maybe nothing at all but maybe important to mention in the documentation, that this process may require restart...
At the download genome assemblies step, I got this error:

ERROR ~ Error executing process > 'NFCORE_PLANTPATHSURVEIL:PLANTPATHSURVEIL:DOWNLOAD_REFERENCES:DOWNLOAD_ASSEMBLIES (GCA_016864655.1)'

Caused by:
  Process `NFCORE_PLANTPATHSURVEIL:PLANTPATHSURVEIL:DOWNLOAD_REFERENCES:DOWNLOAD_ASSEMBLIES (GCA_016864655.1)` terminated with an error exit status (2)

Command executed:

  # Download assemblies as zip archives
  datasets download genome accession GCA_016864655.1 --include gff3,rna,cds,protein,genome,seq-report --filename GCA_016864655.1.zip

  # Unzip
  unzip GCA_016864655.1.zip

  # Rename files with assembly name
  if [ -f ncbi_dataset/data/GCA_016864655.1/genomic.gff ]; then
      mv ncbi_dataset/data/GCA_016864655.1/genomic.gff ncbi_dataset/data/GCA_016864655.1/GCA_016864655.1.gff
  fi
  if [ -f ncbi_dataset/data/GCA_016864655.1/cds_from_genomic.fna ]; then
      mv ncbi_dataset/data/GCA_016864655.1/cds_from_genomic.fna ncbi_dataset/data/GCA_016864655.1/GCA_016864655.1_cds.fna
  fi
  if [ -f ncbi_dataset/data/GCA_016864655.1/protein.faa ]; then
      mv ncbi_dataset/data/GCA_016864655.1/protein.faa ncbi_dataset/data/GCA_016864655.1/GCA_016864655.1.faa
  fi

  cat <<-END_VERSIONS > versions.yml
  "NFCORE_PLANTPATHSURVEIL:PLANTPATHSURVEIL:DOWNLOAD_REFERENCES:DOWNLOAD_ASSEMBLIES":
      datasets: $(datasets --version | sed -e "s/datasets version: //")
  END_VERSIONS

Command exit status:
  2

Command output:
  Archive:  GCA_016864655.1.zip
    inflating: README.md
    inflating: ncbi_dataset/data/assembly_data_report.jsonl
    inflating: ncbi_dataset/data/GCA_016864655.1/GCA_016864655.1_ASM1686465v1_genomic.fna
    inflating: ncbi_dataset/data/GCA_016864655.1/genomic.gff    inflating: ncbi_dataset/data/GCA_016864655.1/cds_from_genomic.fna
    inflating: ncbi_dataset/data/GCA_016864655.1/protein.faa
    inflating: ncbi_dataset/data/GCA_016864655.1/sequence_report.jsonl
    inflating: ncbi_dataset/data/dataset_catalog.json
  Collecting 1  records [================================================] 100% 1/1
  Downloading: GCA_016864655.1.zip    41MB done
  Archive:  GCA_016864655.1.zip
    inflating: README.md
    inflating: ncbi_dataset/data/assembly_data_report.jsonl
    inflating: ncbi_dataset/data/GCA_016864655.1/GCA_016864655.1_ASM1686465v1_genomic.fna
    inflating: ncbi_dataset/data/GCA_016864655.1/genomic.gff
    error:  invalid compressed data to inflate
    inflating: ncbi_dataset/data/GCA_016864655.1/cds_from_genomic.fna
    inflating: ncbi_dataset/data/GCA_016864655.1/protein.faa
    inflating: ncbi_dataset/data/GCA_016864655.1/sequence_report.jsonl
    inflating: ncbi_dataset/data/dataset_catalog.json

Work dir:
  /nfs7/BPP/Chang_Lab/paradarc/nf_brady_N120/scripts/nf-core-plantpathsurveil/work/00/7f5d8df6693888570725145aa13835

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

 -- Check '.nextflow.log' file for details

It could be that it is a glitch on the download or unzip process. I ran it again (-resume) and the download works just fine.

Command used and terminal output

No response

Relevant files

No response

System information

No response

Including number of genes and SNPs used to build trees

Include no of SNPs used to build SNP tree and no of genes used to build core gene phylogeny.

Add test dataset with a complex data set

mixture of organisms
contamination
low depth

rename the 1/2/etc folders to more descriptive names.

sourmash not found

Description of the bug

When I ran the test dataset, I got the error below.

Aug-09 21:12:34.834 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'NFCORE_PLANTPATHSURVEIL:PLANTPATHSURVEIL:ASSIGN_REFERENCES:SOURMASH_SKETCH_GENOME (22_331_assembly)'

Caused by:
  Process `NFCORE_PLANTPATHSURVEIL:PLANTPATHSURVEIL:ASSIGN_REFERENCES:SOURMASH_SKETCH_GENOME (22_331_assembly)` terminated with an error exit status (127)

Command executed:

  sourmash sketch \
      dna --param-string 'scaled=1000,k=21,k=31,k=51' \
      --merge '22_331_assembly' \
      --output '22_331_assembly.sig' \
      reference-22-331.fna

  cat <<-END_VERSIONS > versions.yml
  "NFCORE_PLANTPATHSURVEIL:PLANTPATHSURVEIL:ASSIGN_REFERENCES:SOURMASH_SKETCH_GENOME":
      sourmash: $(echo $(sourmash --version 2>&1) | sed 's/^sourmash //' )
  END_VERSIONS

Command exit status:
  127

Command output:
  (empty)

Command error:
  .command.sh: line 2: sourmash: command not found

Work dir:
  /nfs7/BPP/Chang_Lab/paradarc/nf_brady_N120/scripts/nf-core-plantpathsurveil/work/a2/480383e27aa8caca0e987eaf408c31

If I add the path to the sourmash executable installed in my conda env works no problem. So this maybe an issue with my conda installation or with the way the pipeline calls the sourmash conda.

Command used and terminal output

nextflow run main.nf -profile test,mamba --outdir test -resume

Relevant files

No response

System information

No response

Add column to input to group samples

Description of feature

This would be used to structure how samples are displayed in reports and may also effect parts of the analysis. For example, each group would have its own phylogeny report.

ASSIGN_REFERENCES:SOURMASH_COMPARE error when inputting newly formatted test metadata file

Description of the bug

I was running some of the test datasets in preparation to input a high complexity dataset, and encountered an error with PATHOGENSURVEILLANCE:ASSIGN_REFERENCES:SOURMASH_COMPARE

The data input was metadata_PRJNA523365_small.csv

ERROR ~ Error executing process > 'PATHOGENSURVEILLANCE:ASSIGN_REFERENCES:SOURMASH_COMPARE (1)'

Caused by:
Process PATHOGENSURVEILLANCE:ASSIGN_REFERENCES:SOURMASH_COMPARE input file name collision -- There are multiple input files for each of the following file names: GCF_017189435_1.sig

The metadata file has some new columns compared to the other test datasets, and I wonder if this contributed.
Everything else leading up to this ran as expected.

Command used and terminal output

(nf-core) marthasudermann@pop-os:~/pathogensurveillance$ nextflow run main.nf --input 'https://raw.githubusercontent.com/grunwaldlab/pathogensurveillance/master/test/data/metadata_PRJNA523365_small.csv' --outdir test_out4 --bakta_db /home/marthasudermann/Software/bakta_db_02_2024/db/ -profile docker -resume
N E X T F L O W  ~  version 23.10.1
Launching `main.nf` [confident_engelbart] DSL2 - revision: cc83aa0c27


------------------------------------------------------
                                        ,--./,-.
        ___     __   __   __   ___     /,-._.--~'
  |\ | |__  __ /  ` /  \ |__) |__         }  {
  | \| |       \__, \__/ |  \ |___     \`-._,-`-,
                                        `._,._,'
  nf-core/plantpathsurveil v1.0dev
------------------------------------------------------
Core Nextflow options
  runName        : confident_engelbart
  containerEngine: docker
  launchDir      : /home/marthasudermann/pathogensurveillance
  workDir        : /home/marthasudermann/pathogensurveillance/work
  projectDir     : /home/marthasudermann/pathogensurveillance
  userName       : marthasudermann
  profile        : docker
  configFiles    : /home/marthasudermann/pathogensurveillance/nextflow.config

Input/output options
  input          : https://raw.githubusercontent.com/grunwaldlab/pathogensurveillance/master/test/data/metadata_PRJNA523365_small.csv
  outdir         : test_out4
  bakta_db       : /home/marthasudermann/Software/bakta_db_02_2024/db/

!! Only displaying parameters that differ from the pipeline defaults !!
------------------------------------------------------
If you use nf-core/plantpathsurveil for your analysis please cite:

* The nf-core framework
  https://doi.org/10.1038/s41587-020-0439-x

* Software dependencies
  https://github.com/nf-core/plantpathsurveil/blob/master/CITATIONS.md
------------------------------------------------------
[-        ] process > PATHOGENSURVEILLANCE:INPUT_CHECK:SAMPLESHEET_CHECK                                    -
[-        ] process > PATHOGENSURVEILLANCE:SRATOOLS_FASTERQDUMP                                             -
[-        ] process > PATHOGENSURVEILLANCE:DOWNLOAD_ASSEMBLIES                                              -
[-        ] process > PATHOGENSURVEILLANCE:FASTQC                                                           -
[-        ] process > PATHOGENSURVEILLANCE:COARSE_SAMPLE_TAXONOMY:BBMAP_SENDSKETCH                          -
[-        ] process > PATHOGENSURVEILLANCE:COARSE_SAMPLE_TAXONOMY:INITIAL_CLASSIFICATION                    -
[-        ] process > PATHOGENSURVEILLANCE:DOWNLOAD_REFERENCES:FIND_ASSEMBLIES                              -
[-        ] process > PATHOGENSURVEILLANCE:DOWNLOAD_REFERENCES:PICK_ASSEMBLIES                              -
[-        ] process > PATHOGENSURVEILLANCE:DOWNLOAD_REFERENCES:DOWNLOAD_ASSEMBLIES                          -
[-        ] process > PATHOGENSURVEILLANCE:DOWNLOAD_REFERENCES:MAKE_GFF_WITH_FASTA                          -
[-        ] process > PATHOGENSURVEILLANCE:DOWNLOAD_REFERENCES:SOURMASH_SKETCH_GENOME                       -
[-        ] process > PATHOGENSURVEILLANCE:ASSIGN_REFERENCES:SUBSET_READS                                   -
[-        ] process > PATHOGENSURVEILLANCE:INPUT_CHECK:SAMPLESHEET_CHECK                                    -
[-        ] process > PATHOGENSURVEILLANCE:SRATOOLS_FASTERQDUMP                                             -
[-        ] process > PATHOGENSURVEILLANCE:DOWNLOAD_ASSEMBLIES                                              -
[-        ] process > PATHOGENSURVEILLANCE:FASTQC                                                           -
[-        ] process > PATHOGENSURVEILLANCE:COARSE_SAMPLE_TAXONOMY:BBMAP_SENDSKETCH                          -
[-        ] process > PATHOGENSURVEILLANCE:COARSE_SAMPLE_TAXONOMY:INITIAL_CLASSIFICATION                    -
[-        ] process > PATHOGENSURVEILLANCE:DOWNLOAD_REFERENCES:FIND_ASSEMBLIES                              -
[-        ] process > PATHOGENSURVEILLANCE:DOWNLOAD_REFERENCES:PICK_ASSEMBLIES                              -
[-        ] process > PATHOGENSURVEILLANCE:DOWNLOAD_REFERENCES:DOWNLOAD_ASSEMBLIES                          -
[-        ] process > PATHOGENSURVEILLANCE:DOWNLOAD_REFERENCES:MAKE_GFF_WITH_FASTA                          -
[-        ] process > PATHOGENSURVEILLANCE:DOWNLOAD_REFERENCES:SOURMASH_SKETCH_GENOME                       -
[-        ] process > PATHOGENSURVEILLANCE:ASSIGN_REFERENCES:SUBSET_READS                                   -
[-        ] process > PATHOGENSURVEILLANCE:ASSIGN_REFERENCES:KHMER_TRIMLOWABUND                             -
[-        ] process > PATHOGENSURVEILLANCE:ASSIGN_REFERENCES:SOURMASH_SKETCH_READS                          -
[-        ] process > PATHOGENSURVEILLANCE:ASSIGN_REFERENCES:SOURMASH_SKETCH_GENOME                         -
[-        ] process > PATHOGENSURVEILLANCE:ASSIGN_REFERENCES:SOURMASH_COMPARE                               -
[f5/6125a3] process > PATHOGENSURVEILLANCE:INPUT_CHECK:SAMPLESHEET_CHECK (metadata_PRJNA523365_small.csv)   [100%] 1 of 1, cached: 1 ✔
[21/b6b96f] process > PATHOGENSURVEILLANCE:SRATOOLS_FASTERQDUMP (SRR12574846)                               [100%] 3 of 3, cached: 3 ✔
[42/f6d4e0] process > PATHOGENSURVEILLANCE:DOWNLOAD_ASSEMBLIES (GCF_017189435_1)                            [100%] 1 of 1, cached: 1 ✔
[63/a974dd] process > PATHOGENSURVEILLANCE:FASTQC (SRR12574847)                                             [100%] 3 of 3, cached: 3 ✔
[c2/8679ec] process > PATHOGENSURVEILLANCE:COARSE_SAMPLE_TAXONOMY:BBMAP_SENDSKETCH (SRR12574848)            [100%] 3 of 3, cached: 3 ✔
[a4/18385c] process > PATHOGENSURVEILLANCE:COARSE_SAMPLE_TAXONOMY:INITIAL_CLASSIFICATION (SRR12574846)      [100%] 3 of 3, cached: 3 ✔
[24/1a3d6e] process > PATHOGENSURVEILLANCE:DOWNLOAD_REFERENCES:FIND_ASSEMBLIES (Mycobacteriaceae)           [100%] 1 of 1, cached: 1 ✔
[6a/673614] process > PATHOGENSURVEILLANCE:DOWNLOAD_REFERENCES:PICK_ASSEMBLIES (SRR12574846)                [100%] 3 of 3, cached: 3 ✔
[1d/f003e6] process > PATHOGENSURVEILLANCE:DOWNLOAD_REFERENCES:DOWNLOAD_ASSEMBLIES (GCF_001677215_1)        [100%] 9 of 9, cached: 9 ✔
[65/b881e2] process > PATHOGENSURVEILLANCE:DOWNLOAD_REFERENCES:MAKE_GFF_WITH_FASTA (GCF_001456355_1)        [100%] 9 of 9, cached: 9 ✔
[db/0526c0] process > PATHOGENSURVEILLANCE:DOWNLOAD_REFERENCES:SOURMASH_SKETCH_GENOME (GCF_001456355_1)     [100%] 9 of 9, cached: 9 ✔
[6f/fce62b] process > PATHOGENSURVEILLANCE:ASSIGN_REFERENCES:SUBSET_READS (SRR12574846)                     [100%] 3 of 3, cached: 3 ✔
[cc/a2f30c] process > PATHOGENSURVEILLANCE:ASSIGN_REFERENCES:KHMER_TRIMLOWABUND (SRR12574847)               [100%] 3 of 3, cached: 3 ✔
[9e/09a4cb] process > PATHOGENSURVEILLANCE:ASSIGN_REFERENCES:SOURMASH_SKETCH_READS (SRR12574847)            [100%] 3 of 3, cached: 3 ✔
[d3/a6b247] process > PATHOGENSURVEILLANCE:ASSIGN_REFERENCES:SOURMASH_SKETCH_GENOME (GCF_017189435_1)       [100%] 3 of 3, cached: 3 ✔
[-        ] process > PATHOGENSURVEILLANCE:ASSIGN_REFERENCES:SOURMASH_COMPARE                               -
[-        ] process > PATHOGENSURVEILLANCE:ASSIGN_REFERENCES:ASSIGN_GROUP_REFERENCES                        -
[-        ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:REFERENCE_INDEX:PICARD_CREATESEQUENCEDICTIONARY -
[-        ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:REFERENCE_INDEX:SAMTOOLS_FAIDX                  -
[-        ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:REFERENCE_INDEX:BWA_INDEX                       -
[-        ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:ALIGN_READS:CALCULATE_DEPTH                     -
[-        ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:ALIGN_READS:SUBSET_READS                        -
[-        ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:ALIGN_READS:BWA_MEM                             -
[-        ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:ALIGN_READS:PICARD_ADDORREPLACEREADGROUPS       -
[-        ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:ALIGN_READS:PICARD_SORTSAM_1                    -
[-        ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:ALIGN_READS:PICARD_MARKDUPLICATES               -
[-        ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:ALIGN_READS:PICARD_SORTSAM_2                    -
[f5/6125a3] process > PATHOGENSURVEILLANCE:INPUT_CHECK:SAMPLESHEET_CHECK (metadata_PRJNA523365_small.csv)   [100%] 1 of 1, cached: 1 ✔
[21/b6b96f] process > PATHOGENSURVEILLANCE:SRATOOLS_FASTERQDUMP (SRR12574846)                               [100%] 3 of 3, cached: 3 ✔
[42/f6d4e0] process > PATHOGENSURVEILLANCE:DOWNLOAD_ASSEMBLIES (GCF_017189435_1)                            [100%] 1 of 1, cached: 1 ✔
[63/a974dd] process > PATHOGENSURVEILLANCE:FASTQC (SRR12574847)                                             [100%] 3 of 3, cached: 3 ✔
[c2/8679ec] process > PATHOGENSURVEILLANCE:COARSE_SAMPLE_TAXONOMY:BBMAP_SENDSKETCH (SRR12574848)            [100%] 3 of 3, cached: 3 ✔
[a4/18385c] process > PATHOGENSURVEILLANCE:COARSE_SAMPLE_TAXONOMY:INITIAL_CLASSIFICATION (SRR12574846)      [100%] 3 of 3, cached: 3 ✔
[24/1a3d6e] process > PATHOGENSURVEILLANCE:DOWNLOAD_REFERENCES:FIND_ASSEMBLIES (Mycobacteriaceae)           [100%] 1 of 1, cached: 1 ✔
[6a/673614] process > PATHOGENSURVEILLANCE:DOWNLOAD_REFERENCES:PICK_ASSEMBLIES (SRR12574846)                [100%] 3 of 3, cached: 3 ✔
[1d/f003e6] process > PATHOGENSURVEILLANCE:DOWNLOAD_REFERENCES:DOWNLOAD_ASSEMBLIES (GCF_001677215_1)        [100%] 9 of 9, cached: 9 ✔
[65/b881e2] process > PATHOGENSURVEILLANCE:DOWNLOAD_REFERENCES:MAKE_GFF_WITH_FASTA (GCF_001456355_1)        [100%] 9 of 9, cached: 9 ✔
[db/0526c0] process > PATHOGENSURVEILLANCE:DOWNLOAD_REFERENCES:SOURMASH_SKETCH_GENOME (GCF_001456355_1)     [100%] 9 of 9, cached: 9 ✔
[6f/fce62b] process > PATHOGENSURVEILLANCE:ASSIGN_REFERENCES:SUBSET_READS (SRR12574846)                     [100%] 3 of 3, cached: 3 ✔
[cc/a2f30c] process > PATHOGENSURVEILLANCE:ASSIGN_REFERENCES:KHMER_TRIMLOWABUND (SRR12574847)               [100%] 3 of 3, cached: 3 ✔
[9e/09a4cb] process > PATHOGENSURVEILLANCE:ASSIGN_REFERENCES:SOURMASH_SKETCH_READS (SRR12574847)            [100%] 3 of 3, cached: 3 ✔
[d3/a6b247] process > PATHOGENSURVEILLANCE:ASSIGN_REFERENCES:SOURMASH_SKETCH_GENOME (GCF_017189435_1)       [100%] 3 of 3, cached: 3 ✔
[-        ] process > PATHOGENSURVEILLANCE:ASSIGN_REFERENCES:SOURMASH_COMPARE                               -
[-        ] process > PATHOGENSURVEILLANCE:ASSIGN_REFERENCES:ASSIGN_GROUP_REFERENCES                        -
[-        ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:REFERENCE_INDEX:PICARD_CREATESEQUENCEDICTIONARY -
[-        ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:REFERENCE_INDEX:SAMTOOLS_FAIDX                  -
[-        ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:REFERENCE_INDEX:BWA_INDEX                       -
[-        ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:ALIGN_READS:CALCULATE_DEPTH                     -
[-        ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:ALIGN_READS:SUBSET_READS                        -
[-        ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:ALIGN_READS:BWA_MEM                             -
[-        ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:ALIGN_READS:PICARD_ADDORREPLACEREADGROUPS       -
[-        ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:ALIGN_READS:PICARD_SORTSAM_1                    -
[-        ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:ALIGN_READS:PICARD_MARKDUPLICATES               -
[-        ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:ALIGN_READS:PICARD_SORTSAM_2                    -
[-        ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:ALIGN_READS:SAMTOOLS_INDEX                      -
[-        ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:CALL_VARIANTS:MAKE_REGION_FILE                  -
[-        ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:CALL_VARIANTS:GRAPHTYPER_GENOTYPE               -
[-        ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:CALL_VARIANTS:GRAPHTYPER_VCFCONCATENATE         -
[-        ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:CALL_VARIANTS:TABIX_TABIX                       -
[-        ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:CALL_VARIANTS:BGZIP_MAKE_GZIP                   -
[-        ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:CALL_VARIANTS:GATK4_VARIANTFILTRATION           -
[-        ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:CALL_VARIANTS:VCFLIB_VCFFILTER                  -
[-        ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:VCF_TO_TAB                                      -
[-        ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:VCF_TO_SNPALN                                   -
[-        ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:IQTREE2_SNP                                     -
[-        ] process > PATHOGENSURVEILLANCE:GENOME_ASSEMBLY:SUBSET_READS                                     -
[-        ] process > PATHOGENSURVEILLANCE:GENOME_ASSEMBLY:FASTP                                            -
[-        ] process > PATHOGENSURVEILLANCE:GENOME_ASSEMBLY:SPADES                                           -
[-        ] process > PATHOGENSURVEILLANCE:GENOME_ASSEMBLY:FILTER_ASSEMBLY                                  -
[-        ] process > PATHOGENSURVEILLANCE:GENOME_ASSEMBLY:QUAST                                            -
[-        ] process > PATHOGENSURVEILLANCE:GENOME_ASSEMBLY:BAKTA_BAKTA                                      -
[-        ] process > PATHOGENSURVEILLANCE:CORE_GENOME_PHYLOGENY:PIRATE                                     -
[-        ] process > PATHOGENSURVEILLANCE:CORE_GENOME_PHYLOGENY:REFORMAT_PIRATE_RESULTS                    -
[-        ] process > PATHOGENSURVEILLANCE:CORE_GENOME_PHYLOGENY:ALIGN_FEATURE_SEQUENCES                    -
[-        ] process > PATHOGENSURVEILLANCE:CORE_GENOME_PHYLOGENY:RENAME_CORE_GENE_HEADERS                   -
[-        ] process > PATHOGENSURVEILLANCE:CORE_GENOME_PHYLOGENY:SUBSET_CORE_GENES                          -
[-        ] process > PATHOGENSURVEILLANCE:CORE_GENOME_PHYLOGENY:MAFFT_SMALL                                -
[-        ] process > PATHOGENSURVEILLANCE:CORE_GENOME_PHYLOGENY:IQTREE2_CORE                               -
[-        ] process > PATHOGENSURVEILLANCE:CUSTOM_DUMPSOFTWAREVERSIONS                                      -
[-        ] process > PATHOGENSURVEILLANCE:MULTIQC                                                          -
[-        ] process > PATHOGENSURVEILLANCE:RECORD_MESSAGES                                                  -
[-        ] process > PATHOGENSURVEILLANCE:MAIN_REPORT                                                      -
ERROR ~ Error executing process > 'PATHOGENSURVEILLANCE:ASSIGN_REFERENCES:SOURMASH_COMPARE (1)'

Caused by:
  Process `PATHOGENSURVEILLANCE:ASSIGN_REFERENCES:SOURMASH_COMPARE` input file name collision -- There are multiple input files for each of the following file names: GCF_017189435_1.sig


Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line

 -- Check '.nextflow.log' file for details

Relevant files

nextflow.log

System information

Nextflow 23.10.1.5891
Desktop
local
Docker
Linux

If there are no messages, then the report is not made

The channels need to be changed so that if the messages channel is empty the report still runs.

Not all subworkflows are collecting version info

All the version info needs to converge to the same CUSTOM_DUMPSOFTWAREVERSIONS process at the end of the pipeline.

fastq1 and fastq2 file type error

I deleted the .fastq.gz from the end of the fastq1 file name and then did the same for the fastq2. The error message that is printed is:

ERROR ~ Error executing process > 'NFCORE_PATHOGENSURVEILLANCE:PATHOGENSURVEILLANCE:INPUT_CHECK:SAMPLESHEET_CHECK (metadata_medium.csv)'

There's a more descriptive message farther below that says:

Command error:
[CRITICAL] The FASTQ file has an unrecognized extension: test/data/reads/22-301_R2
It should be one of: .fq.gz, .fastq.gz On line 3.

This might be sufficient enough because I feel like it's pretty self explanatory, but it might be overwhelming for people that do not know how to read error messages. It is up to you whether to make it a more simple message.

GENOME_ASSEMBLY waits on ASSIGN_REFERENCES because of QUAST

Right now genome assembly does not start until references are chosen even though references are not needed for spades. We could move QUAST out of GENOME_ASSEMBLY or redo how the channels are connected inside GENOME_ASSEMBLY

DOWNLOAD_ASSEMBLIES error

ERROR

When running test dataset, I encountered the following error message:

Oct-01 20:44:22.606 [Task monitor] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
  task: name=NFCORE_PATHOGENSURVEILLANCE:PATHOGENSURVEILLANCE:DOWNLOAD_REFERENCES:DOWNLOAD_ASSEMBLIES (); work-dir=/nfs7/BPP/Chang_Lab/paradarc/brady/scripts/pathogensurveillance/work/61/f103677a0f4221ac1d21792a1828ff
  error [nextflow.exception.ProcessFailedException]: Process `NFCORE_PATHOGENSURVEILLANCE:PATHOGENSURVEILLANCE:DOWNLOAD_REFERENCES:DOWNLOAD_ASSEMBLIES ()` terminated with an error exit status (1)
Oct-01 20:44:22.606 [Task monitor] INFO  nextflow.processor.TaskProcessor - [61/f10367] NOTE: Process `NFCORE_PATHOGENSURVEILLANCE:PATHOGENSURVEILLANCE:DOWNLOAD_REFERENCES:DOWNLOAD_ASSEMBLIES ()` terminated with an error exit status (1) -- Error is ignore
Oct-01 20:44:22.620 [Actor Thread 121] ERROR nextflow.extension.OperatorImpl - @unknown
java.lang.NullPointerException: Cannot execute null+[/nfs7/BPP/Chang_Lab/paradarc/brady/scripts/pathogensurveillance/work/7b/d1bb304e828bb1dca4e122b05b0b4a/22_331_assembly.sig]
        at org.codehaus.groovy.runtime.NullObject.plus(NullObject.java:135)
        at org.codehaus.groovy.runtime.NullObject$plus$0.call(Unknown Source)
        at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:47)
        at org.codehaus.groovy.runtime.callsite.NullCallSite.call(NullCallSite.java:34)
        at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:47)
        at org.codehaus.groovy.runtime.callsite.PogoMetaMethodSite.call(PogoMetaMethodSite.java:74)
        at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:139)
        at Script_abdb686a$_runScript_closure1$_closure2$_closure13.doCall(Script_abdb686a:78)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:107)
        at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:323)
        at org.codehaus.groovy.runtime.metaclass.ClosureMetaClass.invokeMethod(ClosureMetaClass.java:274)
        at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1035)
        at org.codehaus.groovy.runtime.callsite.PogoMetaClassSite.call(PogoMetaClassSite.java:38)
        at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:139)
        at nextflow.extension.MapOp$_apply_closure1.doCall(MapOp.groovy:56)
        at jdk.internal.reflect.GeneratedMethodAccessor216.invoke(Unknown Source)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:107)
        at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:323)
        at org.codehaus.groovy.runtime.metaclass.ClosureMetaClass.invokeMethod(ClosureMetaClass.java:274)
        at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1035)
        at groovy.lang.Closure.call(Closure.java:412)
        at groovyx.gpars.dataflow.operator.DataflowOperatorActor.startTask(DataflowOperatorActor.java:120)
        at groovyx.gpars.dataflow.operator.DataflowOperatorActor.onMessage(DataflowOperatorActor.java:108)
        at groovyx.gpars.actor.impl.SDAClosure$1.call(SDAClosure.java:43)
        at groovyx.gpars.actor.AbstractLoopingActor.runEnhancedWithoutRepliesOnMessages(AbstractLoopingActor.java:293)
        at groovyx.gpars.actor.AbstractLoopingActor.access$400(AbstractLoopingActor.java:30)
        at groovyx.gpars.actor.AbstractLoopingActor$1.handleMessage(AbstractLoopingActor.java:93)
        at groovyx.gpars.util.AsyncMessagingCore.run(AsyncMessagingCore.java:132)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
Oct-01 20:44:22.632 [Actor Thread 121] DEBUG nextflow.Session - Session aborted -- Cause: Cannot execute null+[/nfs7/BPP/Chang_Lab/paradarc/brady/scripts/pathogensurveillance/work/7b/d1bb304e828bb1dca4e122b05b0b4a/22_331_assembly.sig]

Map with locations of same taxa found in input

colored by taxon and whether they are inputs

Optional columns with commas cause incorrect CSV parsing

This also applies to TSV input oddly.

Consider using GBIF/refseq occurrences to make a map of species occurrence for species found

Add way for users to specify API keys for NCBI

This is needed to avoid downloading references to fail due to rate limits

picard error - can't find the command

-[nf-core/plantpathsurveil] Pipeline completed with errors-
ERROR ~ Error executing process > 'NFCORE_PATHOGENSURVEILLANCE:PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:REFERENCE_INDEX:PICARD_CREATESEQUENCEDICTIONARY (GCF_000261805.1)'

Caused by:
  Process `NFCORE_PATHOGENSURVEILLANCE:PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:REFERENCE_INDEX:PICARD_CREATESEQUENCEDICTIONARY (GCF_000261805.1)` terminated with an error exit status (127)

Command executed:

  picard \
      -Xmx8g \
      CreateSequenceDictionary  \
       \
      --REFERENCE GCF_000261805.1_25021_genomic.fna \
      --OUTPUT GCF_000261805.1_25021_genomic.dict
  
  cat <<-END_VERSIONS > versions.yml
  "NFCORE_PATHOGENSURVEILLANCE:PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:REFERENCE_INDEX:PICARD_CREATESEQUENCEDICTIONARY":
      picard: $(picard CreateSequenceDictionary --version 2>&1 | grep -o 'Version:.*' | cut -f2- -d:)
  END_VERSIONS

Command exit status:
  127

Command output:
  (empty)

Command error:
  .command.sh: line 3: picard: command not found

Work dir:
  /nfs7/BPP/Chang_Lab/paradarc/brady/scripts/pathogensurveillance/work/6b/d9f7ba5b69255b5f7d8600d80c2e52

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`

 -- Check '.nextflow.log' file for details

Add check to make sure that sample ID are unique

Description of feature

No response

Figure out good default thresholds for filtering sendsketch results

These are used to determine how good a hit needs to be for its taxonomy to be considered useful at a given level. We need thresholds for at least the output columns "ANI" and "Complt", if not others as well, for each of the following taxonomic levels: species, genus, family.

Sample sheet does not appear to contain a header

Description of the bug

The error is saying I don't have a header in my csv file but I have the exact header as the tutorial.

sample,fastq_1,fastq_2 6_13_22,/hdd/2/elyse/microbiome/Sweden_MSU_metagenomics_shotgun/20240209_DNASeq_PE150/6_13_22_S36_L003_R1_001.fastq.gz,/hdd/2/elyse/microbiome/Sweden_MSU_metagenomics_shotgun/20240209_DNASeq_PE150/6_13_22_S36_L003_R2_001.fastq.gz 6_14_22,/hdd/2/elyse/microbiome/Sweden_MSU_metagenomics_shotgun/20240209_DNASeq_PE150/6_14_22_S37_L003_R1_001.fastq.gz,/hdd/2/elyse/microbiome/Sweden_MSU_metagenomics_shotgun/20240209_DNASeq_PE150/6_14_22_S37_L003_R2_001.fastq.gz

Command used and terminal output

nextflow run nf-core/pathogensurveillance -r dev -profile singularity --input pathogen_manifest.csv --outdir pathogen_results


Terminal Output:
Execution cancelled -- Finishing pending tasks before exit
-[nf-core/plantpathsurveil] Pipeline completed with errors-
ERROR ~ Error executing process > 'NFCORE_PATHOGENSURVEILLANCE:PATHOGENSURVEILLANCE:INPUT_CHECK:SAMPLESHEET_CHECK (pathogen_manifest.csv)'

Caused by:
  Process `NFCORE_PATHOGENSURVEILLANCE:PATHOGENSURVEILLANCE:INPUT_CHECK:SAMPLESHEET_CHECK (pathogen_manifest.csv)` terminated with an error exit status (1)

Command executed:

  check_samplesheet.py \
      pathogen_manifest.csv \
      samplesheet.valid.csv

  cat <<-END_VERSIONS > versions.yml
  "NFCORE_PATHOGENSURVEILLANCE:PATHOGENSURVEILLANCE:INPUT_CHECK:SAMPLESHEET_CHECK":
      python: $(python --version | sed 's/Python //g')
  END_VERSIONS

Command exit status:
  1

Command output:
  (empty)

Command error:
  INFO:    Converting SIF file to temporary sandbox...
  WARNING: Skipping mount /home/elysebarker/miniconda3/envs/nextflow/var/singularity/mnt/session/etc/resolv.conf [files]: /etc/resolv.conf doesn't exist in container
  [CRITICAL] The given sample sheet does not appear to contain a header.
  INFO:    Cleaning up image...

Work dir:
  /hdd/2/elyse/microbiome/Sweden_MSU_metagenomics_shotgun/20240209_DNASeq_PE150/work/d6/a0cdf03190be8bfe860c27bc6d4ebb

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

 -- Check '.nextflow.log' file for details

Relevant files

No response

System information

I am using nextflow version 23.10.1 on linux.
nf-core/plantpathsurveil v1.0dev

Quality filters for the core gene phylogeny

Three user-adjustable filters:

--min_core_genes: Minimum number of core genes (e.g. 10)
--min_core_samps: Minimum proportion of samples (e.g. 0.9)
--min_core_refs: Minimum proportion of references (e.g. 0.5)

Plan:

Iteratively remove sample with the least genes shared with all other samples until --min_core_genes is reached or --min_core_samps/--min_core_refs thresholds are reached.

Missing fastq1 error

Deleted fastq1 (second column) of the first sample and got the error message:

ERROR ~ Error executing process > 'NFCORE_PATHOGENSURVEILLANCE:PATHOGENSURVEILLANCE:INPUT_CHECK:SAMPLESHEET_CHECK (metadata_medium.csv)'

Pipeline would benefit from a more descriptive error message. Also note that this is different than having a file path that doesn't exist (including if special characters are in it). That error does have a descriptive error message.

Rmarkdown report outline

The report needs to be able to be rendered at any stage while the pipeline is running so that early results like the sendsketch matches can be reported while waiting on long running process like core genome phylogeny.

The main report (one for each report_group) should be a mix of natural-language descriptions and interactive tables and graphics. Each section should answer one question a user might want to know, with whatever info is available at the time the report is rendered. Potential sections:

What are my samples ?

Data available from fastest to slowest:

Sendsketch outputs (seconds to minutes)
Tree made from reference/sample sourmash ani matrix (minutes, with #6 )
Bacteria: Core genome phylogeny with references (hours)
Eukaryotes: Read2tree results with references (hours)

How do my samples relate to each other?

geographic distribution?
Tree made from sample sourmash ani matrix (minutes, with #6 )
Variant calling analysis, minimum spanning network and SNP tree (hours)
Bacteria: Core genome phylogeny without references (hours)
Eukaryotes: Read2tree results without references (hours)

Check what happens when either the reference or reference_id columns are empty

.. but the other is present.

If ID is missing, then the ID should be inferred from the reference file name.

Add ability to use SRA as sample input

This will let users easily include datasets offline for context and will aid in the creation of test datasets (#35)

Error in dump software versions

Description of the bug

When testing the small test dataset I got this error.

Aug-09 22:46:16.934 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'NFCORE_PLANTPATHSURVEIL:PLANTPATHSURVEIL:CUSTOM_DUMPSOFTWAREVERSION

Caused by:
  Process `NFCORE_PLANTPATHSURVEIL:PLANTPATHSURVEIL:CUSTOM_DUMPSOFTWAREVERSIONS` terminated with an error exit status (1)

Command executed [/nfs7/BPP/Chang_Lab/paradarc/nf_brady_N120/scripts/nf-core-plantpathsurveil/./workflows/../modules/nf-core/custom/dumpsoftwareversions/templates/dumpsof

  #!/usr/bin/env python3
  
  import platform
  from textwrap import dedent
  
  import yaml
  
  
  def _make_versions_html(versions):
      html = [
          dedent(
              """\
              <style>
              #nf-core-versions tbody:nth-child(even) {
                  background-color: #f2f2f2;
              }
              </style>
              <table class="table" style="width:100%" id="nf-core-versions">
                  <thead>
                      <tr>
                          <th> Process Name </th>
                          <th> Software </th>
                          <th> Version  </th>
                      </tr>
                  </thead>
              """
          )
      ]
      for process, tmp_versions in sorted(versions.items()):
          html.append("<tbody>")
          for i, (tool, version) in enumerate(sorted(tmp_versions.items())):
              html.append(
                  dedent(
                      f"""\
                      <tr>
  for process, process_versions in versions_by_process.items():
      module = process.split(":")[-1]
      try:
          if versions_by_module[module] != process_versions:
              raise AssertionError(
                  "We assume that software versions are the same between all modules. "
                  "If you see this error-message it means you discovered an edge-case "
                  "and should open an issue in nf-core/tools. "
              )
      except KeyError:
          versions_by_module[module] = process_versions
  
  versions_by_module["Workflow"] = {
      "Nextflow": "23.04.2",
      "nf-core/plantpathsurveil": "1.0dev",
  }
  
  versions_mqc = {
      "id": "software_versions",
      "section_name": "nf-core/plantpathsurveil Software Versions",
      "section_href": "https://github.com/nf-core/plantpathsurveil",
      "plot_type": "html",
      "description": "are collected at run time from the software output.",
      "data": _make_versions_html(versions_by_module),
  }
  
  with open("software_versions.yml", "w") as f:
      yaml.dump(versions_by_module, f, default_flow_style=False)
  with open("software_versions_mqc.yml", "w") as f:
      yaml.dump(versions_mqc, f, default_flow_style=False)
  
  with open("versions.yml", "w") as f:
      yaml.dump(versions_this_module, f, default_flow_style=False)

Command exit status:
  1

Command output:
  (empty)

Command error:
    File ".command.sh", line 59
      versions_by_process = {**yaml.load(f, Loader=yaml.BaseLoader), **versions_by_process}
                                                                                          ^
  TabError: inconsistent use of tabs and spaces in indentation

Work dir:
  /nfs7/BPP/Chang_Lab/paradarc/nf_brady_N120/scripts/nf-core-plantpathsurveil/work/27/8b6f3ed6a6cbe122d597022cb5c159

Command used and terminal output

No response

Relevant files

I ended it up fixing the script by replacing the line that gave the error with:

versions_by_process = {**yaml.load(f, Loader=yaml.BaseLoader), **versions_by_process}

This actually worked. But I am not experienced with python.

System information

No response

DUMPSOFTWAREVERSIONS! error

This is an error with dumping software versions:

Oct-09 14:30:09.344 [Task monitor] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
  task: name=NFCORE_PATHOGENSURVEILLANCE:PATHOGENSURVEILLANCE:CUSTOM_DUMPSOFTWAREVERSIONS; work-dir=/nfs1/BPP/Grunwald_Lab/home/paradarc/pathogensurveillance/work/a3/5c02e88440b42fdba1e00c2c65b520
  error [nextflow.exception.ProcessFailedException]: Process `NFCORE_PATHOGENSURVEILLANCE:PATHOGENSURVEILLANCE:CUSTOM_DUMPSOFTWAREVERSIONS` terminated with an error exit status (1)
Oct-09 14:30:09.365 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'NFCORE_PATHOGENSURVEILLANCE:PATHOGENSURVEILLANCE:CUSTOM_DUMPSOFTWAREVERSIONS'

Caused by:
  Process `NFCORE_PATHOGENSURVEILLANCE:PATHOGENSURVEILLANCE:CUSTOM_DUMPSOFTWAREVERSIONS` terminated with an error exit status (1)

Command executed [/nfs1/BPP/Grunwald_Lab/home/paradarc/pathogensurveillance/./workflows/../modules/nf-core/custom/dumpsoftwareversions/templates/dumpsoftwareversions.py]:

  #!/usr/bin/env python
  
  import platform
  from textwrap import dedent
  
  import yaml
  
  
  def _make_versions_html(versions):
      html = [
          dedent(
              """\
              <style>
              #nf-core-versions tbody:nth-child(even) {
                  background-color: #f2f2f2;
              }
              </style>
              <table class="table" style="width:100%" id="nf-core-versions">
                  <thead>
                      <tr>
                          <th> Process Name </th>
                          <th> Software </th>
                          <th> Version  </th>
                      </tr>
                  </thead>
              """
          )
      ]
      for process, tmp_versions in sorted(versions.items()):
          html.append("<tbody>")
          for i, (tool, version) in enumerate(sorted(tmp_versions.items())):
              html.append(
                  dedent(
                      f"""\
                      <tr>
                          <td><samp>{process if (i == 0) else ''}</samp></td>
                          <td><samp>{tool}</samp></td>
                          <td><samp>{version}</samp></td>
                      </tr>
                      """
                  )
              )
          html.append("</tbody>")
      html.append("</table>")
      return "\n".join(html)
  
  
  versions_this_module = {}
  versions_this_module["NFCORE_PATHOGENSURVEILLANCE:PATHOGENSURVEILLANCE:CUSTOM_DUMPSOFTWAREVERSIONS"] = {
      "python": platform.python_version(),
      "yaml": yaml.__version__,
  }
  
  versions_by_process = versions_this_module
  
  print("1/versions.yml 2/versions.yml 3/versions.yml 4/versions.yml 5/versions.yml 6/versions.yml 7/versions.yml 8/versions.yml 9/versions.yml 10/versions.yml 11/versions.yml 12/versions.yml 13/versions.yml 14/versions.yml 15/versions.yml 16/versions.yml 17/versions.yml 18/versions.yml 19/versions.yml 20/versions.yml 21/versions.yml 22/versions.yml 23/versions.yml 24/versions.yml 25/versions.yml 26/versions.yml 27/versions.yml 28/versions.yml 29/versions.
  for path in "1/versions.yml 2/versions.yml 3/versions.yml 4/versions.yml 5/versions.yml 6/versions.yml 7/versions.yml 8/versions.yml 9/versions.yml 10/versions.yml 11/versions.yml 12/versions.yml 13/versions.yml 14/versions.yml 15/versions.yml 16/versions.yml 17/versions.yml 18/versions.yml 19/versions.yml 20/versions.yml 21/versions.yml 22/versions.yml 23/versions.yml 24/versions.yml 25/versions.yml 26/versions.yml 27/versions.yml 28/versions.yml 29/ver
      with open(path) as f:
          versions_by_process = yaml.load(f, Loader=yaml.BaseLoader) | versions_by_process
  
  # aggregate versions by the module name (derived from fully-qualified process name)
  versions_by_module = {}
  for process, process_versions in versions_by_process.items():
      module = process.split(":")[-1]
      try:
          if versions_by_module[module] != process_versions:
              raise AssertionError(
                  "We assume that software versions are the same between all modules. "
                  "If you see this error-message it means you discovered an edge-case "
                  "and should open an issue in nf-core/tools. "
              )
      except KeyError:
          versions_by_module[module] = process_versions
  
  versions_by_module["Workflow"] = {
      "Nextflow": "23.04.4",
      "nf-core/plantpathsurveil": "1.0dev",
  }
  
  versions_mqc = {
      "id": "software_versions",
      "section_name": "nf-core/plantpathsurveil Software Versions",
      "section_href": "https://github.com/nf-core/plantpathsurveil",
      "plot_type": "html",
      "description": "are collected at run time from the software output.",
      "data": _make_versions_html(versions_by_module),
  }
  
  with open("software_versions.yml", "w") as f:
      yaml.dump(versions_by_module, f, default_flow_style=False)
  with open("software_versions_mqc.yml", "w") as f:
      yaml.dump(versions_mqc, f, default_flow_style=False)
  
  with open("versions.yml", "w") as f:
      yaml.dump(versions_this_module, f, default_flow_style=False)

Command exit status:
  1

Command output:
  1/versions.yml 2/versions.yml 3/versions.yml 4/versions.yml 5/versions.yml 6/versions.yml 7/versions.yml 8/versions.yml 9/versions.yml 10/versions.yml 11/versions.yml 12/versions.yml 13/versions.yml 14/versions.yml 15/versions.yml 16/versions.yml 17/versions.yml 18/versions.yml 19/versions.yml 20/versions.yml 21/versions.yml 22/versions.yml 23/versions.yml 24/versions.yml 25/versions.yml 26/versions.yml 27/versions.yml 28/versions.yml 29/versions.yml 30/

Command error:
  1/versions.yml 2/versions.yml 3/versions.yml 4/versions.yml 5/versions.yml 6/versions.yml 7/versions.yml 8/versions.yml 9/versions.yml 10/versions.yml 11/versions.yml 12/versions.yml 13/versions.yml 14/versions.yml 15/versions.yml 16/versions.yml 17/versions.yml 18/versions.yml 19/versions.yml 20/versions.yml 21/versions.yml 22/versions.yml 23/versions.yml 24/versions.yml 25/versions.yml 26/versions.yml 27/versions.yml 28/versions.yml 29/versions.yml 30/
  Traceback (most recent call last):
    File ".command.sh", line 58, in <module>
      with open(path) as f:
           ^^^^^^^^^^
  FileNotFoundError: [Errno 2] No such file or directory: '50/nfs1'

Work dir:
  /nfs1/BPP/Grunwald_Lab/home/paradarc/pathogensurveillance/work/a3/5c02e88440b42fdba1e00c2c65b520

Ensure that there is an error when no report group is defined

If no intelligible error is given:

Tasks

Beta Give feedback

Add to required column on line 194 in bin/check_samplesheet.py .
modify bin/check_samplesheet.py to check for this column and give an error if empty.
Add new check to validate_and_transform on line 63
Options

If Spades or Bakta fail for small number of samples, is there a way to proceed without these samples?

Description of the bug

I have encountered this twice, with larger bacterial datasets, but for whatever reason, whether quality of raw read data is bad or spades cannot produce a decent assembly, Bakta will then fail (not surprisingly) and after a few retries, the whole pipeline stops until I remove the samples in question.

Is there a way to proceed with the analysis, even if Spades, or Bakta fail for a small number of samples, making note of which samples couldn't go though the pipeline?

Command used and terminal output

No response

Relevant files

No response

System information

No response

Description of feature

This could either be put in the main workflow or repeated in each organism-specific workflow.

nf-core / pathogensurveillance Goto Github PK

pathogensurveillance's Introduction

NOTE: THIS PROJECT IS UNDER DEVELOPMENT AND MAY NOT FUNCTION AS EXPECTED UNTIL THIS MESSAGE GOES AWAY

Introduction

Pipeline summary

Quick Start

Documentation

Credits

Contributions and Support

Citations

pathogensurveillance's People

Contributors

Stargazers

Watchers

Forkers

pathogensurveillance's Issues

Description of the bug

Command used and terminal output

Relevant files

System information

Description of the bug

Command used and terminal output

Relevant files

System information

Description of feature

Description of the bug

Command used and terminal output

Relevant files

System information

Description of the bug

Command used and terminal output

Relevant files

System information

Description of the bug

Command used and terminal output

Relevant files

System information

Description of feature

Description of the bug

Command used and terminal output

Relevant files

System information

Description of feature

Description of the bug

Command used and terminal output

Relevant files

System information

What are my samples ?

How do my samples relate to each other?

Description of the bug

Command used and terminal output

Relevant files

System information

Tasks

Description of the bug

Command used and terminal output

Relevant files

System information

Description of feature

Recommend Projects

Recommend Topics

Recommend Org