Giter VIP home page Giter VIP logo

pathogensurveillance's Introduction

nf-core/pathogensurveillance nf-core/pathogensurveillance

AWS CICite with Zenodo

Nextflow run with conda run with docker run with singularity Launch on Nextflow Tower

Get help on SlackFollow on TwitterWatch on YouTube

NOTE: THIS PROJECT IS UNDER DEVELOPMENT AND MAY NOT FUNCTION AS EXPECTED UNTIL THIS MESSAGE GOES AWAY

Introduction

nf-core/pathogensurveillance is a population genomic pipeline for pathogen diagnosis, variant detection, and biosurveillance. The pipeline accepts the paths to raw reads for one or more organisms and creates reports in the form of interactive HTML reports or PDF documents. Significant features include the ability to analyze unidentified eukaryotic and prokaryotic samples, creation of reports for multiple user-defined groupings of samples, automated discovery and downloading of reference assemblies from NCBI RefSeq, and rapid initial identification based on k-mer sketches followed by a more robust core genome phylogeny.

The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The Nextflow DSL2 implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies. Where possible, these processes have been submitted to and installed from nf-core/modules in order to make them available to all nf-core pipelines, and to everyone within the Nextflow community!

On release, automated continuous integration tests run the pipeline on a full-sized dataset on the AWS cloud infrastructure. This ensures that the pipeline runs on AWS, has sensible resource allocation defaults set to run on real-world data sets, and permits the persistent storage of results to benchmark between pipeline releases and other analysis sources.The results obtained from the full-sized test can be viewed on the nf-core website.

Pipeline summary

Pipeline flowchart

Quick Start

  1. Install Nextflow (>=21.10.3)

  2. Install any of Docker, Singularity (you can follow this tutorial), Podman, Shifter or Charliecloud for full pipeline reproducibility (you can use Conda both to install Nextflow itself and also to manage software within pipelines. Please only use it within pipelines as a last resort; see docs).

  3. Download the pipeline and test it on a minimal dataset with a single command:

    nextflow run nf-core/pathogensurveillance -profile test,YOURPROFILE --outdir <OUTDIR>

    Note that some form of configuration will be needed so that Nextflow knows how to fetch the required software. This is usually done in the form of a config profile (YOURPROFILE in the example command above). You can chain multiple config profiles in a comma-separated string.

    • The pipeline comes with config profiles called docker, singularity, podman, shifter, charliecloud and conda which instruct the pipeline to use the named tool for software management. For example, -profile test,docker.
    • Please check nf-core/configs to see if a custom config file to run nf-core pipelines already exists for your Institute. If so, you can simply use -profile <institute> in your command. This will enable either docker or singularity and set the appropriate execution settings for your local compute environment.
    • If you are using singularity, please use the nf-core download command to download images first, before running the pipeline. Setting the NXF_SINGULARITY_CACHEDIR or singularity.cacheDir Nextflow options enables you to store and re-use the images from a central location for future pipeline runs.
    • If you are using conda, it is highly recommended to use the NXF_CONDA_CACHEDIR or conda.cacheDir settings to store the environments in a central location for future pipeline runs.
  4. Start running your own analysis!

    nextflow run nf-core/pathogensurveillance --input samplesheet.csv --outdir <OUTDIR> --genome GRCh37 -profile <docker/singularity/podman/shifter/charliecloud/conda/institute>

Documentation

The nf-core/pathogensurveillance pipeline comes with documentation about the pipeline usage, parameters and output.

Credits

nf-core/pathogensurveillance was originally written by Zachary S.L. Foster, Martha Sudermann, Nicholas C. Cauldron, Fernanda I. Bocardo, Hung Phan, Jeff H. Chang, Niklaus J. Grünwald.

We thank the following people for their extensive assistance in the development of this pipeline:

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

For further information or help, don't hesitate to get in touch on the Slack #pathogensurveillance channel (you can join with this invite).

Citations

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

You can cite the nf-core publication as follows:

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

pathogensurveillance's People

Contributors

zachary-foster avatar firuegas avatar grunwald avatar phanhung2 avatar marinawitherell avatar

Stargazers

Liam Brown avatar atongsa avatar  avatar José Afonso Guerra-Assunção avatar Paul R Johnston avatar Tessa Pierce Ward avatar Juan A. Ugalde avatar José Luis Villanueva-Cañas avatar  avatar Ricardo I avatar

Watchers

Anthony Underwood avatar Ani avatar Raoul J.P. Bonnal avatar Jason Stajich avatar Chris Fields avatar Rob Newman avatar Brad Langhorst avatar  avatar Dan Fornika avatar Jeremy Leipzig avatar Aaron Petkau avatar Tim Richardson avatar Thanh Lee avatar Ricardo H. Ramírez-Gonzalez avatar Rob Patro avatar Toni Hermoso Pulido avatar Fábio Madeira avatar Olga Botvinnik avatar Paolo Di Tommaso avatar Christoph K. avatar Greg Putzel avatar Maxime U Garcia avatar Junjun Zhang avatar Charalampos Lazaris avatar Venkat Malladi avatar Ben Berman avatar Ikram Ullah avatar Stephen Ficklin avatar Gonzalo Javiel avatar Heath O'Brien avatar Ramakrishnan E.P. avatar Haruka Ozaki avatar Arnaud Bore avatar Alexander Peltzer avatar Patrick Hüther avatar  avatar  avatar minigonche avatar Bharat avatar Daric V avatar Qi ZHAO avatar  avatar Pablo Riesgo-Ferreiro avatar Vivek Rai avatar Matthew avatar Jim Bannister avatar  avatar Falko Hofmann avatar Pablo Prieto avatar Sébastien Guizard avatar Alessia avatar Rashesh avatar Alexander Nater avatar Sam Minot avatar Robert A. Petit III avatar Yann avatar  avatar Isak Sylvin avatar Tom Kelly avatar Rad Suchecki avatar Khurram Maqbool avatar Joerg Fallmann avatar Alec Steep avatar Timo Sachsenberg avatar Felix Krueger avatar Michael J. Wilson avatar  avatar Matthias Hörtenhuber avatar Alex Makunin avatar Jose Espinosa-Carrasco avatar Andrew Duncan avatar Ashley S Doane avatar Christopher Mohr avatar Greg Medlock avatar Sviatoslav Sidorov avatar  avatar Rohit Jadhav avatar Chih Chuan avatar Ricardo I avatar Nicholas Owen avatar Aleksandra Perz avatar Hadrien Gourlé avatar Jun-Hoe Lee avatar Jakob Berg avatar  avatar Andrew Tupper avatar Anand Maurya avatar Rodrigo O. Polo avatar Yanhai Gong avatar ivivek87 avatar  avatar Enrique Audain avatar Bryan Lajoie avatar Mohammed OE Abdallah avatar Lilia Mesina avatar Karthik Padmanabhan avatar Eirini (Irene) Liampa avatar  avatar Katie Lennard avatar Martin Pippel avatar

pathogensurveillance's Issues

CUSTOM_DUMPSOFTWAREVERSIONS stops -resume from working

Not a big deal since this happens at the end of the pipeline, but it is making multiqc run each time, so it would be nice to have to work with -resume. This seems to be due to .collectFile making a new file each time, even if the contents are the same.

    CUSTOM_DUMPSOFTWAREVERSIONS (                                               
        ch_versions.unique().toSortedList().flatten().collectFile(name: 'collated_versions.yml')
    )                                                                           

Missing sample name error

Deleted sample name of the first sample and got the error message:

ERROR ~ Error executing process > 'NFCORE_PATHOGENSURVEILLANCE:PATHOGENSURVEILLANCE:INPUT_CHECK:SAMPLESHEET_CHECK (metadata_medium.csv)'

Pipeline would benefit from a more descriptive error message.

weird error when downloading genomes?

curl: (3) URL using bad/illegal format or missing URL
ERROR: curl command failed ( Sun Oct 8 04:11:28 PM PDT 2023 ) with: 3
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?query_key=1&WebEnv=MCID_6523371be8af48640e2d92e7&retstart=0&retmax=1580&db=taxonomy&rettype=uilist&retmode=text&tool=edirect&edirect=16.2&edirect_os=Linux&email=changlab%40fileserver%20has%20address%2010.206.67.17
fileserver
ERROR: ELink failure
ERROR: Missing -db argument
ERROR: Missing -db argument

Test data does not match the input check requirements

Description of the bug

Hi @zachary-foster ,

I know you mentioned you were working on fixing the sample check script and converting it to an R script.

I was testing the pipeline to generate the record_message output for the report. I encounter the following issue:

nextflow.config.Manifest@3a7f7201
[3c/caa1f5] NOTE: Process `NFCORE_PATHOGENSURVEILLANCE:PATHOGENSURVEILLANCE:INPUT_CHECK:SAMPLESHEET_CHECK (metadata_full.csv)` terminated with an error exit status (1) -- Execution is retried (1)
ERROR ~ Error executing process > 'NFCORE_PATHOGENSURVEILLANCE:PATHOGENSURVEILLANCE:INPUT_CHECK:SAMPLESHEET_CHECK (metadata_full.csv)'

Caused by:
  Process `NFCORE_PATHOGENSURVEILLANCE:PATHOGENSURVEILLANCE:INPUT_CHECK:SAMPLESHEET_CHECK (metadata_full.csv)` terminated with an error exit status (1)

Command executed:

  check_samplesheet.py \
      metadata_full.csv \
      samplesheet.valid.csv
  
  cat <<-END_VERSIONS > versions.yml
  "NFCORE_PATHOGENSURVEILLANCE:PATHOGENSURVEILLANCE:INPUT_CHECK:SAMPLESHEET_CHECK":
      python: $(python --version | sed 's/Python //g')
  END_VERSIONS

Command exit status:
  1

Command output:
  (empty)

Command error:
  [CRITICAL] The sample sheet **must** contain these column headers: fastq_1, sample, fastq_2.

Work dir:
  /nfs5/BPP/Grunwald_Lab/home/paradarc/pathogensurveillance/work/ca/00ab1aa2b9a3a2f8e72580320b41e7

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

 -- Check '.nextflow.log' file for details

Command used and terminal output

nextflow run nf-core/pathogensurveillance --input https://raw.githubusercontent.com/grunwaldlab/pathogensurveillance/master/test/data/metadata_small.csv --outdir test_out --download_bakta_db true -profile mamba -resume

Relevant files

No response

System information

CQLS

Circos error within bakta

As part or testing the pipeline with a larger dataset and more realistic files. It seems like one of the samples produced an assembly with lots of contigs. Based on the issue here: oschwengers/bakta#174 this could cause circos to clog. I am testing if adding the --skip-plot flag actually helps.

Oct-11 20:03:40.398 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'NFCORE_PATHOGENSURVEILLANCE:PATHOGENSURVEILLANCE:GENOME_ASSEMBLY:BAKTA_BAKTA (H1_10_30-2_S99)'

Caused by:
  Process `NFCORE_PATHOGENSURVEILLANCE:PATHOGENSURVEILLANCE:GENOME_ASSEMBLY:BAKTA_BAKTA (H1_10_30-2_S99)` terminated with an error exit status (1)

Command executed:

  bakta \
      H1_10_30-2_S99_filtered.fasta \
      --force \
      --threads 4 \
      --prefix H1_10_30-2_S99 \
       \
       \
      --db db-light
  
  cat <<-END_VERSIONS > versions.yml
  "NFCORE_PATHOGENSURVEILLANCE:PATHOGENSURVEILLANCE:GENOME_ASSEMBLY:BAKTA_BAKTA":
      bakta: $(echo $(bakta --version) 2>&1 | cut -f '2' -d ' ')
  END_VERSIONS

Command exit status:
  1

Command output:
        discarded spurious: 15
        detected IPSs: 0
        found PSCCs: 0
        lookup annotations...
        filter and combine annotations...
        filtered sORFs: 0
  detect gaps...
        found: 115
  detect oriCs/oriVs...
        found: 1
  detect oriTs...
        found: 0
  apply feature overlap filters...
  select features and create locus tags...
  selected: 14391
  improve annotations...
        revised gene symbols: 0
  
  genome statistics:
        Genome size: 13,369,953 bp
        Contigs/replicons: 20891
        GC: 64.2 %
        N50: 1,121
        N ratio: 0.1 %
        coding density: 50.6 %
  
  annotation summary:
        tRNAs: 106
        tmRNAs: 1
        rRNAs: 3
        ncRNAs: 27
        ncRNA regions: 22
        CRISPR arrays: 0
        CDSs: 14116
                hypotheticals: 9749
                pseudogenes: 0
                signal peptides: 0
        sORFs: 0
        gaps: 115
        oriCs/oriVs: 1
        oriTs: 0
  
  export annotation results to: /nfs1/BPP/Grunwald_Lab/home/paradarc/pathogensurveillance/work/c0/b4330e92b8805ce2508a106c25313e
        human readable TSV...
        GFF3...
        INSDC GenBank & EMBL...
        genome sequences...
        feature nucleotide sequences...
        translated CDS sequences...
        circular genome plot...

Command error:
        found: 1
  detect oriTs...
        found: 0
  apply feature overlap filters...
  select features and create locus tags...
  selected: 14391
  improve annotations...
        revised gene symbols: 0
  
  genome statistics:
        Genome size: 13,369,953 bp
        Contigs/replicons: 20891
        GC: 64.2 %
        N50: 1,121
        N ratio: 0.1 %
        coding density: 50.6 %
  
  annotation summary:
        tRNAs: 106
        tmRNAs: 1
        rRNAs: 3
        ncRNAs: 27
        ncRNA regions: 22
        CRISPR arrays: 0
        CDSs: 14116
                hypotheticals: 9749
                pseudogenes: 0
                signal peptides: 0
        sORFs: 0
        gaps: 115
        oriCs/oriVs: 1
        oriTs: 0
  
  export annotation results to: /nfs1/BPP/Grunwald_Lab/home/paradarc/pathogensurveillance/work/c0/b4330e92b8805ce2508a106c25313e
        human readable TSV...
        GFF3...
        INSDC GenBank & EMBL...
        genome sequences...
        feature nucleotide sequences...
        translated CDS sequences...
        circular genome plot...
  Traceback (most recent call last):
    File "/nfs1/BPP/Grunwald_Lab/home/paradarc/pathogensurveillance/work/conda/env-a426d411c02c15e446ce4ad992ac0c99/bin/bakta", line 10, in <module>
      sys.exit(main())
               ^^^^^^
    File "/nfs1/BPP/Grunwald_Lab/home/paradarc/pathogensurveillance/work/conda/env-a426d411c02c15e446ce4ad992ac0c99/lib/python3.11/site-packages/bakta/main.py", line 562, in main
      plot.write_plot(features, contigs, cfg.output_path)
    File "/nfs1/BPP/Grunwald_Lab/home/paradarc/pathogensurveillance/work/conda/env-a426d411c02c15e446ce4ad992ac0c99/lib/python3.11/site-packages/bakta/plot.py", line 259, in write_plot
      raise Exception(f'circos error! error code: {proc.returncode}')
  Exception: circos error! error code: 255

Work dir:
  /nfs1/BPP/Grunwald_Lab/home/paradarc/pathogensurveillance/work/c0/b4330e92b8805ce2508a106c25313e

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

Missing fastq2 error

Deleted fastq2 (third column) of the first sample and got the error message:

ERROR ~ Error executing process > 'NFCORE_PATHOGENSURVEILLANCE:PATHOGENSURVEILLANCE:ASSIGN_REFERENCES:KHMER_TRIMLOWABUND (22-324_T1)'

This is interesting because the pipeline still ran for quit a bit of time with the first 5 processes completed and a few others. I deleted the fastq2 of the sample 22-299 but the one that resulted in the error was 22-324.

I ran this a second time to see if it was reproducible and got farther in the pipeline than I normally get even with no modification to the sample sheet input. This time I got the error message:

ERROR ~ Error executing process > 'NFCORE_PATHOGENSURVEILLANCE:PATHOGENSURVEILLANCE:GENOME_ASSEMBLY:SPADES (22-330_T1)'

When I ran it a third time I got the same error message, but with the sample 22-324_T1.

Remove code that adds "_T#" to sample IDs

Description of the bug

This is meant to force the sample IDs to be unique, but in our case we want to allow for duplicated rows with different group/reference combinations. Also, we dont want the cache to think that two smaples are different just because their sample IDs differ but their reads/reference do not.

Command used and terminal output

No response

Relevant files

No response

System information

No response

Add read subsampling step to each major subworkflow based on reference genome size

Description of feature

Some of the steps probably don't need to use all the reads, particularly the reference selection step, it would save time to subsample them, ideally based on predicted genome size/complexity. After FIND_ASSEMBLIES, we should be able to get some stats on genome size from assemblies of related organisms. If no close match is found ( no genus-level match) we use all the reads. If we find some reference sequences, we can estimate the genome size of the sample from the mean assembly size of the closely related reference assemblies. The degree of subsmapling should be controlled by a command line parameter.

This would speed up ASSIGN_REFERENCES:KHMER_TRIMLOWABUND

Error picking assemblies

Description of the bug

ERROR ~ Error executing process > 'NFCORE_PLANTPATHSURVEIL:PLANTPATHSURVEIL:DOWNLOAD_REFERENCES:PICK_ASSEMBLIES (13A4_3_S77_T1)'

Caused by:
  Process `NFCORE_PLANTPATHSURVEIL:PLANTPATHSURVEIL:DOWNLOAD_REFERENCES:PICK_ASSEMBLIES (13A4_3_S77_T1)` terminated with an error exit status (137)

Command executed:

  pick_assemblies.R families.txt genera.txt species.txt merged_assembly_stats.tsv 5 13A4_3_S77_T1.tsv

  tail -n +2 13A4_3_S77_T1.tsv | cut -f2 > 13A4_3_S77_T1_ids.txt

  cat <<-END_VERSIONS > versions.yml
  "NFCORE_PLANTPATHSURVEIL:PLANTPATHSURVEIL:DOWNLOAD_REFERENCES:PICK_ASSEMBLIES":
      r-base: $(echo $(R --version 2>&1) | sed 's/^.*R version //; s/ .*$//')
  END_VERSIONS

Command exit status:
  137

Command output:
  (empty)

Command error:
  .command.sh: line 2: 31360 Killed                  pick_assemblies.R families.txt genera.txt species.txt merged_assembly_stats.tsv 5 13A4_3_S77_T1.tsv

Work dir:
  /nfs7/BPP/Chang_Lab/paradarc/nf_brady_N120/logs/work/48/ebb72ad3db2c128eb8379c34bd1a2e

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

 -- Check '.nextflow.log' file for details
-[nf-core/plantpathsurveil] Pipeline completed with errors-
WARN: Killing running tasks (83)

Command used and terminal output

/local/cluster/bin/nextflow run /nfs7/BPP/Chang_Lab/paradarc/nf_brady_N120/scripts/nf-core-plantpathsurveil/main.nf -w /nfs7/BPP/Chang_Lab/paradarc/nf_brady_N120/logs/work --input /nfs7/BPP/Chang_Lab/paradarc/nf_brady_N120/data/metadata/metadata.csv --outdir /nfs7/BPP/Chang_Lab/paradarc/nf_brady_N120/results -profile conda -resume

Relevant files

nextflow.log
execution_trace_2023-08-07_07-31-19.txt

System information

CQLS

Glitch at the download genome step

Description of the bug

This maybe nothing at all but maybe important to mention in the documentation, that this process may require restart...
At the download genome assemblies step, I got this error:

ERROR ~ Error executing process > 'NFCORE_PLANTPATHSURVEIL:PLANTPATHSURVEIL:DOWNLOAD_REFERENCES:DOWNLOAD_ASSEMBLIES (GCA_016864655.1)'

Caused by:
  Process `NFCORE_PLANTPATHSURVEIL:PLANTPATHSURVEIL:DOWNLOAD_REFERENCES:DOWNLOAD_ASSEMBLIES (GCA_016864655.1)` terminated with an error exit status (2)

Command executed:

  # Download assemblies as zip archives
  datasets download genome accession GCA_016864655.1 --include gff3,rna,cds,protein,genome,seq-report --filename GCA_016864655.1.zip

  # Unzip
  unzip GCA_016864655.1.zip

  # Rename files with assembly name
  if [ -f ncbi_dataset/data/GCA_016864655.1/genomic.gff ]; then
      mv ncbi_dataset/data/GCA_016864655.1/genomic.gff ncbi_dataset/data/GCA_016864655.1/GCA_016864655.1.gff
  fi
  if [ -f ncbi_dataset/data/GCA_016864655.1/cds_from_genomic.fna ]; then
      mv ncbi_dataset/data/GCA_016864655.1/cds_from_genomic.fna ncbi_dataset/data/GCA_016864655.1/GCA_016864655.1_cds.fna
  fi
  if [ -f ncbi_dataset/data/GCA_016864655.1/protein.faa ]; then
      mv ncbi_dataset/data/GCA_016864655.1/protein.faa ncbi_dataset/data/GCA_016864655.1/GCA_016864655.1.faa
  fi

  cat <<-END_VERSIONS > versions.yml
  "NFCORE_PLANTPATHSURVEIL:PLANTPATHSURVEIL:DOWNLOAD_REFERENCES:DOWNLOAD_ASSEMBLIES":
      datasets: $(datasets --version | sed -e "s/datasets version: //")
  END_VERSIONS

Command exit status:
  2

Command output:
  Archive:  GCA_016864655.1.zip
    inflating: README.md
    inflating: ncbi_dataset/data/assembly_data_report.jsonl
    inflating: ncbi_dataset/data/GCA_016864655.1/GCA_016864655.1_ASM1686465v1_genomic.fna
    inflating: ncbi_dataset/data/GCA_016864655.1/genomic.gff    inflating: ncbi_dataset/data/GCA_016864655.1/cds_from_genomic.fna
    inflating: ncbi_dataset/data/GCA_016864655.1/protein.faa
    inflating: ncbi_dataset/data/GCA_016864655.1/sequence_report.jsonl
    inflating: ncbi_dataset/data/dataset_catalog.json
  Collecting 1  records [================================================] 100% 1/1
  Downloading: GCA_016864655.1.zip    41MB done
  Archive:  GCA_016864655.1.zip
    inflating: README.md
    inflating: ncbi_dataset/data/assembly_data_report.jsonl
    inflating: ncbi_dataset/data/GCA_016864655.1/GCA_016864655.1_ASM1686465v1_genomic.fna
    inflating: ncbi_dataset/data/GCA_016864655.1/genomic.gff
    error:  invalid compressed data to inflate
    inflating: ncbi_dataset/data/GCA_016864655.1/cds_from_genomic.fna
    inflating: ncbi_dataset/data/GCA_016864655.1/protein.faa
    inflating: ncbi_dataset/data/GCA_016864655.1/sequence_report.jsonl
    inflating: ncbi_dataset/data/dataset_catalog.json

Work dir:
  /nfs7/BPP/Chang_Lab/paradarc/nf_brady_N120/scripts/nf-core-plantpathsurveil/work/00/7f5d8df6693888570725145aa13835

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

 -- Check '.nextflow.log' file for details

It could be that it is a glitch on the download or unzip process. I ran it again (-resume) and the download works just fine.

Command used and terminal output

No response

Relevant files

No response

System information

No response

sourmash not found

Description of the bug

When I ran the test dataset, I got the error below.

Aug-09 21:12:34.834 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'NFCORE_PLANTPATHSURVEIL:PLANTPATHSURVEIL:ASSIGN_REFERENCES:SOURMASH_SKETCH_GENOME (22_331_assembly)'

Caused by:
  Process `NFCORE_PLANTPATHSURVEIL:PLANTPATHSURVEIL:ASSIGN_REFERENCES:SOURMASH_SKETCH_GENOME (22_331_assembly)` terminated with an error exit status (127)

Command executed:

  sourmash sketch \
      dna --param-string 'scaled=1000,k=21,k=31,k=51' \
      --merge '22_331_assembly' \
      --output '22_331_assembly.sig' \
      reference-22-331.fna

  cat <<-END_VERSIONS > versions.yml
  "NFCORE_PLANTPATHSURVEIL:PLANTPATHSURVEIL:ASSIGN_REFERENCES:SOURMASH_SKETCH_GENOME":
      sourmash: $(echo $(sourmash --version 2>&1) | sed 's/^sourmash //' )
  END_VERSIONS

Command exit status:
  127

Command output:
  (empty)

Command error:
  .command.sh: line 2: sourmash: command not found

Work dir:
  /nfs7/BPP/Chang_Lab/paradarc/nf_brady_N120/scripts/nf-core-plantpathsurveil/work/a2/480383e27aa8caca0e987eaf408c31

If I add the path to the sourmash executable installed in my conda env works no problem. So this maybe an issue with my conda installation or with the way the pipeline calls the sourmash conda.

Command used and terminal output

nextflow run main.nf -profile test,mamba --outdir test -resume

Relevant files

No response

System information

No response

Add column to input to group samples

Description of feature

This would be used to structure how samples are displayed in reports and may also effect parts of the analysis. For example, each group would have its own phylogeny report.

ASSIGN_REFERENCES:SOURMASH_COMPARE error when inputting newly formatted test metadata file

Description of the bug

I was running some of the test datasets in preparation to input a high complexity dataset, and encountered an error with PATHOGENSURVEILLANCE:ASSIGN_REFERENCES:SOURMASH_COMPARE

The data input was metadata_PRJNA523365_small.csv

ERROR ~ Error executing process > 'PATHOGENSURVEILLANCE:ASSIGN_REFERENCES:SOURMASH_COMPARE (1)'

Caused by:
Process PATHOGENSURVEILLANCE:ASSIGN_REFERENCES:SOURMASH_COMPARE input file name collision -- There are multiple input files for each of the following file names: GCF_017189435_1.sig

The metadata file has some new columns compared to the other test datasets, and I wonder if this contributed.
Everything else leading up to this ran as expected.

Command used and terminal output

(nf-core) marthasudermann@pop-os:~/pathogensurveillance$ nextflow run main.nf --input 'https://raw.githubusercontent.com/grunwaldlab/pathogensurveillance/master/test/data/metadata_PRJNA523365_small.csv' --outdir test_out4 --bakta_db /home/marthasudermann/Software/bakta_db_02_2024/db/ -profile docker -resume
N E X T F L O W  ~  version 23.10.1
Launching `main.nf` [confident_engelbart] DSL2 - revision: cc83aa0c27


------------------------------------------------------
                                        ,--./,-.
        ___     __   __   __   ___     /,-._.--~'
  |\ | |__  __ /  ` /  \ |__) |__         }  {
  | \| |       \__, \__/ |  \ |___     \`-._,-`-,
                                        `._,._,'
  nf-core/plantpathsurveil v1.0dev
------------------------------------------------------
Core Nextflow options
  runName        : confident_engelbart
  containerEngine: docker
  launchDir      : /home/marthasudermann/pathogensurveillance
  workDir        : /home/marthasudermann/pathogensurveillance/work
  projectDir     : /home/marthasudermann/pathogensurveillance
  userName       : marthasudermann
  profile        : docker
  configFiles    : /home/marthasudermann/pathogensurveillance/nextflow.config

Input/output options
  input          : https://raw.githubusercontent.com/grunwaldlab/pathogensurveillance/master/test/data/metadata_PRJNA523365_small.csv
  outdir         : test_out4
  bakta_db       : /home/marthasudermann/Software/bakta_db_02_2024/db/

!! Only displaying parameters that differ from the pipeline defaults !!
------------------------------------------------------
If you use nf-core/plantpathsurveil for your analysis please cite:

* The nf-core framework
  https://doi.org/10.1038/s41587-020-0439-x

* Software dependencies
  https://github.com/nf-core/plantpathsurveil/blob/master/CITATIONS.md
------------------------------------------------------
[-        ] process > PATHOGENSURVEILLANCE:INPUT_CHECK:SAMPLESHEET_CHECK                                    -
[-        ] process > PATHOGENSURVEILLANCE:SRATOOLS_FASTERQDUMP                                             -
[-        ] process > PATHOGENSURVEILLANCE:DOWNLOAD_ASSEMBLIES                                              -
[-        ] process > PATHOGENSURVEILLANCE:FASTQC                                                           -
[-        ] process > PATHOGENSURVEILLANCE:COARSE_SAMPLE_TAXONOMY:BBMAP_SENDSKETCH                          -
[-        ] process > PATHOGENSURVEILLANCE:COARSE_SAMPLE_TAXONOMY:INITIAL_CLASSIFICATION                    -
[-        ] process > PATHOGENSURVEILLANCE:DOWNLOAD_REFERENCES:FIND_ASSEMBLIES                              -
[-        ] process > PATHOGENSURVEILLANCE:DOWNLOAD_REFERENCES:PICK_ASSEMBLIES                              -
[-        ] process > PATHOGENSURVEILLANCE:DOWNLOAD_REFERENCES:DOWNLOAD_ASSEMBLIES                          -
[-        ] process > PATHOGENSURVEILLANCE:DOWNLOAD_REFERENCES:MAKE_GFF_WITH_FASTA                          -
[-        ] process > PATHOGENSURVEILLANCE:DOWNLOAD_REFERENCES:SOURMASH_SKETCH_GENOME                       -
[-        ] process > PATHOGENSURVEILLANCE:ASSIGN_REFERENCES:SUBSET_READS                                   -
[-        ] process > PATHOGENSURVEILLANCE:INPUT_CHECK:SAMPLESHEET_CHECK                                    -
[-        ] process > PATHOGENSURVEILLANCE:SRATOOLS_FASTERQDUMP                                             -
[-        ] process > PATHOGENSURVEILLANCE:DOWNLOAD_ASSEMBLIES                                              -
[-        ] process > PATHOGENSURVEILLANCE:FASTQC                                                           -
[-        ] process > PATHOGENSURVEILLANCE:COARSE_SAMPLE_TAXONOMY:BBMAP_SENDSKETCH                          -
[-        ] process > PATHOGENSURVEILLANCE:COARSE_SAMPLE_TAXONOMY:INITIAL_CLASSIFICATION                    -
[-        ] process > PATHOGENSURVEILLANCE:DOWNLOAD_REFERENCES:FIND_ASSEMBLIES                              -
[-        ] process > PATHOGENSURVEILLANCE:DOWNLOAD_REFERENCES:PICK_ASSEMBLIES                              -
[-        ] process > PATHOGENSURVEILLANCE:DOWNLOAD_REFERENCES:DOWNLOAD_ASSEMBLIES                          -
[-        ] process > PATHOGENSURVEILLANCE:DOWNLOAD_REFERENCES:MAKE_GFF_WITH_FASTA                          -
[-        ] process > PATHOGENSURVEILLANCE:DOWNLOAD_REFERENCES:SOURMASH_SKETCH_GENOME                       -
[-        ] process > PATHOGENSURVEILLANCE:ASSIGN_REFERENCES:SUBSET_READS                                   -
[-        ] process > PATHOGENSURVEILLANCE:ASSIGN_REFERENCES:KHMER_TRIMLOWABUND                             -
[-        ] process > PATHOGENSURVEILLANCE:ASSIGN_REFERENCES:SOURMASH_SKETCH_READS                          -
[-        ] process > PATHOGENSURVEILLANCE:ASSIGN_REFERENCES:SOURMASH_SKETCH_GENOME                         -
[-        ] process > PATHOGENSURVEILLANCE:ASSIGN_REFERENCES:SOURMASH_COMPARE                               -
[f5/6125a3] process > PATHOGENSURVEILLANCE:INPUT_CHECK:SAMPLESHEET_CHECK (metadata_PRJNA523365_small.csv)   [100%] 1 of 1, cached: 1 ✔
[21/b6b96f] process > PATHOGENSURVEILLANCE:SRATOOLS_FASTERQDUMP (SRR12574846)                               [100%] 3 of 3, cached: 3 ✔
[42/f6d4e0] process > PATHOGENSURVEILLANCE:DOWNLOAD_ASSEMBLIES (GCF_017189435_1)                            [100%] 1 of 1, cached: 1 ✔
[63/a974dd] process > PATHOGENSURVEILLANCE:FASTQC (SRR12574847)                                             [100%] 3 of 3, cached: 3 ✔
[c2/8679ec] process > PATHOGENSURVEILLANCE:COARSE_SAMPLE_TAXONOMY:BBMAP_SENDSKETCH (SRR12574848)            [100%] 3 of 3, cached: 3 ✔
[a4/18385c] process > PATHOGENSURVEILLANCE:COARSE_SAMPLE_TAXONOMY:INITIAL_CLASSIFICATION (SRR12574846)      [100%] 3 of 3, cached: 3 ✔
[24/1a3d6e] process > PATHOGENSURVEILLANCE:DOWNLOAD_REFERENCES:FIND_ASSEMBLIES (Mycobacteriaceae)           [100%] 1 of 1, cached: 1 ✔
[6a/673614] process > PATHOGENSURVEILLANCE:DOWNLOAD_REFERENCES:PICK_ASSEMBLIES (SRR12574846)                [100%] 3 of 3, cached: 3 ✔
[1d/f003e6] process > PATHOGENSURVEILLANCE:DOWNLOAD_REFERENCES:DOWNLOAD_ASSEMBLIES (GCF_001677215_1)        [100%] 9 of 9, cached: 9 ✔
[65/b881e2] process > PATHOGENSURVEILLANCE:DOWNLOAD_REFERENCES:MAKE_GFF_WITH_FASTA (GCF_001456355_1)        [100%] 9 of 9, cached: 9 ✔
[db/0526c0] process > PATHOGENSURVEILLANCE:DOWNLOAD_REFERENCES:SOURMASH_SKETCH_GENOME (GCF_001456355_1)     [100%] 9 of 9, cached: 9 ✔
[6f/fce62b] process > PATHOGENSURVEILLANCE:ASSIGN_REFERENCES:SUBSET_READS (SRR12574846)                     [100%] 3 of 3, cached: 3 ✔
[cc/a2f30c] process > PATHOGENSURVEILLANCE:ASSIGN_REFERENCES:KHMER_TRIMLOWABUND (SRR12574847)               [100%] 3 of 3, cached: 3 ✔
[9e/09a4cb] process > PATHOGENSURVEILLANCE:ASSIGN_REFERENCES:SOURMASH_SKETCH_READS (SRR12574847)            [100%] 3 of 3, cached: 3 ✔
[d3/a6b247] process > PATHOGENSURVEILLANCE:ASSIGN_REFERENCES:SOURMASH_SKETCH_GENOME (GCF_017189435_1)       [100%] 3 of 3, cached: 3 ✔
[-        ] process > PATHOGENSURVEILLANCE:ASSIGN_REFERENCES:SOURMASH_COMPARE                               -
[-        ] process > PATHOGENSURVEILLANCE:ASSIGN_REFERENCES:ASSIGN_GROUP_REFERENCES                        -
[-        ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:REFERENCE_INDEX:PICARD_CREATESEQUENCEDICTIONARY -
[-        ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:REFERENCE_INDEX:SAMTOOLS_FAIDX                  -
[-        ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:REFERENCE_INDEX:BWA_INDEX                       -
[-        ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:ALIGN_READS:CALCULATE_DEPTH                     -
[-        ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:ALIGN_READS:SUBSET_READS                        -
[-        ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:ALIGN_READS:BWA_MEM                             -
[-        ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:ALIGN_READS:PICARD_ADDORREPLACEREADGROUPS       -
[-        ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:ALIGN_READS:PICARD_SORTSAM_1                    -
[-        ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:ALIGN_READS:PICARD_MARKDUPLICATES               -
[-        ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:ALIGN_READS:PICARD_SORTSAM_2                    -
[f5/6125a3] process > PATHOGENSURVEILLANCE:INPUT_CHECK:SAMPLESHEET_CHECK (metadata_PRJNA523365_small.csv)   [100%] 1 of 1, cached: 1 ✔
[21/b6b96f] process > PATHOGENSURVEILLANCE:SRATOOLS_FASTERQDUMP (SRR12574846)                               [100%] 3 of 3, cached: 3 ✔
[42/f6d4e0] process > PATHOGENSURVEILLANCE:DOWNLOAD_ASSEMBLIES (GCF_017189435_1)                            [100%] 1 of 1, cached: 1 ✔
[63/a974dd] process > PATHOGENSURVEILLANCE:FASTQC (SRR12574847)                                             [100%] 3 of 3, cached: 3 ✔
[c2/8679ec] process > PATHOGENSURVEILLANCE:COARSE_SAMPLE_TAXONOMY:BBMAP_SENDSKETCH (SRR12574848)            [100%] 3 of 3, cached: 3 ✔
[a4/18385c] process > PATHOGENSURVEILLANCE:COARSE_SAMPLE_TAXONOMY:INITIAL_CLASSIFICATION (SRR12574846)      [100%] 3 of 3, cached: 3 ✔
[24/1a3d6e] process > PATHOGENSURVEILLANCE:DOWNLOAD_REFERENCES:FIND_ASSEMBLIES (Mycobacteriaceae)           [100%] 1 of 1, cached: 1 ✔
[6a/673614] process > PATHOGENSURVEILLANCE:DOWNLOAD_REFERENCES:PICK_ASSEMBLIES (SRR12574846)                [100%] 3 of 3, cached: 3 ✔
[1d/f003e6] process > PATHOGENSURVEILLANCE:DOWNLOAD_REFERENCES:DOWNLOAD_ASSEMBLIES (GCF_001677215_1)        [100%] 9 of 9, cached: 9 ✔
[65/b881e2] process > PATHOGENSURVEILLANCE:DOWNLOAD_REFERENCES:MAKE_GFF_WITH_FASTA (GCF_001456355_1)        [100%] 9 of 9, cached: 9 ✔
[db/0526c0] process > PATHOGENSURVEILLANCE:DOWNLOAD_REFERENCES:SOURMASH_SKETCH_GENOME (GCF_001456355_1)     [100%] 9 of 9, cached: 9 ✔
[6f/fce62b] process > PATHOGENSURVEILLANCE:ASSIGN_REFERENCES:SUBSET_READS (SRR12574846)                     [100%] 3 of 3, cached: 3 ✔
[cc/a2f30c] process > PATHOGENSURVEILLANCE:ASSIGN_REFERENCES:KHMER_TRIMLOWABUND (SRR12574847)               [100%] 3 of 3, cached: 3 ✔
[9e/09a4cb] process > PATHOGENSURVEILLANCE:ASSIGN_REFERENCES:SOURMASH_SKETCH_READS (SRR12574847)            [100%] 3 of 3, cached: 3 ✔
[d3/a6b247] process > PATHOGENSURVEILLANCE:ASSIGN_REFERENCES:SOURMASH_SKETCH_GENOME (GCF_017189435_1)       [100%] 3 of 3, cached: 3 ✔
[-        ] process > PATHOGENSURVEILLANCE:ASSIGN_REFERENCES:SOURMASH_COMPARE                               -
[-        ] process > PATHOGENSURVEILLANCE:ASSIGN_REFERENCES:ASSIGN_GROUP_REFERENCES                        -
[-        ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:REFERENCE_INDEX:PICARD_CREATESEQUENCEDICTIONARY -
[-        ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:REFERENCE_INDEX:SAMTOOLS_FAIDX                  -
[-        ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:REFERENCE_INDEX:BWA_INDEX                       -
[-        ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:ALIGN_READS:CALCULATE_DEPTH                     -
[-        ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:ALIGN_READS:SUBSET_READS                        -
[-        ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:ALIGN_READS:BWA_MEM                             -
[-        ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:ALIGN_READS:PICARD_ADDORREPLACEREADGROUPS       -
[-        ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:ALIGN_READS:PICARD_SORTSAM_1                    -
[-        ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:ALIGN_READS:PICARD_MARKDUPLICATES               -
[-        ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:ALIGN_READS:PICARD_SORTSAM_2                    -
[-        ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:ALIGN_READS:SAMTOOLS_INDEX                      -
[-        ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:CALL_VARIANTS:MAKE_REGION_FILE                  -
[-        ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:CALL_VARIANTS:GRAPHTYPER_GENOTYPE               -
[-        ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:CALL_VARIANTS:GRAPHTYPER_VCFCONCATENATE         -
[-        ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:CALL_VARIANTS:TABIX_TABIX                       -
[-        ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:CALL_VARIANTS:BGZIP_MAKE_GZIP                   -
[-        ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:CALL_VARIANTS:GATK4_VARIANTFILTRATION           -
[-        ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:CALL_VARIANTS:VCFLIB_VCFFILTER                  -
[-        ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:VCF_TO_TAB                                      -
[-        ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:VCF_TO_SNPALN                                   -
[-        ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:IQTREE2_SNP                                     -
[-        ] process > PATHOGENSURVEILLANCE:GENOME_ASSEMBLY:SUBSET_READS                                     -
[-        ] process > PATHOGENSURVEILLANCE:GENOME_ASSEMBLY:FASTP                                            -
[-        ] process > PATHOGENSURVEILLANCE:GENOME_ASSEMBLY:SPADES                                           -
[-        ] process > PATHOGENSURVEILLANCE:GENOME_ASSEMBLY:FILTER_ASSEMBLY                                  -
[-        ] process > PATHOGENSURVEILLANCE:GENOME_ASSEMBLY:QUAST                                            -
[-        ] process > PATHOGENSURVEILLANCE:GENOME_ASSEMBLY:BAKTA_BAKTA                                      -
[-        ] process > PATHOGENSURVEILLANCE:CORE_GENOME_PHYLOGENY:PIRATE                                     -
[-        ] process > PATHOGENSURVEILLANCE:CORE_GENOME_PHYLOGENY:REFORMAT_PIRATE_RESULTS                    -
[-        ] process > PATHOGENSURVEILLANCE:CORE_GENOME_PHYLOGENY:ALIGN_FEATURE_SEQUENCES                    -
[-        ] process > PATHOGENSURVEILLANCE:CORE_GENOME_PHYLOGENY:RENAME_CORE_GENE_HEADERS                   -
[-        ] process > PATHOGENSURVEILLANCE:CORE_GENOME_PHYLOGENY:SUBSET_CORE_GENES                          -
[-        ] process > PATHOGENSURVEILLANCE:CORE_GENOME_PHYLOGENY:MAFFT_SMALL                                -
[-        ] process > PATHOGENSURVEILLANCE:CORE_GENOME_PHYLOGENY:IQTREE2_CORE                               -
[-        ] process > PATHOGENSURVEILLANCE:CUSTOM_DUMPSOFTWAREVERSIONS                                      -
[-        ] process > PATHOGENSURVEILLANCE:MULTIQC                                                          -
[-        ] process > PATHOGENSURVEILLANCE:RECORD_MESSAGES                                                  -
[-        ] process > PATHOGENSURVEILLANCE:MAIN_REPORT                                                      -
ERROR ~ Error executing process > 'PATHOGENSURVEILLANCE:ASSIGN_REFERENCES:SOURMASH_COMPARE (1)'

Caused by:
  Process `PATHOGENSURVEILLANCE:ASSIGN_REFERENCES:SOURMASH_COMPARE` input file name collision -- There are multiple input files for each of the following file names: GCF_017189435_1.sig


Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line

 -- Check '.nextflow.log' file for details

Relevant files

nextflow.log

System information

Nextflow 23.10.1.5891
Desktop
local
Docker
Linux

fastq1 and fastq2 file type error

I deleted the .fastq.gz from the end of the fastq1 file name and then did the same for the fastq2. The error message that is printed is:

ERROR ~ Error executing process > 'NFCORE_PATHOGENSURVEILLANCE:PATHOGENSURVEILLANCE:INPUT_CHECK:SAMPLESHEET_CHECK (metadata_medium.csv)'

There's a more descriptive message farther below that says:

Command error:
[CRITICAL] The FASTQ file has an unrecognized extension: test/data/reads/22-301_R2
It should be one of: .fq.gz, .fastq.gz On line 3.

This might be sufficient enough because I feel like it's pretty self explanatory, but it might be overwhelming for people that do not know how to read error messages. It is up to you whether to make it a more simple message.

DOWNLOAD_ASSEMBLIES error

ERROR

When running test dataset, I encountered the following error message:

Oct-01 20:44:22.606 [Task monitor] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
  task: name=NFCORE_PATHOGENSURVEILLANCE:PATHOGENSURVEILLANCE:DOWNLOAD_REFERENCES:DOWNLOAD_ASSEMBLIES (); work-dir=/nfs7/BPP/Chang_Lab/paradarc/brady/scripts/pathogensurveillance/work/61/f103677a0f4221ac1d21792a1828ff
  error [nextflow.exception.ProcessFailedException]: Process `NFCORE_PATHOGENSURVEILLANCE:PATHOGENSURVEILLANCE:DOWNLOAD_REFERENCES:DOWNLOAD_ASSEMBLIES ()` terminated with an error exit status (1)
Oct-01 20:44:22.606 [Task monitor] INFO  nextflow.processor.TaskProcessor - [61/f10367] NOTE: Process `NFCORE_PATHOGENSURVEILLANCE:PATHOGENSURVEILLANCE:DOWNLOAD_REFERENCES:DOWNLOAD_ASSEMBLIES ()` terminated with an error exit status (1) -- Error is ignore
Oct-01 20:44:22.620 [Actor Thread 121] ERROR nextflow.extension.OperatorImpl - @unknown
java.lang.NullPointerException: Cannot execute null+[/nfs7/BPP/Chang_Lab/paradarc/brady/scripts/pathogensurveillance/work/7b/d1bb304e828bb1dca4e122b05b0b4a/22_331_assembly.sig]
        at org.codehaus.groovy.runtime.NullObject.plus(NullObject.java:135)
        at org.codehaus.groovy.runtime.NullObject$plus$0.call(Unknown Source)
        at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:47)
        at org.codehaus.groovy.runtime.callsite.NullCallSite.call(NullCallSite.java:34)
        at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:47)
        at org.codehaus.groovy.runtime.callsite.PogoMetaMethodSite.call(PogoMetaMethodSite.java:74)
        at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:139)
        at Script_abdb686a$_runScript_closure1$_closure2$_closure13.doCall(Script_abdb686a:78)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:107)
        at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:323)
        at org.codehaus.groovy.runtime.metaclass.ClosureMetaClass.invokeMethod(ClosureMetaClass.java:274)
        at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1035)
        at org.codehaus.groovy.runtime.callsite.PogoMetaClassSite.call(PogoMetaClassSite.java:38)
        at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:139)
        at nextflow.extension.MapOp$_apply_closure1.doCall(MapOp.groovy:56)
        at jdk.internal.reflect.GeneratedMethodAccessor216.invoke(Unknown Source)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:107)
        at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:323)
        at org.codehaus.groovy.runtime.metaclass.ClosureMetaClass.invokeMethod(ClosureMetaClass.java:274)
        at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1035)
        at groovy.lang.Closure.call(Closure.java:412)
        at groovyx.gpars.dataflow.operator.DataflowOperatorActor.startTask(DataflowOperatorActor.java:120)
        at groovyx.gpars.dataflow.operator.DataflowOperatorActor.onMessage(DataflowOperatorActor.java:108)
        at groovyx.gpars.actor.impl.SDAClosure$1.call(SDAClosure.java:43)
        at groovyx.gpars.actor.AbstractLoopingActor.runEnhancedWithoutRepliesOnMessages(AbstractLoopingActor.java:293)
        at groovyx.gpars.actor.AbstractLoopingActor.access$400(AbstractLoopingActor.java:30)
        at groovyx.gpars.actor.AbstractLoopingActor$1.handleMessage(AbstractLoopingActor.java:93)
        at groovyx.gpars.util.AsyncMessagingCore.run(AsyncMessagingCore.java:132)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
Oct-01 20:44:22.632 [Actor Thread 121] DEBUG nextflow.Session - Session aborted -- Cause: Cannot execute null+[/nfs7/BPP/Chang_Lab/paradarc/brady/scripts/pathogensurveillance/work/7b/d1bb304e828bb1dca4e122b05b0b4a/22_331_assembly.sig]

picard error - can't find the command

-[nf-core/plantpathsurveil] Pipeline completed with errors-
ERROR ~ Error executing process > 'NFCORE_PATHOGENSURVEILLANCE:PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:REFERENCE_INDEX:PICARD_CREATESEQUENCEDICTIONARY (GCF_000261805.1)'

Caused by:
  Process `NFCORE_PATHOGENSURVEILLANCE:PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:REFERENCE_INDEX:PICARD_CREATESEQUENCEDICTIONARY (GCF_000261805.1)` terminated with an error exit status (127)

Command executed:

  picard \
      -Xmx8g \
      CreateSequenceDictionary  \
       \
      --REFERENCE GCF_000261805.1_25021_genomic.fna \
      --OUTPUT GCF_000261805.1_25021_genomic.dict
  
  cat <<-END_VERSIONS > versions.yml
  "NFCORE_PATHOGENSURVEILLANCE:PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:REFERENCE_INDEX:PICARD_CREATESEQUENCEDICTIONARY":
      picard: $(picard CreateSequenceDictionary --version 2>&1 | grep -o 'Version:.*' | cut -f2- -d:)
  END_VERSIONS

Command exit status:
  127

Command output:
  (empty)

Command error:
  .command.sh: line 3: picard: command not found

Work dir:
  /nfs7/BPP/Chang_Lab/paradarc/brady/scripts/pathogensurveillance/work/6b/d9f7ba5b69255b5f7d8600d80c2e52

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`

 -- Check '.nextflow.log' file for details

Figure out good default thresholds for filtering sendsketch results

These are used to determine how good a hit needs to be for its taxonomy to be considered useful at a given level. We need thresholds for at least the output columns "ANI" and "Complt", if not others as well, for each of the following taxonomic levels: species, genus, family.

Sample sheet does not appear to contain a header

Description of the bug

The error is saying I don't have a header in my csv file but I have the exact header as the tutorial.

sample,fastq_1,fastq_2 6_13_22,/hdd/2/elyse/microbiome/Sweden_MSU_metagenomics_shotgun/20240209_DNASeq_PE150/6_13_22_S36_L003_R1_001.fastq.gz,/hdd/2/elyse/microbiome/Sweden_MSU_metagenomics_shotgun/20240209_DNASeq_PE150/6_13_22_S36_L003_R2_001.fastq.gz 6_14_22,/hdd/2/elyse/microbiome/Sweden_MSU_metagenomics_shotgun/20240209_DNASeq_PE150/6_14_22_S37_L003_R1_001.fastq.gz,/hdd/2/elyse/microbiome/Sweden_MSU_metagenomics_shotgun/20240209_DNASeq_PE150/6_14_22_S37_L003_R2_001.fastq.gz

Command used and terminal output

nextflow run nf-core/pathogensurveillance -r dev -profile singularity --input pathogen_manifest.csv --outdir pathogen_results


Terminal Output:
Execution cancelled -- Finishing pending tasks before exit
-[nf-core/plantpathsurveil] Pipeline completed with errors-
ERROR ~ Error executing process > 'NFCORE_PATHOGENSURVEILLANCE:PATHOGENSURVEILLANCE:INPUT_CHECK:SAMPLESHEET_CHECK (pathogen_manifest.csv)'

Caused by:
  Process `NFCORE_PATHOGENSURVEILLANCE:PATHOGENSURVEILLANCE:INPUT_CHECK:SAMPLESHEET_CHECK (pathogen_manifest.csv)` terminated with an error exit status (1)

Command executed:

  check_samplesheet.py \
      pathogen_manifest.csv \
      samplesheet.valid.csv

  cat <<-END_VERSIONS > versions.yml
  "NFCORE_PATHOGENSURVEILLANCE:PATHOGENSURVEILLANCE:INPUT_CHECK:SAMPLESHEET_CHECK":
      python: $(python --version | sed 's/Python //g')
  END_VERSIONS

Command exit status:
  1

Command output:
  (empty)

Command error:
  INFO:    Converting SIF file to temporary sandbox...
  WARNING: Skipping mount /home/elysebarker/miniconda3/envs/nextflow/var/singularity/mnt/session/etc/resolv.conf [files]: /etc/resolv.conf doesn't exist in container
  [CRITICAL] The given sample sheet does not appear to contain a header.
  INFO:    Cleaning up image...

Work dir:
  /hdd/2/elyse/microbiome/Sweden_MSU_metagenomics_shotgun/20240209_DNASeq_PE150/work/d6/a0cdf03190be8bfe860c27bc6d4ebb

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

 -- Check '.nextflow.log' file for details

Relevant files

No response

System information

I am using nextflow version 23.10.1 on linux.
nf-core/plantpathsurveil v1.0dev

Quality filters for the core gene phylogeny

Three user-adjustable filters:

  • --min_core_genes: Minimum number of core genes (e.g. 10)
  • --min_core_samps: Minimum proportion of samples (e.g. 0.9)
  • --min_core_refs: Minimum proportion of references (e.g. 0.5)

Plan:

Iteratively remove sample with the least genes shared with all other samples until --min_core_genes is reached or --min_core_samps/--min_core_refs thresholds are reached.

Missing fastq1 error

Deleted fastq1 (second column) of the first sample and got the error message:

ERROR ~ Error executing process > 'NFCORE_PATHOGENSURVEILLANCE:PATHOGENSURVEILLANCE:INPUT_CHECK:SAMPLESHEET_CHECK (metadata_medium.csv)'

Pipeline would benefit from a more descriptive error message. Also note that this is different than having a file path that doesn't exist (including if special characters are in it). That error does have a descriptive error message.

Rmarkdown report outline

The report needs to be able to be rendered at any stage while the pipeline is running so that early results like the sendsketch matches can be reported while waiting on long running process like core genome phylogeny.

The main report (one for each report_group) should be a mix of natural-language descriptions and interactive tables and graphics. Each section should answer one question a user might want to know, with whatever info is available at the time the report is rendered. Potential sections:

What are my samples ?

Data available from fastest to slowest:

  • Sendsketch outputs (seconds to minutes)
  • Tree made from reference/sample sourmash ani matrix (minutes, with #6 )
  • Bacteria: Core genome phylogeny with references (hours)
  • Eukaryotes: Read2tree results with references (hours)

How do my samples relate to each other?

  • geographic distribution?
  • Tree made from sample sourmash ani matrix (minutes, with #6 )
  • Variant calling analysis, minimum spanning network and SNP tree (hours)
  • Bacteria: Core genome phylogeny without references (hours)
  • Eukaryotes: Read2tree results without references (hours)

Error in dump software versions

Description of the bug

When testing the small test dataset I got this error.

Aug-09 22:46:16.934 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'NFCORE_PLANTPATHSURVEIL:PLANTPATHSURVEIL:CUSTOM_DUMPSOFTWAREVERSION

Caused by:
  Process `NFCORE_PLANTPATHSURVEIL:PLANTPATHSURVEIL:CUSTOM_DUMPSOFTWAREVERSIONS` terminated with an error exit status (1)

Command executed [/nfs7/BPP/Chang_Lab/paradarc/nf_brady_N120/scripts/nf-core-plantpathsurveil/./workflows/../modules/nf-core/custom/dumpsoftwareversions/templates/dumpsof

  #!/usr/bin/env python3
  
  import platform
  from textwrap import dedent
  
  import yaml
  
  
  def _make_versions_html(versions):
      html = [
          dedent(
              """\
              <style>
              #nf-core-versions tbody:nth-child(even) {
                  background-color: #f2f2f2;
              }
              </style>
              <table class="table" style="width:100%" id="nf-core-versions">
                  <thead>
                      <tr>
                          <th> Process Name </th>
                          <th> Software </th>
                          <th> Version  </th>
                      </tr>
                  </thead>
              """
          )
      ]
      for process, tmp_versions in sorted(versions.items()):
          html.append("<tbody>")
          for i, (tool, version) in enumerate(sorted(tmp_versions.items())):
              html.append(
                  dedent(
                      f"""\
                      <tr>
  for process, process_versions in versions_by_process.items():
      module = process.split(":")[-1]
      try:
          if versions_by_module[module] != process_versions:
              raise AssertionError(
                  "We assume that software versions are the same between all modules. "
                  "If you see this error-message it means you discovered an edge-case "
                  "and should open an issue in nf-core/tools. "
              )
      except KeyError:
          versions_by_module[module] = process_versions
  
  versions_by_module["Workflow"] = {
      "Nextflow": "23.04.2",
      "nf-core/plantpathsurveil": "1.0dev",
  }
  
  versions_mqc = {
      "id": "software_versions",
      "section_name": "nf-core/plantpathsurveil Software Versions",
      "section_href": "https://github.com/nf-core/plantpathsurveil",
      "plot_type": "html",
      "description": "are collected at run time from the software output.",
      "data": _make_versions_html(versions_by_module),
  }
  
  with open("software_versions.yml", "w") as f:
      yaml.dump(versions_by_module, f, default_flow_style=False)
  with open("software_versions_mqc.yml", "w") as f:
      yaml.dump(versions_mqc, f, default_flow_style=False)
  
  with open("versions.yml", "w") as f:
      yaml.dump(versions_this_module, f, default_flow_style=False)

Command exit status:
  1

Command output:
  (empty)

Command error:
    File ".command.sh", line 59
      versions_by_process = {**yaml.load(f, Loader=yaml.BaseLoader), **versions_by_process}
                                                                                          ^
  TabError: inconsistent use of tabs and spaces in indentation

Work dir:
  /nfs7/BPP/Chang_Lab/paradarc/nf_brady_N120/scripts/nf-core-plantpathsurveil/work/27/8b6f3ed6a6cbe122d597022cb5c159

Command used and terminal output

No response

Relevant files

I ended it up fixing the script by replacing the line that gave the error with:

versions_by_process = {**yaml.load(f, Loader=yaml.BaseLoader), **versions_by_process}

This actually worked. But I am not experienced with python.

System information

No response

DUMPSOFTWAREVERSIONS! error

This is an error with dumping software versions:

Oct-09 14:30:09.344 [Task monitor] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
  task: name=NFCORE_PATHOGENSURVEILLANCE:PATHOGENSURVEILLANCE:CUSTOM_DUMPSOFTWAREVERSIONS; work-dir=/nfs1/BPP/Grunwald_Lab/home/paradarc/pathogensurveillance/work/a3/5c02e88440b42fdba1e00c2c65b520
  error [nextflow.exception.ProcessFailedException]: Process `NFCORE_PATHOGENSURVEILLANCE:PATHOGENSURVEILLANCE:CUSTOM_DUMPSOFTWAREVERSIONS` terminated with an error exit status (1)
Oct-09 14:30:09.365 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'NFCORE_PATHOGENSURVEILLANCE:PATHOGENSURVEILLANCE:CUSTOM_DUMPSOFTWAREVERSIONS'

Caused by:
  Process `NFCORE_PATHOGENSURVEILLANCE:PATHOGENSURVEILLANCE:CUSTOM_DUMPSOFTWAREVERSIONS` terminated with an error exit status (1)

Command executed [/nfs1/BPP/Grunwald_Lab/home/paradarc/pathogensurveillance/./workflows/../modules/nf-core/custom/dumpsoftwareversions/templates/dumpsoftwareversions.py]:

  #!/usr/bin/env python
  
  import platform
  from textwrap import dedent
  
  import yaml
  
  
  def _make_versions_html(versions):
      html = [
          dedent(
              """\
              <style>
              #nf-core-versions tbody:nth-child(even) {
                  background-color: #f2f2f2;
              }
              </style>
              <table class="table" style="width:100%" id="nf-core-versions">
                  <thead>
                      <tr>
                          <th> Process Name </th>
                          <th> Software </th>
                          <th> Version  </th>
                      </tr>
                  </thead>
              """
          )
      ]
      for process, tmp_versions in sorted(versions.items()):
          html.append("<tbody>")
          for i, (tool, version) in enumerate(sorted(tmp_versions.items())):
              html.append(
                  dedent(
                      f"""\
                      <tr>
                          <td><samp>{process if (i == 0) else ''}</samp></td>
                          <td><samp>{tool}</samp></td>
                          <td><samp>{version}</samp></td>
                      </tr>
                      """
                  )
              )
          html.append("</tbody>")
      html.append("</table>")
      return "\n".join(html)
  
  
  versions_this_module = {}
  versions_this_module["NFCORE_PATHOGENSURVEILLANCE:PATHOGENSURVEILLANCE:CUSTOM_DUMPSOFTWAREVERSIONS"] = {
      "python": platform.python_version(),
      "yaml": yaml.__version__,
  }
  
  versions_by_process = versions_this_module
  
  print("1/versions.yml 2/versions.yml 3/versions.yml 4/versions.yml 5/versions.yml 6/versions.yml 7/versions.yml 8/versions.yml 9/versions.yml 10/versions.yml 11/versions.yml 12/versions.yml 13/versions.yml 14/versions.yml 15/versions.yml 16/versions.yml 17/versions.yml 18/versions.yml 19/versions.yml 20/versions.yml 21/versions.yml 22/versions.yml 23/versions.yml 24/versions.yml 25/versions.yml 26/versions.yml 27/versions.yml 28/versions.yml 29/versions.
  for path in "1/versions.yml 2/versions.yml 3/versions.yml 4/versions.yml 5/versions.yml 6/versions.yml 7/versions.yml 8/versions.yml 9/versions.yml 10/versions.yml 11/versions.yml 12/versions.yml 13/versions.yml 14/versions.yml 15/versions.yml 16/versions.yml 17/versions.yml 18/versions.yml 19/versions.yml 20/versions.yml 21/versions.yml 22/versions.yml 23/versions.yml 24/versions.yml 25/versions.yml 26/versions.yml 27/versions.yml 28/versions.yml 29/ver
      with open(path) as f:
          versions_by_process = yaml.load(f, Loader=yaml.BaseLoader) | versions_by_process
  
  # aggregate versions by the module name (derived from fully-qualified process name)
  versions_by_module = {}
  for process, process_versions in versions_by_process.items():
      module = process.split(":")[-1]
      try:
          if versions_by_module[module] != process_versions:
              raise AssertionError(
                  "We assume that software versions are the same between all modules. "
                  "If you see this error-message it means you discovered an edge-case "
                  "and should open an issue in nf-core/tools. "
              )
      except KeyError:
          versions_by_module[module] = process_versions
  
  versions_by_module["Workflow"] = {
      "Nextflow": "23.04.4",
      "nf-core/plantpathsurveil": "1.0dev",
  }
  
  versions_mqc = {
      "id": "software_versions",
      "section_name": "nf-core/plantpathsurveil Software Versions",
      "section_href": "https://github.com/nf-core/plantpathsurveil",
      "plot_type": "html",
      "description": "are collected at run time from the software output.",
      "data": _make_versions_html(versions_by_module),
  }
  
  with open("software_versions.yml", "w") as f:
      yaml.dump(versions_by_module, f, default_flow_style=False)
  with open("software_versions_mqc.yml", "w") as f:
      yaml.dump(versions_mqc, f, default_flow_style=False)
  
  with open("versions.yml", "w") as f:
      yaml.dump(versions_this_module, f, default_flow_style=False)

Command exit status:
  1

Command output:
  1/versions.yml 2/versions.yml 3/versions.yml 4/versions.yml 5/versions.yml 6/versions.yml 7/versions.yml 8/versions.yml 9/versions.yml 10/versions.yml 11/versions.yml 12/versions.yml 13/versions.yml 14/versions.yml 15/versions.yml 16/versions.yml 17/versions.yml 18/versions.yml 19/versions.yml 20/versions.yml 21/versions.yml 22/versions.yml 23/versions.yml 24/versions.yml 25/versions.yml 26/versions.yml 27/versions.yml 28/versions.yml 29/versions.yml 30/

Command error:
  1/versions.yml 2/versions.yml 3/versions.yml 4/versions.yml 5/versions.yml 6/versions.yml 7/versions.yml 8/versions.yml 9/versions.yml 10/versions.yml 11/versions.yml 12/versions.yml 13/versions.yml 14/versions.yml 15/versions.yml 16/versions.yml 17/versions.yml 18/versions.yml 19/versions.yml 20/versions.yml 21/versions.yml 22/versions.yml 23/versions.yml 24/versions.yml 25/versions.yml 26/versions.yml 27/versions.yml 28/versions.yml 29/versions.yml 30/
  Traceback (most recent call last):
    File ".command.sh", line 58, in <module>
      with open(path) as f:
           ^^^^^^^^^^
  FileNotFoundError: [Errno 2] No such file or directory: '50/nfs1'

Work dir:
  /nfs1/BPP/Grunwald_Lab/home/paradarc/pathogensurveillance/work/a3/5c02e88440b42fdba1e00c2c65b520

Ensure that there is an error when no report group is defined

If no intelligible error is given:

Tasks

If Spades or Bakta fail for small number of samples, is there a way to proceed without these samples?

Description of the bug

I have encountered this twice, with larger bacterial datasets, but for whatever reason, whether quality of raw read data is bad or spades cannot produce a decent assembly, Bakta will then fail (not surprisingly) and after a few retries, the whole pipeline stops until I remove the samples in question.

Is there a way to proceed with the analysis, even if Spades, or Bakta fail for a small number of samples, making note of which samples couldn't go though the pipeline?

Command used and terminal output

No response

Relevant files

No response

System information

No response

Convention for handling sample IDs

  • convert to valid file names (a-zA-Z0-9_-)
  • check for duplicates and rename duplicates as needed with numeric suffix
  • write lookup table for users doing research (could be updated input metadata)
  • report should say which samples were renamed

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.