Giter VIP home page Giter VIP logo

cecret's Introduction

Cecret

UPDATE : THIS README IS (albiet slowly) GETTING TURNED INTO A WIKI. YOU CAN CHECK OUR PROGRESS HERE: https://github.com/UPHL-BioNGS/Cecret/wiki

Named after the beautiful Cecret lake

Location: 40.570°N 111.622°W , Elevation: 9,875 feet (3,010 m), Hiking level: easy

(Image credit: Intermountain Healthcare)

Table of Contents:

Introduction

Cecret was originally developed by @erinyoung at the Utah Public Health Laborotory for SARS-COV-2 sequencing with the artic/Illumina hybrid library prep workflow for MiSeq data with protocols here and here. This nextflow workflow, however, is flexible for many additional organisms and primer schemes as long as the reference genome is "small" and "good enough." In 2022, @tives82 added in contributions for Monkeypox virus, including converting IDT's primer scheme to NC_063383.1 coordinates. We are grateful to everyone that has contributed to this repo.

The nextflow workflow was built to work on linux-based operating systems. Additional config options are needed for cloud batch usage.

The library preparation method greatly impacts which bioinformatic tools are recommended for creating a consensus sequence. For example, amplicon-based library prepation methods will required primer trimming and an elevated minimum depth for base-calling. Some bait-derived library prepation methods have a PCR amplification step, and PCR duplicates will need to be removed. This has added complexity and several (admittedly confusing) options to this workflow. Please submit an issue if/when you run into issues.

It is possible to use this workflow to simply annotate fastas generated from any workflow or downloaded from GISAID or NCBI. There are also options for multiple sequence alignment (MSA) and phylogenetic tree creation from the fasta files.

Cecret is also part of the staphb-toolkit.

Dependencies

Usage

This workflkow does not run without input files, and there are multiple ways to specify which input files should be used

# using singularity on paired-end reads in a directory called 'reads'
nextflow run UPHL-BioNGS/Cecret -profile singularity --reads <directory to reads>

# using docker on samples specified in SampleSheet.csv
nextflow run UPHL-BioNGS/Cecret -profile docker --sample_sheet SampleSheet.csv

# using a config file containing all inputs
nextflow run UPHL-BioNGS/Cecret -c file.config

Results are roughly organiized into 'params.outdir'/< analysis >/sample.result

A file summarizing all results is found in 'params.outdir'/cecret_results.csv and 'params.outdir'/cecret_results.txt.

Consensus sequences can be found in 'params.outdir'/consensus and end with *.consensus.fa.

Getting files from directories

(can be adjusted with 'params.reads', 'params.single_reads', 'params.fastas', and 'params.nanopore')

Paired-end fastq.gz (ending with 'fastq', 'fastq.gz', 'fq', or 'fq.gz') reads as follows or designate directory with 'params.reads' or '--reads'

directory
└── reads
     └── *fastq.gz

WARNING : Sometimes nextflow does not catch every name of paired-end fastq files. This workflow is meant to be fairly agnostic, but if paired-end fastq files are not being found it might be worth renaming them to some sort of sample_1.fastq.gz format or using a sample sheet.

Single-end fastq.gz reads as follows or designate directory with 'params.single_reads' or '--single_reads'

directory
└── single_reads
     └── *fastq.gz

WARNING : single and paired-end reads cannot be in the same directory

Nanopore/ONT reads as follows or designate directory with 'params.nanopore' or '--nanopore'

directory
└── nanopore
     └── *fastq.gz

Fasta files (ending with 'fa', 'fasta', or 'fna') as follows or designate directory with 'params.fastas' or '--fastas'

directory
└── fastas
     └── *fasta

MultiFasta files (ending with 'fa', 'fasta', or 'fna') as follows or designate directory with 'params.multifastas' or '--multifastas'

directory
└── multifastas
     └── *fasta

WARNING : fastas and multifastas cannot be in the same directory. If no fasta preprocessing is necessary, put the single fastas in the multifastas directory.

Using a sample sheet

Cecret can also use a sample sheet for input with the sample name and reads separated by commas. The header must be sample,fastq_1,fastq_2. The general rule is the identifier for the file(s), the file locations, and the type if not paired-end fastq files.

Rows match files with their processing needs.

  • paired-end reads: sample,read1.fastq.gz,read2.fastq.gz
  • single-reads reads: sample,sample.fastq.gz,single
  • nanopore reads : sample,sample.fastq.gz,nanopore
  • fasta files: sample,sample.fasta,fasta
  • multifasta files: multifasta,multifasta.fasta,multifasta

Example sample sheet:

sample,fastq_1,fastq_2
SRR13957125,/home/eriny/sandbox/test_files/cecret/reads/SRR13957125_1.fastq.gz,/home/eriny/sandbox/test_files/cecret/reads/SRR13957125_2.fastq.gz
SRR13957170,/home/eriny/sandbox/test_files/cecret/reads/SRR13957170_1.fastq.gz,/home/eriny/sandbox/test_files/cecret/reads/SRR13957170_2.fastq.gz
SRR13957177S,/home/eriny/sandbox/test_files/cecret/single_reads/SRR13957177_1.fastq.gz,single
OQ255990.1,/home/eriny/sandbox/test_files/cecret/fastas/OQ255990.1.fasta,fasta
SRR22452244,/home/eriny/sandbox/test_files/cecret/nanopore/SRR22452244.fastq.gz,nanopore

Example usage with sample sheet using docker to manage containers

nextflow run UPHL-BioNGS/Cecret -profile docker --sample_sheet SampleSheet.csv

Full workflow

alt text

Determining primer and amplicon bedfiles

The default primer scheme of the 'Cecret' workflow is the 'V4' primer scheme developed by artic network for SARS-CoV-2. Releases prior to and including '2.2.20211221' used the 'V3' primer scheme as the default. As many public health laboratories are still using 'V3', the 'V3' files are still in this repo, but now the 'V4', 'V4.1' ('V4' with a spike-in of additional primers), and 'V5.3.2' are also included. The original primer and amplicon bedfiles can be found at artic's github repo.

Setting primers with a parameter on the command line (these can also be defined in a config file)

# using artic V3 primers
nextflow run UPHL-BioNGS/Cecret -profile singularity --primer_set 'ncov_V3'

# using artic V4 primers
nextflow run UPHL-BioNGS/Cecret -profile singularity --primer_set 'ncov_V4'

# using artic V4.1 primers
nextflow run UPHL-BioNGS/Cecret -profile singularity --primer_set 'ncov_V4.1'

# using artic V5.3.2 primers
nextflow run UPHL-BioNGS/Cecret -profile singularity --primer_set 'ncov_V5.3.2'

Some "Midnight" primers are also included and can be set with midnight_idt_V1, midnight_ont_V1, midnight_ont_V2, midnight_ont_V3.

It is still possible to set 'params.primer_bed' and 'params.amplicon_bed' via the command line or in a config file with the path to the corresponding file.

Using the included nextclade dataset

It has been requested by some of our more sofisticated colleagues to include a way to upload a nextclade dataset separately. We expect that is mostly for cloud usage. To accomadate this, there is now a sars.zip file in data with a nextclade dataset. To use this included dataset, params.download_nextclade_dataset must be set to false in either the command line of in a config file.

nextflow run UPHL-BioNGS/Cecret -profile singularity --sample_sheet input.csv --download_nextclade_dataset false

This included dataset, however, will only be as current as Cecret's maintainers are able to upload it. There is a Github actions that should attempt to update the nextclade dataset every Tuesday, but this still has be merged and go through testing. The end user can also create a nextclade dataset, and then feed that into this workflow with params.predownloaded_nextclade_dataset.

To create the nextclade dataset with nextclade

nextclade dataset get --name sars-cov-2 --output-zip sars.zip

To use with Cecret

nextflow run UPHL-BioNGS/Cecret -profile singularity --sample_sheet input.csv --download_nextclade_dataset false --predownloaded_nextclade_dataset sars.zip

Or the corresponding params can be set in a config file.

Determining CPU usage

For the sake of simplicity, processes in this workflow are designated 1 CPU, a medium amount of CPUs (5), or the largest amount of CPUs (the number of CPUs of the environment launching the workflow if using the main workflow and a simple config file or 8 if using profiles and the config template). The medium amount of CPUs can be adjusted by the End User by adjusting 'params.medcpus', the largest amount can be adjusted with 'params.maxcpus', or the cpus can be specified for each process individually in a config file.

The End User can adjust this by specifying the maximum cpus that one process can take in the config file 'params.maxcpus = <new value>' or on the command line

nextflow run UPHL-BioNGS/Cecret -profile singularity --maxcpus <new value>

It is important to remember that nextflow will attempt to utilize all CPUs available, and this value is restricted to one process. As a specific example, the prcoess 'bwa' will be allocated 'params.maxcpus'. If there are 48 CPUs available and 'params.maxcpus = 8', then 6 samples will be run simultaneously.

Determining depth for base calls

Sequencing has an intrinsic amount of error for every predicted base on a read. This error is reduced the more reads there are. As such, there is a minimum amount of depth that is required to call a base with ivar consensus, ivar variants, and bcftools variants. The main assumption of using this workflow is that the virus is clonal (i.e. only one infection represented in a sample) and created via pcr amplified libraries. The default depth for calling bases or finding variants is set with 'params.minimum_depth' with the default value being 'params.minimum_depth = 100'. This parameter can be adjusted by the END USER in a config file or on the command line.

A corresponding parameter is 'params.mpileup_depth' (default of 'params.mpileup_depth = 8000'), which is the number of reads that samtools (used by ivar) or bcftools uses to put into memory for any given position. If the END USER is experiencing memory issues, this number may need to be decreased.

Determining if duplicates should be taken into account

For library preparation methods with baits followed by PCR amplification, it is recommended to remove duplicate reads. For most other methods, removing deplicates will generally not harm anything. To remove duplicates, set the 'params.markdup' to true. This removes duplicate reads from the aligned sam file, which is before the primer trimming and after the filter processes. This will likely enable a lower minimum depth for variant calling (default is 100).

On the command line:

nextflow run UPHL-BioNGS/Cecret -profile singularity --markdup true --minimum_depth 10

In a config file:

params.markdup = true
params.minimum_depth = 10

Monkeypox

The defaults for Cecret continue to be for SARS-CoV-2, but there are growing demands for a workflow for Monkeypox Virus. As such, there are a few parameters that might benefit the End User.

Using the Monkeypox profile

There are three profiles for Monkeypox Virus sequencing : mpx, mpx_idt and mpx_primalseq. The mpx profile has some defaults for a metagenomic-type sequencing, while mpx_idt is for libraries prepped with IDT's primers, and mpx_primalseq which has been validated with Illumina library prep methods and sequencing platforms.

# metagenomic
nextflow run UPHL-BioNGS/Cecret -profile singularity,mpx

# using IDT's primers
nextflow run UPHL-BioNGS/Cecret -profile singularity,mpx_idt

# using Illumina library prep methods and sequencing platforms
nextflow run UPHL-BioNGS/Cecret -profile singularity,mpx_primalseq

Other library prep methods

There are amplicon-based methods, bait, and amplicon-bait hybrid library preparation methods which increases the portion of reads for a relevant organism. If there is a common preparation for the End User, please submit an issue, and we can create a profile or config file. Remember that the bedfiles for the primer schemes and amplicons MUST match the reference.

Wastewater

This workflow has also been used with primer-amplified Illumina sequencing of wastewater. Patient samples conceptually are different than wastewater samples, but many of the bioinformatic steps are the same. The files from Freyja are likely the most significant for this analysis. Freyja uses the bam files after primer trimming has been completed to look for variants and their proportions to assign expected pangolin lineages.

Recommended parameter adjustements for wastewater

params.species           = 'sarscov2' //this is the default, but it is required for the subworkflow that involves freyja
params.bcftools_variants = false
params.ivar_variants     = false
params.pangolin          = false
params.nextclade         = false
params.vadr              = false

Pangolin, Nextclade, and any analysis that evaluates the consensus fasta are not as useful in the context of wastewater. There is currently not a way to remove consensus sequence generation from Cecret, but an option may be available in the future if there is "enough" demand.

Updating Cecret

nextflow pull UPHL-BioNGS/Cecret

Cecret has a weekly update schedule. Cecret's versions have three numbers : X.Y.Z. If the first number, X, changes, there has been a major modification. Params may have changed or subworkflows/channels may have been modified. If the second number, Y, changes, there has been a minor to moderate change. These are mainly for bug fixes or the changing the defaults of params. If the last number has been modified, Z, the workflow is basically the same, there have just been some updates in the containers pulled for the workflow. Most of these updates are to keep Freyja, NextClade, and Pangolin current for SARS-CoV-2 analysis.

Quality Assessment

The quality of a sequencing run is very important. As such, many values are recorded so that the End User can assess the quality of the results produced from a sequencing run.

  • fastqc_raw_reads_1 and fastqc_raw_reads_2 are the number of reads prior to cleaning by either seqyclean or fastp.
  • seqyclean_Perc_Kept (params.cleaner = 'seqyclean') or fastp_pct_surviving (params.cleaner = 'fastp') indicate how many reads remain after removal of low-quality reads (more = better).
  • num_N is the number of uncalled bases in the generated consensus sequence (less = better).
  • num_total is the total number of called bases in the generated consensus sequequence (more = better). As many consensus sequences are generated with this workflow via amplicon sequencing, the intitial and end of the reference often has little coverage. This means that the number of bases in the consensus sequence is less than the length of the reference sequence.
  • num_pos_${params.minimum_depth}X (which is num_pos_100X by default) is the number of positions for which there is sufficient depth to call variants (more = better). Any sequence below this value will be an N.
  • aci_num_failed_amplicons uses the amplicon file to give a rough estimate as to which primer pairs are not getting enough coverage (less = better).
  • samtools_num_failed_amplicons uses the primer file to detect primer pairs and estimates coverages based on this (less = better).

More imformation on evaluating amplicon/primer failure can be found in the FAQ under the question 'Is there a way to determine if certain amplicons are failing?'

Kraken2 is optional for this workflow, but can provide additional quality assessment metrics:

  • top_organism is the most common organism identified in the reads.
  • percent_reads_top_organism is the percentage of reads assigned that organism (more = better).
  • %_human_reads is the percentage of human reads reads (less = better).

Optional toggles:

Using fastp to clean reads instead of seqyclean

nextflow run UPHL-BioNGS/Cecret -profile singularity --cleaner fastp

Or set params.cleaner = 'fastp' in a config file

Using samtools to trim amplicons instead of ivar

nextflow run UPHL-BioNGS/Cecret -profile singularity --trimmer samtools

Or set params.trimmer = 'samtools' in a config file

Skipping primer trimming completely

nextflow run UPHL-BioNGS/Cecret -profile singularity --trimmer none

Or set params.trimmer = 'none' in a config file

Using minimap2 to align reads instead of bwa

nextflow run UPHL-BioNGS/Cecret -profile singularity --aligner minimap2

Or set params.aligner = 'minimap2' in a config file

Determining relatedness

To create a multiple sequence alignment and corresponding phylogenetic tree and SNP matrix, set params.relatedness = true or

nextflow run UPHL-BioNGS/Cecret -profile singularity --relatedness true

Using nextclade to for multiple sequence alignement instead of mafft

nextflow run UPHL-BioNGS/Cecret -profile singularity --relatedness true --msa nextclade

Or set params.msa = 'nextclade' and params.relatedness = true in a config file

And then you get trees like this which can visualized with itol or ggtree. alt text

Classify reads with kraken2

To classify reads with kraken2 to identify reads from human or the organism of choice

Step 1. Get a kraken2 database

mkdir kraken2_db
cd kraken2_db
wget https://genome-idx.s3.amazonaws.com/kraken/k2_standard_08gb_20230605.tar.gz
tar -zxvf minikraken2_v2_8GB_201904.tgz

Step 2. Set the paramaters accordingly

params.kraken2 = true
params.kraken2_db = 'kraken2_db'

The main components of Cecret are:

  • aci - for depth estimation over amplicons (optional, set params.aci = true)
  • artic network - for aligning and consensus creation of nanopore reads
  • bbnorm - for normalizing reads (optional, set params.bbnorm = true)
  • bcftools - for variants
  • bwa - for aligning reads to the reference
  • fastp - for cleaning reads ; (optional, set params.cleaner = 'fastp')
  • fastqc - for QC metrics
  • freyja - for multiple SARS-CoV-2 lineage classifications
  • heatcluster - for visualizing SNP matrices generated via SNP dists
  • iqtree2 - for phylogenetic tree generation (optional, set params.relatedness = true)
  • igv-reports - visualizing SNPs (optional, set params.igv_reports = true)
  • ivar - calling variants and creating a consensus fasta; default primer trimmer
  • kraken2 - for read classification
  • mafft - for multiple sequence alignment (optional, relatedness must be set to "true")
  • minimap2 - an alternative to bwa (optional, set params.aligner = minimap2 )
  • multiqc - summary of results
  • nextclade - for SARS-CoV-2 clade classification (optional: aligned fasta can be used from this analysis when relatedness is set to "true" and msa is set to "nextclade")
  • pangolin - for SARS-CoV-2 lineage classification
  • pango collapse - for SARS-CoV-2 lineage tracing
  • phytreeviz - for visualizing phylogenetic trees
  • samtools - for QC metrics and sorting; optional primer trimmer; optional converting bam to fastq files; optional duplication marking
  • seqyclean - for cleaning reads
  • snp-dists - for relatedness determination (optional, relatedness must be set to "true")
  • vadr - for annotating fastas like NCBI

Turning off unneeded processes

It came to my attention that some processes (like bcftools) do not work consistently. Also, they might take longer than wanted and might not even be needed for the end user. Here's the processes that can be turned off with their default values:

params.bcftools_variants = true           # vcf of variants
params.fastqc = true                      # qc on the sequencing reads
params.ivar_variants = true               # itemize the variants identified by ivar
params.samtools_stats = true              # stats about the bam files
params.samtools_coverage = true           # stats about the bam files
params.samtools_depth = true              # stats about the bam files
params.samtools_flagstat = true           # stats about the bam files
params.samtools_ampliconstats = true      # stats about the amplicons
params.samtools_plot_ampliconstats = true # images related to amplicon performance
params.kraken2 = false                    # used to classify reads and needs a corresponding params.kraken2_db and organism if not SARS-CoV-2
params.aci = false                        # coverage approximation of amplicons
parms.igv_reports = false                 # SNP IGV images
params.nextclade = true                   # SARS-CoV-2 clade determination
params.pangolin = true                    # SARS-CoV-2 lineage determination
params.pango_collapse = true              # SARS-CoV-2 lineage tracing
params.freyja = true                      # multiple SARS-CoV-2 lineage determination
params.vadr = false                       # NCBI fasta QC
params.relatedness = false                # create multiple sequence alignments with input fastq and fasta files
params.snpdists = true                    # creates snp matrix from mafft multiple sequence alignment
params.iqtree2 = true                     # creates phylogenetic tree from mafft multiple sequence alignement
params.bamsnap = false                    # has been removed
params.rename = false                     # needs a corresponding sample file and will rename files for GISAID and NCBI submission
params.filter = false                     # takes the aligned reads and turns them back into fastq.gz files
params.multiqc = true                     # aggregates data into single report

Final file structure

Final File Tree after running cecret.nf
cecret                                # results from this workflow
├── aci
│   ├── amplicon_depth.csv
│   ├── amplicon_depth_mqc.png
│   └── amplicon_depth.png
├── aligned                           # aligned (with aligner) but untrimmed bam files with indexes
│   ├── SRR13957125.sorted.bam
│   ├── SRR13957125.sorted.bam.bai
│   ├── SRR13957170.sorted.bam
│   ├── SRR13957170.sorted.bam.bai
│   ├── SRR13957177.sorted.bam
│   └── SRR13957177.sorted.bam.bai
├── bcftools_variants                 # set to false by default; VCF files of variants identified
│   ├── SRR13957125.vcf
│   ├── SRR13957170.vcf
│   └── SRR13957177.vcf
├── cecret_results.csv                # comma-delimeted summary of results
├── cecret_results.txt                # tab-delimited summary of results
├── consensus                         # the likely reason you are running this workflow
│   ├── SRR13957125.consensus.fa
│   ├── SRR13957170.consensus.fa
│   └── SRR13957177.consensus.fa
├── dataset                           # generated by nextclade
│   ├── genemap.gff
│   ├── primers.csv
│   ├── qc.json
│   ├── reference.fasta
│   ├── sequences.fasta
│   ├── tag.json
│   ├── tree.json
│   └── virus_properties.json
├── fasta_prep                        # optional for inputted fastas
│   ├── SRR13957125.test.fa
│   ├── SRR13957170.test.fa
│   └── SRR13957177.test.fa
├── fastp                             # optional tools for cleaning reads when 'params.cleaner = fastp'
│   ├── SRR13957125_clean_PE1.fastq.gz
│   ├── SRR13957125_clean_PE2.fastq.gz
│   ├── SRR13957125_fastp.html
│   ├── SRR13957125_fastp.json
│   ├── SRR13957170_clean_PE1.fastq.gz
│   ├── SRR13957170_clean_PE2.fastq.gz
│   ├── SRR13957170_fastp.html
│   ├── SRR13957170_fastp.json
│   ├── SRR13957177_clean_PE1.fastq.gz
│   ├── SRR13957177_clean_PE2.fastq.gz
│   ├── SRR13957177_fastp.html
│   └── SRR13957177_fastp.json
├── fastqc                            # QC metrics for each fasta sequence
│   ├── SRR13957125_1_fastqc.html
│   ├── SRR13957125_1_fastqc.zip
│   ├── SRR13957125_2_fastqc.html
│   ├── SRR13957125_2_fastqc.zip
│   ├── SRR13957170_1_fastqc.html
│   ├── SRR13957170_1_fastqc.zip
│   ├── SRR13957170_2_fastqc.html
│   ├── SRR13957170_2_fastqc.zip
│   ├── SRR13957177_1_fastqc.html
│   ├── SRR13957177_1_fastqc.zip
│   ├── SRR13957177_2_fastqc.html
│   └── SRR13957177_2_fastqc.zip
├── filter                           # fastq.gz files from reads that were aligned to the reference genome
│   ├── SRR13957125_filtered_R1.fastq.gz
│   ├── SRR13957125_filtered_R2.fastq.gz
│   ├── SRR13957125_filtered_unpaired.fastq.gz
│   ├── SRR13957170_filtered_R1.fastq.gz
│   ├── SRR13957170_filtered_R2.fastq.gz
│   ├── SRR13957170_filtered_unpaired.fastq.gz
│   ├── SRR13957177_filtered_R1.fastq.gz
│   ├── SRR13957177_filtered_R2.fastq.gz
│   └── SRR13957177_filtered_unpaired.fastq.gz
├── freyja                          # finding co-lineages
│   ├── aggregated-freyja.png
│   ├── aggregated-freyja.tsv
│   ├── SRR13957125_boot.tsv_lineages.csv
│   ├── SRR13957125_boot.tsv_summarized.csv
│   ├── SRR13957125_demix.tsv
│   ├── SRR13957125_depths.tsv
│   ├── SRR13957125_variants.tsv
│   ├── SRR13957170_boot.tsv_lineages.csv
│   ├── SRR13957170_boot.tsv_summarized.csv
│   ├── SRR13957170_demix.tsv
│   ├── SRR13957170_depths.tsv
│   ├── SRR13957170_variants.tsv
│   ├── SRR13957177_boot.tsv_lineages.csv
│   ├── SRR13957177_boot.tsv_summarized.csv
│   ├── SRR13957177_demix.tsv
│   ├── SRR13957177_depths.tsv
│   └── SRR13957177_variants.tsv
├── iqtree2                          # phylogenetic tree that is generated with 'params.relatedness = true'
│   ├── iqtree2.iqtree
│   ├── iqtree2.log
│   ├── iqtree2.mldist
│   └── iqtree2.treefile
├── ivar_consensus
│   ├── SRR13957125.consensus.fa
│   ├── SRR13957125.consensus.qual.txt
│   ├── SRR13957125_NTC.consensus.fa
│   ├── SRR13957125_NTC.consensus.qual.txt
│   ├── SRR13957170.consensus.fa
│   ├── SRR13957170.consensus.qual.txt
│   ├── SRR13957177.consensus.fa
│   └── SRR13957177.consensus.qual.txt
├── ivar_trim                        # bam files after primers have been trimmed off the reads with ivar
│   ├── SRR13957125_ivar.log
│   ├── SRR13957125.primertrim.sorted.bam
│   ├── SRR13957125.primertrim.sorted.bam.bai
│   ├── SRR13957170_ivar.log
│   ├── SRR13957170.primertrim.sorted.bam
│   ├── SRR13957170.primertrim.sorted.bam.bai
│   ├── SRR13957177_ivar.log
│   ├── SRR13957177.primertrim.sorted.bam
│   └── SRR13957177.primertrim.sorted.bam.bai
├── ivar_variants                    # tsv and vcf files of variants identified in sample
│   ├── SRR13957125.ivar_variants.vcf
│   ├── SRR13957125.variants.tsv
│   ├── SRR13957170.ivar_variants.vcf
│   ├── SRR13957170.variants.tsv
│   ├── SRR13957177.ivar_variants.vcf
│   └── SRR13957177.variants.tsv
├── kraken2                          # kraken2 report of the organisms the reads may be from
│   ├── SRR13957125_kraken2_report.txt
│   ├── SRR13957170_kraken2_report.txt
│   └── SRR13957177_kraken2_report.txt
├── logs                             # divided log and err files for QC and troubleshooting pleasures
│   └── processes*
│       ├── sample.run_id.err
│       └── sample.run_id.log
├── mafft                            # multiple sequence alignment created when 'params.relatedness = true'
│   └── mafft_aligned.fasta
├── markdup
│   ├── SRR13957125.markdup.sorted.bam
│   ├── SRR13957125.markdup.sorted.bam.bai
│   ├── SRR13957125_markdupstats.txt
│   ├── SRR13957170.markdup.sorted.bam
│   ├── SRR13957170.markdup.sorted.bam.bai
│   ├── SRR13957170_markdupstats.txt
│   ├── SRR13957177.markdup.sorted.bam
│   ├── SRR13957177.markdup.sorted.bam.bai
│   └── SRR13957177_markdupstats.txt
├── multiqc                          # aggregates data into single report
│   ├── multiqc_data
│   │   ├── multiqc_citations.txt
│   │   ├── multiqc_data.json
│   │   ├── multiqc_fastqc.txt
│   │   ├── multiqc_general_stats.txt
│   │   ├── multiqc_ivar_primers.txt
│   │   ├── multiqc_ivar_summary.txt
│   │   ├── multiqc.log
│   │   ├── multiqc_samtools_flagstat.txt
│   │   ├── multiqc_samtools_stats.txt
│   │   ├── multiqc_seqyclean.txt
│   │   └── multiqc_sources.txt
│   └── multiqc_report.html
├── nextclade                        # nextclade reports
│   ├── combined.fasta
│   ├── nextclade.aligned.fasta
│   ├── nextclade.auspice.json
│   ├── nextclade.csv
│   ├── nextclade.errors.csv
│   ├── nextclade.gene.E.fasta
│   ├── nextclade.gene.M.fasta
│   ├── nextclade.gene.N.fasta
│   ├── nextclade.gene.ORF1a.fasta
│   ├── nextclade.gene.ORF1b.fasta
│   ├── nextclade.gene.ORF3a.fasta
│   ├── nextclade.gene.ORF6.fasta
│   ├── nextclade.gene.ORF7a.fasta
│   ├── nextclade.gene.ORF7b.fasta
│   ├── nextclade.gene.ORF8.fasta
│   ├── nextclade.gene.ORF9b.fasta
│   ├── nextclade.gene.S.fasta
│   ├── nextclade.insertions.csv
│   ├── nextclade.json
│   └── nextclade.tsv
├── pango_collapse
│   └── pango_collapse.csv
├── pangolin                         # pangolin results
│   ├── combined.fasta
│   └── lineage_report.csv
├── samtools_ampliconstats           # amplicon statistics and metrics as determined by samtools
│   ├── SRR13957125_ampliconstats.txt
│   ├── SRR13957170_ampliconstats.txt
│   └── SRR13957177_ampliconstats.txt
├── samtools_coverage                # coverage and metrics as determined by samtools
│   ├── SRR13957125.cov.aligned.hist
│   ├── SRR13957125.cov.aligned.txt
│   ├── SRR13957125.cov.trimmed.hist
│   ├── SRR13957125.cov.trimmed.txt
│   ├── SRR13957170.cov.aligned.hist
│   ├── SRR13957170.cov.aligned.txt
│   ├── SRR13957170.cov.trimmed.hist
│   ├── SRR13957170.cov.trimmed.txt
│   ├── SRR13957177.cov.aligned.hist
│   ├── SRR13957177.cov.aligned.txt
│   ├── SRR13957177.cov.trimmed.hist
│   └── SRR13957177.cov.trimmed.txt
├── samtools_depth                   # the number of reads
│   ├── SRR13957125.depth.aligned.txt
│   ├── SRR13957125.depth.trimmed.txt
│   ├── SRR13957170.depth.aligned.txt
│   ├── SRR13957170.depth.trimmed.txt
│   ├── SRR13957177.depth.aligned.txt
│   └── SRR13957177.depth.trimmed.txt
├── samtools_flagstat                # flag information
│   ├── SRR13957125.flagstat.aligned.txt
│   ├── SRR13957125.flagstat.trimmed.txt
│   ├── SRR13957125.flagstat.txt
│   ├── SRR13957170.flagstat.aligned.txt
│   ├── SRR13957170.flagstat.trimmed.txt
│   ├── SRR13957170.flagstat.txt
│   ├── SRR13957177.flagstat.aligned.txt
│   ├── SRR13957177.flagstat.trimmed.txt
│   └── SRR13957177.flagstat.txt
├── samtools_plot_ampliconstats      # plots of the ampliconstats for troubleshooting purposes
│   ├── SRR13957125
│   ├── SRR13957125-combined-amp.gp
│   ├── SRR13957125-combined-amp.png
│   ├── SRR13957125-combined-coverage-1.gp
│   ├── SRR13957125-combined-coverage-1.png
│   ├── SRR13957125-combined-depth.gp
│   ├── SRR13957125-combined-depth.png
│   ├── SRR13957125-combined-read-perc.gp
│   ├── SRR13957125-combined-read-perc.png
│   ├── SRR13957125-combined-reads.gp
│   ├── SRR13957125-combined-reads.png
│   ├── SRR13957125-combined-tcoord.gp
│   ├── SRR13957125-combined-tcoord.png
│   ├── SRR13957125-combined-tdepth.gp
│   ├── SRR13957125-combined-tdepth.png
│   ├── SRR13957125-heat-amp-1.gp
│   ├── SRR13957125-heat-amp-1.png
│   ├── SRR13957125-heat-coverage-1-1.gp
│   ├── SRR13957125-heat-coverage-1-1.png
│   ├── SRR13957125-heat-read-perc-1.gp
│   ├── SRR13957125-heat-read-perc-1.png
│   ├── SRR13957125-heat-read-perc-log-1.gp
│   ├── SRR13957125-heat-read-perc-log-1.png
│   ├── SRR13957125-heat-reads-1.gp
│   ├── SRR13957125-heat-reads-1.png
│   ├── SRR13957125-SRR13957125.primertrim.sorted-amp.gp
│   ├── SRR13957125-SRR13957125.primertrim.sorted-amp.png
│   ├── SRR13957125-SRR13957125.primertrim.sorted-cov.gp
│   ├── SRR13957125-SRR13957125.primertrim.sorted-cov.png
│   ├── SRR13957125-SRR13957125.primertrim.sorted-reads.gp
│   ├── SRR13957125-SRR13957125.primertrim.sorted-reads.png
│   ├── SRR13957125-SRR13957125.primertrim.sorted-tcoord.gp
│   ├── SRR13957125-SRR13957125.primertrim.sorted-tcoord.png
│   ├── SRR13957125-SRR13957125.primertrim.sorted-tdepth.gp
│   ├── SRR13957125-SRR13957125.primertrim.sorted-tdepth.png
│   ├── SRR13957125-SRR13957125.primertrim.sorted-tsize.gp
│   ├── SRR13957125-SRR13957125.primertrim.sorted-tsize.png
│   ├── SRR13957170
│   ├── SRR13957170-combined-amp.gp
│   ├── SRR13957170-combined-amp.png
│   ├── SRR13957170-combined-coverage-1.gp
│   ├── SRR13957170-combined-coverage-1.png
│   ├── SRR13957170-combined-depth.gp
│   ├── SRR13957170-combined-depth.png
│   ├── SRR13957170-combined-read-perc.gp
│   ├── SRR13957170-combined-read-perc.png
│   ├── SRR13957170-combined-reads.gp
│   ├── SRR13957170-combined-reads.png
│   ├── SRR13957170-combined-tdepth.gp
│   ├── SRR13957170-combined-tdepth.png
│   ├── SRR13957170-heat-amp-1.gp
│   ├── SRR13957170-heat-amp-1.png
│   ├── SRR13957170-heat-coverage-1-1.gp
│   ├── SRR13957170-heat-coverage-1-1.png
│   ├── SRR13957170-heat-read-perc-1.gp
│   ├── SRR13957170-heat-read-perc-1.png
│   ├── SRR13957170-heat-read-perc-log-1.gp
│   ├── SRR13957170-heat-read-perc-log-1.png
│   ├── SRR13957170-heat-reads-1.gp
│   ├── SRR13957170-heat-reads-1.png
│   ├── SRR13957170-SRR13957170.primertrim.sorted-amp.gp
│   ├── SRR13957170-SRR13957170.primertrim.sorted-amp.png
│   ├── SRR13957170-SRR13957170.primertrim.sorted-cov.gp
│   ├── SRR13957170-SRR13957170.primertrim.sorted-cov.png
│   ├── SRR13957170-SRR13957170.primertrim.sorted-reads.gp
│   ├── SRR13957170-SRR13957170.primertrim.sorted-reads.png
│   ├── SRR13957170-SRR13957170.primertrim.sorted-tdepth.gp
│   ├── SRR13957170-SRR13957170.primertrim.sorted-tdepth.png
│   ├── SRR13957177
│   ├── SRR13957177-combined-amp.gp
│   ├── SRR13957177-combined-amp.png
│   ├── SRR13957177-combined-coverage-1.gp
│   ├── SRR13957177-combined-coverage-1.png
│   ├── SRR13957177-combined-depth.gp
│   ├── SRR13957177-combined-depth.png
│   ├── SRR13957177-combined-read-perc.gp
│   ├── SRR13957177-combined-read-perc.png
│   ├── SRR13957177-combined-reads.gp
│   ├── SRR13957177-combined-reads.png
│   ├── SRR13957177-combined-tcoord.gp
│   ├── SRR13957177-combined-tcoord.png
│   ├── SRR13957177-combined-tdepth.gp
│   ├── SRR13957177-combined-tdepth.png
│   ├── SRR13957177-heat-amp-1.gp
│   ├── SRR13957177-heat-amp-1.png
│   ├── SRR13957177-heat-coverage-1-1.gp
│   ├── SRR13957177-heat-coverage-1-1.png
│   ├── SRR13957177-heat-read-perc-1.gp
│   ├── SRR13957177-heat-read-perc-1.png
│   ├── SRR13957177-heat-read-perc-log-1.gp
│   ├── SRR13957177-heat-read-perc-log-1.png
│   ├── SRR13957177-heat-reads-1.gp
│   ├── SRR13957177-heat-reads-1.png
│   ├── SRR13957177-SRR13957177.primertrim.sorted-amp.gp
│   ├── SRR13957177-SRR13957177.primertrim.sorted-amp.png
│   ├── SRR13957177-SRR13957177.primertrim.sorted-cov.gp
│   ├── SRR13957177-SRR13957177.primertrim.sorted-cov.png
│   ├── SRR13957177-SRR13957177.primertrim.sorted-reads.gp
│   ├── SRR13957177-SRR13957177.primertrim.sorted-reads.png
│   ├── SRR13957177-SRR13957177.primertrim.sorted-tcoord.gp
│   ├── SRR13957177-SRR13957177.primertrim.sorted-tcoord.png
│   ├── SRR13957177-SRR13957177.primertrim.sorted-tdepth.gp
│   ├── SRR13957177-SRR13957177.primertrim.sorted-tdepth.png
│   ├── SRR13957177-SRR13957177.primertrim.sorted-tsize.gp
│   └── SRR13957177-SRR13957177.primertrim.sorted-tsize.png
├── samtools_stats                   # stats as determined by samtools
│   ├── SRR13957125.stats.aligned.txt
│   ├── SRR13957125.stats.trimmed.txt
│   ├── SRR13957125.stats.txt
│   ├── SRR13957170.stats.aligned.txt
│   ├── SRR13957170.stats.trimmed.txt
│   ├── SRR13957170.stats.txt
│   ├── SRR13957177.stats.aligned.txt
│   ├── SRR13957177.stats.trimmed.txt
│   └── SRR13957177.stats.txt
├── seqyclean                        # reads that have had PhiX and adapters removed
│   ├── Combined_SummaryStatistics.tsv
│   ├── SRR13957125_clean_PE1.fastq.gz
│   ├── SRR13957125_clean_PE2.fastq.gz
│   ├── SRR13957125_clean_SummaryStatistics.tsv
│   ├── SRR13957125_clean_SummaryStatistics.txt
│   ├── SRR13957170_clean_PE1.fastq.gz
│   ├── SRR13957170_clean_PE2.fastq.gz
│   ├── SRR13957170_clean_SummaryStatistics.tsv
│   ├── SRR13957170_clean_SummaryStatistics.txt
│   ├── SRR13957177_clean_PE1.fastq.gz
│   ├── SRR13957177_clean_PE2.fastq.gz
│   ├── SRR13957177_clean_SummaryStatistics.tsv
│   └── SRR13957177_clean_SummaryStatistics.txt
├── snp-dists                        # SNP matrix created with 'params.relatedness = true'
│   └── snp-dists.txt
└── vadr                            # consensus file QC
    ├── combined.fasta
    ├── trimmed.fasta
    ├── vadr.vadr.alc
    ├── vadr.vadr.alt
    ├── vadr.vadr.alt.list
    ├── vadr.vadr.cmd
    ├── vadr.vadr.dcr
    ├── vadr.vadr.fail.fa
    ├── vadr.vadr.fail.list
    ├── vadr.vadr.fail.tbl
    ├── vadr.vadr.filelist
    ├── vadr.vadr.ftr
    ├── vadr.vadr.log
    ├── vadr.vadr.mdl
    ├── vadr.vadr.pass.fa
    ├── vadr.vadr.pass.list
    ├── vadr.vadr.pass.tbl
    ├── vadr.vadr.rpn
    ├── vadr.vadr.sda
    ├── vadr.vadr.seqstat
    ├── vadr.vadr.sgm
    ├── vadr.vadr.sqa
    └── vadr.vadr.sqc
reads                                # user supplied fastq files for analysis
single_reads                         # user supplied fastq files for analysis
fastas                               # user supplied fasta files for analysis
multifastas                          # user supplied multifasta files for analysis
work                                 # nextflow's working directories

Config files

A FILE THAT THE END USER CAN COPY AND EDIT IS FOUND AT configs/cecret_config_template.config

To get a copy of the config file, the End User can use the following command. This creates an edit_me.config file in the current directory.

nextflow run UPHL-BioNGS/Cecret --config_file true

This file contains all of the configurable parameters with their default values. Use '-c' to specify the edited config file.

Using config files

The End User should not have to change the files in the Cecret repository in order to get the analyses that they need. Nextflow has a wonderful system of config files, which are recommended. Please read nextflow's documentation about config files at https://www.nextflow.io/docs/latest/config.html

The general format of a config file is

param.<parameter that needs adjusting> = <what it needs to be adjusted to>

Then this config file is used with the workflow via the following

nextflow run UPHL-BioNGS/Cecret -c <path to custom config file>

In theory, the values specified in this config file will be used over the defaults set in the workflow.

If the End User is using some sort of cloud or HPC setup, it is highly recommended that this file is copied and edited appropriately. A limited list of parameters is listed below:

Input and output directories

  • params.reads = workflow.launchDir + '/reads'
  • params.single_reads = workflow.launchDir + '/single_reads'
  • params.fastas = workflow.launchDir + '/fastas'
  • params.multifastas = workflow.launchDir + '/multifastas'
  • params.outdir = workflow.launchDir + '/cecret'

Other useful nextflow options

  • To "resume" a workflow, use -resume with the nextflow command
  • To create a report, use -with-report with the nextflow command
  • To use nextflow tower, use -with-tower with the nextflow command (reports will not be available for download from nextflow tower using this method)

Frequently Asked Questions (aka FAQ)

What do I do if I encounter an error?

TELL US ABOUT IT!!!

Be sure to include the command that was used, what config file was used, and what the nextflow error was.

What is the MultiQC report?

The multiqc report aggregates data across your samples into one file. Open the 'cecret/multiqc/multiqc_report.html' file with your favored browser. There tables and graphs are generated for 'General Statistics', 'Samtools stats', 'Samtools flagstats', 'FastQC', 'iVar', 'SeqyClean', 'Fastp', 'Pangolin', and 'Kraken2'.

Example fastqc graph

alt text

Example kraken2 graph

alt text

Example iVar graph

alt text

Example pangolin graph

alt text

What if I want to test the workflow?

In the history of this repository, there actually was an attempt to store fastq files here that the End User could use to test out this workflow. This made the repository very large and difficult to download.

There are several test profiles. These download fastq files from the ENA to use in the workflow. This does not always work due to local internet connectivity issues, but may work fine for everyone else.

nextflow run UPHL-BioNGS/Cecret -profile {docker or singularity},test

Another great resources is SARS-CoV-2 datasets, an effort of the CDC to provide a benchmark dataset for validating bioinformatic workflows. Fastq files from the nonviovoc, voivoc, and failed projects were downloaded from the SRA and put through this workflow and tested locally before releasing a new version.

The expected amount of time to run this workflow with 250 G RAM and 48 CPUs, 'params.maxcpus = 8', and 'params.medcpus = 4' is ~42 minutes. This corresponded with 25.8 CPU hours.

What if I just want to annotate some SARS-CoV-2 fastas with pangolin, freyja, nextclade and vadr?

# for a collection of fastas
nextflow run UPHL-BioNGS/Cecret -profile singularity --fastas <directory with fastas>

# for a collection of fastas and multifastas
nextflow run UPHL-BioNGS/Cecret -profile singularity --fastas <directory with fastas> --multifastas <directory with multifastas>

How do I compare a bunch of sequences? How do I create a phylogenetic tree?

The End User can run mafft, snpdists, and iqtree on a collection of fastas as well with

nextflow run UPHL-BioNGS/Cecret -profile singularity --relatedness true --fastas <directory with fastas> --multifastas <directory with multifastas>

The End User can have paired-end, singled-end, and fastas that can all be put together into one analysis.

nextflow run UPHL-BioNGS/Cecret -profile singularity --relatedness true --fastas <directory with fastas> --multifastas <directory with multifastas> --reads <directory with paire-end reads> --single_reads <directory with single-end reads>

Where is an example config file?

The End User is more than welcome to look at an example here. Just remove the comments for the parameters that need to be adjusted and specify with -c.

To get a copy of the config file, the End User can use the following command. This created edit_me.config in the current directory.

nextflow run UPHL-BioNGS/Cecret --config_file true

At UPHL, our config file is small enough to be put as a profile option, but the text of the config file would be as follows:

singularity.enabled = true
singularity.autoMounts = true
params {
  reads = "Sequencing_reads/Raw"
  kraken2 = true
  kraken2_db = '/Volumes/IDGenomics_NAS/Data/kraken2_db/h+v'
  vadr = false
}

And then run with

nextflow run UPHL-BioNGS/Cecret -c <path to custom config file>

Is there a way to determine if certain amplicons are failing?

There are two ways to do this.

With ACI :

cecret/aci has two files : amplicon_depth.csv and amplicon_depth.png. There is a row for each sample in 'amplicon_depth.csv', and a column for each primer in the amplicon bedfile. The values contained within are reads that only map to the region specified in the amplicon bedfile and excludes reads that do not. A boxplot of these values is visualized in amplicon_depth.png.

alt text

With samtools ampliconstats :

cecret/samtools_ampliconstats has a file for each sample.

Row number 126 (FDEPTH) has a column for each amplicon (also without a header). To get this row for all of the samples, grep the keyword "FDEPTH" from each sample.

grep "^FDEPTH" cecret/samtools_ampliconstats/* > samtools_ampliconstats_all.tsv

There are corresponding images in cecret/samtools_plot_ampliconstats for each sample.

Sample samtools plot ampliconstats depth graph

alt text

What is the difference between params.amplicon_bed and params.primer_bed?

The primer bedfile is the file with the start and stop of each primer sequence.

$ head configs/artic_V3_nCoV-2019.primer.bed
MN908947.3	30	54	nCoV-2019_1_LEFT	nCoV-2019_1	+
MN908947.3	385	410	nCoV-2019_1_RIGHT	nCoV-2019_1	-
MN908947.3	320	342	nCoV-2019_2_LEFT	nCoV-2019_2	+
MN908947.3	704	726	nCoV-2019_2_RIGHT	nCoV-2019_2	-
MN908947.3	642	664	nCoV-2019_3_LEFT	nCoV-2019_1	+
MN908947.3	1004	1028	nCoV-2019_3_RIGHT	nCoV-2019_1	-
MN908947.3	943	965	nCoV-2019_4_LEFT	nCoV-2019_2	+
MN908947.3	1312	1337	nCoV-2019_4_RIGHT	nCoV-2019_2	-
MN908947.3	1242	1264	nCoV-2019_5_LEFT	nCoV-2019_1	+
MN908947.3	1623	1651	nCoV-2019_5_RIGHT	nCoV-2019_1	-

The amplicon bedfile is the file with the start and stop of each intended amplicon.

$ head configs/artic_V3_nCoV-2019.insert.bed <==
MN908947.3	54	385	1	1	+
MN908947.3	342	704	2	2	+
MN908947.3	664	1004	3	1	+
MN908947.3	965	1312	4	2	+
MN908947.3	1264	1623	5	1	+
MN908947.3	1595	1942	6	2	+
MN908947.3	1897	2242	7	1	+
MN908947.3	2205	2568	8	2	+
MN908947.3	2529	2880	9	1	+
MN908947.3	2850	3183	10	2	+

Due to the many varieties of primer bedfiles, it is best if the End User supplied this file for custom primer sequences.

What if I am using an amplicon-based library that is not SARS-CoV-2?

First of all, this is a great thing! Let us know if tools specific for your organism should be added to this workflow. There are already options for 'mpx' and 'other' species.

In a config file, change the following relevant parameters:

params.reference_genome
params.primer_bed
params.amplicon_bed #or set params.aci = false
params.gff_file #or set params.ivar_variants = false

And set

params.species = 'other'
params.pangolin = false
params.freyja = false
params.nextclade = false #or adjust nexclade_prep_options from '--name sars-cov-2' to the name of the relevent dataset
params.vadr = false #or configure the vadr container appropriately and params.vadr_reference

What if I need to filter out human reads or I only want reads that map to my reference?

Although not perfect, if 'params.filter = true', then only the reads that were mapped to the reference are returned. This should eliminate all human contamination (as long as human is not part of the supplied reference) and all "problematic" incidental findings.

This workflow has too many bells and whistles. I really only care about generating a consensus fasta. How do I get rid of all the extras?

Change the parameters in a config file and set most of them to false.

params.species = 'none'
params.fastqc = false
params.ivar_variants = false
params.samtools_stats = false
params.samtools_coverage = false
params.samtools_depth = false
params.samtools_flagstat = false
params.samtools_ampliconstats = false
params.samtools_plot_ampliconstats = false
params.aci = false
params.pangolin = false
params.pango_collapse = false
params.freyja = false
params.nextclade = false
params.vadr = false
params.multiqc = false

And, yes, this means I added some bells and whistles so the End User could turn off the bells and whistles. /irony

Can I get images of my SNPs and indels?

No. Prior versions supported a tool called bamsnap, which had the potential to be great. Due to time constraints, this feature is no longer supported.

Where did the SAM files go?

Never fear, they are still in nextflow's work directory if the End User really needs them. They are no longer included in publishDir because of size issues. The BAM files are still included in publishDir, and most analyses for SAM files can be done with BAM files.

Where did the *err files go?

Personally, we liked having stderr saved to a file because some of the tools using in this workflow print to stderr instead of stdout. We have found, however, that this puts all the error text into a file, which a lot of new-to-nextflow users had a hard time finding. It was easier to assist and troubleshoot with End Users when stderr was printed normally.

What is in the works to get added to 'Cecret'? (These are waiting for either versions to stabilize or a docker container to be available.)

What processes take the longest?

Bedtools multicov was replaced by ACI due to processing times, but there are other processes that take longer.

Right now, the processes that take the most time are

  • ivar trim
  • freyja

cecret's People

Contributors

drb-s avatar erinyoung avatar fanninpm avatar k-florek avatar tives82 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

cecret's Issues

fastp version is not captured correctly

When using fastp, the version is still in the log file, but it doesn't make it to the summary file.

I think this is due to fastp printing the version to stderr.

Segmentation fault for iVar variants process

When running iVar occasionally throws a segmentation fault possibly related to samtools usage of memory.

Command exit status:
  139
Command output:
  (empty)
Command error:
  .command.sh: line 11:    37 Broken pipe             samtools mpileup -A -d 8000 -B -Q 0 --reference MN908947.3.fasta 091176-44249.primertrim.sorted.bam 2>> $err_file
          38 Segmentation fault      (core dumped) | ivar variants -p ivar_variants/091176-44249.variants -q 20 -t 0.6 -m 10 -r MN908947.3.fasta -g MN908947.3.gff 2>> $err_file >> $log_file

Giving the process addition memory after a retry seems to fix the issue:

memory {2.GB * task.attempt}
errorStrategy {'retry'}
maxRetries 2

I'm opening a pull request to add this to the process in case you want to include the changes to the workflow.

Update readme for kraken2 database

The initial kraken2 download in the readme is broken, BUT! there's a new WORKING one :

I already have it in a readme here and it would be simple to adjust the wording.

Process `summary` input file name collision

When both params.pangolin and params.vadr are set to false, nextflow complains:

Error executing process > 'summary (P120-S09-00846-4-S_S24)'

Caused by:
  Process `summary` input file name collision -- There are multiple input files for each of the following file names: Cecret.nf

Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line

Possible offending lines:

Cecret/Cecret.nf

Lines 1134 to 1136 in c683a19

} else {
pangolin_file = Channel.fromPath(workflow.projectDir + "/Cecret.nf", type:'file')
}

Cecret/Cecret.nf

Lines 1233 to 1235 in c683a19

} else {
vadr_file = Channel.fromPath(workflow.projectDir + "/Cecret.nf", type:'file')
}

This was run with version 1.3.3 of the StaPH-B toolkit, which may not include an up-to-date workflow file.

Update Nextclade

Copied over from SLACK

Quick note for those upgrading Nextclade CLI to v1.10.*:
If you don’t use --input-dataset, you need to add a new arg --input-virus-properties
In v.1.10.0, you won’t get a helpful error message, this is fixed in patch release 1.10.1
Details in the thread replies here so as not to spam.

nextclade error

There is an error when I run the cecret workflow using staphb-toolkit today.I run this command for many datasets before and it works fine.
I saw this issue for the nextcade update. However, when I check the errors, it seems like it still using the --output-dir instead of --output-all. I have upgrade the staphb-toolkit before I run the workflow but it doesn't work. I think it probably because staphb-toolkit doesn't update. I have checked latest release date for staphb-toolkit is April 25.

Here is my command:
staphb-wf cecret fastq_dir --output out --config cecret.config

nextflow error:

Error executing process > 'nextclade (Clade Determination)'

Caused by:
  Process `nextclade (Clade Determination)` terminated with an error exit status (2)

Command executed:

  mkdir -p nextclade dataset logs/nextclade
  log_file=logs/nextclade/nextclade.5cb932f9-fc35-4591-84a7-5fafd637a83e.log
  err_file=logs/nextclade/nextclade.5cb932f9-fc35-4591-84a7-5fafd637a83e.err

  date | tee -a $log_file $err_file > /dev/null
  nextclade --version >> $log_file
  nextclade_version=$(nextclade --version)

  nextclade dataset get --name sars-cov-2 --output-dir dataset

  for fasta in 1.consensus.fa 2.consensus.fa 3.consensus.fa 4.consensus.fa 5.consensus.fa
  do
    cat $fasta >> ultimate_fasta.fasta
  done

  nextclade        --input-fasta=ultimate_fasta.fasta       --input-dataset dataset       --output-json=nextclade/nextclade.json       --output-csv=nextclade/nextclade.csv       --output-tsv=nextclade/nextclade.tsv       --output-tree=nextclade/nextclade.auspice.json       --output-dir=nextclade       --output-basename=nextclade       2>> $err_file >> $log_file
  cp ultimate_fasta.fasta nextclade/combined.fasta

Command exit status:
  2

Command output:
  (empty)

Work dir:
  /storage/hpc/group/cov-sur/datasets/data/fastq/work/44/e4b578e9b613d8ea3fc2973cae69e8

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`

Test run ...

Hello @erinyoung -

I installed singularity, nextflow on my mac.
I git cloned this repo and have a test fastq file (single end).
I was hoping I can run the command,

nextflow Cecret/Cecret.nf

But, it fails with fastqc command not found.
I was assuming the script might pull necessary containers and thus no need to install dependencies on my laptop before running this script.
I see containers.config in configs directory.
If I want to run this script using singularity, how should I invoke the command or configure this software?
Thank you in advance.

Best,
Mahesh

creation of a parameter schema

All nf-core workflows include a json parameter schema file in the top level directory that outlines all of the parameters of the workflow. Here are some examples:
viralrecon
mycosnp
bactopia

nf-core includes a tool to help with creating the file if you are using nf-core schema builder.

StaPH-B Toolkit v2 will use this file to tell users which parameters can be supplied with each workflow. If there isn't a schema file the toolkit will just inform the user that help is not available for this workflow. Including this file for Cecret will be really helpful for users once I release v2 of the toolkit.

Issue setting maxcpus with v.2.0.2021115

I just updated to the newest version but I'm having issues with the cpu settings. I've been running the previous versions for a few months now (thanks for your hard work on this!) without issue on a machine with 6 cores + Docker, but now I'm getting errors such as:
Error executing process > 'bwa '

Caused by:
Process requirement exceed available CPUs -- req: 8; avail: 6

I've tried to override the params.maxcpus & params.medcpus and even hard coding -t or --threads to 6 throughout the Cecret.nf file for each process, but so far the error persists and the message starting the run displays:
The maximum number of CPUS used in this workflow is 8

Any ideas on what I should try next?

Error executing process > 'fastqc (12)'

I don't know what is the problem. I reinstalled fastqc from the link, but I can't fix. Please.

Furthermore, I left the complete output:

Error executing process > 'fastqc (12)'

Caused by:
Process fastqc (12) terminated with an error exit status (1)

Command executed:

mkdir -p fastqc logs/fastqc
log_file=logs/fastqc/12.427d29a5-7e5f-482e-82cd-7be02fedbeea.log
err_file=logs/fastqc/12.427d29a5-7e5f-482e-82cd-7be02fedbeea.err

time stamp + capturing tool versions
date | tee -a $log_file $err_file > /dev/null
fastqc --version >> $log_file

fastqc --outdir fastqc --threads 1 12_S12_L001_R1_001.fastq.gz 12_S12_L001_R2_001.fastq.gz 2>> $err_file >> $log_file

zipped_fastq=($(ls fastqc/*fastqc.zip) "")

raw_1=$(unzip -p ${zipped_fastq[0]} */fastqc_data.txt | grep "Total Sequences" | awk '{ print $3 }' )
raw_2=NA
if [ -f "${zipped_fastq[1]}" ] ; then raw_2=$(unzip -p fastqc/*fastqc.zip */fastqc_data.txt | grep "Total Sequences" | awk '{ print $3 }' ) ; fi

if [ -z "$raw_1" ] ; then raw_1="0" ; fi
if [ -z "$raw_2" ] ; then raw_2="0" ; fi

Command exit status:
1

Command output:
(empty)

Command error:
WARNING: failed to set O_CLOEXEC flags on image
WARNING: failed to set O_CLOEXEC flags on image
ERROR : Failed to set securebits: Invalid argument
ERROR : Failed to set securebits: Invalid argument

Nextclade needs an update

The erin-dev branch actually has a fix for this.

The error goes something like this

  --2021-11-16 22:05:08--  https://raw.githubusercontent.com/nextstrain/nextclade/master/data/sars-cov-2/tree.json
  Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...
  Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
  HTTP request sent, awaiting response... 404 Not Found
  2021-11-16 22:05:08 ERROR 404: Not Found.

NextStrain told everyone months ago that this would happen, so there is effort to get this corrected. We've had internal issues that have prevented us from releasing the next version of Cecret.

WARN: Unable to fetch attribute for file...

Are these warning messages a problem for the outgroup files or can I safely ignore them?

WARN: Unable to fetch attribute for file: /app/becksts/.nextflow/assets/UPHL-BioNGS/Cecret/configs/NC_063383.1.gff - Hash is inferred from Git repository commit Id
WARN: Unable to fetch attribute for file: /app/becksts/.nextflow/assets/UPHL-BioNGS/Cecret/configs/NC_063383.1.gff - Hash is inferred from Git repository commit Id

[heads up] Nextclade v2.0

Today, Nextclade v2.0 was released, with a lot of improvements (especially in performance). However, the new version contains a couple of breaking changes that may break your workflow…

  • The input FASTA argument is now a positional argument, and it's now possible to specify more than one input FASTA file (no more need to cat them together!).
  • The --output-dir flag has now been superseded by the --output-all flag. (Also, the --output-{json,csv,tsv,tree} flags have become redundant if you only use the default names for those output files. Here is the file that specifies the CLI flags and the default output file names.)

Best of luck with the new version!

Error executing process > 'pangolin (SARS-CoV-2 lineage Determination)'

This was using the git cloned version of Cecret. Error from NextFlow:

Error executing process > 'pangolin (SARS-CoV-2 lineage Determination)'

Caused by:
Process pangolin (SARS-CoV-2 lineage Determination) terminated with an error exit status (2)

Command executed:

mkdir -p pangolin logs/pangolin
log_file=logs/pangolin/pangolin.11a368d3-9f2d-4abc-b735-f07ad370a00d.log
err_file=logs/pangolin/pangolin.11a368d3-9f2d-4abc-b735-f07ad370a00d.err

date | tee -a $log_file $err_file > /dev/null
pangolin --all-versions >> $log_file

for fasta in 2111180087.consensus.fa
do
cat $fasta >> ultimate_fasta.fasta
done

pangolin --threads 4 --outdir pangolin ultimate_fasta.fasta 2>> $err_file >> $log_file
cp ultimate_fasta.fasta pangolin/combined.fasta

Command exit status:
2

Command output:
(empty)

Command error:
WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
usage: pangolin [options]
pangolin: error: unrecognized arguments: --all-versions

Work dir:
/home/mdubfx/cecret_test/Cecret/work/be/1e5b21170bcb2897537dc02b9f6f8b

Tip: when you have fixed the problem you can continue the execution adding the option -resume to the run command line

The StaPH-B version of Cecret is also crashing at the pangolin stage. Thanks!

Add collection dates for Freyja

Freyja can create a cool figure if collection dates are added for each sample.

alt text

Right now there's no way to way to incorporate this into Cecret.

bedtools multicov overlap threshold is too low

Cecret runs bedtools to calculate the read depth of each amplicon using this command:
bedtools multicov -bams !{bam} -bed amplicon.bed
This counts the number of alignments in the bam file that overlap each amplicon region defined in the bed file. Bedtools' default definition of "overlap" is 1bp, so when bedtools counts up the number of alignments overlapping amplicon 5, it reports the number of alignments for amplicons 4, 5, and 6, since amplicons 4 and 6 overlap 5. If you require an overlap of 50% (-f .5), then the overlapping alignments from amplicons 4 and 6 won't be included in the count.
This will bring the number of "failed amplicons" reported by bedtools more in line with the number reported by samtools.

Here's an example of bedtools multicov with the default 1bp overlap:
MN908947.3 30 410 nCoV-2019_1 7450
MN908947.3 320 726 nCoV-2019_2 13592
MN908947.3 642 1028 nCoV-2019_3 13859
MN908947.3 943 1337 nCoV-2019_4 10224
MN908947.3 1242 1651 nCoV-2019_5 7396
MN908947.3 1573 1964 nCoV-2019_6 8599
MN908947.3 1875 2269 nCoV-2019_7 6648
MN908947.3 2181 2592 nCoV-2019_8 4817
MN908947.3 2505 2904 nCoV-2019_9 6788
MN908947.3 2826 3210 nCoV-2019_10 10485

And here's the same thing but requiring an overlap of 50% (-f .5)
MN908947.3 30 410 nCoV-2019_1 383
MN908947.3 320 726 nCoV-2019_2 6930
MN908947.3 642 1028 nCoV-2019_3 5911
MN908947.3 943 1337 nCoV-2019_4 906
MN908947.3 1242 1651 nCoV-2019_5 3347
MN908947.3 1573 1964 nCoV-2019_6 3047
MN908947.3 1875 2269 nCoV-2019_7 2151
MN908947.3 2181 2592 nCoV-2019_8 1358
MN908947.3 2505 2904 nCoV-2019_9 1283
MN908947.3 2826 3210 nCoV-2019_10 4117

When an overlap of 50% is required bedtools correctly reports the 383 alignments for amplicon 1 and 6,930 for amplicon 2. Running with the default 1bp overlap the number of amplicon 1 alignments is reported as 7,450, which is (approximately) the number of alignments to amplicons 1 and 2 (383 + 6,930 = 7313).

All N's in MPX consensus

Our local MPX sequence had all N's in it's consensus sequence, whereas a comparative SRR sequence did not. So a complete SNP matrix was not produced, though bcftools_variants ran successfully for both.

I therefore uncommented 3 lines in cecret.config to make sure ivar_consensus was working, but had the same results.

Do I need to use ivar_variants instead of bcftools_variants?

segmentation fault at ivar trim step

Hello @erinyoung

The pipeline fails at ivar trim step.
Here's the log:

WARNING: The BED file provided did not have the expected score column, but iVar will continue trimming

iVar uses the standard 6 column BED format as defined here - https://genome.ucsc.edu/FAQ/FAQformat.html#format1.
It requires the following columns delimited by a tab: chrom, chromStart, chromEnd, name, score, strand

WARNING: The BED file provided did not have the expected score column, but iVar will continue trimming

Found 218 primers in BED file
Amplicons detected:
Segmentation fault

Any ideas what might be the issue?

Feature Request : Amplicon bedfile

For people that use custom primers, an amplicon bedfile would be useful for bedtools. I create one for SARS-CoV-2 artic V3 primers, but it would be better if it was user supplied with some documentation on how to perform it locally.

what if I want the input as assembled fastas?

Hi,
Impressive binf tool! May I ask what shall I do if I just want the input to be the assembled fastas to get the analysis such as phylogeny, pangolin, vadr, etc.?
Regards,
Shaokang

Add option to bypass trimming

Actually, this entire workflow needs a facelift and needs to be simplified.

Cecret will be more useful if there's an option to bypass primer trimming.

Apple M1 & Docker Issue

I've run the 'nextflow run Cecret.nf -c configs/docker.config' command to test 3 samples and I have encountered an error.

N E X T F L O W ~ version 21.10.1
Launching Cecret.nf [hungry_faggin] - revision: 9cede0c1f7
Currently using the Cecret workflow for use with amplicon-based Illumina hybrid library prep on MiSeq

Author: Erin Young
email: [email protected]
Version: v.2.2.20211220

Fastq file found : 7093-MS-1_80
Fastq file found : 7093-MS-1_7
Fastq file found : 7093-MS-1_81
The maximum number of CPUS used in this workflow is 8
The files and directory for results is /Users/vestalg/Cecret-master 2/cecret
A table summarizing results will be created: /Users/vestalg/Cecret-master 2/cecret/summary.txt and /Users/vestalg/Cecret-master 2/cecret_run_results.txt

Reference Genome : /Users/vestalg/Cecret-master 2/configs/MN908947.3.fasta
GFF file for Reference Genome : /Users/vestalg/Cecret-master 2/configs/MN908947.3.gff
Primer BedFile : /Users/vestalg/Cecret-master 2/configs/artic_V3_nCoV-2019.bed
Amplicon BedFile : /Users/vestalg/Cecret-master 2/configs/nCoV-2019.insert.bed

executor > local (6)
[be/112020] process > fastqc (7093-MS-1_81) [ 0%] 0 of 3
[09/afc9fb] process > seqyclean (7093-MS-1_81) [ 0%] 0 of 3
executor > local (6)
[be/112020] process > fastqc (7093-MS-1_81) [ 0%] 0 of 3
[09/afc9fb] process > seqyclean (7093-MS-1_81) [ 25%] 1 of 4, failed: 1, r..
[- ] process > bwa -
[- ] process > sort -
executor > local (10)
[04/185650] process > fastqc (7093-MS-1_80) [ 40%] 2 of 5, failed: 2, r..
[e8/d25ab3] process > seqyclean (7093-MS-1_80) [ 50%] 3 of 6, failed: 3, r..
[- ] process > bwa -
[- ] process > sort -
[- ] process > filter -
[- ] process > ivar_trim -
[- ] process > ivar_variants -
[- ] process > ivar_consensus -
[- ] process > fasta_prep -
[- ] process > bcftools_variants -
[- ] process > bamsnap -
[- ] process > samtools_stats -
executor > local (12)
[35/a63224] process > fastqc (7093-MS-1_81) [ 50%] 3 of 6, failed: 3, r..
[ed/d3f88a] process > seqyclean (7093-MS-1_80) [ 50%] 3 of 6, failed: 3, r..
[- ] process > bwa -
[- ] process > sort -
[- ] process > filter -
[- ] process > ivar_trim -
[- ] process > ivar_variants -
[- ] process > ivar_consensus -
[- ] process > fasta_prep -
[- ] process > bcftools_variants -
[- ] process > bamsnap -
[- ] process > samtools_stats -
[- ] process > samtools_coverage -
[- ] process > samtools_flagstat -
executor > local (12)
[35/a63224] process > fastqc (7093-MS-1_81) [ 50%] 3 of 6, failed: 3, r..
[ed/d3f88a] process > seqyclean (7093-MS-1_80) [ 50%] 3 of 6, failed: 3, r..
[- ] process > bwa -
[- ] process > sort -
[- ] process > filter -
[- ] process > ivar_trim -
[- ] process > ivar_variants -
[- ] process > ivar_consensus -
[- ] process > fasta_prep -
[- ] process > bcftools_variants -
[- ] process > bamsnap -
[- ] process > samtools_stats -
[- ] process > samtools_coverage -
[- ] process > samtools_flagstat -
[- ] process > samtools_depth -
[- ] process > kraken2 -
[- ] process > bedtools_multicov -
[- ] process > samtools_ampliconstats -
[- ] process > samtools_plot_ampliconstats -
[- ] process > pangolin -
[- ] process > nextclade -
[- ] process > vadr -
[- ] process > summary -
[- ] process > combine_results -
[09/afc9fb] NOTE: Process seqyclean (7093-MS-1_81) terminated with an error exit status (127) -- Execution is retried (1)
[b3/7b5c47] NOTE: Process fastqc (7093-MS-1_7) terminated with an error exit status (127) -- Execution is retried (1)
[ae/30a22f] NOTE: Process fastqc (7093-MS-1_80) terminated with an error exit status (127) -- Execution is retried (1)
[af/4e7788] NOTE: Process seqyclean (7093-MS-1_7) terminated with an error exit status (127) -- Execution is retried (1)
[e8/d25ab3] NOTE: Process seqyclean (7093-MS-1_80) terminated with an error exit status (127) -- Execution is retried (1)
[be/112020] NOTE: Process fastqc (7093-MS-1_81) terminated with an error exit status (127) -- Execution is retried (1)
Error executing process > 'fastqc (7093-MS-1_80)'

Caused by:
Process fastqc (7093-MS-1_80) terminated with an error exit status (127)

Command executed:

mkdir -p fastqc logs/fastqc
log_file=logs/fastqc/7093-MS-1_80.f574859a-d7c7-4223-a402-3fd94fdd9e50.log
err_file=logs/fastqc/7093-MS-1_80.f574859a-d7c7-4223-a402-3fd94fdd9e50.err

time stamp + capturing tool versions

date | tee -a $log_file $err_file > /dev/null
fastqc --version >> $log_file

fastqc --outdir fastqc --threads 1 7093-MS-1_80_S1_L005_R1_001.fastq.gz 7093-MS-1_80_S1_L005_R2_001.fastq.gz 2>> $err_file >> $log_file

zipped_fastq=($(ls fastqc/*fastqc.zip) "")

raw_1=$(unzip -p ${zipped_fastq[0]} */fastqc_data.txt | grep "Total Sequences" | awk '{ print $3 }' )
raw_2=NA
if [ -f "${zipped_fastq[1]}" ] ; then raw_2=$(unzip -p fastqc/*fastqc.zip */fastqc_data.txt | grep "Total Sequences" | awk '{ print $3 }' ) ; fi

if [ -z "$raw_1" ] ; then raw_1="0" ; fi
if [ -z "$raw_2" ] ; then raw_2="0" ; fi

Command exit status:
127

Command output:
(empty)
executor > local (12)
[04/185650] process > fastqc (7093-MS-1_80) [ 66%] 4 of 6, failed: 4, r..
[ed/d3f88a] process > seqyclean (7093-MS-1_80) [ 50%] 3 of 6, failed: 3, r..
[- ] process > bwa -
[- ] process > sort -
[- ] process > filter -
[- ] process > ivar_trim -
[- ] process > ivar_variants -
[- ] process > ivar_consensus -
[- ] process > fasta_prep -
[- ] process > bcftools_variants -
[- ] process > bamsnap -
[- ] process > samtools_stats -
[- ] process > samtools_coverage -
[- ] process > samtools_flagstat -
[- ] process > samtools_depth -
[- ] process > kraken2 -
[- ] process > bedtools_multicov -
[- ] process > samtools_ampliconstats -
[- ] process > samtools_plot_ampliconstats -
[- ] process > pangolin -
[- ] process > nextclade -
[- ] process > vadr -
[- ] process > summary -
[- ] process > combine_results -
[09/afc9fb] NOTE: Process seqyclean (7093-MS-1_81) terminated with an error exit status (127) -- Execution is retried (1)
[b3/7b5c47] NOTE: Process fastqc (7093-MS-1_7) terminated with an error exit status (127) -- Execution is retried (1)
[ae/30a22f] NOTE: Process fastqc (7093-MS-1_80) terminated with an error exit status (127) -- Execution is retried (1)
[af/4e7788] NOTE: Process seqyclean (7093-MS-1_7) terminated with an error exit status (127) -- Execution is retried (1)
[e8/d25ab3] NOTE: Process seqyclean (7093-MS-1_80) terminated with an error exit status (127) -- Execution is retried (1)
[be/112020] NOTE: Process fastqc (7093-MS-1_81) terminated with an error exit status (127) -- Execution is retried (1)
Error executing process > 'fastqc (7093-MS-1_80)'

Caused by:
Process fastqc (7093-MS-1_80) terminated with an error exit status (127)

Command executed:

mkdir -p fastqc logs/fastqc
log_file=logs/fastqc/7093-MS-1_80.f574859a-d7c7-4223-a402-3fd94fdd9e50.log
err_file=logs/fastqc/7093-MS-1_80.f574859a-d7c7-4223-a402-3fd94fdd9e50.err

time stamp + capturing tool versions

date | tee -a $log_file $err_file > /dev/null
fastqc --version >> $log_file

fastqc --outdir fastqc --threads 1 7093-MS-1_80_S1_L005_R1_001.fastq.gz 7093-MS-1_80_S1_L005_R2_001.fastq.gz 2>> $err_file >> $log_file

zipped_fastq=($(ls fastqc/*fastqc.zip) "")

raw_1=$(unzip -p ${zipped_fastq[0]} */fastqc_data.txt | grep "Total Sequences" | awk '{ print $3 }' )
raw_2=NA
if [ -f "${zipped_fastq[1]}" ] ; then raw_2=$(unzip -p fastqc/*fastqc.zip */fastqc_data.txt | grep "Total Sequences" | awk '{ print $3 }' ) ; fi

if [ -z "$raw_1" ] ; then raw_1="0" ; fi
if [ -z "$raw_2" ] ; then raw_2="0" ; fi

Command exit status:
127

Command output:
(empty)

Command error:
WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested
/bin/bash: line 0: export: `2/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/FastQC/': not a valid identifier
.command.sh: line 2: mkdir: command not found

Work dir:
/Users/vestalg/Cecret-master 2/work/04/185650992b0ea86d57547adb49c938

Tip: view the complete command output by changing to the process work dir and entering the command cat .command.out

Is this related to the Apple M1 chip I am using? I've attempted to use Option 1 and Option 2 to run from Docker and Singularity and I encounter an issue with the StaphB FastQC Docker container:

FATAL: Unable to pull docker://staphb/fastqc:latest: conveyor failed to get: no descriptor found for reference "5d67fad373325a597d0eb75d0986ad7dd78f8b2b67e78435d9176c434a8cec04"

Is this an issue with the Apple M1 chips?

mpx:vadr (QC metrics) error

The last error I am seeing with Cecret is that vadr is terminating:

[e4/abdf4a] NOTE: Process mpx:vadr (QC metrics) terminated with an error exit status (255) -- Error is ignored

Add freyja

Freyja is a tool for co-occurring SARS-CoV-2 infections

iqtree2 failure: nextalign.aligned.fasta.renamed has wrong outgroup header for iqtree2

iqtree2 fails because of a wrong header for the outgroup.

outgroup="-o NC_063383"
cat nextalign.aligned.fasta | sed 's/NC_063383.*/NC_063383/g' > nextalign.aligned.fasta.renamed

nextalign.aligned.fasta header is:

MPXV_USA_2022_MA001 in NC_063383 coordinates

So nextalign.aligned.fasta.renamed header is:

MPXV_USA_2022_MA001 in NC_063383

Instead of this:

NC_63383

Solution is to fix sed command like this:
sed 's/^.NC_063383./NC_063383/g'

iqtree2 failure for monkeypox - wrong outgroup header in nextalign.aligned.fasta

Running from UPHL-BioNGS/Cecret with the following execution error:

[24/8ac928] NOTE: Process msa:iqtree2 (Creating phylogenetic tree with iqtree) terminated with an error exit status (2) -- Execution is retried (1)
Error executing process > 'msa:iqtree2 (Creating phylogenetic tree with iqtree)'

Caused by:
Process msa:iqtree2 (Creating phylogenetic tree with iqtree) terminated with an error exit status (2)

Command executed:

mkdir -p iqtree2 logs/msa:iqtree2
log_file=logs/msa:iqtree2/msa:iqtree2.7be2fac2-15c3-4970-aa30-aadd189177a2.log
err_file=logs/msa:iqtree2/msa:iqtree2.7be2fac2-15c3-4970-aa30-aadd189177a2.err

date | tee -a $log_file $err_file > /dev/null
iqtree2 --version >> $log_file

if [ -n "NC_063383.1" ] && [ "NC_063383.1" != "null" ] && [ "nextalign" != "nextclade" ]
then
outgroup="-o NC_063383.1"
cat nextalign.aligned.fasta | sed 's/NC_063383.1.*/NC_063383.1/g' > nextalign.aligned.fasta.renamed
else
outgroup=""
mv nextalign.aligned.fasta nextalign.aligned.fasta.renamed
fi

creating a tree

iqtree2 -ninit 2 -n 2 -me 0.05 -m GTR -o NC_063383.1 -nt AUTO -ntmax 8 -s nextalign.aligned.fasta.renamed -pre iqtree2/iqtree2 $outgroup >> $log_file 2>> $err_file

Command exit status:
2

Command output:
(empty)

Work dir:
/data/Sequence_analysis/Cecret/Analyses/monkeypox/iSeqs_Runs_220727/work/14/3f4bbbaca74412d37dbeb3cfe80aa3

==
The sequence header in nextalign.aligned.fasta was:

ref_in_coord Reference sequence in coord.fasta coordinates

It should have been:

NC_063383.1

So the sed command above didn't work:
sed 's/NC_063383.1.*/NC_063383.1/g'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.