sanger-tol / blobtoolkit Goto Github PK

View Code? Open in Web Editor NEW

11.0 7.0 0.0 72.43 MB

Nextflow pipeline for BlobToolKit for Sanger ToL production suite

Home Page: https://pipelines.tol.sanger.ac.uk/blobtoolkit

License: MIT License

HTML 1.82% Python 14.97% Nextflow 64.05% Groovy 18.96% Shell 0.20%

genomics nextflow pipeline decontamination workflow genomehubs blobtools

blobtoolkit's People

Contributors

Stargazers

Watchers

blobtoolkit's Issues

Improved generation of the summary Yaml file

Description of feature

Update the input config to fix the summary YAML generated

Optional readmapping subworkflows

Description of feature

Import the readmapping pipeline (minus the statistics subworkflow)

subworkflow: blobtools

Nextflow implementation of blobtools.smk, rules: run_blobtools_create.smk, run_blobtools_add.smk, add_summary_to_metadata.smk; nf-core modules:

blobtools2 (not implemented), requires: containers and module

Description of feature

Create a new folder called docs/decisions_records with a README.md
docs/images - It should only have 2 files
- sanger-tol-readmapping_logo.png – The logo for the pipeline
- sanger-tol-readmapping_workflow.png – The workflow diagram
docs/usage.md - Move the config.yaml details from docs/parameters.md to docs/usage.md after the section about creating the databases.
docs/output.md –

Updates in place and broken Nextflow job cache

Description of feature

In the version 0.2.0 of the pipeline, the job cache only functions up to the BLOBTOOLKIT_CREATEBLOBDIR process. Something downstream must be modifying some input file or parameter, breaking the cache mechanism.

My guess is that BLOBTOOLKIT_UPDATEBLOBDIR and others are updating the blobdir in place and therefore modifying its timestamp, which is an element considered by Nextflow when checking processess against the cache.

Even though the remaining processes don't take that long to rerun, we should make this cleaner.

subworkflow: chunk_stats

Nextflow implementation of chunk_stats.smk, rules: get_chunked_stats.smk (imports minimap and windowmasker); nf-core modules:

blast_windowmasker not implemented
minimap2_index and minimap2_align

Clean up

Description of feature

To close this issue, systematically clean up the pipeline for future template updates.

Workflow diagram

Description of feature

To close this issue, create and update the README.md with the workflow diagram under the Pipeline Summary section. The actual .png file is stored in docs/images with the name sanger-tol-readmapping_workflow.png.

containers: blobtools2

Build and upload Bioconda, Docker, and, Singularity containers for blobtools2.

Large test

Organisms: ?

Small test

Organisms: ?

Add optional read mapping subworkflow

Description of feature

Add read mapping flag
Import all alignment subworkflows

Separate BUSCO pipeline ?

Given that we need to run BUSCO in at least two places (genomenote and blobtoolkit) and that we may want to run just BUSCO on some assemblies, is it worth making a pipeline just for running BUSCO ? It would probably have a single extra step to fetch lineages.

The alternative is running blobtoolkit with a set of ext.when conditions that essentially limit it to BUSCO.

Regardless of which pipeline has BUSCO, the resulting gene annotations should be processed the same way ensemblgenedownload does.

i/o automate input subworkflow

Description of feature

Implements i/o with automated_input subworkflow.

nf-core module: entrezdirect/esummary

Write entrez-direct nf-core module: entrezdirect/esummary ; nf-core modules issue: new module: entrezdirect/esearch #1831.

subworkflow: blastn

Nextflow implementation of blastn.smk, rules: extract_nohit_fasta.smk, chunk_nohit_fasta.smk, run_blastn.smk, unchunk_blastn.smk; nf-core modules:

nf-core module: pysam

Write a pysam nf-core module.

subworkflow: busco

Nextflow Implementation of busco.smk, rules: run_busco5.smk, count_busco_genes.smk, unzip_assembly_fasta.smk ; nf-core modules:

subworkflow: diamond_blastp

Nextflow implementation of diamond_blastp.smk, rules: extract_busco_genes.smk, run_diamond_blastp.smk; nf-core modules:

subworkflow: cov_stats

Nextflow implementation of cov_stats.smk, rules: run_mosdepth.smk, add_cov_to_tsv.smk (pastes tables) ; nf-core modules:

mosdepth

nf-core module: blobtools2

After the containers are available, implement blobtools2 as a nf-core module (or subworkflow).

nf-core pipeline: blobtoolkit

Based on the BlobToolkit Snakemake pipeline, convert each sub-pipeline into Nextflow subworkflows:

“minimap.smk - align reads to the genome assembly using minimap2”.
“windowmasker.smk - identify and mask repetitive regions using Windowmasker. Masked sequences are used in all blast searches”.
“chunk_stats.smk - calculate sequence statistics in 1kb windows for each contig”.
“busco.smk - run BUSCO using specific and basal lineages. Count BUSCOs in 1kb windows for each contig”.
“cov_stats - calculate coverage in 1kb windows using mosdepth”.
“window_stats - aggregate 1kb values into windows of fixed proportion (10%, 1% of contig length) and fixed length (100kb, 1Mb)”.
“diamond_blastp.smk - Diamond blastp search of busco gene models for basal lineages (archaea_odb10, bacteria_odb10 and eukaryota_odb10) against the UniProt reference proteomes”.
“diamond.smk - Diamond blastx search of assembly contigs against the UniProt reference proteomes. Contigs are split into chunks to allow distribution-based taxrules. Contigs over 1Mb are subsampled by retaining only the most BUSCO-dense 100 kb region from each chunk.”
“blastn.smk - NCBI blastn search of assembly contigs with no Diamond blastx match against the NCBI nt database”.
“blobtools.smk - import analysis results into a BlobDir dataset”.
“view.smk - BlobDir validation and static image generation”.

Create entrez-direct nf-core module

The entrez-direct nf-core module is a dependency in BlobToolKitPipeline. "Entrez Direct (EDirect) provides access to the NCBI's suite of interconnected databases (publication, sequence, structure, gene, variation, expression, etc.) from a Unix terminal window. Search terms are entered as command-line arguments. Individual operations are connected with Unix pipes to construct multi-step queries. Selected records can then be retrieved in a variety of formats".

summary.json missing from the output blobdir

Description of the bug

The pipeline runs BLOBTOOLKIT_SUMMARY, which generates a *.summary.json file, but that file is not copied to the output blobdir

Command used and terminal output

No response

Relevant files

No response

System information

Pipeline v0.3.0

subworkflow: window_stats

Nextflow implmementation of window_stats.smk, rules: get_window_stats.smk (runs btk pipeline).

subworkflow: view

Nexttflow implementation of view.smk, rules: validate_dataset.smk, generate_images.smk, generate_summary.smk, checksum_files.smk; nf-core modules:

blobtools2 (not implemented), requires: containers and module

Pipeline documentation

Write documentation for the blobtoolkit pipeline.

Medium test

Organisms: ?

nf-core module: goat/taxonsearch

Write goat/taxonsearch nf-core module: entrezdirect/esummary ; nf-core modules issue: nf-core/modules#1864.

Make BUSCO annotations an optional input

Related to #101.

If we make a separate BUSCO pipeline, we should support blobtoolkit taking pre-computed BUSCO annotations.

It may also have some value if we want to run the pipeline on the same assembly in two phases, but what's the advantage of that versus using the native -resume functionality ?

Unit testing - sample databases

For unit testing the pipeline, create taxon restricted subsets of

uniprot diamond blastp database
ncbi nt blastn database
busco lineages

Put these in /lustre/scratch123/tol/resources, synchronise with s3 (so others can access them too)

Create instructions for re-creating the FULL databases and the sub-sampled databases

nf-core module: blast/windowmasker

Write blast/windowmasker nf-core module; nf-cores modules issue: new module: blast/windowmasker #1792 .

nf-core module: entrezdirect/esearch

Write entrez-direct nf-core module: entrezdirect/esearch; nf-core modules issue: new module: entrezdirect/esearch #1771.

error with FASTAWINDOWS process

Hi There,
i'm trying to run the nextflow pipeline on my draft genome using the command:
nextflow run sanger-tol/blobtoolkit -r 0.2.0 -resume -profile docker --input Ahemp.csv --fasta Ahemp_final.fasta --yaml Ahemp.yaml --accession Ahemp --taxon "Acropora hemprichii" --taxdump /databases/taxdump --blastp /databases/uniprot/reference_proteomes.dmnd --blastn /databases/20230316-ncbi/nt/nt --blastx /databases/uniprot/reference_proteomes.dmnd

But it always has the following error, any hints?

-[sanger-tol/blobtoolkit] Pipeline completed with errors-
ERROR ~ Error executing process > 'SANGERTOL_BLOBTOOLKIT:BLOBTOOLKIT:COLLATE_STATS:BLOBTOOLKIT_WINDOWSTATS (Ahemp)'

Caused by:
  Missing output file(s) `*_window_stats*.tsv` expected by process `SANGERTOL_BLOBTOOLKIT:BLOBTOOLKIT:COLLATE_STATS:BLOBTOOLKIT_WINDOWSTATS (Ahemp)`

Command executed:

  btk pipeline window-stats \
          --in Ahemp.tsv \
          --window 0.1 --window 0.01 --window 1 --window 100000 --window 1000000 \
          --out Ahemp_window_stats.tsv
  
  cat <<-END_VERSIONS > versions.yml
  "SANGERTOL_BLOBTOOLKIT:BLOBTOOLKIT:COLLATE_STATS:BLOBTOOLKIT_WINDOWSTATS":
      blobtoolkit: $(btk --version | cut -d' ' -f2 | sed 's/v//')
  END_VERSIONS

Command exit status:
  0

Command output:
  (empty)

Work dir:
  /home/sharafa/Ahmep_genome/work/5f/0f95c92e19f521dc31245971f4fd9a

Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line

 -- Check '.nextflow.log' file for details

Missing BUSCO results

Description of the bug

The BUSCO page only shows the metazoa_odb10 results whereas there should be eukaryota, archaea, and bacteria as well.

All four are computed by the pipeline and make their way to the blobdir.

Command used and terminal output

No response

Relevant files

No response

System information

sanger-tol/blobtoolkit 0.2.0

Remove `parameters.md`

Description of feature

To close this issue:

Move the config.yaml details from docs/parameters.md to docs/usage.md after the section about creating the databases.
Delete parameters.md
Delete the description of parameters.md from docs/README.md
Delete parameters from the "Documentation" section in README.md

subworkflow: diamond_blastx

Nextflow implementation of diamond.smk, rules: chunk_fasta_by_busco.smk, run_diamond_blastx.smk, unchunk_blastx.smk; nf-core modules:

nf-core module: entrezdirect/xtract

Write entrez-direct nf-core module: entrezdirect/xtract; ; nf-core modules issue: new module: entrezdirect/esearch #1832.

sanger-tol / blobtoolkit Goto Github PK

blobtoolkit's People

Contributors

Stargazers

Watchers

blobtoolkit's Issues

Description of feature

Description of feature

Description of feature

Description of feature

Description of feature

Description of feature

Description of feature

Description of feature

Description of the bug

Command used and terminal output

Relevant files

System information

Description of the bug

Command used and terminal output

Relevant files

System information

Description of feature

Recommend Projects

Recommend Topics

Recommend Org