sanger-tol / blobtoolkit Goto Github PK
View Code? Open in Web Editor NEWNextflow pipeline for BlobToolKit for Sanger ToL production suite
Home Page: https://pipelines.tol.sanger.ac.uk/blobtoolkit
License: MIT License
Nextflow pipeline for BlobToolKit for Sanger ToL production suite
Home Page: https://pipelines.tol.sanger.ac.uk/blobtoolkit
License: MIT License
Update the input config to fix the summary YAML generated
Import the readmapping pipeline (minus the statistics subworkflow)
Nextflow implementation of blobtools.smk
, rules: run_blobtools_create.smk
, run_blobtools_add.smk
, add_summary_to_metadata.smk
; nf-core modules:
blobtools2
(not implemented), requires: containers and moduledocs/decisions_records
with a README.mddocs/images
- It should only have 2 files
docs/usage.md
- Move the config.yaml
details from docs/parameters.md
to docs/usage.md
after the section about creating the databases.docs/output.md
–In the version 0.2.0 of the pipeline, the job cache only functions up to the BLOBTOOLKIT_CREATEBLOBDIR
process. Something downstream must be modifying some input file or parameter, breaking the cache mechanism.
My guess is that BLOBTOOLKIT_UPDATEBLOBDIR
and others are updating the blobdir in place and therefore modifying its timestamp, which is an element considered by Nextflow when checking processess against the cache.
Even though the remaining processes don't take that long to rerun, we should make this cleaner.
Nextflow implementation of chunk_stats.smk
, rules: get_chunked_stats.smk
(imports minimap
and windowmasker
); nf-core modules:
blast_windowmasker
not implementedminimap2_index
and minimap2_align
To close this issue, systematically clean up the pipeline for future template updates.
To close this issue, create and update the README.md with the workflow diagram under the Pipeline Summary section. The actual .png
file is stored in docs/images
with the name sanger-tol-readmapping_workflow.png
.
Build and upload Bioconda, Docker, and, Singularity containers for blobtools2.
Organisms: ?
Organisms: ?
Given that we need to run BUSCO in at least two places (genomenote and blobtoolkit) and that we may want to run just BUSCO on some assemblies, is it worth making a pipeline just for running BUSCO ? It would probably have a single extra step to fetch lineages.
The alternative is running blobtoolkit with a set of ext.when
conditions that essentially limit it to BUSCO.
Regardless of which pipeline has BUSCO, the resulting gene annotations should be processed the same way ensemblgenedownload
does.
Implements i/o with automated_input subworkflow.
Write entrez-direct nf-core module: entrezdirect/esummary ; nf-core modules issue: new module: entrezdirect/esearch #1831.
Nextflow implementation of blastn.smk
, rules: extract_nohit_fasta.smk
, chunk_nohit_fasta.smk
, run_blastn.smk
, unchunk_blastn.smk
; nf-core modules:
Write a pysam nf-core module.
Nextflow implementation of diamond_blastp.smk
, rules: extract_busco_genes.smk
, run_diamond_blastp.smk
; nf-core modules:
Nextflow implementation of cov_stats.smk
, rules: run_mosdepth.smk
, add_cov_to_tsv.smk
(pastes tables) ; nf-core modules:
After the containers are available, implement blobtools2 as a nf-core module (or subworkflow).
Based on the BlobToolkit Snakemake pipeline, convert each sub-pipeline into Nextflow subworkflows:
“minimap.smk - align reads to the genome assembly using minimap2”.
“windowmasker.smk - identify and mask repetitive regions using Windowmasker. Masked sequences are used in all blast searches”.
“chunk_stats.smk - calculate sequence statistics in 1kb windows for each contig”.
“busco.smk - run BUSCO using specific and basal lineages. Count BUSCOs in 1kb windows for each contig”.
“cov_stats - calculate coverage in 1kb windows using mosdepth”.
“window_stats - aggregate 1kb values into windows of fixed proportion (10%, 1% of contig length) and fixed length (100kb, 1Mb)”.
“diamond_blastp.smk - Diamond blastp search of busco gene models for basal lineages (archaea_odb10, bacteria_odb10 and eukaryota_odb10) against the UniProt reference proteomes”.
“diamond.smk - Diamond blastx search of assembly contigs against the UniProt reference proteomes. Contigs are split into chunks to allow distribution-based taxrules. Contigs over 1Mb are subsampled by retaining only the most BUSCO-dense 100 kb region from each chunk.”
“blastn.smk - NCBI blastn search of assembly contigs with no Diamond blastx match against the NCBI nt database”.
“blobtools.smk - import analysis results into a BlobDir dataset”.
“view.smk - BlobDir validation and static image generation”.
The entrez-direct nf-core module is a dependency in BlobToolKitPipeline. "Entrez Direct (EDirect) provides access to the NCBI's suite of interconnected databases (publication, sequence, structure, gene, variation, expression, etc.) from a Unix terminal window. Search terms are entered as command-line arguments. Individual operations are connected with Unix pipes to construct multi-step queries. Selected records can then be retrieved in a variety of formats".
The pipeline runs BLOBTOOLKIT_SUMMARY
, which generates a *.summary.json
file, but that file is not copied to the output blobdir
No response
No response
Pipeline v0.3.0
Nextflow implmementation of window_stats.smk
, rules: get_window_stats.smk
(runs btk pipeline
).
Nexttflow implementation of view.smk
, rules: validate_dataset.smk
, generate_images.smk
, generate_summary.smk
, checksum_files.smk
; nf-core modules:
blobtools2
(not implemented), requires: containers and moduleWrite documentation for the blobtoolkit pipeline.
Organisms: ?
Write goat/taxonsearch nf-core module: entrezdirect/esummary ; nf-core modules issue: nf-core/modules#1864.
Related to #101.
If we make a separate BUSCO pipeline, we should support blobtoolkit taking pre-computed BUSCO annotations.
It may also have some value if we want to run the pipeline on the same assembly in two phases, but what's the advantage of that versus using the native -resume
functionality ?
For unit testing the pipeline, create taxon restricted subsets of
Put these in /lustre/scratch123/tol/resources, synchronise with s3 (so others can access them too)
Create instructions for re-creating the FULL databases and the sub-sampled databases
Write blast/windowmasker nf-core module; nf-cores modules issue: new module: blast/windowmasker #1792 .
Write entrez-direct nf-core module: entrezdirect/esearch; nf-core modules issue: new module: entrezdirect/esearch #1771.
Hi There,
i'm trying to run the nextflow pipeline on my draft genome using the command:
nextflow run sanger-tol/blobtoolkit -r 0.2.0 -resume -profile docker --input Ahemp.csv --fasta Ahemp_final.fasta --yaml Ahemp.yaml --accession Ahemp --taxon "Acropora hemprichii" --taxdump /databases/taxdump --blastp /databases/uniprot/reference_proteomes.dmnd --blastn /databases/20230316-ncbi/nt/nt --blastx /databases/uniprot/reference_proteomes.dmnd
But it always has the following error, any hints?
-[sanger-tol/blobtoolkit] Pipeline completed with errors-
ERROR ~ Error executing process > 'SANGERTOL_BLOBTOOLKIT:BLOBTOOLKIT:COLLATE_STATS:BLOBTOOLKIT_WINDOWSTATS (Ahemp)'
Caused by:
Missing output file(s) `*_window_stats*.tsv` expected by process `SANGERTOL_BLOBTOOLKIT:BLOBTOOLKIT:COLLATE_STATS:BLOBTOOLKIT_WINDOWSTATS (Ahemp)`
Command executed:
btk pipeline window-stats \
--in Ahemp.tsv \
--window 0.1 --window 0.01 --window 1 --window 100000 --window 1000000 \
--out Ahemp_window_stats.tsv
cat <<-END_VERSIONS > versions.yml
"SANGERTOL_BLOBTOOLKIT:BLOBTOOLKIT:COLLATE_STATS:BLOBTOOLKIT_WINDOWSTATS":
blobtoolkit: $(btk --version | cut -d' ' -f2 | sed 's/v//')
END_VERSIONS
Command exit status:
0
Command output:
(empty)
Work dir:
/home/sharafa/Ahmep_genome/work/5f/0f95c92e19f521dc31245971f4fd9a
Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line
-- Check '.nextflow.log' file for details
The BUSCO page only shows the metazoa_odb10
results whereas there should be eukaryota, archaea, and bacteria as well.
All four are computed by the pipeline and make their way to the blobdir.
No response
No response
sanger-tol/blobtoolkit 0.2.0
To close this issue:
config.yaml
details from docs/parameters.md
to docs/usage.md
after the section about creating the databases.parameters.md
parameters.md
from docs/README.md
parameters
from the "Documentation" section in README.md
Nextflow implementation of diamond.smk
, rules: chunk_fasta_by_busco.smk
, run_diamond_blastx.smk
, unchunk_blastx.smk
; nf-core modules:
Write entrez-direct nf-core module: entrezdirect/xtract; ; nf-core modules issue: new module: entrezdirect/esearch #1832.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.