Giter VIP home page Giter VIP logo

blobtoolkit's People

Contributors

alxndrdiaz avatar gq1 avatar muffato avatar priyanka-surana avatar sujaikumar avatar zb32 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

blobtoolkit's Issues

subworkflow: blobtools

Nextflow implementation of blobtools.smk, rules: run_blobtools_create.smk, run_blobtools_add.smk, add_summary_to_metadata.smk; nf-core modules:

  • blobtools2 (not implemented), requires: containers and module

Clean up docs

Description of feature

  1. Create a new folder called docs/decisions_records with a README.md
  2. docs/images - It should only have 2 files
    • sanger-tol-readmapping_logo.png – The logo for the pipeline
    • sanger-tol-readmapping_workflow.png – The workflow diagram
  3. docs/usage.md - Move the config.yaml details from docs/parameters.md to docs/usage.md after the section about creating the databases.
  4. docs/output.md

Updates in place and broken Nextflow job cache

Description of feature

In the version 0.2.0 of the pipeline, the job cache only functions up to the BLOBTOOLKIT_CREATEBLOBDIR process. Something downstream must be modifying some input file or parameter, breaking the cache mechanism.

My guess is that BLOBTOOLKIT_UPDATEBLOBDIR and others are updating the blobdir in place and therefore modifying its timestamp, which is an element considered by Nextflow when checking processess against the cache.

Even though the remaining processes don't take that long to rerun, we should make this cleaner.

Clean up

Description of feature

To close this issue, systematically clean up the pipeline for future template updates.

Workflow diagram

Description of feature

To close this issue, create and update the README.md with the workflow diagram under the Pipeline Summary section. The actual .png file is stored in docs/images with the name sanger-tol-readmapping_workflow.png.

Separate BUSCO pipeline ?

Given that we need to run BUSCO in at least two places (genomenote and blobtoolkit) and that we may want to run just BUSCO on some assemblies, is it worth making a pipeline just for running BUSCO ? It would probably have a single extra step to fetch lineages.

The alternative is running blobtoolkit with a set of ext.when conditions that essentially limit it to BUSCO.

Regardless of which pipeline has BUSCO, the resulting gene annotations should be processed the same way ensemblgenedownload does.

nf-core pipeline: blobtoolkit

Based on the BlobToolkit Snakemake pipeline, convert each sub-pipeline into Nextflow subworkflows:

  1. “minimap.smk - align reads to the genome assembly using minimap2”.

  2. “windowmasker.smk - identify and mask repetitive regions using Windowmasker. Masked sequences are used in all blast searches”.

  3. “chunk_stats.smk - calculate sequence statistics in 1kb windows for each contig”.

  4. “busco.smk - run BUSCO using specific and basal lineages. Count BUSCOs in 1kb windows for each contig”.

  5. “cov_stats - calculate coverage in 1kb windows using mosdepth”.

  6. “window_stats - aggregate 1kb values into windows of fixed proportion (10%, 1% of contig length) and fixed length (100kb, 1Mb)”.

  7. “diamond_blastp.smk - Diamond blastp search of busco gene models for basal lineages (archaea_odb10, bacteria_odb10 and eukaryota_odb10) against the UniProt reference proteomes”.

  8. “diamond.smk - Diamond blastx search of assembly contigs against the UniProt reference proteomes. Contigs are split into chunks to allow distribution-based taxrules. Contigs over 1Mb are subsampled by retaining only the most BUSCO-dense 100 kb region from each chunk.”

  9. “blastn.smk - NCBI blastn search of assembly contigs with no Diamond blastx match against the NCBI nt database”.

  10. “blobtools.smk - import analysis results into a BlobDir dataset”.

  11. “view.smk - BlobDir validation and static image generation”.

Create entrez-direct nf-core module

The entrez-direct nf-core module is a dependency in BlobToolKitPipeline. "Entrez Direct (EDirect) provides access to the NCBI's suite of interconnected databases (publication, sequence, structure, gene, variation, expression, etc.) from a Unix terminal window. Search terms are entered as command-line arguments. Individual operations are connected with Unix pipes to construct multi-step queries. Selected records can then be retrieved in a variety of formats".

summary.json missing from the output blobdir

Description of the bug

The pipeline runs BLOBTOOLKIT_SUMMARY, which generates a *.summary.json file, but that file is not copied to the output blobdir

Command used and terminal output

No response

Relevant files

No response

System information

Pipeline v0.3.0

subworkflow: view

Nexttflow implementation of view.smk, rules: validate_dataset.smk, generate_images.smk, generate_summary.smk, checksum_files.smk; nf-core modules:

  • blobtools2 (not implemented), requires: containers and module

Make BUSCO annotations an optional input

Related to #101.

If we make a separate BUSCO pipeline, we should support blobtoolkit taking pre-computed BUSCO annotations.

It may also have some value if we want to run the pipeline on the same assembly in two phases, but what's the advantage of that versus using the native -resume functionality ?

Unit testing - sample databases

For unit testing the pipeline, create taxon restricted subsets of

  • uniprot diamond blastp database
  • ncbi nt blastn database
  • busco lineages

Put these in /lustre/scratch123/tol/resources, synchronise with s3 (so others can access them too)

Create instructions for re-creating the FULL databases and the sub-sampled databases

error with FASTAWINDOWS process

Hi There,
i'm trying to run the nextflow pipeline on my draft genome using the command:
nextflow run sanger-tol/blobtoolkit -r 0.2.0 -resume -profile docker --input Ahemp.csv --fasta Ahemp_final.fasta --yaml Ahemp.yaml --accession Ahemp --taxon "Acropora hemprichii" --taxdump /databases/taxdump --blastp /databases/uniprot/reference_proteomes.dmnd --blastn /databases/20230316-ncbi/nt/nt --blastx /databases/uniprot/reference_proteomes.dmnd

But it always has the following error, any hints?

-[sanger-tol/blobtoolkit] Pipeline completed with errors-
ERROR ~ Error executing process > 'SANGERTOL_BLOBTOOLKIT:BLOBTOOLKIT:COLLATE_STATS:BLOBTOOLKIT_WINDOWSTATS (Ahemp)'

Caused by:
  Missing output file(s) `*_window_stats*.tsv` expected by process `SANGERTOL_BLOBTOOLKIT:BLOBTOOLKIT:COLLATE_STATS:BLOBTOOLKIT_WINDOWSTATS (Ahemp)`

Command executed:

  btk pipeline window-stats \
          --in Ahemp.tsv \
          --window 0.1 --window 0.01 --window 1 --window 100000 --window 1000000 \
          --out Ahemp_window_stats.tsv
  
  cat <<-END_VERSIONS > versions.yml
  "SANGERTOL_BLOBTOOLKIT:BLOBTOOLKIT:COLLATE_STATS:BLOBTOOLKIT_WINDOWSTATS":
      blobtoolkit: $(btk --version | cut -d' ' -f2 | sed 's/v//')
  END_VERSIONS

Command exit status:
  0

Command output:
  (empty)

Work dir:
  /home/sharafa/Ahmep_genome/work/5f/0f95c92e19f521dc31245971f4fd9a

Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line

 -- Check '.nextflow.log' file for details

Missing BUSCO results

Description of the bug

The BUSCO page only shows the metazoa_odb10 results whereas there should be eukaryota, archaea, and bacteria as well.

All four are computed by the pipeline and make their way to the blobdir.

Command used and terminal output

No response

Relevant files

No response

System information

sanger-tol/blobtoolkit 0.2.0

Remove `parameters.md`

Description of feature

To close this issue:

  • Move the config.yaml details from docs/parameters.md to docs/usage.md after the section about creating the databases.
  • Delete parameters.md
  • Delete the description of parameters.md from docs/README.md
  • Delete parameters from the "Documentation" section in README.md

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.