Giter VIP home page Giter VIP logo

ebi-metagenomics-cwl's Introduction

Build Status

ebi-metagenomics-cwl

This repository contains the CWL description of the EBI Metagenomics pipeline. It is superceded by https://github.com/EBI-Metagenomics/pipeline-v5

Example workflow layout

The steps of the original pipeline are visualised on the website and can be found here: https://www.ebi.ac.uk/metagenomics/pipelines/3.0 OR https://www.ebi.ac.uk/metagenomics/pipelines/4.0

How to run our CWL files?

  1. Install the cwlref-runner as described here: https://github.com/common-workflow-language/cwltool

  2. Get a clone of this repository

  3. Install the command line tools on your local machine/cluster (e.g. FragGeneScan or InterProScan 5)

  4. Choose the command line tool or workflow you want to run

  5. Write an YAML job file for the selected command line tool or workflow

  6. Run the command line tool/workflow, specifying the path if the tools are not installed to /usr/bin or /usr/local/bin

  $ PATH=~/my/FragGeneScan:~/my/InterProScan:${PATH} cwltool \
      --preserve-environment PATH workflows/emg-pipeline-v3.cwl \
      workflows/emg-pipeline-v3-example-job.yaml

ebi-metagenomics-cwl's People

Contributors

mr-c avatar mscheremetjew avatar rdfinn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ebi-metagenomics-cwl's Issues

Integrate QC

Integrate the QC workflow into:

  • v4 assembly
  • v4 reads

Chunking for FGS

Running on a large dataset, FGS scan have been running fro 3 days. Ideally shatter/gather this process, using chunking.

Taxonomic annotation is not consistent in metadata file

Hello,

I have noticed that taxonomic annotation is not consistent between genomes assigned to the same species representatives.

Below is my code:

library(tidyverse)
genomes_all_metadata=read_tsv('https://ftp.ebi.ac.uk/pub/databases/metagenomics/mgnify_genomes/human-gut/v2.0/genomes-all_metadata.tsv')

genomes_all_metadata=genomes_all_metadata %>%
  select(Species_rep, Lineage) %>%
  group_by(Species_rep,Lineage) %>% summarise(num_genomes=n())

genomes_all_metadata = genomes_all_metadata %>% 
  group_by(Species_rep) %>%
  filter(n()>1)

For instance, genomes assigned to MGYG000002478 are sometimes classified as Phocaeicola dorei and sometimes as Bacteroides_B dorei

Could you fix this?

Florian

containers

duplicate headers qiime sadness

re-running v3 with latest changes. this time the 16s path was processed 1st and our old friend re-appeared:

bfillings.uclust.UclustParseError: Query id QiimeExactMatch.ERR770958.210359-HWI-M02691:4:000000000-ABPDV:1:1114:3668:11849-1 hit multiple seeds. This 
can happen if --allhits is enabled in the call to uclust, which isn't supported by default. Call clusters_from_uc_file(lines, error_on_multiple_hits=Fa
lse) to allow a query to cluster to multiple seeds.

Yeah, there are duplicate headers again ๐Ÿ˜ž

[mcrusoe@ebi-cli-001 /]$ grep '^>' /hps/nobackup/production/metagenomics/CWL/mcrusoe/ebi-metagenomics-cwl/workflows/cwltool-cache/e7775b7a8443160feaf36df648e3735e/sequences.filtered.fasta  | sort -u | wc -l
715
[mcrusoe@ebi-cli-001 /]$ grep '^>' /hps/nobackup/production/metagenomics/CWL/mcrusoe/ebi-metagenomics-cwl/workflows/cwltool-cache/e7775b7a8443160feaf36df648e3735e/sequences.filtered.fasta  | wc -l
1307

MetaSpades Assembly pipeline tweaks

Need to pull back scaffolds, but we should work on the contigs. A minor tweak in the V4 assembly pipeline. This will make it possible to swap between assemblers.

schema html parsing?

I was trying to use a simple workflow to test the package, run into a problem with the schema not being read in correctly. Seems like it does not parse the html file, rather, it reads it in literally. Is it a known issue? Am I doing something incorrectly?

/P/cwl/venv/bin/cwltool 1.0.20180615183820
Resolved 'ebi-metagenomics-cwl/tools/trimmomatic.cwl' to 'file:///P/cwl/ebi-metagenomics-cwl/tools/trimmomatic.cwl'
https://schema.org/docs/!DOCTYPE html does not look like a valid URI, trying to serialize this will break.
https://schema.org/docs/html lang="en" does not look like a valid URI, trying to serialize this will break.
I'm sorry, I couldn't load this CWL file.
The error was: 
Traceback (most recent call last):
  File "/P/cwl/venv/lib/python3.6/site-packages/cwltool/main.py", line 478, in main
    metadata, uri, loadingContext)
  File "/P/cwl/venv/lib/python3.6/site-packages/cwltool/load_tool.py", line 342, in make_tool
    tool = loadingContext.construct_tool_object(processobj, loadingContext)
  File "/P/cwl/venv/lib/python3.6/site-packages/cwltool/workflow.py", line 45, in default_make_tool
    return command_line_tool.CommandLineTool(toolpath_object, loadingContext)
  File "/P/cwl/venv/lib/python3.6/site-packages/cwltool/command_line_tool.py", line 217, in __init__
    toolpath_object, loadingContext)
  File "/P/cwl/venv/lib/python3.6/site-packages/cwltool/process.py", line 559, in __init__
    validate_js_expressions(cast(CommentedMap, toolpath_object), self.doc_schema.names[toolpath_object["class"]], validate_js_options)
  File "/P/cwl/venv/lib/python3.6/site-packages/cwltool/validate_js.py", line 181, in validate_js_expressions
    expressions = get_expressions(tool, schema)
  File "/P/cwl/venv/lib/python3.6/site-packages/cwltool/validate_js.py", line 76, in get_expressions
    SourceLine(tool, schema_field.name)
  File "/P/cwl/venv/lib/python3.6/site-packages/cwltool/validate_js.py", line 55, in get_expressions
    assert valid_schema is not None
AssertionError

Mini Workflow for sharing with MG-RAST

We need to construct a mini-cwl (and execute it) for a benchmarking experiment in collaboration with MG-RAST.

The tools are seq-prep and trimmomatic. We may have to have two CWL files that deals with paired and non-paired end inputs.

production switchover checklist

  • chunking of outputs
  • scatter & chunking for other steps, perhaps
  • running on OpenLava
  • replication at another site
    • prepare by confirming that all needed scripts are copied out of existing pipeline)
  • run new dataset through CWL v3 inclusive of upload to website

toil with CWL on LSF status

  • Must keep --workdir on a non shared filesystem like /tmp until DataBiosphere/toil#1573 is merged (Might be better from a performance perspective anyway)
  • Make sure to specify --retries 1 or higher so that killed job get retried with at least the default memory (from --defaultMemory 10Gi or similar) automatically Nope, hand specify minimum memory and update those as jobs fail.
  • Speaking of memory, add ResourceRequirements with fixed or dynamic ramMin to all tools.
  • test specifying ResourceRequirements at the Workflow and WorkflowStep levels
  • toil is experiencing a serialization bug, so don't use format with multiple options for inputs (for now) DataBiosphere/toil#1692
  • --preserve-environment takes a space separated list of environment variables to preserve, not a comma separated list as the docs previously reported DataBiosphere/toil#1689
  • Use the TOIL_LSF_ARGS to specify the queue in your runscript: export TOIL_LSF_ARGS="-q production-rh7" DataBiosphere/toil#1640
  • There's an error in enumerating jobs in Toil 3.7.0, fix is at DataBiosphere/toil#1690
  • Toil doesn't have an override for cwltool's strict filename check, so be sure to strip out offending characters such as :, example at 767cc8f DataBiosphere/toil#1782
  • like most cwltool based CWL executors, you'll be happier if you set a dedicated output directory via --outdir
  • the CWL output object is written to stdout, so redirect that to a file for posterity (example: cwltoil [โ€ฆ] | tee output)
  • --restart is handy for resuming a previous run, but (due to the lack of cache support while using the LSF batch system) any changes to the CWL descriptions will require a clean start
  • apparently Toil will "make up" resource requirements on its own (randomly?) for tools without those annotations, so better be safe least cat get assigned 16 cores and 100GiB of memory :-)
  • Toil runs testing using many batch systems (SLURM, Yarn, parasol, mesos, spark, GridEngine), but not LSF -- need to add setup instructions for spinning up a LSF cluster to https://github.com/BD2KGenomics/cgcloud/blob/master/jenkins/src/cgcloud/jenkins/toil_jenkins_slave.py
  • Review Globus toolkit's LSF code for inspiration: https://github.com/globus/globus-toolkit/blob/globus_6_branch/gram/jobmanager/lrms/lsf/source/lsf.pm
  • how to capture timestamps? they are output to stderr, but not in the on disk log
  • how to capture output from LSF?
  • Is it possible to run the housekeeping jobs on the launcher node and not via cluster submission? (CWLWorkflow, ResolveIndirect, CWLGather, CWLScatter, etc.. ) DataBiosphere/toil#1783
  • Restore usage of InitialWorkDirRequirment and confirm
  • write up the above lessons learned
  • we don't request space in /tmp even though Toil does write there
  • Migrate Toil's LSF code to use AbstractGridEngineBatchSystem DataBiosphere/toil#2043

Current working branch will the bulk of the above fixed merged: https://github.com/mr-c/toil/tree/issues/1666-fail-not-on-unsubmitted-jobs Latest Toil release has all of the above mentioned fixes merged

Missing required parameter

After running:
cwltool emg-pipeline-v4-paired.cwl emg-pipeline-v4-paired-job.yaml
I get the error:
emg-pipeline-v4-paired.cwl:41:3: Missing required input parameter 'go_summary_config'

Rfam library files

Hello,

First of all - thank you, @mr-c, for the response to my previous problem!

I've been further trying to run emg workflows, specifically emg-pipeline-v4-paired.cwl, on my machine. I have run into several more issues, most of which seem too trivial to formulate as formal issues. If there is a preferred channel of communication, please, let me know.

Some of the problems might reveal my unfamiliarity with CWL, so I apologise, and I'm working on it! Seems like it's really quite an elegant way to deal with patchworks of wildly different tools, commonly known as pipelines.

Here's what I had problems with so far:

  • 1

ISSUE:

workflows/emg-pipeline-v4-paired-job.yaml specifies Rfam libraries contained within directories: "other" (e.g. .../CWL/data/libraries/Rfam/other/Archaea_SRP.cm), "ribosomal" (e.g. .../CWL/data/libraries/Rfam/ribosomal/RF02542.cm).
I downloaded the Rfam database from ftp://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/Rfam.tar.gz, and it does not contain a corresponding directory structure.

SOLUTION:
I have not found any so far.

  • 2

ISSUE:

MGRAST_base.py script used in tools/qc-stats.cwl is missing, can't find it online.

  • 3

ISSUE: At step trim_quality_control when running workflows/emg-qc-paired.cwl:

[job trim_quality_control] /tmp/tmp688nyarp$ /bin/sh \
    -c \
    'java' 'org.usadellab.trimmomatic.Trimmomatic' 'PE' '-trimlog' 'trim.log' '-threads' '8' '-phred33' '/tmp/tmpvkjt1pef/stg827d3176-4f1b-4179-8404-4b46397fff43/merged_with_unmerged_reads' 'merged_with_unmerged_reads.trimmed.fastq' 'LEADING:3' 'TRAILING:3' 'SLIDINGWINDOW:4:15' 'MINLEN:100'
Error: Could not find or load main class org.usadellab.trimmomatic.Trimmomatic

EXPLANATION:
On my computer / a different version of Trimmomatic installs as a bash executable, which then calls java.

SOLUTION:
change:
baseCommand [ java, org.usadellab.trimmomatic.Trimmomatic ]
to:
baseCommand [ trimmomatic ]

  • 4

ISSUE: Is Trimmomatic output log file saved in a directory when it can be found?

Error collecting output for parameter 'output_log':
ebi-metagenomics-cwl/tools/trimmomatic.cwl:221:3: Traceback (most recent call last):
ebi-metagenomics-cwl/tools/trimmomatic.cwl:221:3: 
ebi-metagenomics-cwl/tools/trimmomatic.cwl:221:3:   File "/P/cwl/venv/lib/python3.6/site-packages/cwltool/command_line_tool.py", line 707, in collect_output
ebi-metagenomics-cwl/tools/trimmomatic.cwl:221:3:     raise WorkflowException("Did not find output file with glob pattern: '{}'".format(globpatterns))
ebi-metagenomics-cwl/tools/trimmomatic.cwl:221:3: 
ebi-metagenomics-cwl/tools/trimmomatic.cwl:221:3: cwltool.errors.WorkflowException: Did not find output file with glob pattern: '['trim.log']'

SOLUTION:
For now, I just commented out the output_log section of outputs in trimmomatic.cwl (lines 221-233). Guess that might break something, I'm sure there's a better solution.

v3 1st draft overview

CWL descriptions for each tool

  • seqprep #4
  • trimmomatic common-workflow-library/legacy#132
    Order of trimming steps can still be made more flexible
  • "biopython"

    Sequences < 100 nucleotides in length removed.

    • need to find the script
  • hmmer
  • FragGeneScan
    • how is this used at EBI?
    • updated at #3
  • InterProScan
  • "QIIME"

    16s rRNA are annotated using the Greengenes reference database (default closed-reference OTU picking protocol with Greengenes 13.8 reference with reverse strand matching enabled)."

    • Which parts of QIIME are used and how?

ONT Pipeline

We need to start thinking about how we should run ONT data

  • QC using poretools
  • Functional assignments using nhmmer and pfam
  • The functional assignments looks really good, need a Pfam2GO, or Pfam -> InterPro2GO
  • Kmer based taxonomy profiling
  • Standard SSU/LSU approach, may need to take top hit? Does mapseq do any thing different.

Comment about samtools in docker file

ebi-metagenomics-cwl/tools/infernal-Dockerfile

Is that comment obsolete in this docker file? I think this is just install infernal. Also, we need to have a specific version. Is that version correct?

V3 missing bits and bobs

All files/directories can be found here /hps/nobackup/production/metagenomics/CWL/data/EMGv3_0/ERP009703/results/ERR770958_MERGED_FASTQ:

  • ERR770958_MERGED_FASTQ.fasta.submitted.count - I do not think that we have counts of the seq-prep-ed file

  • Missing QC stats, should be 8 files. You are generating this

  • Missing summary files of counts ERR770958_MERGED_FASTQ_summary.ipr, ERR770958_MERGED_FASTQ_summary

  • taxonomy-summary files. I think we are generating the krona files? Missing 3.
    kingdom-counts.txt krona.html krona-input.txt

  • Missing the following files from the taxonomy steps

  1. ERR770958_MERGED_FASTQ_otu_table_hdf5.biom
  2. ERR770958_MERGED_FASTQ_qiime_assigned_taxonomy.txt
  3. uclust_ref_picked_otus
  • sequence categorisation step outputs in sequence-categorisation are missing. We are generating many of the fasta files.

  • tRNA sequences

  • stepChunkingAndCompression step is missing?

  • standalone ExpressionTool to re-write all the CWL outputs to match the Python workflow outputs

Add binning to assembly pipeline

Add simple binning to the assembly pipeline:

  • Install bbmap tool
  • Install MetaBat
  • See if we can get QC files from metaspades
  • map sequences to assembly, CWL implementation
  • metabat CWL implementation
  • Investigate MaxBins and MetaWatt
  • Should we use CheckM or something similar to assess quality?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.