ebi-metagenomics / ebi-metagenomics-cwl Goto Github PK

This repository contains the CWL description of the EBI Metagenomics pipeline

Shell 3.14% Perl 2.48% Python 5.80% Common Workflow Language 88.59%

ebi-metagenomics-cwl's Introduction

ebi-metagenomics-cwl

This repository contains the CWL description of the EBI Metagenomics pipeline. It is superceded by https://github.com/EBI-Metagenomics/pipeline-v5

The steps of the original pipeline are visualised on the website and can be found here: https://www.ebi.ac.uk/metagenomics/pipelines/3.0 OR https://www.ebi.ac.uk/metagenomics/pipelines/4.0

How to run our CWL files?

Install the cwlref-runner as described here: https://github.com/common-workflow-language/cwltool
Get a clone of this repository
Install the command line tools on your local machine/cluster (e.g. FragGeneScan or InterProScan 5)
Choose the command line tool or workflow you want to run
Write an YAML job file for the selected command line tool or workflow
Run the command line tool/workflow, specifying the path if the tools are not installed to /usr/bin or /usr/local/bin

  $ PATH=~/my/FragGeneScan:~/my/InterProScan:${PATH} cwltool \
      --preserve-environment PATH workflows/emg-pipeline-v3.cwl \
      workflows/emg-pipeline-v3-example-job.yaml

ebi-metagenomics-cwl's People

Contributors

Stargazers

Watchers

Forkers

stain farahzkhan jirikuncar aperz david4096 agonopol elizabethcook21 mitiku90 nbisweden truwl gaybro8777 nigusekelile

ebi-metagenomics-cwl's Issues

Integrate QC

Integrate the QC workflow into:

v4 assembly
v4 reads

Produce raw read v4 pipeline based on the assembly pipeline

It should be relatively straight forward to make a raw read pipeline based on the assembly as there is only a change at the initial input.

Archive this repo?

And point to https://github.com/EBI-Metagenomics/pipeline-v5 more explicitly?

Chunking for FGS

Running on a large dataset, FGS scan have been running fro 3 days. Ideally shatter/gather this process, using chunking.

Taxonomic annotation is not consistent in metadata file

Hello,

I have noticed that taxonomic annotation is not consistent between genomes assigned to the same species representatives.

Below is my code:

library(tidyverse)
genomes_all_metadata=read_tsv('https://ftp.ebi.ac.uk/pub/databases/metagenomics/mgnify_genomes/human-gut/v2.0/genomes-all_metadata.tsv')

genomes_all_metadata=genomes_all_metadata %>%
  select(Species_rep, Lineage) %>%
  group_by(Species_rep,Lineage) %>% summarise(num_genomes=n())

genomes_all_metadata = genomes_all_metadata %>% 
  group_by(Species_rep) %>%
  filter(n()>1)

For instance, genomes assigned to MGYG000002478 are sometimes classified as Phocaeicola dorei and sometimes as Bacteroides_B dorei

Could you fix this?

Florian

duplicate headers qiime sadness

re-running v3 with latest changes. this time the 16s path was processed 1st and our old friend re-appeared:

bfillings.uclust.UclustParseError: Query id QiimeExactMatch.ERR770958.210359-HWI-M02691:4:000000000-ABPDV:1:1114:3668:11849-1 hit multiple seeds. This 
can happen if --allhits is enabled in the call to uclust, which isn't supported by default. Call clusters_from_uc_file(lines, error_on_multiple_hits=Fa
lse) to allow a query to cluster to multiple seeds.

Yeah, there are duplicate headers again 😞

[mcrusoe@ebi-cli-001 /]$ grep '^>' /hps/nobackup/production/metagenomics/CWL/mcrusoe/ebi-metagenomics-cwl/workflows/cwltool-cache/e7775b7a8443160feaf36df648e3735e/sequences.filtered.fasta  | sort -u | wc -l
715
[mcrusoe@ebi-cli-001 /]$ grep '^>' /hps/nobackup/production/metagenomics/CWL/mcrusoe/ebi-metagenomics-cwl/workflows/cwltool-cache/e7775b7a8443160feaf36df648e3735e/sequences.filtered.fasta  | wc -l
1307

MetaSpades Assembly pipeline tweaks

Need to pull back scaffolds, but we should work on the contigs. A minor tweak in the V4 assembly pipeline. This will make it possible to swap between assemblers.

Test if toil can deal with dynamic memory requests

Running on some large datasets, the job was killed due to memory usage (esl-sfetch-index.cwl). I have changed this in my local version, but it does not make sense to request so much for most jobs.

I was trying to use a simple workflow to test the package, run into a problem with the schema not being read in correctly. Seems like it does not parse the html file, rather, it reads it in literally. Is it a known issue? Am I doing something incorrectly?

/P/cwl/venv/bin/cwltool 1.0.20180615183820
Resolved 'ebi-metagenomics-cwl/tools/trimmomatic.cwl' to 'file:///P/cwl/ebi-metagenomics-cwl/tools/trimmomatic.cwl'
https://schema.org/docs/!DOCTYPE html does not look like a valid URI, trying to serialize this will break.
https://schema.org/docs/html lang="en" does not look like a valid URI, trying to serialize this will break.
I'm sorry, I couldn't load this CWL file.
The error was:

Traceback (most recent call last):
  File "/P/cwl/venv/lib/python3.6/site-packages/cwltool/main.py", line 478, in main
    metadata, uri, loadingContext)
  File "/P/cwl/venv/lib/python3.6/site-packages/cwltool/load_tool.py", line 342, in make_tool
    tool = loadingContext.construct_tool_object(processobj, loadingContext)
  File "/P/cwl/venv/lib/python3.6/site-packages/cwltool/workflow.py", line 45, in default_make_tool
    return command_line_tool.CommandLineTool(toolpath_object, loadingContext)
  File "/P/cwl/venv/lib/python3.6/site-packages/cwltool/command_line_tool.py", line 217, in __init__
    toolpath_object, loadingContext)
  File "/P/cwl/venv/lib/python3.6/site-packages/cwltool/process.py", line 559, in __init__
    validate_js_expressions(cast(CommentedMap, toolpath_object), self.doc_schema.names[toolpath_object["class"]], validate_js_options)
  File "/P/cwl/venv/lib/python3.6/site-packages/cwltool/validate_js.py", line 181, in validate_js_expressions
    expressions = get_expressions(tool, schema)
  File "/P/cwl/venv/lib/python3.6/site-packages/cwltool/validate_js.py", line 76, in get_expressions
    SourceLine(tool, schema_field.name)
  File "/P/cwl/venv/lib/python3.6/site-packages/cwltool/validate_js.py", line 55, in get_expressions
    assert valid_schema is not None
AssertionError

Mini Workflow for sharing with MG-RAST

We need to construct a mini-cwl (and execute it) for a benchmarking experiment in collaboration with MG-RAST.

The tools are seq-prep and trimmomatic. We may have to have two CWL files that deals with paired and non-paired end inputs.

post-mapSeq processing

for reads based pipeline
for assembly based pipeline

production switchover checklist

chunking of outputs
scatter & chunking for other steps, perhaps
running on OpenLava
replication at another site
- prepare by confirming that all needed scripts are copied out of existing pipeline)
run new dataset through CWL v3 inclusive of upload to website

toil with CWL on LSF status

~~Current working branch will the bulk of the above fixed merged: https://github.com/mr-c/toil/tree/issues/1666-fail-not-on-unsubmitted-jobs~~ Latest Toil release has all of the above mentioned fixes merged

Possible bug in infernal descriptions

In the infernal description, it refers to "spades" in the specs - should this not be cmscan?

hints:
  - $import: infernal-docker.yml
  - class: SoftwareRequirement
    packages:
      spades:
        specs: [ "https://identifiers.org/rrid/RRID:SCR_011809" ]
        version: [ "1.1.2" ]

https://github.com/ProteinsWebTeam/ebi-metagenomics-cwl/blob/master/tools/infernal-cmscan.cwl

Running FragGeneScan directly w/o their perl wrapper?

I see that FragGeneScan is run directly:

https://github.com/ProteinsWebTeam/ebi-metagenomics-cwl/blob/master/workflows/command_line_tools/FragGeneScan1_20.cwl#L14

without the run_FragGeneScan.pl wrapper that they distributed.

Is this on purpose?

Define common QC step for read sequences

Check that we are using common tools
Check parameters
Agree with MG-RAST
Define workflow and put in a metagenomics exchange GitHub repo.

Missing required parameter

After running:
cwltool emg-pipeline-v4-paired.cwl emg-pipeline-v4-paired-job.yaml
I get the error:
emg-pipeline-v4-paired.cwl:41:3: Missing required input parameter 'go_summary_config'

Paired and single run inputs for v4

Rfam library files

Hello,

First of all - thank you, @mr-c, for the response to my previous problem!

I've been further trying to run emg workflows, specifically emg-pipeline-v4-paired.cwl, on my machine. I have run into several more issues, most of which seem too trivial to formulate as formal issues. If there is a preferred channel of communication, please, let me know.

Some of the problems might reveal my unfamiliarity with CWL, so I apologise, and I'm working on it! Seems like it's really quite an elegant way to deal with patchworks of wildly different tools, commonly known as pipelines.

Here's what I had problems with so far:

ISSUE:

workflows/emg-pipeline-v4-paired-job.yaml specifies Rfam libraries contained within directories: "other" (e.g. .../CWL/data/libraries/Rfam/other/Archaea_SRP.cm), "ribosomal" (e.g. .../CWL/data/libraries/Rfam/ribosomal/RF02542.cm).
I downloaded the Rfam database from ftp://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/Rfam.tar.gz, and it does not contain a corresponding directory structure.

SOLUTION:
I have not found any so far.

ISSUE:

MGRAST_base.py script used in tools/qc-stats.cwl is missing, can't find it online.

ISSUE: At step trim_quality_control when running workflows/emg-qc-paired.cwl:

[job trim_quality_control] /tmp/tmp688nyarp$ /bin/sh \
    -c \
    'java' 'org.usadellab.trimmomatic.Trimmomatic' 'PE' '-trimlog' 'trim.log' '-threads' '8' '-phred33' '/tmp/tmpvkjt1pef/stg827d3176-4f1b-4179-8404-4b46397fff43/merged_with_unmerged_reads' 'merged_with_unmerged_reads.trimmed.fastq' 'LEADING:3' 'TRAILING:3' 'SLIDINGWINDOW:4:15' 'MINLEN:100'
Error: Could not find or load main class org.usadellab.trimmomatic.Trimmomatic

EXPLANATION:
On my computer / a different version of Trimmomatic installs as a bash executable, which then calls java.

SOLUTION:
change:
baseCommand [ java, org.usadellab.trimmomatic.Trimmomatic ]
to:
baseCommand [ trimmomatic ]

ISSUE: Is Trimmomatic output log file saved in a directory when it can be found?

Error collecting output for parameter 'output_log':
ebi-metagenomics-cwl/tools/trimmomatic.cwl:221:3: Traceback (most recent call last):
ebi-metagenomics-cwl/tools/trimmomatic.cwl:221:3: 
ebi-metagenomics-cwl/tools/trimmomatic.cwl:221:3:   File "/P/cwl/venv/lib/python3.6/site-packages/cwltool/command_line_tool.py", line 707, in collect_output
ebi-metagenomics-cwl/tools/trimmomatic.cwl:221:3:     raise WorkflowException("Did not find output file with glob pattern: '{}'".format(globpatterns))
ebi-metagenomics-cwl/tools/trimmomatic.cwl:221:3: 
ebi-metagenomics-cwl/tools/trimmomatic.cwl:221:3: cwltool.errors.WorkflowException: Did not find output file with glob pattern: '['trim.log']'

SOLUTION:
For now, I just commented out the output_log section of outputs in trimmomatic.cwl (lines 221-233). Guess that might break something, I'm sure there's a better solution.

v3 1st draft overview

CWL descriptions for each tool

deliver complete v3 example output to @mr-c

ONT Pipeline

We need to start thinking about how we should run ONT data

QC using poretools
Functional assignments using nhmmer and pfam
The functional assignments looks really good, need a Pfam2GO, or Pfam -> InterPro2GO
Kmer based taxonomy profiling
Standard SSU/LSU approach, may need to take top hit? Does mapseq do any thing different.

ERR770958_MERGED_FASTQ.fasta.submitted.count - I do not think that we have counts of the seq-prep-ed file
Missing QC stats, should be 8 files. You are generating this
Missing summary files of counts ERR770958_MERGED_FASTQ_summary.ipr, ERR770958_MERGED_FASTQ_summary
taxonomy-summary files. I think we are generating the krona files? Missing 3.
kingdom-counts.txt krona.html krona-input.txt
Missing the following files from the taxonomy steps

ERR770958_MERGED_FASTQ_otu_table_hdf5.biom
ERR770958_MERGED_FASTQ_qiime_assigned_taxonomy.txt
uclust_ref_picked_otus

sequence categorisation step outputs in sequence-categorisation are missing. We are generating many of the fasta files.
tRNA sequences
stepChunkingAndCompression step is missing?
standalone ExpressionTool to re-write all the CWL outputs to match the Python workflow outputs

Add binning to assembly pipeline

Add simple binning to the assembly pipeline:

Install bbmap tool
Install MetaBat
See if we can get QC files from metaspades
map sequences to assembly, CWL implementation
metabat CWL implementation
Investigate MaxBins and MetaWatt
Should we use CheckM or something similar to assess quality?

draft assembly workflow