amp-pd / amp-pd-workflows Goto Github PK

View Code? Open in Web Editor NEW

6.0 4.0 6.0 51 KB

Pipeline definitions used to construct AMP-PD data products

License: BSD 3-Clause "New" or "Revised" License

WDL 100.00%

amp-pd-workflows's Introduction

amp-pd-workflows

Pipeline definitions used to construct AMP-PD data products

amp-pd-workflows's People

Contributors

Stargazers

Watchers

Forkers

wnojopra mbookman sterding bwh-bioinformatics-hub jboktor

amp-pd-workflows's Issues

Add GatherVCFS workflow

A workflow was run in Terra on the annotated VCFS that produced per-chromosome VCFs using gatk GatherVcfsCloud

Add to this repository:
README
WDL
inputs.json

Add collect-rna-seq-metrics workflow

For AMP PD RNA data, a collect-rna-seq-metrics workflow was run.

Add Salmon workflow

For AMP PD RNA data, a salmon workflow was run in Terra.
Add to this repository:

README describing the setup of the index file
WDL
inputs.json

Add workflow for turning a file list into a samples entity

AMP PD workflows in Terra are driven by Terra entity records.
Input are read from the Terra entity and output paths written back to the entity.

We have a workflow that starts from a list of files:

SAMPLE1file1
SAMPLE1file2

and turns those into sample entities with arrays of files for each sample.

Add that workflow here.

Add featureCounts workflow

For AMP PD RNA data, a featureCounts workflow was run in Terra BAMs.
Add to this repository:

README describing the setup of the index file
WDL
inputs.json

Add STAR workflow

For AMP PD RNA data, a STAR workflow was run in Terra.
Add to this repository:

README describing the setup of the index file
WDL
inputs.json

Add WGS joint genotyping workflow

For AMP PD WGS data, after single sample processing, a version of the Broad's joint discovery workflow was run on Terra.

The original sources is the joint-discovery-gatk4.wdl from:

https://github.com/gatk-workflows/gatk4-germline-snps-indels

The version for AMP PD was modified with very small operational (not scientific) changes:

Add support for WDL maxretries (to retry on Pipelines API error 10)
Add support for requester pays buckets (GenomicsDBImport ... --gcs-project-for-requester-pays)
Increase memory for the VariantRecalibrator step
Fix path to gsutil in the GatherMetrics step (https://github.com/gatk-workflows/gatk4-germline-snps-indels/issues/41)
Change the merge_count to produce few intervals, getting under the limit of 50,000 tasks in a workflow (https://github.com/gatk-workflows/gatk4-germline-snps-indels/issues/40)

Add WGS single sample workflow

For single sample WGS processing, AMP PD has used a CCDG-complaint workflow from the Broad Institute here:

https://github.com/gatk-workflows/broad-prod-wgs-germline-snps-indels

Add to this repository, the specific version used, including:

README
WDL (PairedEndSingleSampleWf.wdl)
inputs.json

Add samtools workflow

For AMP PD RNA data, a samtools workflow was run in Terra on the STAR BAMs.
Add to this repository:

README describing the setup of the index file
WDL
inputs.json

FR: review workflow definitions/inputs for explicit zones

In the RNASeq workflows, we have default inputs like:

  "RNAAlignment.runtime_zones": "us-central1-a us-central1-b",

The reason we explicit set the zones is because AMP PD data is in us-central1, including reference files:

  "RNAAlignment.star_index": "gs://amp-pd-transcriptomics/inputs/reference/star/STAR_genome_GencodeV29_oh125.tar.gz",

  "RNACollectMultipleMetrics.reference_sequence_file": "gs://amp-pd-transcriptomics/inputs/reference/gencode/GRCh38.primary_assembly.genome.fa.gz",

  "RNACollectRnaSeqMetrics.ref_flat_file": "gs://amp-pd-transcriptomics/inputs/reference/picard/gencode.v29.primary_assembly.annotation.Reflat_Picard.txt",
  "RNACollectRnaSeqMetrics.ribosomal_intervals_file": "gs://amp-pd-transcriptomics/inputs/reference/picard/gencode.v29.primary_assembly.annotation.RibosomalLocations.txt",

  "RNAQuantification.gene_map": "gs://amp-pd-transcriptomics/inputs/reference/gencode/gencode.v29.primary_assembly.annotation.gtf.gz",
  "RNAQuantification.salmon_index": "gs://amp-pd-transcriptomics/inputs/reference/salmon/Homo_sapiens.gencode.v29.all.salmon0.11.3_gencodeOption.index.tar.gz",

  "RNASummarization.gene_map": "gs://amp-pd-transcriptomics/inputs/reference/gencode/gencode.v29.primary_assembly.annotation.gtf.gz",

This is all good in the context of AMP PD. We want to locate compute in the same region as the data in order to avoid network egress charges.

However, if someone uses these workflows, they may not notice the hard-coded zones. Those zones might be a mismatch for the location of their data and could result in unnecessary egress costs.

We have a couple of options here:

remove the hard-coded zones
highlight and explain the hard-coded zones

In this particular case, I am actually inclined to do the latter. I think that having the hard-coded zones makes sense in this context. If we didn't have the reference files, I think I would suggest the former.

So we should add some verbiage to the rna/README.md in this repo as well as the the workspace description in Terra. The language should be something like:

Overview

The workflows in this repository...

Regional considerations

The *inputs.json files for the workflows in this repository explicitly set the location of compute VMs to zones in the GCP us-central1 region. For example:

   "RNAAlignment.runtime_zones": "us-central1-a us-central1-b",

The reason for this is that AMP PD data is stored in us-central1, including both the RNASeq sample input files and the reference files (such as the gencode assembly files in the bucket gs://amp-pd-transcriptomics). When we processed AMP PD data, we wanted to locate compute in the same region as the data in order to avoid network egress charges.

If you use these workflows (in Terra or otherwise on Google Cloud Platform), be sure to set the compute zones to the same location as your input files. If you want to use the same reference files as were used in these workflow, and your data is NOT in a us-central1 regional or US multi-region bucket, you should copy the reference file(s) to a bucket in the same region as your data, at least for the duration that you run the workflows.

Detailed steps

1- Upload a list of samples ...

Add collect-multiple-metrics workflow

For AMP PD RNA data, a collect-multiple-metrics workflow was run.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.