Giter VIP home page Giter VIP logo

amp-pd-workflows's Introduction

amp-pd-workflows

Pipeline definitions used to construct AMP-PD data products

amp-pd-workflows's People

Contributors

wnojopra avatar mbookman avatar dvismer-technome avatar ha-duong-technome avatar

Stargazers

Zih-Hua Fang avatar Mary B. Makarious avatar Rebecca Valentino, Ph.D. avatar  avatar Weimy avatar  avatar

Watchers

James Cloos avatar Paul Grosu avatar  avatar  avatar

amp-pd-workflows's Issues

Add GatherVCFS workflow

A workflow was run in Terra on the annotated VCFS that produced per-chromosome VCFs using gatk GatherVcfsCloud

Add to this repository:
README
WDL
inputs.json

Add Salmon workflow

For AMP PD RNA data, a salmon workflow was run in Terra.
Add to this repository:

  • README describing the setup of the index file
  • WDL
  • inputs.json

Add workflow for turning a file list into a samples entity

AMP PD workflows in Terra are driven by Terra entity records.
Input are read from the Terra entity and output paths written back to the entity.

We have a workflow that starts from a list of files:

SAMPLE1file1
SAMPLE1file2

and turns those into sample entities with arrays of files for each sample.

Add that workflow here.

Add featureCounts workflow

For AMP PD RNA data, a featureCounts workflow was run in Terra BAMs.
Add to this repository:

  • README describing the setup of the index file
  • WDL
  • inputs.json

Add STAR workflow

For AMP PD RNA data, a STAR workflow was run in Terra.
Add to this repository:

  • README describing the setup of the index file
  • WDL
  • inputs.json

Add WGS joint genotyping workflow

For AMP PD WGS data, after single sample processing, a version of the Broad's joint discovery workflow was run on Terra.

The original sources is the joint-discovery-gatk4.wdl from:

https://github.com/gatk-workflows/gatk4-germline-snps-indels

The version for AMP PD was modified with very small operational (not scientific) changes:

Add samtools workflow

For AMP PD RNA data, a samtools workflow was run in Terra on the STAR BAMs.
Add to this repository:

  • README describing the setup of the index file
  • WDL
  • inputs.json

FR: review workflow definitions/inputs for explicit zones

In the RNASeq workflows, we have default inputs like:

  "RNAAlignment.runtime_zones": "us-central1-a us-central1-b",

The reason we explicit set the zones is because AMP PD data is in us-central1, including reference files:

  "RNAAlignment.star_index": "gs://amp-pd-transcriptomics/inputs/reference/star/STAR_genome_GencodeV29_oh125.tar.gz",

  "RNACollectMultipleMetrics.reference_sequence_file": "gs://amp-pd-transcriptomics/inputs/reference/gencode/GRCh38.primary_assembly.genome.fa.gz",

  "RNACollectRnaSeqMetrics.ref_flat_file": "gs://amp-pd-transcriptomics/inputs/reference/picard/gencode.v29.primary_assembly.annotation.Reflat_Picard.txt",
  "RNACollectRnaSeqMetrics.ribosomal_intervals_file": "gs://amp-pd-transcriptomics/inputs/reference/picard/gencode.v29.primary_assembly.annotation.RibosomalLocations.txt",

  "RNAQuantification.gene_map": "gs://amp-pd-transcriptomics/inputs/reference/gencode/gencode.v29.primary_assembly.annotation.gtf.gz",
  "RNAQuantification.salmon_index": "gs://amp-pd-transcriptomics/inputs/reference/salmon/Homo_sapiens.gencode.v29.all.salmon0.11.3_gencodeOption.index.tar.gz",

  "RNASummarization.gene_map": "gs://amp-pd-transcriptomics/inputs/reference/gencode/gencode.v29.primary_assembly.annotation.gtf.gz",

This is all good in the context of AMP PD. We want to locate compute in the same region as the data in order to avoid network egress charges.

However, if someone uses these workflows, they may not notice the hard-coded zones. Those zones might be a mismatch for the location of their data and could result in unnecessary egress costs.

We have a couple of options here:

  1. remove the hard-coded zones
  2. highlight and explain the hard-coded zones

In this particular case, I am actually inclined to do the latter. I think that having the hard-coded zones makes sense in this context. If we didn't have the reference files, I think I would suggest the former.

So we should add some verbiage to the rna/README.md in this repo as well as the the workspace description in Terra. The language should be something like:

Overview

The workflows in this repository...

Regional considerations

The *inputs.json files for the workflows in this repository explicitly set the location of compute VMs to zones in the GCP us-central1 region. For example:

   "RNAAlignment.runtime_zones": "us-central1-a us-central1-b",

The reason for this is that AMP PD data is stored in us-central1, including both the RNASeq sample input files and the reference files (such as the gencode assembly files in the bucket gs://amp-pd-transcriptomics). When we processed AMP PD data, we wanted to locate compute in the same region as the data in order to avoid network egress charges.

If you use these workflows (in Terra or otherwise on Google Cloud Platform), be sure to set the compute zones to the same location as your input files. If you want to use the same reference files as were used in these workflow, and your data is NOT in a us-central1 regional or US multi-region bucket, you should copy the reference file(s) to a bucket in the same region as your data, at least for the duration that you run the workflows.

Detailed steps

1- Upload a list of samples ...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.