Pipeline definitions used to construct AMP-PD data products
amp-pd / amp-pd-workflows Goto Github PK
View Code? Open in Web Editor NEWPipeline definitions used to construct AMP-PD data products
License: BSD 3-Clause "New" or "Revised" License
Pipeline definitions used to construct AMP-PD data products
License: BSD 3-Clause "New" or "Revised" License
A workflow was run in Terra on the annotated VCFS that produced per-chromosome VCFs using gatk GatherVcfsCloud
Add to this repository:
README
WDL
inputs.json
For AMP PD RNA data, a collect-rna-seq-metrics workflow was run.
For AMP PD RNA data, a salmon workflow was run in Terra.
Add to this repository:
AMP PD workflows in Terra are driven by Terra entity records.
Input are read from the Terra entity and output paths written back to the entity.
We have a workflow that starts from a list of files:
SAMPLE1file1
SAMPLE1file2
and turns those into sample entities with arrays of files for each sample.
Add that workflow here.
For AMP PD RNA data, a featureCounts workflow was run in Terra BAMs.
Add to this repository:
For AMP PD RNA data, a STAR workflow was run in Terra.
Add to this repository:
For AMP PD WGS data, after single sample processing, a version of the Broad's joint discovery workflow was run on Terra.
The original sources is the joint-discovery-gatk4.wdl from:
https://github.com/gatk-workflows/gatk4-germline-snps-indels
The version for AMP PD was modified with very small operational (not scientific) changes:
maxretries
(to retry on Pipelines API error 10)GenomicsDBImport ... --gcs-project-for-requester-pays
)VariantRecalibrator
stepgsutil
in the GatherMetrics
step (https://github.com/gatk-workflows/gatk4-germline-snps-indels/issues/41)merge_count
to produce few intervals, getting under the limit of 50,000 tasks in a workflow (https://github.com/gatk-workflows/gatk4-germline-snps-indels/issues/40)For single sample WGS processing, AMP PD has used a CCDG-complaint workflow from the Broad Institute here:
https://github.com/gatk-workflows/broad-prod-wgs-germline-snps-indels
Add to this repository, the specific version used, including:
For AMP PD RNA data, a samtools workflow was run in Terra on the STAR BAMs.
Add to this repository:
In the RNASeq workflows, we have default inputs like:
"RNAAlignment.runtime_zones": "us-central1-a us-central1-b",
The reason we explicit set the zones is because AMP PD data is in us-central1
, including reference files:
"RNAAlignment.star_index": "gs://amp-pd-transcriptomics/inputs/reference/star/STAR_genome_GencodeV29_oh125.tar.gz",
"RNACollectMultipleMetrics.reference_sequence_file": "gs://amp-pd-transcriptomics/inputs/reference/gencode/GRCh38.primary_assembly.genome.fa.gz",
"RNACollectRnaSeqMetrics.ref_flat_file": "gs://amp-pd-transcriptomics/inputs/reference/picard/gencode.v29.primary_assembly.annotation.Reflat_Picard.txt",
"RNACollectRnaSeqMetrics.ribosomal_intervals_file": "gs://amp-pd-transcriptomics/inputs/reference/picard/gencode.v29.primary_assembly.annotation.RibosomalLocations.txt",
"RNAQuantification.gene_map": "gs://amp-pd-transcriptomics/inputs/reference/gencode/gencode.v29.primary_assembly.annotation.gtf.gz",
"RNAQuantification.salmon_index": "gs://amp-pd-transcriptomics/inputs/reference/salmon/Homo_sapiens.gencode.v29.all.salmon0.11.3_gencodeOption.index.tar.gz",
"RNASummarization.gene_map": "gs://amp-pd-transcriptomics/inputs/reference/gencode/gencode.v29.primary_assembly.annotation.gtf.gz",
This is all good in the context of AMP PD. We want to locate compute in the same region as the data in order to avoid network egress charges.
However, if someone uses these workflows, they may not notice the hard-coded zones. Those zones might be a mismatch for the location of their data and could result in unnecessary egress costs.
We have a couple of options here:
In this particular case, I am actually inclined to do the latter. I think that having the hard-coded zones makes sense in this context. If we didn't have the reference files, I think I would suggest the former.
So we should add some verbiage to the rna/README.md
in this repo as well as the the workspace description in Terra. The language should be something like:
The workflows in this repository...
The *inputs.json
files for the workflows in this repository explicitly set the location of compute VMs to zones in the GCP us-central1
region. For example:
"RNAAlignment.runtime_zones": "us-central1-a us-central1-b",
The reason for this is that AMP PD data is stored in us-central1
, including both the RNASeq sample input files and the reference files (such as the gencode assembly files in the bucket gs://amp-pd-transcriptomics
). When we processed AMP PD data, we wanted to locate compute in the same region as the data in order to avoid network egress charges.
If you use these workflows (in Terra or otherwise on Google Cloud Platform), be sure to set the compute zones to the same location as your input files. If you want to use the same reference files as were used in these workflow, and your data is NOT in a us-central1
regional or US multi-region bucket, you should copy the reference file(s) to a bucket in the same region as your data, at least for the duration that you run the workflows.
1- Upload a list of samples ...
For AMP PD RNA data, a collect-multiple-metrics workflow was run.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.