Giter VIP home page Giter VIP logo

singletons-bash's Introduction

Singletons-bash

Bash scripts needed for singleton scores and stats calculation

This bash version of the pipeline is based on the work presented in:

Mezzavilla M, Cocca M, Guidolin F, Gasparini P. A population-based approach for gene prioritization in understanding complex traits. Hum Genet. 2020;139(5):647-655. doi:10.1007/s00439-020-02152-4


DISCLAIMER

This is the very first version of the pipeline, meant to be shared quickly while we work on a more polished and portable Snakemake version.

There are some hard coded parameters you should modify if you plan to run the scripts on a cluster not based on the SGE queue system:

  • in the folder there is a ja_runner_par.sh script to be used on LSB cluster: you can use this script to modify the job array submission syntax in the VPP_GR2018_pipeline.sh script (lines 78 and 82)
  • you should modify job submission syntax according to your cluster on lines 86 and 91 in VPP_GR2018_pipeline.sh

In case you need any help, for code related problems, please open an issue in the repository or drop a line to:

  • massimiliano [dot] cocca [at] burlo [dot] trieste [dot] .it

In case you need any help, for method related issues, please drop a line to:

  • massimo [dot] mezzavilla [at] burlo [dot] trieste [dot] .it

Requirements

  • mysql (to retrieve information from genome-mysql.cse.ucsc.edu)
  • Python <= 3.6
  • bedtools
  • vcftools
  • bcftools

Pipeline steps:

  1. Provide genomic region input

    • Get gene info from GENCODE data (if requested)
    • Provide a genomic region using chr:start-end informations
  2. Calculate singletons for the study population

    • Computationally intensive step
    • This step will be performed in any case, and once per population, so the data can be reused
  3. Calculate singleton stats

    • This step will allow two option:
      1. Genomic region stats

        • Singleton density for each gene across all samples
        • Singleton count for each gene across all samples
        • Singleton scores (SSC and DSC)
      2. Sample level stats (in development )

        • Genome wide singleton density for each sample
        • Singleton density for each sample in the provided genomic region
        • Singleton count for each sample in the provided genomic region

Input files

  • Multisample vcf files for the study population
  • Bed files for genomic regions comprising whole gene regions and coding regions only (examples provided to extract the correct data format from GENCODE data)

Sample command

Generate genomic regions bed files for b38 using GENCODE v35 data:

  • Extract only genes annotated as protein coding:
zcat [..]/hg38/gencode.v35.annotation.gff3.gz| grep -E -v "^#"| awk '$3=="gene"'| fgrep "protein_coding" > [..]/hg38/gencode.v35.annotation.gff3.GENE.bed
zcat [..]/hg38/gencode.v35.annotation.gff3.gz| grep -E -v "^#"| awk '$3=="CDS"'| fgrep "transcript_type=protein_coding" > [..]/hg38/gencode.v35.annotation.gff3.CDS.bed
  • Generate the final formatted bed files:
cut -f 1,4,5,9 [..]/hg38/gencode.v35.annotation.gff3.GENE.bed | awk 'BEGIN{OFS="\t"}{split($4,a,";");split(a[2],g,"=");split(a[4],n,"="); print $1,$2,$3,n[2],g[2]}' > [..]/hg38/gencode.v35.annotation_GENES.bed
cut -f 1,4,5,9 [..]/hg38/gencode.v35.annotation.gff3.CDS.bed | awk 'BEGIN{OFS="\t"}{split($4,a,";");split(a[3],g,"=");split(a[6],n,"=");split(a[2],p,"=");split(a[4],t,"="); print $1,$2,$3,n[2],g[2],p[2],t[2]}' > [..]/hg38/gencode.v35.annotation_CDS.bed
  • Split bed files by chromosome for parallelization:
ref_seq="hg38/gencode.v35.annotation_GENES.bed"
cds_uniq="hg38/gencode.v35.annotation_CDS.bed"
outfolder="hg38/CHR"
mkdir -p ${outfolder}
for chr in {1..22} X Y M
do
# chr=1
    awk -v c=${chr} '$1=="chr"c' ${ref_seq} > ${outfolder}/gencode.v${gencode_v}.${chr}.annotation_GENES.bed
    cut -f 1,5 ${outfolder}/gencode.v${gencode_v}.${chr}.annotation_GENES.bed > ${outfolder}/gencode.v${gencode_v}.${chr}.GENES.bed
    awk -v c=${chr} '$1=="chr"c' ${cds_uniq} > ${outfolder}/gencode.v${gencode_v}.${chr}.annotation_CDS.bed
done
  • Run the singleton scores calculation for our study population:
pop="TEST_POP"
genome_build="GRCh38"
for chr in {1..22} X
do
win_size=50000
out_folder=[..]/SINGLETONS/20201021/${pop}/${chr}
pop_vcf=[base-population-folder]/${pop}/${chr}.vcf.gz

#the code is actually written to run on an SGE cluster, but can be modified to run on all sort of clusters
m=5G #memory requiremente for job submission on a queue system
q="fast" #queue to submit the job

[your-script-folder]/VPP_GR2018_pipeline.sh -S ${chr} ${win_size} ${out_folder} ${pop_vcf} ${genome_build} ${m} ${q}

done
  • Calculate singleton scores
pop="TEST_POP"

for chr in {1..22} X   
do
#list of uniq gene ids, by chr that we generated earlier
ref_seq=../hg38/CHR/gencode.v35.${chr}.annotation_GENES.bed
cds_uniq=../hg38/CHR/gencode.v35.${chr}.annotation_CDS.bed

sing_file=[base-population-folder]/${pop}/${chr}/singletons/${chr}.ALL.samples.singletons #this file come from the previous step
out_path=[base-output-path]/SINGLETONS/20201021/${pop}/singleton_score
outfile=${out_path}/${pop}_SING_SCORE.${chr}.txt
log_folder=${out_path}/logs

job_name=s_score_${pop}_${chr}
#the code is actually written to run on an SGE cluster, but can be modified to run on all sort of clusters
qu="fast"

mkdir -p ${log_folder}

echo "[your-script-folder]/singletons_score.sh ${ref_seq} ${cds_uniq} ${sing_file} ${outfile} POP"|qsub -o ${log_folder}/\$JOB_ID_${pop}_${chr}.log -e ${log_folder}/\$JOB_ID_${pop}_${chr}.e -V -N ${job_name} -l h_vmem=4G -q ${qu}

done

singletons-bash's People

Contributors

macsx82 avatar

Watchers

 avatar James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.