Giter VIP home page Giter VIP logo

gene_expression_prediction_1's Introduction

Prediction of gene expression

This script is designed to predict allele-specific expressian and total gene expression using allelic fold change (aFC). See the manuscript for method description.

More information on estimating aFCs, could be found here.

Inputs

aFC file

In the data folder there is a file named aFC_Whole_Blood.txt. This is a tab delimited file providing allelic fold change for each eQTL. In this file for each variant in columnn "variant_id" we have the following information:

  • gene_id : gene associated to that variant

  • rank : rank of that variant in eQTL calling process

  • CHROM : the chromosome of the variant

  • POS : position of that variant

  • REF : The Reference allele

  • ALT : The Alternative allele

  • log2_aFC : The effect size of the variant

VCF file

A phased vcf file is required to extract the genotypes of individuals. A sample VCF file named "vcf_sample.txt" could be found in the data folder.

Genotype, encoded as allele values separated by either of / or |. " /" means genotype unphased and "|" means genotype phased. The allele values are 0 for the reference allele (what is in the REF field), 1 for the allele listed in ALT. For diploid calls examples could be 0/1, 1|0.

The REF and ALT information should match the REF and ALT information in aFC_Whole_Blood.txt

If a tabix index is generated for the vcf file, in order to extract the subset of the vcf file you can use the following command in which "IDs.txt" repesents the required variants. A sample of this file is available in the data folder.

for line in $(cat ~/data/IDs.txt)
    do
    id=`echo ${line} | sed 's/\"/\t/g'`
    chr=`echo ${id} | sed 's/_/\t/g' | cut -f 1`
    pos=`echo ${id} | sed 's/_/\t/g' | cut -f 2`
    echo ${id}
    echo ${chr}
    echo ${pos}
    tabix vcf_file ${chr}:${pos}-${pos} | grep -w $id >> vcf_sample.txt
    done

Here is an example to predict the expression of a specific gene for an individual using these inputs

# aFC file
aFC_file = "data/aFC_Whole_Blood.txt"

#VCF sample file
VCF_file = "data/sample_vcf.txt"


# Name of gene that we want to predict its expression
gene_id = "gene_id"

# Name of individual that we want to predict its expression
individual_id = "individual_id"

#########################

expression_prediction_gene_individual<-function(aFC_file,vcf_file,gene_id,individual_id){
    # read aFC file
    aFC_df=read.table(aFC_file, header=TRUE, sep="\t")

    #read VCF file 
    genotype_info= read.table(vcf_file, header=TRUE, sep = "\t")
   

    # get the afc vector for a specific gene 
    # the function definition is available in R folder
    aFC_vector<-aFC_gene_vector(gene_id,aFC_df)
   

    # get the afc vector for a specific gene 
    # the function definition is available in R folder
    variant_vector<-variant_gene_vector(gene_id,aFC_df)


    #null genotypes
    genotype_h1 = c()
    genotype_h2 = c()

# assuming variant and individual_id exist in vcf file
    for (variant in variant_vector){
        
        genotype<-genotype_info[genotype_info$ID == variant,individual_id]      
        genotype_split_full <-unlist(strsplit(as.character(genotype),":"))[1]
        genotype_split<-unlist(strsplit(as.character(genotype),"|"))
        genotype_h1<-c(genotype_h1,genotype_split[1])
        genotype_h2<-c(genotype_h2,genotype_split[3])
    }

    
    #get the expression output vector for two haplotypes
    # the function definition is available in R folder
    #output[1] : log expression for haplotype 1 in log2 scale
    #output[2] : gene expression for haplotype 2 in log2 scale
    #output[3] : total expression in log2 scale
    result<-gene_expression_estimation(as.numeric(genotype_h1),as.numeric(genotype_h2),aFC_vector)
    
    return(result)
    }

Resources

R/gene_expression_estimation_functions.ipynb

This notebook provides functions in R used in the above example that will read the aFC file and predict the gene expression.

R/gene_expression_lookupTable.R

This R script gets a sorted aFC file (sorted based on gene_id), counts the number of variants for each gene and produces lookup tables representing expression values for all genotypes. The lookup tables are generated based on the number of the eQTLs for each gene, thus "lookupTable_variantNo_2" includes genes with two eQTLS. To run the script use the following command:

    Rscript gene_expression_lookupTable.R data\aFC_Whole_Blood.txt

python/ASE_prediction.ipynb

This script uses the lookup tables to predict expression for each haplotype, reading the individual genotypes from vcf file.

gene_expression_prediction_1's People

Contributors

navaehsan avatar

Stargazers

qianche avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.