labrazil / coetzee_seq_analysis Goto Github PK

Coetzee Lab Data Analysis

R 100.00%

coetzee_seq_analysis's Introduction

New URL: http://coetzeeseq.usc.edu/publication/Coetzee_SG_et_al_2012/

Abstract/Summary:
Single nucleotide polymorphisms (SNPs) are increasingly used to tag genetic loci
associated with phenotypes such as risk of complex diseases. Technically, this
is done genome-wide without prior restriction or knowledge of biological
feasibility in scans referred to as genome-wide association studies (GWAS).
Depending on the linkage-disequilibrium (LD) structure at a particular locus,
such tagSNPs may be surrogates for many thousands of other SNPs, and it is
difficult to distinguish those that may play a functional role in the phenotype
from those simply genetically linked. Because a large proportion of tagSNPs have
been identified within non-coding regions of the genome, distinguishing
functional from non-functional SNPs has been an even greater challenge. A
strategy was recently proposed that prioritizes surrogate SNPs based on
non-coding chromatin and epigenomic mapping techniques that have become feasible
with the advent of massively parallel sequencing. Here we introduce an
R/Bioconductor software package that enables the identification of candidate
functional SNPs by integrating information from tagSNP locations, lists of
linked SNPs from the 1000 genomes project, and locations of chromatin features
which may have functional significance. Availability: FunciSNP is available from
Bioconductor (bioconductor.org).

##
May 17, 2012 - Repo made public.

##
May 13, 2012 - FunciSNP manuscript accepted in Nucleic Acids Research.
doi: 10.1093/nar/gks542

##
Feb 3, 2012 - Reorganized folder project. Added new versions of FunciSNP.

new version of TSS.human.GRCh37
from genomebrowser on 2012-03-26
filtered only rows with "hg19.knonwCanonical.transcript" != "na"
no duplicated ensembl id's allowed no "na" ensembl id's allowed

##
September 27, 2011
This project contains all the current working in house scripts for the Coetzee Lab.

##
Each folder represents one in house script. View README within is folder for more information.

coetzee_seq_analysis's People

Contributors

Stargazers

Watchers

Forkers

kcvavra

coetzee_seq_analysis's Issues

same candidate SNP and risk tag SNP, but different LOD and r.2

rs10087810, candidate SNP and risk tag SNP rs13281615.
For first row, LOD is 96.14, second row's LOD is 97.19.
For first row, r.2 is 0.873, second row's r.2 is 0.874.

This kind of issue was arisen last time, but I see more recently both in LOD and r.2

add .R file, and generalize

Perhaps we should think about doing some of the R work in bash, I'll have to look more closely

one candidate SNP found from 2 different risk SNPs give different biofeatures

rs10087810 rs13281615 0.955 96.14 0.873 0.91 0.98 2888 - rs13281615 Enhancer_HMEC_H3K27Ac_hg19 8:127855617-128855618 EUR EUR
rs10087810 rs13281615 0.955 97.19 0.874 0.91 0.98 2888 - rs13281615 Enhancer_HMEC_H3K4me1_hg19 8:127855617-128855618 EUR EUR
rs10087810 rs13281615 0.955 96.14 0.873 0.91 0.98 2888 - rs13281615 Enhancer_HMEC_H3K4me2_hg19 8:127855617-128855618 EUR EUR
rs10087810 rs13281615 0.955 96.14 0.873 0.91 0.98 2888 - rs13281615 Enhancer_HMEC_H3K9Ac_hg19 8:127855617-128855618 EUR EUR
rs10087810 rs13281615 0.955 97.19 0.874 0.91 0.98 2888 - rs13281615 Enhancer_HMEC_HMMEnhancer_hg19 8:127855617-128855618 EUR EUR
rs10087810 rs1562430 0.913 45.62 0.504 0.84 0.96 35122 - rs1562430 Enhancer_HMEC_H3K27Ac_hg19 8:127887851-128887852 EUR EUR
rs10087810 rs1562430 0.913 45.62 0.504 0.84 0.96 35122 - rs1562430 Enhancer_HMEC_H3K4me2_hg19 8:127887851-128887852 EUR EUR
rs10087810 rs1562430 0.913 45.62 0.504 0.84 0.96 35122 - rs1562430 Enhancer_HMEC_H3K9Ac_hg19 8:127887851-128887852 EUR EUR

rs10087810, candidate SNP is within biofeature of HMMEnhancer and H3K4me1.
However, these two biofeatures were only detected in rs13281615 risk tag SNP results, not in rs1562430 risk tag SNP result.

bed format check earlier

We need to check bed format before running script. When I feed 64 bed files, it didn't output an error for the 60th bed file which didn't have the required column numbers.

I'd like eyes on the newest version of the script

There have been substantial changes to the layout and function of this script
start command with -h option to see how to set up a run. That information
has been copied here:
COMMAND LINE OPTIONS FOR FuncSNPi:
-c : max cpus availible (ex. -c 8)
-g : genome build to use (ex. -g hg19)
-r : location of risk SNP file (ex. -r ~/riskSNP.txt)
-b : location of of biological peaks file, in bed format (ex. -b ~/biopeaks.bed)

            riskSNP file must be in the following tab deliminated format
            <region>                <risk SNP id>   <ethnic group>
            10:122837334-123837335  rs2981579       EUR             <- example

script will not die gracefully

to kill the file when it is running in parallel, find the PID by running
ps aux | grep | grep
and then run
kill
If you just kill the subroutines that we call like java etc, the program will continue running.

inconsistent Rsquared values

Rsquared values change depending upon biological features:

I analyzed BCa candidate functional SNPs results from FuncSNPi, and realized that there are some errors (?).

For instance,
r^2 value of rs999737 and rs79071597 do not match in spite of the fact that same ethnicity groups were used.
The only difference is a biofeature, so theoretically, r^2, D, LOD values from haploview data should not be changed.
Please check the script to make it 100% correct!

L1 L2 D. LOD r.2 CIlow CIhi Dist T.int tagSNP bioFeature location ethnicity genome1000 TagTRUE1 TagTRUE2
rs999737 rs79071597 0.908 50.97 0.639 0.84 0.95 9872 62.05 rs999737 Galaxy13-[MDAMB231-vs-Input_triangel_subpeaks 14:68534681-69534682 EUR EUR TRUE FALSE
rs999737 rs79071597 0.91 51.96 0.644 0.84 0.95 9872 64.26 rs999737 Galaxy9-[HMEC-vs-Input_triangle_subpeaks 14:68534681-69534682 EUR EUR TRUE FALSE

Shorten argument requirement.

Currently a handful of arguments are needed. I think this can be shorten down. For example, instead of listing path to to files and then name of files, we can merge into one and then in the script we parse it.

size of region can be inconsistent

I just realized a situation where calculating the size will be a problem if we base it on the first column. If a tagSNP is near the edge of the chromosome, and we intended to get 1MB, but because the tagSNP is less than 1MB from the start or end of the chromosome, then the calculated value $SIZE will be different. We need to figure out another way to represent this or just remove this feature for now...

check ftp site instead of home directory

Currently we are using the downloaded 1000genomes db but when we publish, we will have to use the ftp site. We need to implement a check. Because there are two mirror sites, it will be important to run the analysis and check which site is up and depending on the users location, direct the link to either NCBI or Ensembl.

reduce.by.features=FALSE <- not work in FunciSNP(ver.0.1.7)

FuncySNP(snp.regions.file="BCa_risk_tagSNP_rs13281615_Feb_24_2012", reduce.by.features=FALSE, par.threads=1, bio.features.TSS=FALSE, search.window=1000000)

----> made an error

canning tabix file for rs13281615
creating snpMatrix for rs13281615
Error in IRanges::unlist(GRangesList(lapply(as.character(bio.features.file), :
error in evaluating the argument 'x' in selecting a method for function 'unlist': Error in IRanges:::newCompressedList("GRangesList", unlistData, end = end, :
'unlistData' not of class GRanges

R » bca<- FuncySNP(snp.regions.file="BCa_risk_tagSNP_rs13281615_Feb_24_2012", reduce.by.features=FALSE, bio.features.loc="bed_files/", par.threads=1, bio.features.TSS=FALSE, search.window=1000000)

-----> made an error

Error in is.data.frame(x) : could not find function "ALL.geno"

=====> reduce.by.feautres=FALSE tool does not seem to work.

Make haploview LD block and haplotype

Using the new version of FunciSNP (R), I would like to make a LD block and check the haplotype.
Could you include this function in the new FunciSNP R package?

Only plot those with a specified cutoff

Since Haploview takes a long time to generate svg, we need to first determine the selected regions by a cutoff and then only plot those. More reason why we need to build in R.

Make loop for all 4 ethnicities.

Currently the script has a section for each ethnic group (eur, afr, asn, all). Need to clean up and merge into one loop.

need to characterize biological features with r.2 cutoff

Currently we are taking only the overlapping SNP and not the biological peaks that overlap SNP. When we identify SNP that pass a r.2 value, we should also retain information about the associated peak. This way we can classify and characterize all biological features with the associated tagSNP and it's surrogate. We will need to know the distance to the nearest gene, if it's intergenic, gene body, etc. (basically run annotatePeaks on them). Then we can plot the values. Another reason we need to create this in R.

ERROR- one SNP in one biofeature per one risk tagSNP?

I am running the newest version of FunciSNP(source the .R file from gitfolder), and error occurred... T,T

The new FunciSNP folder is here,
/media/bigboy/suhn/New_FunciSNP/

bed files are in the sub-folder, bed_files/
/media/bigboy/suhn/New_FunciSNP/bed_files

Basically, error occurred after scanning all biofeatures in all ethnic groups for the first risk tag SNP.
I guess when the second risk tag SNP, rs2981579 is used to scan with first biofeature, Enhancer_HMEC_CTCF_hg19.bed, the error seems to occur due to data lines?

This is the error message.

Pulling In Variants File for SNP: rs2981579
trying ncbi as 1000 genomes server
OK using ncbi : ftp://ftp-trace.ncbi.nih.gov/1000genomes/
[get_local_version] downloading the index file...
subsetting Variants File for racial/ethnic group: AFR
time: 12:02:42 Risk SNP: rs2981579, AFR (2/56)

(1/13) Enhancer_HMEC_CTCF_hg19.bed
Error in $<-.data.frame(*tmp*, "loc.feature", value = c("chr10:123813382-123814049", :
replacement has 177 rows, data has 173
In addition: There were 50 or more warnings (use warnings() to see the first 50)

Comments.

Need to include more comments for documentations

Implement: Use HOMER to annotate surrogate SNPs

I was thinking, we should include a pipeline to feed in the newly identified SNPs from FunciSNP into HOMER so we can annotate it's location and distance to interesting genes and other features. I'm currently doing it ad-hoc...you'll have to study it carefully (i ran it three different times and I'm merging...).

After running FuniSNP, I pooled the data from p0, p1, p2
dat <- read.delim(file="p0/FunciSNP_tables/risk.snp.p0.txt",sep="\t")
tt <- read.delim(file="p1/risk.snp.p1.txt",sep="\t")
ts <- read.delim(file="p2/risk.snp.p2.txt",sep="\t")
dat <- rbind(dat, tt, ts)

Then I want to annotate the peaks overlapping the SNPs with Rsquared>0.5
d <- unique(dat[,"loc.feature"])
d <- as.data.frame(d)
d$name <- d[,1]
write.table(d, file="loc_feature_rquared_0.5.txt",sep="\t",row.names=F,quote=F, col.names=F)

Then using shell, i replaced ':' & "-" with tabs (sed s/:/\t/ file > tmp)
Then using bed2pos.pl, I converted this to pos file so homer would work.
Then ran homer:
annotatePeaks.pl loc_feature_rquared_0.5.pos hg19 loc_feature_rquared_0.5_outputHomer.txt

Then, reload the dat file above
dat <- read.delim(file="p0/FunciSNP_tables/risk.snp.p0.txt",sep="\t")
tt <- read.delim(file="p1/risk.snp.p1.txt",sep="\t")
ts <- read.delim(file="p2/risk.snp.p2.txt",sep="\t")
dat <- rbind(dat, tt, ts)

Load HOMER output
d <- read.delim(file="loc_feature_rquared_0.5.outputHomer.txt",sep="\t")

Merge
dat.m <- merge(dat, d, by.x="loc.feature", by.y="PeakID")