labrazil / coetzee_seq_analysis Goto Github PK
View Code? Open in Web Editor NEWCoetzee Lab Data Analysis
Home Page: http://coetzeeseq.usc.edu
Coetzee Lab Data Analysis
Home Page: http://coetzeeseq.usc.edu
New URL: http://coetzeeseq.usc.edu/publication/Coetzee_SG_et_al_2012/ Abstract/Summary: Single nucleotide polymorphisms (SNPs) are increasingly used to tag genetic loci associated with phenotypes such as risk of complex diseases. Technically, this is done genome-wide without prior restriction or knowledge of biological feasibility in scans referred to as genome-wide association studies (GWAS). Depending on the linkage-disequilibrium (LD) structure at a particular locus, such tagSNPs may be surrogates for many thousands of other SNPs, and it is difficult to distinguish those that may play a functional role in the phenotype from those simply genetically linked. Because a large proportion of tagSNPs have been identified within non-coding regions of the genome, distinguishing functional from non-functional SNPs has been an even greater challenge. A strategy was recently proposed that prioritizes surrogate SNPs based on non-coding chromatin and epigenomic mapping techniques that have become feasible with the advent of massively parallel sequencing. Here we introduce an R/Bioconductor software package that enables the identification of candidate functional SNPs by integrating information from tagSNP locations, lists of linked SNPs from the 1000 genomes project, and locations of chromatin features which may have functional significance. Availability: FunciSNP is available from Bioconductor (bioconductor.org). ## May 17, 2012 - Repo made public. ## May 13, 2012 - FunciSNP manuscript accepted in Nucleic Acids Research. doi: 10.1093/nar/gks542 ## Feb 3, 2012 - Reorganized folder project. Added new versions of FunciSNP. new version of TSS.human.GRCh37 from genomebrowser on 2012-03-26 filtered only rows with "hg19.knonwCanonical.transcript" != "na" no duplicated ensembl id's allowed no "na" ensembl id's allowed ## September 27, 2011 This project contains all the current working in house scripts for the Coetzee Lab. ## Each folder represents one in house script. View README within is folder for more information.
rs10087810 rs13281615 0.955 96.14 0.873 0.91 0.98 2888 - rs13281615 Enhancer_HMEC_H3K27Ac_hg19 8:127855617-128855618 EUR EUR
rs10087810 rs13281615 0.955 97.19 0.874 0.91 0.98 2888 - rs13281615 Enhancer_HMEC_H3K4me1_hg19 8:127855617-128855618 EUR EUR
rs10087810 rs13281615 0.955 96.14 0.873 0.91 0.98 2888 - rs13281615 Enhancer_HMEC_H3K4me2_hg19 8:127855617-128855618 EUR EUR
rs10087810 rs13281615 0.955 96.14 0.873 0.91 0.98 2888 - rs13281615 Enhancer_HMEC_H3K9Ac_hg19 8:127855617-128855618 EUR EUR
rs10087810 rs13281615 0.955 97.19 0.874 0.91 0.98 2888 - rs13281615
rs10087810, candidate SNP and risk tag SNP rs13281615.
For first row, LOD is 96.14, second row's LOD is 97.19.
For first row, r.2 is 0.873, second row's r.2 is 0.874.
This kind of issue was arisen last time, but I see more recently both in LOD and r.2
Perhaps we should think about doing some of the R work in bash, I'll have to look more closely
rs10087810 rs13281615 0.955 96.14 0.873 0.91 0.98 2888 - rs13281615 Enhancer_HMEC_H3K27Ac_hg19 8:127855617-128855618 EUR EUR
rs10087810 rs13281615 0.955 97.19 0.874 0.91 0.98 2888 - rs13281615 Enhancer_HMEC_H3K4me1_hg19 8:127855617-128855618 EUR EUR
rs10087810 rs13281615 0.955 96.14 0.873 0.91 0.98 2888 - rs13281615 Enhancer_HMEC_H3K4me2_hg19 8:127855617-128855618 EUR EUR
rs10087810 rs13281615 0.955 96.14 0.873 0.91 0.98 2888 - rs13281615 Enhancer_HMEC_H3K9Ac_hg19 8:127855617-128855618 EUR EUR
rs10087810 rs13281615 0.955 97.19 0.874 0.91 0.98 2888 - rs13281615 Enhancer_HMEC_HMMEnhancer_hg19 8:127855617-128855618 EUR EUR
rs10087810 rs1562430 0.913 45.62 0.504 0.84 0.96 35122 - rs1562430 Enhancer_HMEC_H3K27Ac_hg19 8:127887851-128887852 EUR EUR
rs10087810 rs1562430 0.913 45.62 0.504 0.84 0.96 35122 - rs1562430 Enhancer_HMEC_H3K4me2_hg19 8:127887851-128887852 EUR EUR
rs10087810 rs1562430 0.913 45.62 0.504 0.84 0.96 35122 - rs1562430 Enhancer_HMEC_H3K9Ac_hg19 8:127887851-128887852 EUR EUR
rs10087810, candidate SNP is within biofeature of HMMEnhancer and H3K4me1.
However, these two biofeatures were only detected in rs13281615 risk tag SNP results, not in rs1562430 risk tag SNP result.
We need to check bed format before running script. When I feed 64 bed files, it didn't output an error for the 60th bed file which didn't have the required column numbers.
There have been substantial changes to the layout and function of this script
start command with -h option to see how to set up a run. That information
has been copied here:
COMMAND LINE OPTIONS FOR FuncSNPi:
-c : max cpus availible (ex. -c 8)
-g : genome build to use (ex. -g hg19)
-r : location of risk SNP file (ex. -r ~/riskSNP.txt)
-b : location of of biological peaks file, in bed format (ex. -b ~/biopeaks.bed)
riskSNP file must be in the following tab deliminated format
<region> <risk SNP id> <ethnic group>
10:122837334-123837335 rs2981579 EUR <- example
to kill the file when it is running in parallel, find the PID by running
ps aux | grep | grep
and then run
kill
If you just kill the subroutines that we call like java etc, the program will continue running.
Rsquared values change depending upon biological features:
I analyzed BCa candidate functional SNPs results from FuncSNPi, and realized that there are some errors (?).
For instance,
r^2 value of rs999737 and rs79071597 do not match in spite of the fact that same ethnicity groups were used.
The only difference is a biofeature, so theoretically, r^2, D, LOD values from haploview data should not be changed.
Please check the script to make it 100% correct!
L1 L2 D. LOD r.2 CIlow CIhi Dist T.int tagSNP bioFeature location ethnicity genome1000 TagTRUE1 TagTRUE2
rs999737 rs79071597 0.908 50.97 0.639 0.84 0.95 9872 62.05 rs999737 Galaxy13-[MDAMB231-vs-Input_triangel_subpeaks 14:68534681-69534682 EUR EUR TRUE FALSE
rs999737 rs79071597 0.91 51.96 0.644 0.84 0.95 9872 64.26 rs999737 Galaxy9-[HMEC-vs-Input_triangle_subpeaks 14:68534681-69534682 EUR EUR TRUE FALSE
Currently a handful of arguments are needed. I think this can be shorten down. For example, instead of listing path to to files and then name of files, we can merge into one and then in the script we parse it.
I just realized a situation where calculating the size will be a problem if we base it on the first column. If a tagSNP is near the edge of the chromosome, and we intended to get 1MB, but because the tagSNP is less than 1MB from the start or end of the chromosome, then the calculated value $SIZE will be different. We need to figure out another way to represent this or just remove this feature for now...
Currently we are using the downloaded 1000genomes db but when we publish, we will have to use the ftp site. We need to implement a check. Because there are two mirror sites, it will be important to run the analysis and check which site is up and depending on the users location, direct the link to either NCBI or Ensembl.
FuncySNP(snp.regions.file="BCa_risk_tagSNP_rs13281615_Feb_24_2012", reduce.by.features=FALSE, par.threads=1, bio.features.TSS=FALSE, search.window=1000000)
----> made an error
canning tabix file for rs13281615
creating snpMatrix for rs13281615
Error in IRanges::unlist(GRangesList(lapply(as.character(bio.features.file), :
error in evaluating the argument 'x' in selecting a method for function 'unlist': Error in IRanges:::newCompressedList("GRangesList", unlistData, end = end, :
'unlistData' not of class GRanges
R » bca<- FuncySNP(snp.regions.file="BCa_risk_tagSNP_rs13281615_Feb_24_2012", reduce.by.features=FALSE, bio.features.loc="bed_files/", par.threads=1, bio.features.TSS=FALSE, search.window=1000000)
-----> made an error
Error in is.data.frame(x) : could not find function "ALL.geno"
=====> reduce.by.feautres=FALSE tool does not seem to work.
Using the new version of FunciSNP (R), I would like to make a LD block and check the haplotype.
Could you include this function in the new FunciSNP R package?
Since Haploview takes a long time to generate svg, we need to first determine the selected regions by a cutoff and then only plot those. More reason why we need to build in R.
Currently the script has a section for each ethnic group (eur, afr, asn, all). Need to clean up and merge into one loop.
Currently we are taking only the overlapping SNP and not the biological peaks that overlap SNP. When we identify SNP that pass a r.2 value, we should also retain information about the associated peak. This way we can classify and characterize all biological features with the associated tagSNP and it's surrogate. We will need to know the distance to the nearest gene, if it's intergenic, gene body, etc. (basically run annotatePeaks on them). Then we can plot the values. Another reason we need to create this in R.
I am running the newest version of FunciSNP(source the .R file from gitfolder), and error occurred... T,T
The new FunciSNP folder is here,
/media/bigboy/suhn/New_FunciSNP/
bed files are in the sub-folder, bed_files/
/media/bigboy/suhn/New_FunciSNP/bed_files
Basically, error occurred after scanning all biofeatures in all ethnic groups for the first risk tag SNP.
I guess when the second risk tag SNP, rs2981579 is used to scan with first biofeature, Enhancer_HMEC_CTCF_hg19.bed, the error seems to occur due to data lines?
This is the error message.
Pulling In Variants File for SNP: rs2981579
trying ncbi as 1000 genomes server
OK using ncbi : ftp://ftp-trace.ncbi.nih.gov/1000genomes/
[get_local_version] downloading the index file...
subsetting Variants File for racial/ethnic group: AFR
time: 12:02:42 Risk SNP: rs2981579, AFR (2/56)
(1/13) Enhancer_HMEC_CTCF_hg19.bed
Error in$<-.data.frame
(*tmp*
, "loc.feature", value = c("chr10:123813382-123814049", :
replacement has 177 rows, data has 173
In addition: There were 50 or more warnings (use warnings() to see the first 50)
Need to include more comments for documentations
I was thinking, we should include a pipeline to feed in the newly identified SNPs from FunciSNP into HOMER so we can annotate it's location and distance to interesting genes and other features. I'm currently doing it ad-hoc...you'll have to study it carefully (i ran it three different times and I'm merging...).
After running FuniSNP, I pooled the data from p0, p1, p2
dat <- read.delim(file="p0/FunciSNP_tables/risk.snp.p0.txt",sep="\t")
tt <- read.delim(file="p1/risk.snp.p1.txt",sep="\t")
ts <- read.delim(file="p2/risk.snp.p2.txt",sep="\t")
dat <- rbind(dat, tt, ts)
Then I want to annotate the peaks overlapping the SNPs with Rsquared>0.5
d <- unique(dat[,"loc.feature"])
d <- as.data.frame(d)
d$name <- d[,1]
write.table(d, file="loc_feature_rquared_0.5.txt",sep="\t",row.names=F,quote=F, col.names=F)
Then using shell, i replaced ':' & "-" with tabs (sed s/:/\t/ file > tmp)
Then using bed2pos.pl, I converted this to pos file so homer would work.
Then ran homer:
annotatePeaks.pl loc_feature_rquared_0.5.pos hg19 loc_feature_rquared_0.5_outputHomer.txt
Then, reload the dat file above
dat <- read.delim(file="p0/FunciSNP_tables/risk.snp.p0.txt",sep="\t")
tt <- read.delim(file="p1/risk.snp.p1.txt",sep="\t")
ts <- read.delim(file="p2/risk.snp.p2.txt",sep="\t")
dat <- rbind(dat, tt, ts)
Load HOMER output
d <- read.delim(file="loc_feature_rquared_0.5.outputHomer.txt",sep="\t")
Merge
dat.m <- merge(dat, d, by.x="loc.feature", by.y="PeakID")
We should have the script generate a log file, similar to what HOMER does so we know which command we are running each time.
Currently a handful of arguments are needed. I think this can be shorten down. For example, instead of listing path to to files and then name of files, we can merge into one and then in the script we parse it.
When users input bed files for biological data, we need to implement a test to confirm. It must contain 4 columns only. Chrom, start, end, name. Otherwise overlap feature won't work
I think one of the lagging steps is overlapping biofeatures with SNPs. I think it can speed up if we sort the bed files before doing the overlap.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.