jrflab / modules Goto Github PK
View Code? Open in Web Editor NEWMSKCC Reis-Filho Lab pipeline thingy
Home Page: https://www.mskcc.org/research-areas/labs/jorge-reis-filho/overview
MSKCC Reis-Filho Lab pipeline thingy
Home Page: https://www.mskcc.org/research-areas/labs/jorge-reis-filho/overview
transcripts with > 1 snv in a vcf are missing annotations: only first snv is annotated
For the pies, please take out the grey background, axes and labels.
For the bars, the panels should be on a single row instead of over 2 rows, take out the grey background, larger font on the y-axis and the top title bars.
Thanks.
likewise, mm10 is actually GRCm38
Usually we want only the headers:
TUMOR_SAMPLE GENE AA EFFECT TUMOR_MAF NORMAL_MAF CHASM Uterus Mut Tastor FATHMM Cancer Gene Census Kandoth Lawrence haploinsufficiency Cancer Cell Fraction ABSOLUTE IMPACT
It would be nice to have one sheet that only has those headers (summary view) and one with all the data for both high_moderate and low_modifier. That is four sheets in total.
defuse results do not match oncofuse results in the oncofuse.txt tables
They seem to contain the same values, so it is not super important. Would be good to fix tho.
The bed files have been made for the annotations. They just need to be added to the vcf tools file
modules/clonality/absoluteSeq.mk: genes <- c(snvs[['EFF....GENE']], indels[['EFF....GENE']])
After each run, we check whether or not the file size is the same on each node. This however requires the main process to ssh to every node. We should therefore check before running the pipeline that this is possible.
due to the strand lookup on the ensembl database
This is the output:
[ngk1@saba2 log]$ pwd
/home/ngk1/share/projects/nendo_amyo_btseq/log
[ngk1@saba2 log]$ less mutation_summary.2015-08-05.25.log
make[1]: Entering directory '/ifs/e63data/reis-filho/projects/nendo_amyo_btseq'
mkdir -p -m 775 log/mutation_summary.2015-08-05.25/excel excel; umask 002; set -o pipefail; source /home/debruiji/share/usr/anaconda-envs/anaconda-2.7/bin/activate /home/debruiji/share/usr/anaconda-envs/anaconda-2.7; \
python modules/scripts/mutation_summary_excel.py alltables/allTN.mutect.dp_ft.som_ad_ft.target_ft.pass.dbsnp.cosmic.nsfp.eff.gene_ann.cn_reg.chasm.fathmm.tab.high_moderate.txt alltables/allTN.mutect.dp_ft.som_ad_ft.target_ft.pass.dbsnp.cosmic.nsfp.eff.gene_ann.cn_reg.chasm.fathmm.tab.low_modifier.txt alltables/allTN.mutect.dp_ft.som_ad_ft.target_ft.pass.dbsnp.cosmic.nsfp.eff.gene_ann.cn_reg.chasm.fathmm.tab.synonymous.txt alltables/allTN.mutect.dp_ft.som_ad_ft.target_ft.pass.dbsnp.cosmic.nsfp.eff.gene_ann.cn_reg.chasm.fathmm.tab.nonsynonymous.txt alltables/allTN.strelka_varscan_indels.tabigh_moderate.txt alltables/allTN.strelka_varscan_indels.tab.low_modifier.txt alltables/allTN.strelka_varscan_indels.tab.synonymous.txt alltables/allTN.strelka_varscan_indels.tab.nonsynonymous.txt excel/mutation_summary.xlsx
discarding /home/debruiji/anaconda/bin from PATH
prepending /home/debruiji/share/usr/anaconda-envs/anaconda-2.7/bin to PATH
/home/ngk1/share/usr/lib/python/pandas-0.12.0-py2.7-linux-x86_64.egg/pandas/hashtable.so: undefined symbol: PyUnicodeUCS2_DecodeUTF8
Traceback (most recent call last):
File "modules/scripts/mutation_summary_excel.py", line 6, in <module>
import pandas as pd
File "/home/ngk1/share/usr/lib/python/pandas-0.12.0-py2.7-linux-x86_64.egg/pandas/__init__.py", line 6, in <module>
from . import hashtable, tslib, lib
ImportError: /home/ngk1/share/usr/lib/python/pandas-0.12.0-py2.7-linux-x86_64.egg/pandas/hashtable.so: undefined symbol: PyUnicodeUCS2_DecodeUTF8
modules/excel/mutationSummary.mk:24: recipe for target 'excel/mutation_summary.xlsx' failed
make[1]: *** [excel/mutation_summary.xlsx] Error 1
make[1]: Leaving directory '/ifs/e63data/reis-filho/projects/nendo_amyo_btseq'
#### turn segmented copy number data to gene-based copy number with findOverlaps
## define HomDel as TCN=0, loss as TCN<ploidy, gain as TCN>ploidy, amp as TCN>=ploidy+4
## where ploidy= mode of TCN
### some variant of the below, also need one for the breast panel, IMPACT310 and exome
genes <- read.delim("/home/ngk1/share/reference/IMPACT410_genes_for_copynumber.txt", as.is=T)
genesGR <- GRanges(seqnames=genes$chromosome,
ranges=IRanges(as.numeric(genes$start_position), as.numeric(genes$end_position)),
mcols=genes[,c("order", "Cyt", "hgnc_symbol")])
facets_files <- dir("facets", pattern="txt", full=T)
mm <- do.call("cbind", lapply(facets_files, function(f) {
tab <- read.delim(f, as.is=T)
tab$chrom[which(tab$chrom==23)] <- "X"
tabGR <- GRanges(seqnames=tab$chrom,
ranges=IRanges(as.numeric(tab$loc.start), as.numeric(tab$loc.end)),
mcols=tab[,-c(1:4)])
fo <- findOverlaps(tabGR, genesGR)
rr <- ranges(fo, ranges(tabGR), ranges(genesGR))
df <- cbind(as.data.frame(fo), as.data.frame(rr))
df <- cbind(df, mcols(genesGR)[df$subjectHits,], mcols(tabGR)[df$queryHits,])
#when genes span multiple segments
oo <- tapply(df$mcols.cnlr.median, df$subjectHits, function(x){which.max(abs(x))})
oo <- oo[match(1:409, names(oo))]
oo[which(is.na(oo))] <- 1
df <- df[unlist(lapply(1:409, function(x) { which(df$mcols.order==x)[oo[which(names(oo)==x)]]})),]
ploidy <- table(df$mcols.tcn)
ploidy <- as.numeric(names(ploidy)[which.max(ploidy)])
df$GL <- 0
df$GL[which(df$mcols.tcn<ploidy)] <- -1
df$GL[which(df$mcols.tcn==0)] <- -2
df$GL[which(df$mcols.tcn>ploidy)] <- 1
df$GL[which(df$mcols.tcn>=ploidy+4)] <- 2
df <- df[match(genes$order, df$mcols.order),]
df$GL
}))
colnames(mm) <- facets_files
mm <- cbind(genes, mm)
write.table(mm, file="GL.txt", sep="\t", row.names=F, na="", quote=F)
We need snpEff annotation (with mouse genome), but don't need dbNSFP, CHASM, FATHMM and the rest. Something choked (I think CHASM) because of the different chromosome sizes.
Please. This column is very useful for troubleshooting.
The raw file sheets are created from a groupby object. When that object is
ouputted to an excel sheet, cells with the same value are merged. It looks nice
visually in the sheet but it makes the sheet unparseable, so for the raw data
sheets it is probably better to turn the groupby object to a regular dataframe
before outputting to excel.
We should create some simple test data sets that can show that all modules are working. Preferably a very tiny one that runs only a couple of minutes.
oncofuse results are not matched correctly to the defuse results. see /home/ngk1/share/data/pancreas_sem/rnaseq/defuse/alltables/
Strelka names tumor and normal in the vcf as TUMOR and NORMAL so that they are not picked up correctly by VariantFiltration. In addition this was not supposed to be run in the first place. string with a space != empty string
It would be nice to have TUMOR_MAF and NORMAL_MAF in the alltables so we don't have to compute them every time during post processing.
In addition to what's already there, please add the following to the report. This is basically trying to solve the signatures using the NMF method.
require(NMF)
require(lsa)
require(reshape2)
require(plyr)
require(ggplot2)
require(gplots)
alexandrov <- read.delim(opt$alexandrovData)
rownames(alexandrov) <- with(alexandrov, gsub(">", ".", paste(Trinucleotide, Substitution.Type, sep=".")))
alexandrov.matrix <- data.matrix(alexandrov[,4:ncol(alexandrov)])
solveNMF <- function(x, inmatrix){
coef <- fcnnls(x, inmatrix[rownames(x),]) # reorder the rownames of the in matrix
colsum <- apply(coef$x, 2, sum)
coef_x_scaled <- scale(coef$x, center=F, scale=colsum)
return(coef_x_scaled)
}
mutcounts.nmf <- solveNMF(alexandrov.matrix, spectra_mat) ### this spectra_mat should be similar to the "X" that goes into plotMutBarplot, example attached.
# then perhaps a heatmap or something showing the results of mutcounts.nmf, also write it out to a text file.
need to ensure that all the nodes in the jrf cluster have access to the complete file before processing can continue
Currently we only support running CHASM with one classifier e.g. Breast. It would be nice to support multiple, maybe by supplying a space separated list of classifiers in the Makefile
perhaps create a top-level variant caller makefile that includes several callers and merges them
For the absolute vast majority of projects, only the novel mutations are reported (GMAF>0.05). mutsig_report is using the everything.vcf. To be consistent, this should filter the input mutations to everything.novel.vcf.
certain lines contain what should be two lines
this would allow variables to be set by default without being set at the command line during debugging
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.