jrflab / modules Goto Github PK

View Code? Open in Web Editor NEW

16.0 16.0 13.0 79.37 MB

MSKCC Reis-Filho Lab pipeline thingy

Home Page: https://www.mskcc.org/research-areas/labs/jorge-reis-filho/overview

Makefile 17.22% R 18.52% Perl 2.24% Python 9.63% Shell 0.25% PHP 0.03% Jupyter Notebook 51.31% HTML 0.78% C++ 0.03%

modules's People

Contributors

Stargazers

Watchers

Forkers

bioinfowangm inodb merckey hxrts hrk2109 ipstone sasi-arunachalam selenicp starfish001 haoxianglin giovianco shicheng-guo

modules's Issues

Add number of mutations to pie chart in mutsig_report

qsub.pl doesn't check exit codes before checking for non-zero/sync'd file size

fathmm annotation bug

transcripts with > 1 snv in a vcf are missing annotations: only first snv is annotated

mutsig_report: run on high/moderate/low

integrate_rnaseq: empty result files cause error in oncofuse

qsubClient fails to connect occasionally: should retry several times

facets: option for two critical values to better estimate diploid log ratio

For the pies, please take out the grey background, axes and labels.
For the bars, the panels should be on a single row instead of over 2 rows, take out the grey background, larger font on the y-axis and the top title bars.
Thanks.

missing X chromosome from non-targeted sequencing

merge modules and scripts

hg19 should actually be b37 (or GRCh37)

likewise, mm10 is actually GRCm38

Output excel version of alltables

Usually we want only the headers:

TUMOR_SAMPLE    GENE    AA  EFFECT  TUMOR_MAF   NORMAL_MAF  CHASM  Uterus   Mut Tastor  FATHMM  Cancer Gene Census  Kandoth Lawrence    haploinsufficiency  Cancer Cell Fraction ABSOLUTE   IMPACT

It would be nice to have one sheet that only has those headers (summary view) and one with all the data for both high_moderate and low_modifier. That is four sheets in total.

defuse oncofuse results not merging properly

defuse results do not match oncofuse results in the oncofuse.txt tables

USE_CLUSTER=false doesn't work

slackbot not working for some users

varscan_indels in tables/ has duplicate REF columns

They seem to contain the same values, so it is not super important. Would be good to fix tho.

cosmic IDs overwrite dbsnp IDs

Add gene annotations for Kandoth, Lawrence, Cancer Gene Census

The bed files have been made for the annotations. They just need to be added to the vcf tools file

Absolute likely fails with new ANN field

modules/clonality/absoluteSeq.mk:       genes <- c(snvs[['EFF....GENE']], indels[['EFF....GENE']])

Check if one can ssh to all nodes in cluster

After each run, we check whether or not the file size is the same on each node. This however requires the main process to ssh to every node. We should therefore check before running the pipeline that this is possible.

integrate oncofuse R script gives duplicate rows

due to the strand lookup on the ensembl database

mutsig_report: normalised count pie charts

fpfilter for varscan output

https://github.com/ckandoth/variant-filter

mutation_summary's environment is not working for other users

This is the output:

[ngk1@saba2 log]$ pwd
/home/ngk1/share/projects/nendo_amyo_btseq/log
[ngk1@saba2 log]$ less mutation_summary.2015-08-05.25.log
make[1]: Entering directory '/ifs/e63data/reis-filho/projects/nendo_amyo_btseq'
mkdir -p -m 775 log/mutation_summary.2015-08-05.25/excel excel; umask 002; set -o pipefail;  source /home/debruiji/share/usr/anaconda-envs/anaconda-2.7/bin/activate /home/debruiji/share/usr/anaconda-envs/anaconda-2.7; \
python modules/scripts/mutation_summary_excel.py alltables/allTN.mutect.dp_ft.som_ad_ft.target_ft.pass.dbsnp.cosmic.nsfp.eff.gene_ann.cn_reg.chasm.fathmm.tab.high_moderate.txt alltables/allTN.mutect.dp_ft.som_ad_ft.target_ft.pass.dbsnp.cosmic.nsfp.eff.gene_ann.cn_reg.chasm.fathmm.tab.low_modifier.txt alltables/allTN.mutect.dp_ft.som_ad_ft.target_ft.pass.dbsnp.cosmic.nsfp.eff.gene_ann.cn_reg.chasm.fathmm.tab.synonymous.txt alltables/allTN.mutect.dp_ft.som_ad_ft.target_ft.pass.dbsnp.cosmic.nsfp.eff.gene_ann.cn_reg.chasm.fathmm.tab.nonsynonymous.txt alltables/allTN.strelka_varscan_indels.tabigh_moderate.txt alltables/allTN.strelka_varscan_indels.tab.low_modifier.txt alltables/allTN.strelka_varscan_indels.tab.synonymous.txt alltables/allTN.strelka_varscan_indels.tab.nonsynonymous.txt excel/mutation_summary.xlsx
discarding /home/debruiji/anaconda/bin from PATH
prepending /home/debruiji/share/usr/anaconda-envs/anaconda-2.7/bin to PATH
/home/ngk1/share/usr/lib/python/pandas-0.12.0-py2.7-linux-x86_64.egg/pandas/hashtable.so: undefined symbol: PyUnicodeUCS2_DecodeUTF8
Traceback (most recent call last):
 File "modules/scripts/mutation_summary_excel.py", line 6, in <module>
    import pandas as pd
     File "/home/ngk1/share/usr/lib/python/pandas-0.12.0-py2.7-linux-x86_64.egg/pandas/__init__.py", line 6, in <module>
        from . import hashtable, tslib, lib
        ImportError: /home/ngk1/share/usr/lib/python/pandas-0.12.0-py2.7-linux-x86_64.egg/pandas/hashtable.so: undefined symbol: PyUnicodeUCS2_DecodeUTF8
        modules/excel/mutationSummary.mk:24: recipe for target 'excel/mutation_summary.xlsx' failed
        make[1]: *** [excel/mutation_summary.xlsx] Error 1
        make[1]: Leaving directory '/ifs/e63data/reis-filho/projects/nendo_amyo_btseq'

Summarise FACETS results to gene-level results

#### turn segmented copy number data to gene-based copy number with findOverlaps
## define HomDel as TCN=0, loss as TCN<ploidy, gain as TCN>ploidy, amp as TCN>=ploidy+4
## where ploidy= mode of TCN
### some variant of the below, also need one for the breast panel, IMPACT310 and exome

genes <- read.delim("/home/ngk1/share/reference/IMPACT410_genes_for_copynumber.txt", as.is=T)

genesGR <- GRanges(seqnames=genes$chromosome, 
        ranges=IRanges(as.numeric(genes$start_position), as.numeric(genes$end_position)),
        mcols=genes[,c("order", "Cyt", "hgnc_symbol")])

facets_files <- dir("facets", pattern="txt", full=T)

mm <- do.call("cbind", lapply(facets_files, function(f) {
    tab <- read.delim(f, as.is=T)
    tab$chrom[which(tab$chrom==23)] <- "X"

    tabGR <- GRanges(seqnames=tab$chrom, 
        ranges=IRanges(as.numeric(tab$loc.start), as.numeric(tab$loc.end)),
        mcols=tab[,-c(1:4)])

    fo <- findOverlaps(tabGR, genesGR)
    rr <- ranges(fo, ranges(tabGR), ranges(genesGR))
    df <- cbind(as.data.frame(fo), as.data.frame(rr))

    df <- cbind(df, mcols(genesGR)[df$subjectHits,], mcols(tabGR)[df$queryHits,])

#when genes span multiple segments
    oo <- tapply(df$mcols.cnlr.median, df$subjectHits, function(x){which.max(abs(x))})
    oo <- oo[match(1:409, names(oo))]
    oo[which(is.na(oo))] <- 1

    df <- df[unlist(lapply(1:409, function(x) { which(df$mcols.order==x)[oo[which(names(oo)==x)]]})),]

    ploidy <- table(df$mcols.tcn)
    ploidy <- as.numeric(names(ploidy)[which.max(ploidy)])

    df$GL <- 0
    df$GL[which(df$mcols.tcn<ploidy)] <- -1
    df$GL[which(df$mcols.tcn==0)] <- -2
    df$GL[which(df$mcols.tcn>ploidy)] <- 1
    df$GL[which(df$mcols.tcn>=ploidy+4)] <- 2

    df <- df[match(genes$order, df$mcols.order),]
    df$GL
}))
colnames(mm) <- facets_files
mm <- cbind(genes, mm)
write.table(mm, file="GL.txt", sep="\t", row.names=F, na="", quote=F)

lose annotations when going from opl_tab to tab

Turn off annotation for mouse mutation calls

We need snpEff annotation (with mouse genome), but don't need dbNSFP, CHASM, FATHMM and the rest. Something choked (I think CHASM) because of the different chromosome sizes.

Include dbNSFP_ExAC_Adj_AF in mutation_summary

Please. This column is very useful for troubleshooting.

Excel output raw file sheet unparseable with pandas

The raw file sheets are created from a groupby object. When that object is
ouputted to an excel sheet, cells with the same value are merged. It looks nice
visually in the sheet but it makes the sheet unparseable, so for the raw data
sheets it is probably better to turn the groupby object to a regular dataframe
before outputting to excel.

add log file to slackbot error message

facets: increment cval if emcncf fails

facets cncf plot is missing chr X

qsub.pl takes up too much memory

Test data

We should create some simple test data sets that can show that all modules are working. Preferably a very tiny one that runs only a couple of minutes.

incorrect merging of defuse and oncofuse

oncofuse results are not matched correctly to the defuse results. see /home/ngk1/share/data/pancreas_sem/rnaseq/defuse/alltables/

Annotation of copy-number-regulated genes

specify facets options

strelka workflow incorrectly applies somatic depth filter

Strelka names tumor and normal in the vcf as TUMOR and NORMAL so that they are not picked up correctly by VariantFiltration. In addition this was not supposed to be run in the first place. string with a space != empty string

Add TUMOR_MAF and NORMAL_MAF to vcfs files

It would be nice to have TUMOR_MAF and NORMAL_MAF in the alltables so we don't have to compute them every time during post processing.

Additions to mutsig_report

In addition to what's already there, please add the following to the report. This is basically trying to solve the signatures using the NMF method.

require(NMF)
require(lsa)
require(reshape2)
require(plyr)
require(ggplot2)
require(gplots)

alexandrov <- read.delim(opt$alexandrovData)
rownames(alexandrov) <- with(alexandrov, gsub(">", ".", paste(Trinucleotide, Substitution.Type, sep=".")))
alexandrov.matrix <- data.matrix(alexandrov[,4:ncol(alexandrov)])

solveNMF <- function(x, inmatrix){
    coef <- fcnnls(x, inmatrix[rownames(x),]) # reorder the rownames of the in matrix 
    colsum <- apply(coef$x, 2, sum)
    coef_x_scaled <- scale(coef$x, center=F, scale=colsum)
    return(coef_x_scaled)
}

mutcounts.nmf <- solveNMF(alexandrov.matrix, spectra_mat) ### this spectra_mat should be similar to the "X" that goes into plotMutBarplot, example attached.

# then perhaps a heatmap or something showing the results of mutcounts.nmf, also write it out to a text file.

sig_counts_matrix.txt