The gemprep from systemsgenetics

Nextflow workflow fails to output ks-results

The nextflow workflow enables ks-test but outputs no log file. In order to get a log file from the results you currently need to run the normalize-frankenstein.pbs script. Can an option be added to the nextflow.config file to provide this log file?

Add UMAP

UMAP is a dimensionality reduction / visualization method similar to t-SNE, Github repo is here. It would be useful to have this method as an option in visualize.py.

Verify quantile normalization results between R and python

The only reason why normalize.R is still around is because we have not been able to match the output of normalize.quantiles() in normalize.py. The scikit-learn implementation of quantile normalization can't handle missing values (nans), so we've written our own implementation because it's a simple algorithm. However, it doesn't match the R implementation. I've run some experiments to try and figure out the cause, so I will document my findings here.

I've been working with the old TCGA (5 cancer) matrix. First I applied a log2 transform to the matrix, then I applied quantile normalization using R and python separately, producing two normalized matrices. I wrote a script to quantify the difference in expression values between two matrices, so I used this script to compare the two quantile implementations.

$ python bin/validate.py TCGA.fpkm.0.r.txt TCGA.fpkm.0.py.txt 
Loaded TCGA.fpkm.0.r.txt (73599, 2016)
Loaded TCGA.fpkm.0.py.txt (73599, 2016)
warning: column names do not match
number of mismatched nans: 0
min error:     0.000000
avg error:     0.535444
max error:    18.932696

So at a broad glance, these two results are slightly different. Expression values range from about 0 to 20, so the average error is small but noticeable. The columns do actually match, it's just that R for some reason replaces '-' with '.' in the column names. Interestingly the occurence of nans in each matrix are identical.

I thought that maybe the difference arises from how our implementation handles the nans, so I re-ran this test with matrices that don't have nans. To do that, I do a log2(1 + x) transform instead of log2(x). Since raw expression values range from 0 to infinity, log2(1 + x) will never return a nan in this case.

$ python bin/validate.py TCGA.fpkm.1.r.txt TCGA.fpkm.1.py.txt 
Loaded TCGA.fpkm.1.r.txt (73599, 2016)
Loaded TCGA.fpkm.1.py.txt (73599, 2016)
warning: column names do not match
number of mismatched nans: 0
number of errors: 56823902
min error:     0.000000
avg error:     0.020964 +/-     0.147583
max error:     5.201935

Much better, but still slightly off. I really need to find the source code for the R implementation in order to know exactly how they do it. I found the R library but the implementation itself is written in C and I couldn't find the C source files.

Also need to post density plots here when I get the chance.

systemsgenetics / gemprep Goto Github PK

gemprep's People

Contributors

Stargazers

Watchers

Forkers

gemprep's Issues

Nextflow workflow fails to output ks-results

Add UMAP

Verify quantile normalization results between R and python

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent