Giter VIP home page Giter VIP logo

simontuk / simlr Goto Github PK

View Code? Open in Web Editor NEW

This project forked from batzogloulabsu/simlr

0.0 2.0 0.0 624.36 MB

Single-cell Interpretation via Multi-kernel Learning (R code in master branch, Matlab code in matlab branch), implementations of the method published in http://www.nature.com/nmeth/journal/vaop/ncurrent/full/nmeth.4207.html. Also see http://biorxiv.org/content/early/2017/03/21/118901 for a description of the software.

License: GNU General Public License v3.0

C++ 96.63% C 1.25% Makefile 0.02% R 0.80% Fortran 0.09% CMake 0.01% MATLAB 1.19% M 0.01%

simlr's Introduction

SIMLR (Single-cell Interpretation via Multi-kernel LeaRning)

OVERVIEW

Single-cell RNA-seq technologies enable high throughput gene expression measurement of individual cells, and allow the discovery of heterogeneity within cell populations. Measurement of cell-to-cell gene expression similarity is critical to identification, visualization and analysis of cell populations. However, single-cell data introduce challenges to conventional measures of gene expression similarity because of the high level of noise, outliers and dropouts. We develop a novel similarity-learning framework, SIMLR (Single-cell Interpretation via Multi-kernel LeaRning), which learns an appropriate distance metric from the data for dimension reduction, clustering and visualization. SIMLR is capable of separating known subpopulations more accurately in single-cell data sets than do existing dimension reduction methods. Additionally, SIMLR demonstrates high sensitivity and accuracy on high-throughput peripheral blood mononuclear cells (PBMC) data sets generated by the GemCode single-cell technology from 10x Genomics.

SIMLR

SIMLR offers three main unique advantages over previous methods: (1) it learns a distance metric that best fits the structure of the data via combining multiple kernels. This is important because the diverse statistical characteristics due to large noise and dropout effect of single-cell data produced today do not easily fit specific statistical assumptions made by standard dimension reduction algorithms. The adoption of multiple kernel representations provides a better fit to the true underlying statistical distribution of the specific input scRNA-seq data set; (2) SIMLR addresses the challenge of high levels of dropout events that can significantly weaken cell-to-cell similarities even under an appropriate distance metric, by employing graph diffusion, which improves weak similarity measures that are likely to result from noise or dropout events; (3) in contrast to some previous analyses that pre-select gene subsets of known function, SIMLR is unsupervised, thus allowing de novo discovery from the data. We empirically demonstrate that SIMLR produces more reliable clusters than commonly used linear methods, such as principal component analysis (PCA), and nonlinear methods, such as t-distributed stochastic neighbor embedding (t-SNE), and we use SIMLR to provide 2-D and 3-D visualizations that assist with the interpretation of single-cell data derived from several diverse technologies and biological samples.

Furthermore, here we also provide an implementation of SIMLR (see SIMLR large scale) capable of handling large scale datasets.

REFERENCE

The latest draft of the manuscript related to SIMLR can be found as a preprint at http://biorxiv.org/content/early/2017/02/28/052225 and it is published on Nature Methods at http://www.nature.com/nmeth/journal/vaop/ncurrent/full/nmeth.4207.html.

Also see http://biorxiv.org/content/early/2017/03/21/118901 for a description of the software.

DOWNLOAD

We provide both the R and MATLAB implementations of SIMLR (both standard and large scale) in the SIMLR branch, while the master (stable version) or the development (development version) branches provide the version of SIMLR available on Bioconductor.

Furthermore, we also provide a Python implementation for SIMLR which can be found at https://github.com/bowang87/SIMLR_PY.

INSTALLING SIMLR R Bioconductor IMPLEMENTATION

As mentioned, SIMLR is also hosted on Bioconductor at https://bioconductor.org/packages/release/bioc/html/SIMLR.html and can be installed as follow. To install the package directly from Bioconductor, run the following commands directly from R:

source("https://bioconductor.org/biocLite.R")

biocLite("SIMLR")

Moreover, it is also possible to install the Github version of SIMLR from R by using the R library devtools.

library(devtools)

install_github("BatzoglouLabSU/SIMLR", ref = 'master')

library(SIMLR)

or,

library(devtools)

install_github("BatzoglouLabSU/SIMLR", ref = 'development')

library(SIMLR)

We notice that on the "master" branch it is hosted the latest stable version of the code which is also available on Bioconductor on the stable repository. While on the "development" branch it is hosted the latest version that is on the devel repository on Bioconductor.

We describe next our to manually install SIMLR in case one wishes to do so.

RUNNING SIMLR R IMPLEMENTATION

We provide the R code to run SIMLR on 4 examples in the script R_main_demo.R. Furthermore, we provide a large scale implementation of SIMLR (see large scale implementation) with 1 example in the script R_main_demo_large_scale.R. The R libraries required to run the 2 demos can be installed by running the script install_R_libraries.R. We now present a set of requirements to run the examples.

  1. Required R libraries. SIMLR requires 2 R packages to run, namely the Matrix package (see https://cran.r-project.org/web/packages/Matrix/index.html) to handle sparse matrices and the parallel package (see https://stat.ethz.ch/R-manual/R-devel/library/parallel/doc/parallel.pdf) for a parallel implementation of the kernel estimation.

To run the large scale analysis, it is necessary to install 4 more packages, namely Rcpp package (see https://cran.r-project.org/web/packages/Rcpp/index.html), pracma package (see https://cran.r-project.org/web/packages/pracma/index.html), RcppAnnoy package (see https://cran.rstudio.com/web/packages/RcppAnnoy/index.html) and RSpectra package (see https://cran.r-project.org/web/packages/RSpectra/index.html).

Furthermore, to run the examples, we require the igraph package (see http://igraph.org/r/) to compute the normalized mutual informetion metric and the grDevices package (see https://stat.ethz.ch/R-manual/R-devel/library/grDevices/html/00Index.html) to color the plots.

All these packages, can be installed with the R built-in install.packages function.

  1. External C code. We make use of an external C program during the computations of SIMLR. The code is located in the R directory in the file projsplx_R.c. In order to compite the program, one needs to run on the shell the command R CMD SHLIB -c projsplx_R.c.

An OS X pre-compiled file is also provided. Note: if there are issues in compiling the .c file, try to remove the pre-compiled files (i.e., projsplx_R.o and projsplx_R.so).

  1. Example datasets. The 5 example datasets are provided in the directory data.

Specifically, the dataset of Test_1_mECS.RData refers to http://www.ncbi.nlm.nih.gov/pubmed/25599176, Test_2_Kolod.RData refers to http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4595712/, Test_3_Pollen.RData refers to http://www.ncbi.nlm.nih.gov/pubmed/25086649 and Test_4_Usoskin.RData refers to http://www.ncbi.nlm.nih.gov/pubmed/25420068.

Moreover, for the large scale example, the dataset of Zelsel.RData refers to https://www.ncbi.nlm.nih.gov/pubmed/25700174.

RUNNING SIMLR MATLAB IMPLEMENTATION

We also provide the MATLAB code to run SIMLR on the 5 examples in the script main_demo.m and main_LARGE_demo_.m. Please refer to the directory MATLAB and the file ReadMe.txt within for further details.

simlr's People

Contributors

danro9685 avatar junjiezhujason avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.