Giter VIP home page Giter VIP logo

smallvis's Introduction

smallvis

An R package for small-scale dimensionality reduction using neighborhood-preservation dimensionality reduction methods, including t-Distributed Stochastic Neighbor Embedding, LargeVis and UMAP.

LargeVis and UMAP are of particular interest because they seem to give visualizations which are very competitive with t-SNE, but can use stochastic gradient descent to give faster run times and/or better scaling with dataset size than the typical Barnes-Hut t-SNE implementation.

This package is designed to make it easier to experiment with and compare these methods, by removing differences in implementation details.

One way it does this is by abandoning the more advanced nearest-neighbor methods, distance approximations, sampling, and multi-threaded stochastic gradient descent techniques. The price paid for this simplification is that the algorithms are back to being O(N^2) in storage and computation costs (and being in pure R). Unlike UMAP, the official implementation of LargeVis, and the Barnes-Hut implementation of t-SNE, this package is therefore not suitable for large scale visualization. Hence the name smallvis.

Prerequisites

smallvis uses the vizier package to plot the coordinates during optimization. It's not on CRAN, and therefore requires a fairly new version of devtools (1.9 or greater) to install this as a dependency from github.

There is also an optional dependency on the RSpectra package, which is used only if you want to initialize from a spectral method (set Y_init = "laplacian" or Y_init = "normlaplacian" to do this and see the paper by Linderman and Steinerberger for details on why you might want to). If not present, then the standard R function eigen is used, but this is much slower (because we only need the first few eigenvectors, and eigen calculates all of them). On my Sandy Bridge-era laptop running R 3.4.2 on Windows 10, using Rspectra::eigs to fetch the top 3 eigenvectors from a 6,000 x 6,000 affinity matrix takes about 6 seconds; using eigen takes around 25 minutes.

Installing

install.packages("devtools")
devtools::install_github("jlmelville/smallvis", subdir = "smallvis")
library(smallvis)

Using

# By default, we use all numeric columns found in a data frame, so you don't need to filter out factor or strings
# set verbose = TRUE to log progress to the console
# Automatically plots the results during optimization
tsne_iris <- smallvis(iris, perplexity = 25, verbose = TRUE)

# Using a custom epoch_callback
uniq_spec <- unique(iris$Species)
colors <- rainbow(length(uniq_spec))
names(colors) <- uniq_spec
iris_plot <- function(x) {
  plot(x, col = colors[iris$Species])
}

tsne_iris <- smallvis(iris, perplexity = 25, epoch_callback = iris_plot, verbose = TRUE)

# Default method is t-SNE, use largevis cost function instead
# Also needs a gamma value specified, and not as easy to optimize as t-SNE:
# reduce learning rate (eta) and increase maximum iterations
largevis_iris <- smallvis(iris, method = "largevis", perplexity = 25, epoch_callback = iris_plot, 
                          eta = 0.1, max_iter = 5000, verbose = TRUE)
                          
# For extra control over method-specific parameters pass a list as the "method" parameter:
# In largevis, gamma controls the balance of repulsion vs attraction
# The smallvis man page lists the method-specific parameters which can be controlled in this way
largevis_iris <- smallvis(iris, method = list("largevis", gamma = 1), perplexity = 25, epoch_callback = iris_plot, 
                          eta = 0.1, max_iter = 5000, verbose = TRUE)

# UMAP: see https://github.com/lmcinnes/umap
# UMAP also has extra parameters, but we use the defaults here
umap_iris <- smallvis(iris, method = "umap", perplexity = 25, eta = 0.01)

# use (scaled) PCA initialization so embedding is repeatable
tsne_iris_spca <- smallvis(iris, perplexity = 25, epoch_callback = iris_plot, Y_init = "spca")

# or initialize from Laplacian Eigenmap (similar to UMAP initialization)
tsne_iris_lap <- smallvis(iris, perplexity = 25, epoch_callback = iris_plot, Y_init = "lap")

# or initialize from normalized Laplacian eigenvectors (even closer to UMAP initialization)
tsne_iris_nlap <- smallvis(iris, perplexity = 25, epoch_callback = iris_plot, Y_init = "normlap")

# return extra information in a list, like with Rtsne
tsne_iris_extra <- smallvis(iris, perplexity = 25, epoch_callback = iris_plot, ret_extra = TRUE)

# more (potentially large and expensive to calculate) return values, but you have to ask for them specifically
tsne_iris_extra_extra <- smallvis(iris, perplexity = 25, epoch_callback = iris_plot,
                              ret_extra = c("P", "Q", "DX", "DY", "X"))

# Repeat embedding 10 times and keep the one with the best cost
tsne_iris_best <- smallvis_rep(nrep = 10, X = iris, perplexity = 25, ret_extra = TRUE)
iris_plot(tsne_iris_best$Y)

# Let smallvis pick a perplexity for you, using the Intrinsic Dimensionality Perplexity
tsne_iris_idp <- smallvis(iris, epoch_callback = ecb, perplexity = "idp", Y_init = "spca",
                          exaggeration_factor = 4)
                          
# Classical momentum optimization instead of delta-bar-delta
umap_iris_mom <- smallvis(iris, scale = FALSE, opt = list("mom", eta = 1e-2, mu = 0.8),
                          method = "umap", Y_init = "spca")

# L-BFGS optimization via the mize package
umap_iris_lbfgs <- smallvis(iris, scale = FALSE, opt = list("l-bfgs", c1 = 1e-4, c2 = 0.9),
                            method = "umap", Y_init = "spca", max_iter = 300)
                            
# Early Exaggeration
tsne_iris_ex <- smallvis(iris, eta = 100, exaggeration_factor = 4, stop_lying_iter = 100)

# and Late Exaggeration as suggested by Linderman and co-workers
tsne_iris_lex <- smallvis(iris, eta = 100, exaggeration_factor = 4, stop_lying_iter = 100,
                          late_exaggeration_factor = 1.5, start_late_lying_iter = 900) 

Available Embedding Methods

Things To Be Aware Of

  • March 23 2019: Methods that use the exponential function (e.g. NeRV, JSE, SSNE, ASNE) are now more robust, but sadly a lot slower, due to me implementing the log-sum-exp "trick" to avoid numeric underflow. This mainly helps JSE, which showed a tendency to have its gradients suddenly explode. It's still difficult to optimize, though. Perhaps symmetric JSE can help under those circumstances.
  • Feb 13 2018: the UMAP paper is out, but I have yet to read and understand it fully, so the implementation in smallvis currently relies on my examination of the UMAP source code, with some much-appreciated clarification from UMAP creator Leland McInnes. Expect some bugs, and any horrifically bogus results should be double-checked with the output of the official UMAP implementation before casting calumnies on the quality of UMAP itself.
  • LargeVis requires the use of a gamma parameter, which weights the contribution of attractive and repulsive contributions to the cost function. In the real LargeVis, it is recommended to set this value to 7, but this relies on the specifics of the stochastic gradient descent method. In smallvis, this value is very dataset dependent: the more data you have, the smaller gamma should be to avoid over-emphasising repulsive interactions.
  • LargeVis partitions each pairwise interaction into either an attractive or repulsive contribution. In smallvis, each interaction is a combination of both.
  • Both UMAP and LargeVis use a classic stochastic gradient descent approach with a decaying learning rate. The implementation in smallvis uses the same delta-bar-delta optimization method used in t-SNE. It works well in my experience, but may require some tuning and more iterations compared to optimizing t-SNE.
  • In this setting, LargeVis and UMAP gradient requires quite a large value of epsilon to avoid division by zero and get decent results. It's hard-coded to 0.1 in the LargeVis source code, so I have used the same value by default in smallvis. It can be controlled by the lveps parameter.
  • Mainly for my own benefit, there is also a theory page showing a comparison of cost functions and gradients. Also, some material on the various spectral methods, which justifies the use of Laplacian Eigenmaps (a bit).

My Idle Thoughts

smallvis exists mainly to satisfy my urge to answer the various, minor, stamp-collecting questions that have occurred to me as I have read the dimensionality reduction literature. Those that I have cobbled together into something that demonstrates the use of smallvis can be found at the documentation page.

See Also

For large scale visualization in R see:

Also of relevance are:

  • UMAP (in Python)
  • UWOT a package implementing LargeVis and UMAP.
  • LargeVis (in C++)
  • Spectra, the C++ library that RSpectra wraps.
  • FIt-SNE, an FFT-based t-SNE library. I have implemented the "late exaggeration" method that it uses. See their paper for more details.

Much of the code here is based on my fork of Justin Donaldson's R package for t-SNE.

License

GPLv2 or later. Any LargeVis-specific code (e.g. cost and gradient calculation) can also be considered Apache 2.0. Similarly, UMAP-related code is also licensed as BSD 3-clause.

smallvis's People

Contributors

jlmelville avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

smallvis's Issues

smallvis_rep() not working

Very useful and fun package. But trying to run the sample code, I get an error messge.

library(smallvis)
tsne_iris_best <- smallvis_rep(nrep = 10, iris, perplexity = 25, ret_extra = TRUE)

23:16:10 Starting embedding # 1 of 10
Error in make_smallvis_cb(X) : argument "X" is missing, with no default

Thanks!

A reference for scaled PCA initialisation

Hey, do you by any chance know of any reference for the scaled PCA initialization? Perhaps there is some paper that at least mentions PCA initialization as a possibility and explicitly says that one needs to use scaled PCA in this case? I am not able to find any reference at all.

I looked here https://jlmelville.github.io/smallvis/init.html and here http://jlmelville.github.io/sneer/references.html but did not find anything. In fact, in your References list you cite our preprint in the Initialization setting, but I think I borrowed the scaled PCA idea from your pages :-) I am preparing a revision right now and it'd be good to have some reference for this...

Error in if (Z == 0) { : missing value where TRUE/FALSE needed

Thank you for such an extensive package and for including examples for each and every function. The example with iris runs no problem; I try to reproduce for diff dataset of similar str() but I get the following error:

Load dataset, inspect str

i:

URL  <- "http://bit.ly/data_shannon_error"
goblins_df <- read.csv(URL) 
utils::str(goblins_df)

o:

'data.frame':	373 obs. of  9 variables:
 $ bone_length  : num  0.355 0.324 0.63 0.472 0.469 ...
 $ rotting_flesh: num  0.351 0.431 0.581 0.494 0.517 ...
 $ hair_length  : num  0.466 0.205 0.647 0.557 0.621 ...
 $ has_soul     : num  0.781 0.258 0.505 0.564 0.625 ...
 $ red          : int  255 255 255 255 255 255 255 255 255 255 ...
 $ green        : int  255 255 255 255 255 255 255 255 255 255 ...
 $ blue         : int  255 255 255 255 255 255 255 255 255 255 ...
 $ alpha        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ type         : Factor w/ 3 levels "Ghost","Ghoul",..: 2 1 2 2 3 2 2 2 3 3 ...

smallvis() call

i:

tsne <- smallvis(data, 
                          perplexity = 30, 
                          verbose    = TRUE)

Error:

o:

Error in if (Z == 0) { : missing value where TRUE/FALSE needed
6.shannon(Di, beta[i])
5.x2aff(X, perplexity, tol = 1e-05, kernel = kernel, verbose = verbose)
4.sne_init(cost, X, perplexity = perplexity, kernel = inp_kernel, symmetrize = "symmetric", normalize = TRUE, verbose = verbose, ret_extra = ret_extra)
3.cost$init(cost, X, verbose = verbose, ret_extra = ret_extra, max_iter = max_iter)
2.cost_init(cost_fn, X, max_iter = max_iter, verbose = verbose, ret_extra = ret_optionals)
1.smallvis(data, perplexity = 30, verbose = TRUE)

Does it have sth to do with shannon()? I ctrl + F in your code and source code from pgirmess ::shannon() but can't seem to detect where this error is exported from.

Unable to install smallvis

Hi there, I followed your readme and installed RSpectra and vizier successfully, but could not install smallvis. Hopefully there is something small I missed!

Thank you.

> library(RSpectra)
> library(vizier)
> devtools::install_github("jlmelville/smallvis/smallvis")
Downloading GitHub repo jlmelville/smallvis@HEAD
Error in utils::download.file(url, path, method = method, quiet = quiet,  : 
  download from 'https://api.github.com/repos/jlmelville/smallvis/tarball/HEAD' failed
> R.version.string
[1] "R version 4.2.1 (2022-06-23)"
> sessionInfo()
R version 4.2.1 (2022-06-23)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Monterey 12.6.1

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] devtools_2.4.5        usethis_2.1.6         vizier_0.4.2          RSpectra_0.16-1       data.table_1.14.6     ComplexHeatmap_2.12.1
 [7] BiocParallel_1.30.4   forcats_0.5.2         stringr_1.4.1         dplyr_1.0.10          purrr_0.3.5           readr_2.1.3          
[13] tidyr_1.2.1           tibble_3.1.8          ggplot2_3.4.0         tidyverse_1.3.2       SeuratObject_4.1.3    Seurat_4.3.0         
[19] SCopeLoomR_0.13.0     SCENIC_1.3.1          AUCell_1.18.1         flowCore_2.8.0       

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.