Hello, maybe I'm missing it, but is there the 'transform' function,

What I noticed so far is that embedding of the same data via <code class="notranslate"

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Transform and Unsupervised Data about umap HOT 21 OPEN

lmcinnes commented on April 28, 2024

Transform and Unsupervised Data

from umap.

Comments (21)

lmcinnes commented on April 28, 2024 4

Thanks, I have a clearer understanding now. The catch compared to PCA is that UMAP in general is stochastic -- refitting to the same data repeatedly will give different results (just like t-SNE). I believe it is more stable than t-SNE, but it will be different thus:

from sklearn import datasets
import umap
import numpy as np

iris = datasets.load_iris()

X = iris.data
y = iris.target

embedding1 = umap.UMAP().fit_transform(X)
embedding2 = umap.UMAP().fit_transform(X)

np.testing.assert_array_almost_equal(embedding1, embedding2, decimal=14)

will return raise an error. This is ultimately baked into the algorithm, and can be remedied by setting a fixed seed, but that is just a matter of making the randomness consistent rather than eliminating the random component.

The current transform function operates the same way, since it is using the same fundamental UMAP building blocks to perform the transformation (it isn't a deterministic parameterised function) -- repeated application to the same (new or otherwise) data will produce a slightly different result each time. This could possibly be remedied by fixing random seeds, and I will certainly look into making that a possibility. My goal so far has been to provide a method that would allow one to fit against some data (say the MNIST train set) and then perform a transformation on new data (say the MNIST test set) and have it work reasonably efficiently and embed the new data with respect to the prior learned embedding. This much I believe works, and I've tested it on MNIST, Fashion-MNIST and a few other datasets and it seems to place new data well.

I will have to look into setting seeds for the transform so that one can fix it, however, to get more consistent results.

from umap.

lmcinnes commented on April 28, 2024 3

No, you haven't missed anything. Right now UMAP is transductive -- it creates a single transform of all the data at once and you would need to redo the embedding for the combined old and new data. This is similar to, say, t-SNE.

On the other hand I am currently working on implementing a transform function that would do this. It's still experimental, and so isn't in the mainline codebase yet. Right now I am working on the necessary refactoring to make it easy to implement what I have sketched-out/hacked-together in some notebooks. Eventually it will appear in the 0.3dev branch.

You can also look at issue #40 which discusses some of these topics. An alternative approach is to train a neural network to learn the non-linear transformation as a parameterised function and then use the NN to transform new points. I am not much of neural network person, but other have apparently had some success with those approaches.

from umap.