Giter VIP home page Giter VIP logo

Comments (35)

lmcinnes avatar lmcinnes commented on April 27, 2024 15

There are a few things to deal with here.

First, for your dataset (500,000) you will need code that is currently in a branch not merged with master. There are some issues once you get beyond 100,000 points that I have finally resolved but not merged.

Second, it will use vectorized instructions as much as numba will let it -- numba is a library that can take numerical python code and compile it to LLVM, so whatever LLVM and numba can do will be happenining. Since I'm still working at the algorithmic level I have not concerned myself with that level of optimization.

Third, if it is using all CPUs and taking a long time ... that means that the spectral embedding is busy failing since there isn't a large enough eigengap. Try using init='random' to avoid this. I'm still trying to figure out the right way to handle that sort of issue.

Fourth, try to keep n_neighbors small; the larger it is, especially for large datasets, the worse the performance. Part of the efficiency of UMAP comes from its ability to make use of much smaller nearest neighbor sets than comparable algorithms.

Fifth, I do recommend trying on a sub-sample of your dataset first to make sure things are working before scaling up.

Finally, UMAP ultimately relies on SGD for a portion of the computation. Of course, as these things go it is a bit of a custom SGD, so what is there now is essentially one I wrote myself from scratch. Presumably deep learning libraries that make use of GPUs can be suitably coerced into doing the right kind of optimization, but that is beyond my current knowledge (as it requires careful phrasing of the problem I suspect). If you would like to look into that we can discuss the possibilities further.

from umap.

cjnolet avatar cjnolet commented on April 27, 2024 4

@candalfigomoro @mik1904 we just recently added a very similar capability to the GPU-accelerated UMAP in RAPIDS cuML, which uses Dask.

https://github.com/rapidsai/cuml/blob/branch-0.14/python/cuml/dask/manifold/umap.py

So far, initial evaluations are showing some very positive results, especially as the inference dataset grows very large.

We are continuing to work on distributing the training as well, however the embarrassingly parallel inference is showing to be very useful thus far.

from umap.

snakers4 avatar snakers4 commented on April 27, 2024 2

Ran a bech on these params for 500k * 50 data:

umap = umap.UMAP(
    n_neighbors=200,
    min_dist=0.0,
    n_components=2,
    metric='euclidean'
    )
umap = umap.UMAP(
    n_neighbors=30,
    min_dist=0.0,
    n_components=2,
    metric='euclidean'
    )

Nothing more or less happened for hours.

Then I just tried:

umap = umap.UMAP(
    n_components=2
    )

And after and hour or so it started using all the processors

from umap.

100518832 avatar 100518832 commented on April 27, 2024 2

@snakers4

import pandas as pd # import pandas
import numpy as np # import numpy

from multiprocessing import Process # import from standard python libs for multi-core threading and processing 
import glob # import to iterate through files in directories 
import os.path # import for creating os path objects
import matplotlib.pyplot as plt # import used to create and save scatter plots
#from mpl_toolkits.mplot3d import Axes3D
from numpy import genfromtxt # numpy import to read txt files as csv's
import umap # the embedding algorithm, UMAP, this is the reduction method (or class) used
#from sklearn.decomposition import PCA 
#from sklearn.preprocessing import MinMaxScaler


def data_grab(path,filename,fraction): # this is the function that grads the dataset
    
    df = pd.read_csv(str(path)+str(filename)) # reading the dataset from a csv file and making it a pandas dataframe
    df = df.select_dtypes(include=[np.number]) # selecting only the columsn that contain float/int, columns with NaN, missing entries and strings are ignored
    df = df.sample(frac=fraction, random_state = 32).reset_index(drop = True) # this takes a sample of our dataset to use for embedding, the random sate controls the rand seed
    df = df.as_matrix() # this converts the dataframe into a matrix form, similar to the form that a numpy ndarray would look
     
    return df # return our 'fake' ndarray


def UMAP(n_neighbors, min_dist, metric, input_data): # this is the embedding function
    
    global fraction, filename # we need to keep the filename of our dataset for folder creation, same for fraction 
    
    try:
        data = umap.UMAP(n_neighbors = n_neighbors, min_dist = min_dist, metric = metric, n_components = 2).fit_transform(input_data)
    except:
        print('Embedding_Failed')
        return None 
    
    df_reduct = pd.DataFrame(data) # our lower embedded data is saved as a pandas dataframe
    
    return df_reduct.to_csv('%s_embedding_%s/embedding,%s,%s,%s.csv' %(filename,fraction,n_neighbors, min_dist, metric)) # we are saving our embedding to a csv


def savefig(file): # this function creates the scatter plots and saves them, currently the cmap, or colouring does not work with my distro of anaconda (install problem) you can add them if they work for you
    
        global fraction, filename # make sure we maintain our global variables for file saving
    
        fileparams = file.replace('%s_embedding_%s/embedding,'%(filename,fraction),'') # taking the embedding csv filename and repalcing the first part of the string
        # explanation 
        # since multiprocessing does not share variables between instances, if we try to make a global search dataframe (or any structure) each process will create its own copy
        # of it and append to it; therefore, at the end we have an empty dataframe because each multiprocess creates its own copy, instead each process will create a file of the lower embedding 
        # and in the name of the file will be the parameters used to generate this embedding, their is a way to get around this using multiprocessing queues; however, I do not know the lib well enough
        
        fileparams = fileparams.replace('.csv','')
        fileparams = fileparams.strip('')
        fileparams = fileparams.split(',') # this will leave use with a list of 3 strings, which rep. the 3 UMAP argument used
        
        if os.path.isfile('%s_embedding_%s/figures/Embedding-%s-%s-%s.pdf'%(filename, fraction, fileparams[0],fileparams[1],fileparams[2])) == True: # checking to see if the graph already exists, if so, we return nothing
            return None
        
        fig = plt.figure() # this is just creating the embedding data from the file, and saving the graph
        plt.title('Embedding-%s-%s-%s'%(fileparams[0],fileparams[1],fileparams[2]))
        embedding = genfromtxt(file, delimiter=',', skip_header = 1, usecols = (1,2))
        plt.scatter(embedding[:,0], embedding[:,1])
        fig.savefig('%s_embedding_%s/figures/Embedding-%s-%s-%s.pdf'%(filename, fraction, fileparams[0],fileparams[1],fileparams[2])) 
        plt.close()
        
        return None


if __name__ == '__main__': # this is the main execution 
    
    fraction = 0.01 # percentage of original dataset used for UMAP and HDBSCAN algorithms, must be between (0 and 1.0]
    
    path = '/home/USER/anaconda3/envs/ML/scripts/ML_IDS_Dashboard/app1/static/input_files/' # CHECK THIS PATH, THIS NEEDS TO BE THE PATH TO THE DATASET
    
    filename = 'Data.csv' # CHECK THIS FILENAME THIS NEED TO MATCH THE DATASET CSV FILENAME, ALSO TAKE NOTE NO LABELS ARE USED
    
    if os.path.exists('%s_embedding_%s'%(filename, fraction)) == False: # checking and creating folders
        os.makedirs('%s_embedding_%s'%(filename, fraction))
        
    if os.path.exists('%s_embedding_%s/log_files'%(filename, fraction)) == False:
        os.makedirs('%s_embedding_%s/log_files'%(filename, fraction))
        
    if os.path.exists('%s_embedding_%s/figures'%(filename, fraction)) == False:
        os.makedirs('%s_embedding_%s/figures'%(filename, fraction))

    input_data = data_grab(path,filename,fraction) # calling the datagrab function
    
    procs_UMAP = []
    n_neighbors = [15,20,40] # these parameters will be looped through
    min_dist = [0.01,0.05] # these parameters will be looped through
    metric_iso = ['correlation','euclidean','manhattan'] # these parameters will be looped through, feel free to add more if you want to check more

    procs_savefig = []


    for index, a in enumerate(n_neighbors): # these loops will create , 'processes' just instance calls of the function (refered to as targets)
        for index, b in enumerate(min_dist):
            for index, c in enumerate(metric_iso):
                if os.path.isfile('%s_embedding_%s/embedding,%s,%s,%s.csv' %(filename,fraction,a,b,c)) == True:
                    continue
                proc = Process(target=UMAP, args=(a, b, c, input_data))
                procs_UMAP.append(proc) # appending all of our processes to the list 
                proc.start() # start the process
                
            for proc in procs_UMAP:
                proc.join() # join all the processes currently in the list, this is looped over as it will connect new processes just appended to procs_UMAP to ones already started

    filenames = glob.glob('%s_embedding_%s/*.csv'%(filename, fraction)) # looping through the files in a directory 
    #data = np.random.rand(1260,4)
    
    for d in filenames: # tried to use multiprocessing on savefig function; however, the matplotlib package has some sort of caching that prevents saving multiple figures in a multiprocess 
        #strfile = str(d)
        #proc = Process(target=savefig, args=(strfile,))
        #procs_savefig.append(proc)
        #proc.start()
        savefig(d)        
        #for proc in procs_savefig:
            #proc.join()

I have been playing around with testing various combinations of UMAP with multi processing in python, feel free to use this script if you want.

from umap.

lmcinnes avatar lmcinnes commented on April 27, 2024 2

So the good news is that Nvidia will have a GPU accelerated implementation of UMAP in their Rapids platform soon, and I believe most of the theoretical hurdles to multi-core CPU implementation have been resolved, and I hope the 0.5 release of UMAP will include multi-core/threaded support. So while we might not quite be there yet, we are certainly getting closer.

from umap.

lmcinnes avatar lmcinnes commented on April 27, 2024 1

As a side note I would highly recommend looking into datashader as a means to visualise the results rather than relying on KDEs.

from umap.

stsievert avatar stsievert commented on April 27, 2024 1

a custom SGD, so what is there now is essentially one I wrote myself from scratch. Presumably deep learning libraries that make use of GPUs can be suitably coerced into doing the right kind of optimization

One nice thing about deep learning libraries is that they're declarative, and will do automatic differentiation. This only requires computing the loss for the input variables, then calling optimizer.step(). Surprisingly, PyTorch is faster than NumPy when hand-coding the gradients. I think this is because of the graph it defines, so it can be parallelized and the IO is optimized. I walk through benchmarks/more motivation in a blog post.

Out-of-core computation (#62) + deep learning libraries would be easiest with a Dask parameter server (dask/dask-ml#171).

from umap.

lmcinnes avatar lmcinnes commented on April 27, 2024 1

@candalfigomoro The NN search is not currently pluggable at the top API level, but it is largely factored out in the code. For now that means you could likely monkey-patch things and define a new umap.umap_.nearest_neighbors function that does what you want. Down the road I think it is most likely that it will adapt to make use of the KNeighborsTransformer API/approach in scikit-learn. At that point you would need to wrap whatever ANN solution you have in a suitable transformer. All of that is likely to happen in v0.5 when PyNNDescent becomes a required dependency and will be the default ANN search.

from umap.

TomAugspurger avatar TomAugspurger commented on April 27, 2024 1

from umap.

candalfigomoro avatar candalfigomoro commented on April 27, 2024 1

@TomAugspurger "Unfortunately", I'm in a Spark (with pyspark) cluster/environment... no Dask for me. My inputs are Spark DataFrames (or RDDs). I can use UDFs to apply transforms, but I don't think there's a straightforward way to support the "fit()" part on out-of-memory datasets.

from umap.

candalfigomoro avatar candalfigomoro commented on April 27, 2024 1

@candalfigomoro I am currently using Spark cluster (pyspark) too. As you do, the fit of UMAP is done locally on the master node. For the transform I use rdd.map() and pass a partial function (functools) that already has the UMAP trained model. So the latter get shipped to the various nodes in the cluster and the transform is parallelized.Out of curiosity, how did you approach this? Thanks

I didn't do it, but I think I'd do something similar to that (maybe I'd use a Pandas UDF instead of rdd.map()).

For training on larger-than-memory datasets, maybe we could try something like this:

  1. Use pyspark's LSH-based Approximate Nearest Neighbor Search for NN search, then
  2. Use ParametricUMAP with mini-batches of data

But it's not trivial. @lmcinnes Do you think it would make sense?

from umap.

lmcinnes avatar lmcinnes commented on April 27, 2024 1

I have used this implementation for a (sparse) dataset of size 100,000,000 x 5,500,000 on a very large memory SMP using 64 cores. Obviously a dense dataset of that size would not be possible due to memory constraints, so you would need a distributed implementation to manage that. In terms of the CuML implementation: it has sparse support coming, so at that point it may scale to very large feature sizes.

from umap.

100518832 avatar 100518832 commented on April 27, 2024

@snakers4 I don't think there is direct GPU support, but it is listed on the road map under no priority. UMAP source code does make use of numba, which can be used to target the GPU, but you would have to make those changes yourself.

The current dataset I use is 120k * 15, but I am running an searching algorithm to find the best possible combination of HDBSCAN and UMAP parameters using multiprocessing.

If you are using linux, I would open command line and type top which will show you the top running processes. This will let you know what python is doing, I am willing to bet that python with have one process running the cpu at 100% (or close too) when you use that large of a dataset.

I hope this helps,

Ralph

from umap.

snakers4 avatar snakers4 commented on April 27, 2024

Hi, first of all many thanks for you replies)

For this dataset PCA (5000 => 2) + eyeballing the clusters worked fine for some reason.
Data science is magic sometimes. I guess clusters are linearly separable. But this is a good benchmark nevertheless (see my comments below).

Also it may be worth reporting that my colleague reported running 2M * 10 datasets w/o problems just using standard params w/o any tweaking some time ago.

First, for your dataset (500,000) you will need code that is currently in a branch not merged with master

I guess I will wait for the code to be merged.

Try using init='random'
try to keep n_neighbors small
trying on a sub-sample of your dataset

Tried this

import umap    
    
umap = umap.UMAP(
    n_neighbors=5,
    n_components=2,
    metric='euclidean',
    init='random'
    )

umap_vectors = umap.fit_transform(pca_vectors[0:50000])

Well, your advice helped:

  • on 50k * 15 sample (500k * 15 => 50k * 15) - 59.0s
  • and on 100k * 15 sample - 1m 55.3s
  • and on 500k * 15 sample - 15m 19s

As for the evaluation:

  1. 500k*5000 => PCA 500k * 50 => HDBSCAN performed poorly producing only 2 bad clusters
  2. PCA produced this
    https://pics.spark-in.me/upload/b27a448c2f19e8c450dce2c20c57454b.jpg

Just visually eyeballing the clusters worked - they are really similar.

  1. Umap produced this
    50k https://pics.spark-in.me/upload/247670b5825c833f3c6964809468758f.png
    100k https://pics.spark-in.me/upload/cb77948b8c17ff262d7f60cd1b124f21.png
    500k https://pics.spark-in.me/upload/0780efa4b681b4290d13017a387ee7b1.png - I guess it broke here

I will report the visual composition of the above clusters a bit later.

Finally, UMAP ultimately relies on SGD for a portion of the computation. Of course, as these things go it is a bit of a custom SGD, so what is there now is essentially one I wrote myself from scratch. Presumably deep learning libraries that make use of GPUs can be suitably coerced into doing the right kind of optimization, but that is beyond my current knowledge (as it requires careful phrasing of the problem I suspect). If you would like to look into that we can discuss the possibilities further.

Well, if you could share some minimum reading (i.e. 1 paper on this topic you based this portion of the SGD pipeline on) and point to the part of code with SGD - if I am up to the task I will think about applying my skills to that.

In practice (work / competitions) I noticed that Adam is better than plain SGD in most of cases. Modern high level frameworks (pytorch) have this implemented + they have GPU support, which may be a nice feature.

from umap.

snakers4 avatar snakers4 commented on April 27, 2024

Visually evaluated UMAP embeddings.
They are really great. They seem to be like PCA embeddings, but further refined into couple sub-categories.
But as you noticed, 100k+ fails.

from umap.

lmcinnes avatar lmcinnes commented on April 27, 2024

This has been resolved (or should be resolved) as of v0.2.0 (i.e. the branch got merged). With regard to the SGD part of the code you would want optimize_layout function. The most relevant paper would probably be the LargeVis paper which this bears some similarity to.

from umap.

snakers4 avatar snakers4 commented on April 27, 2024

The code looks readable, except for the numba decorator, which is black magic for me.

I guess I will just end up comparing your latest release vs. just loading the data to GPU and optimizing using the same loss function.

from umap.

snakers4 avatar snakers4 commented on April 27, 2024

This has been resolved (or should be resolved) as of v0.2.0 (i.e. the branch got merged)

Downloaded the latest release via

wget https://github.com/lmcinnes/umap/archive/master.zip
unzip master.zip
rm master.zip
cd umap-master
sudo pip install -r requirements.txt
python setup.py install

Used these settings

import umap    
    
umap = umap.UMAP(
    n_neighbors=5,
    n_components=2,
    metric='euclidean',
    init='random'
    )

umap_vectors = umap.fit_transform(pca_vectors_processed)

Ran these benchmarks

I guess there is some stochasticity in this process
I will try just looking at the blobs a bit later

Great job guys!
Looks like even this version of UMAP is really useful on top of PCA.

from umap.

lmcinnes avatar lmcinnes commented on April 27, 2024

Note that those are the gradients of the loss encoded by hand in that SGD. James Melville has a nice piece on UMAP loss here.

from umap.

snakers4 avatar snakers4 commented on April 27, 2024

As a side note I would highly recommend looking into datashader as a means to visualise the results rather than relying on KDEs.

Yup, datashader turned out to be the most inspiring and underrated visualization tool I ever saw.
In the end I ended up with doing this 5000-dimension CNN vectors => 15-dimension PCA => 2-dimension UMAP vectors => HDBSCAN for naive clustering

The results are amazing:

  • UMAP embedding
  • HDBSCAN + UMAP embedding
  • top1 and top2 most prominent clusters (I am feeding the clustered data into GAN, so having ~5-7 distinct clusters + noise is exactly what I was hoping for)
  • Also funnily enough there is a cluster of toilets and baths

LargeVis paper which this bears some similarity to.
Note that those are the gradients of the loss encoded by hand in that SGD. James Melville has a nice piece on UMAP loss here.

Read the paper and a note, looks very interesting.
Looks like I made a wrong assumption - since the task is unsupervised learning, autograd will not help - since we have to annotated data.

I guess, if so, then porting to GPU may give only negligible performance boost.
I hope I will try porting your code into PyTorch directly when I find time.

from umap.

lmcinnes avatar lmcinnes commented on April 27, 2024

Of note is the fact that you can embed into higher dimensional spaces than just 2 -- if your goal is clustering rather than visualization then using a higher dimensional embedding can give the process more degrees of freedom to more accurately represent the data. So, for example, embedding to 4 or 8 dimensions may allow for better clustering. It is, at least, worth experimenting with.

from umap.

snakers4 avatar snakers4 commented on April 27, 2024

higher dimensional spaces than just 2

Ofc, I just presented 2-dimensional results for ease of interpretation

from umap.

LemonPi avatar LemonPi commented on April 27, 2024

@snakers4 Any luck porting to pytorch? Would be interested in this.

from umap.

stsievert avatar stsievert commented on April 27, 2024

So the good news is that Nvidia will have a GPU accelerated implementation of UMAP in their Rapids platform soon

Cool! Are you referencing the implementation at cuML/src/umap@993eb6? It looks like it got merged through rapidsai/cuml#261.

will include multi-core/threaded support

This is through lmcinnes/pynndescent#12, correct?

from umap.

lmcinnes avatar lmcinnes commented on April 27, 2024

from umap.

khaled-rahman avatar khaled-rahman commented on April 27, 2024

Hi,

Do you have any plan for barnes-hut approximation using quad/oct-tree for 2D/3D rather than RP-tree? It would be great to know. Thank you for awesome work!

from umap.

lmcinnes avatar lmcinnes commented on April 27, 2024

@khaled-rahman: The RP-Trees are in support of NN-Descent for (approximate) nearest neighbor computation, for which Barnes-Hut is not applicable. Where t-SNE uses Barnes-Hut is in the optimization step, and in contrast UMAP uses an approximation via negative sampling based approach to get an O(N) instead of O(N log N) Barnes-Hut algorithm.

from umap.

khaled-rahman avatar khaled-rahman commented on April 27, 2024

@lmcinnes : Thank you for quick response. I am studying your tool. What is the running time of spectral embedding in your code? I see it has dependency on sklearn.manifold which says the following: https://scikit-learn.org/stable/modules/manifold.html

Though it is done only once, doesn't it have dependency on number of data-points?
More specifically, if we exclude OPTIMIZEEMBEDDING algorithm from Algorithm 1 (i.e. UMAP algorithm) then what will be the asymptotical running time of Algorithm 1? Isn't it greater than O(n^2) now?

from umap.

candalfigomoro avatar candalfigomoro commented on April 27, 2024

For Spark (PySpark) environments, maybe we could rely on the LSH-based Approximate Nearest Neighbor Search (https://spark.apache.org/docs/latest/ml-features#approximate-nearest-neighbor-search), however it only supports Euclidean and Jaccard distances at the moment.

@lmcinnes
Is there an easy way to swap the NN search algorithm (so we can easily plug in a different algorithm such as the one available in Spark)?

from umap.

candalfigomoro avatar candalfigomoro commented on April 27, 2024

@lmcinnes Thank you for your reply. Another issue that I see is that, even if we implement a distributed NN search, UMAP currently expects a numpy array as input, so "fitting" out-of-memory datasets (e.g. HDFS distributed datasets with billions of rows) could be a problem. Maybe a workaround (although suboptimal) could be to "fit" UMAP on a random subsample of data and then apply a transform() on the remaining data in a massively parallel way. What do you think about this?

from umap.

mik1904 avatar mik1904 commented on April 27, 2024

@candalfigomoro I am currently using Spark cluster (pyspark) too. As you do, the fit of UMAP is done locally on the master node. For the transform I use rdd.map() and pass a partial function (functools) that already has the UMAP trained model. So the latter get shipped to the various nodes in the cluster and the transform is parallelized.Out of curiosity, how did you approach this? Thanks

from umap.

ydennisy avatar ydennisy commented on April 27, 2024

I am having trouble getting UMAP to flood all available CPUs on a Google VM, in various stages I am about to use 2-4 of the available 32 cores.

Is there anything one can do as of v0.4.x to increase CPU usage?

from umap.

lmcinnes avatar lmcinnes commented on April 27, 2024

Installing the latest version of pynndescent should help some, but even then it will depend, to a degree, on the datset itself.

from umap.

lmcinnes avatar lmcinnes commented on April 27, 2024

I think that would make sense. I believe the UMAP implementation in CuML now has some level of support for multi-GPU / distributed computation, so that may also be an option worth investigating.

from umap.

matanox avatar matanox commented on April 27, 2024

Sorry for joining the conversation like this but has anyone here used any of the discussed implementations for multi-million sized feature sets yet?

from umap.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.