Giter VIP home page Giter VIP logo

Comments (17)

lmcinnes avatar lmcinnes commented on April 28, 2024 3

I have finally found and fixed the issue that was causing this -- it was a (subtle) code bug in the SGD optimization. Moving to a different approach to the SGD optimization phase made this evident and resolved the issue. The latest master branch (v0.2.0+) should give better embeddings, particularly of larger datasets. For the data sample you provided I got the following:

image

from umap.

lmcinnes avatar lmcinnes commented on April 28, 2024

That's an interesting and disconcerting phenomenon. It isn't immediately clear to me what would be causing this. I would speculate that the issue is "noise" -- points that are sufficiently far from everything that UMAP ends up trying to spread them all apart from one another, with the result that any points that are close end up getting packed into the point in the center to make them far from the scattered points around the outside. Assuming this speculation is correct I would expect the central dense cluster to have significant further substructure if you were to zoom in on only it and ignore the outlying points.

As to how to remedy this -- assuming my speculation is correct (it may not be) then increasing the n_neighbors value may help since it will ensure more points are connected into the overall manifold structure and reduce the effect of the outliers. The other alternative might be to increase min_dist to prevent UMAP from packing points quite so close, but that feels more like a hack than a solution. I would be interested, if you can share the data, to dig in a little and see if I can actually figure out what is causing this and whether it can be fixed easily, or whether it requires rethinking parts of the algorithm.

from umap.

kylemcdonald avatar kylemcdonald commented on April 28, 2024

I ran it again with increased n_neighbors and then again with increased min_dist.

Here it is with n_neighbors=5 and again with min_dist=0.01:

download-10

download-5

The central dense cluster does appear to have more substructure than is visible from a distance:

download-4

And, interestingly enough, running UMAP again on the sparse cloud surrounding the dense cluster might reveal some other structure?

download-3

Here's a subset of 250Kx128 points https://drive.google.com/open?id=18tEzVM7nQ3KZhJNH6HuvEHL9rrDMmGAC (122MB). This should be enough to show the effect.

from umap.

lmcinnes avatar lmcinnes commented on April 28, 2024

I would guess you might need quite a large n_neighbors value, potentially around 30 or more, to manage to connect up, and hence signficantly reduce, that outer cloud of points. Thanks for the data sample. I'll see if I can play with it a bit and work out what is actually happenning here.

from umap.

kylemcdonald avatar kylemcdonald commented on April 28, 2024

Sorry, I just realized I went in exactly the wrong direction with these values :) I'll re-run it.

edit: min_dist=0.5 helped a little, n_neighbors=30 did not.

Also, fwiw, here's the code I'm using to render things quickly:

def draw_embedding(embedding, size=(1024,1024), face_color=255, stroke_color=0):
    canvas = np.empty(size, dtype=np.uint8)
    canvas.fill(face_color)
    emax = embedding.max(axis=0)
    emin = embedding.min(axis=0)
    erange = emax - emin
    scale = np.subtract(canvas.shape[:2], 1) / erange
    indices = ((embedding - emin) * scale).astype(np.int32)
    canvas[indices[:,0], indices[:,1]] = stroke_color
    return canvas

download-11
download-12

from umap.

lmcinnes avatar lmcinnes commented on April 28, 2024

So I'm playing a little and the obvious potential issues (the simplicial set skeleton has lots of tiny connected components) is not the case. Something very odd is going on. Increasing n_neighbors helps, but not as much as I would like (although I'm now trying very large values like 128 out of curiousity). There is something going on here, and the structure of the "noise" cloud that you found implies there are some possibilities. I feel like UMAP isn't managing to adjust for a couple of very different scales of density and so it isn't managing to render things quite right, but I'm still looking as to what's actually going on internally to cause results like this. It's certainly an intriguing dataset.

from umap.

lmcinnes avatar lmcinnes commented on April 28, 2024

Thanks for the update. It seems like min_dist is the key here: in trying to get all the distances "right" UMAP is compacting the dense region into a very tiny spot in the middle, and right now the only way to prevent that is to set min_dist large enough to not let it compact points too tightly. This isn't really a satisfactory answer however.

After more exploration I am more convinced that this actually a structure of the data itself (a scattering of points that are are all relatively different from one another and then a more interesting manifold that is essentially equidistant from all the "noise") rather than a "bug", but I do agree that this is not a helpful presentation. What would you like to see in this circumstance however? I think the results with the larger min_dist certainly seem better, but I would prefer to have a more principled way to derive this as the right approach from a data driven perspective rather than having to guess. I think I'll have to ponder this a little longer to come up with a good answer rather than merely an expedient one. In the meantime hopefully increasing min_dist further will help for now. Sorry I don't have better answers.

from umap.

lmcinnes avatar lmcinnes commented on April 28, 2024

An alternative possibility occurred to me: it could be the approximate nearest neighbors failing in enough cases, and that may be what the cloud around the outside is. That's a little harder to look into, but I'll see if I can at least find out if that's true this evening. If that is the case then it is certainly fixable as it is a bug, although exactly how to fix it will be an interesting question.

Edit: I thought about this some more and it seems like a likely candidate, as I am pretty sure it would produce the behaviour we are seeing here. As for a fix I have some initial heuristics that should work and hopefully I can refine them into something sensible that would do the job well. Definitely some work required though.

from umap.

lmcinnes avatar lmcinnes commented on April 28, 2024

I can confirm that the approx nearest neighbors is not working as well as would be desireable, and importantly the distribution of precision is quite wide, which leads me to believe that this is indeed the source of the issue. I still have to figure out the "right" way to fix this.

Edit: Making progress on this -- I think I can have an "interim" solution soon, and hopefully a more robust solution not too long after that. Sorry for the lack of visible progress, but I am now convinced that this is an implementation related bug rather than anything fundamental to the algorithm, and so its just a matter of figuring out how best to dig myself out of that particular implementation issue.

from umap.

jay-reynolds avatar jay-reynolds commented on April 28, 2024

I'm seeing a possible precision-related issue in some of my tests, but it goes away when I change the random_state seed. I'll work on getting some examples...

from umap.

lmcinnes avatar lmcinnes commented on April 28, 2024

So the good news is I made some progress figuring out how to improve the nearest neighbor issues. The current approach would cause some performance regressions, so I just need to tweak things a little more to work well in cases like this but not lose (too much) performance in general.

The bad news is that it didn't actually "fix" the problem, which tips me back toward it possibly being something structural in the data. I will have to play more to see if I can find a better way to give a nicer presentation.

from umap.

lmcinnes avatar lmcinnes commented on April 28, 2024

Alright, I have an appropriate solution that should work with the current code! The nearest neighbor approximation does need to be fixed, but that is not so much the problem here, because this is "structurally true" of the data. What we actually want is to have the effective repulsive forces between data points to be dampened (since that is what is actually causing the packing). Fortunately there is already a parameter for this: gamma. If you raise min_dist and lower gamma you can get something much better. Here is what I got with min_dist=0.25 and gamma=0.01 (admittedly with better approx nearest neighbor code still turned on):

image

As my colleague pointed out, if you have a manifold and noise that is noise in the full dimensional ambient space (as opposed to noise off the manifold) then this is exactly what you expect to happen, and the only way to reasonably combat that is to reduce how hard we force the noise points away from everything else.

from umap.

lmcinnes avatar lmcinnes commented on April 28, 2024

Here is the same (min_dist=0.25 and gamma=0.01) with the original approximate nearest neighbors:

image

I feel like this is (hopefully) the solution you were probably looking for. Clearly some more documentation on parameters and what to tweak under different circumstances is needed. Let me know if this is sufficient in term of what you were looking for, or if you had a different sort of result in mind.

from umap.

kylemcdonald avatar kylemcdonald commented on April 28, 2024

with more data, i was hoping for more resolution and data points in these smaller clusters.

33568540-9808de06-d8db-11e7-8cf1-cd33e20c798e

and it happens up to a point, but once there are enough points these clusters turn into these "spiking" structures that shoot out. my ideal embedding would avoid those star-like spikes. but i need to look at the actual data closer and see if those small clusters are getting turned into spikes because they exist on a 1d manifold, or if it's just a "bug" and they really should be represented as a small cluster.

going to close this though, since it solves my original issue of everything collapsing to a point.

thanks so much for all your help and involvement in developing this tool :)

from umap.

lmcinnes avatar lmcinnes commented on April 28, 2024

For reference I have reproduced similar issues on another dataset, again at around the same amount of data. That seems a little suspicious to me, so I will continue digging. Sorry that I still don't have any good answers, but it is hard to understand exactly what is happening, let along what the correct fix is.

from umap.

kylemcdonald avatar kylemcdonald commented on April 28, 2024

wow this is great news! this embedding looks incredible! way more like what i would have expected!

edit: i double-checked for myself and confirm i get the same output. to be clear, this is with all default parameters, no gamma or min_dist customization 😱

download

from umap.

arita37 avatar arita37 commented on April 28, 2024

Why not create a repository of dataset
and pre-configured parameters ?
This would be easier for benchmarking.

from umap.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.