Comments (17)
I have finally found and fixed the issue that was causing this -- it was a (subtle) code bug in the SGD optimization. Moving to a different approach to the SGD optimization phase made this evident and resolved the issue. The latest master branch (v0.2.0+) should give better embeddings, particularly of larger datasets. For the data sample you provided I got the following:
from umap.
That's an interesting and disconcerting phenomenon. It isn't immediately clear to me what would be causing this. I would speculate that the issue is "noise" -- points that are sufficiently far from everything that UMAP ends up trying to spread them all apart from one another, with the result that any points that are close end up getting packed into the point in the center to make them far from the scattered points around the outside. Assuming this speculation is correct I would expect the central dense cluster to have significant further substructure if you were to zoom in on only it and ignore the outlying points.
As to how to remedy this -- assuming my speculation is correct (it may not be) then increasing the n_neighbors
value may help since it will ensure more points are connected into the overall manifold structure and reduce the effect of the outliers. The other alternative might be to increase min_dist
to prevent UMAP from packing points quite so close, but that feels more like a hack than a solution. I would be interested, if you can share the data, to dig in a little and see if I can actually figure out what is causing this and whether it can be fixed easily, or whether it requires rethinking parts of the algorithm.
from umap.
I ran it again with increased n_neighbors
and then again with increased min_dist
.
Here it is with n_neighbors=5
and again with min_dist=0.01
:
The central dense cluster does appear to have more substructure than is visible from a distance:
And, interestingly enough, running UMAP again on the sparse cloud surrounding the dense cluster might reveal some other structure?
Here's a subset of 250Kx128 points https://drive.google.com/open?id=18tEzVM7nQ3KZhJNH6HuvEHL9rrDMmGAC (122MB). This should be enough to show the effect.
from umap.
I would guess you might need quite a large n_neighbors
value, potentially around 30 or more, to manage to connect up, and hence signficantly reduce, that outer cloud of points. Thanks for the data sample. I'll see if I can play with it a bit and work out what is actually happenning here.
from umap.
Sorry, I just realized I went in exactly the wrong direction with these values :) I'll re-run it.
edit: min_dist=0.5
helped a little, n_neighbors=30
did not.
Also, fwiw, here's the code I'm using to render things quickly:
def draw_embedding(embedding, size=(1024,1024), face_color=255, stroke_color=0):
canvas = np.empty(size, dtype=np.uint8)
canvas.fill(face_color)
emax = embedding.max(axis=0)
emin = embedding.min(axis=0)
erange = emax - emin
scale = np.subtract(canvas.shape[:2], 1) / erange
indices = ((embedding - emin) * scale).astype(np.int32)
canvas[indices[:,0], indices[:,1]] = stroke_color
return canvas
from umap.
So I'm playing a little and the obvious potential issues (the simplicial set skeleton has lots of tiny connected components) is not the case. Something very odd is going on. Increasing n_neighbors
helps, but not as much as I would like (although I'm now trying very large values like 128 out of curiousity). There is something going on here, and the structure of the "noise" cloud that you found implies there are some possibilities. I feel like UMAP isn't managing to adjust for a couple of very different scales of density and so it isn't managing to render things quite right, but I'm still looking as to what's actually going on internally to cause results like this. It's certainly an intriguing dataset.
from umap.
Thanks for the update. It seems like min_dist
is the key here: in trying to get all the distances "right" UMAP is compacting the dense region into a very tiny spot in the middle, and right now the only way to prevent that is to set min_dist
large enough to not let it compact points too tightly. This isn't really a satisfactory answer however.
After more exploration I am more convinced that this actually a structure of the data itself (a scattering of points that are are all relatively different from one another and then a more interesting manifold that is essentially equidistant from all the "noise") rather than a "bug", but I do agree that this is not a helpful presentation. What would you like to see in this circumstance however? I think the results with the larger min_dist
certainly seem better, but I would prefer to have a more principled way to derive this as the right approach from a data driven perspective rather than having to guess. I think I'll have to ponder this a little longer to come up with a good answer rather than merely an expedient one. In the meantime hopefully increasing min_dist
further will help for now. Sorry I don't have better answers.
from umap.
An alternative possibility occurred to me: it could be the approximate nearest neighbors failing in enough cases, and that may be what the cloud around the outside is. That's a little harder to look into, but I'll see if I can at least find out if that's true this evening. If that is the case then it is certainly fixable as it is a bug, although exactly how to fix it will be an interesting question.
Edit: I thought about this some more and it seems like a likely candidate, as I am pretty sure it would produce the behaviour we are seeing here. As for a fix I have some initial heuristics that should work and hopefully I can refine them into something sensible that would do the job well. Definitely some work required though.
from umap.
I can confirm that the approx nearest neighbors is not working as well as would be desireable, and importantly the distribution of precision is quite wide, which leads me to believe that this is indeed the source of the issue. I still have to figure out the "right" way to fix this.
Edit: Making progress on this -- I think I can have an "interim" solution soon, and hopefully a more robust solution not too long after that. Sorry for the lack of visible progress, but I am now convinced that this is an implementation related bug rather than anything fundamental to the algorithm, and so its just a matter of figuring out how best to dig myself out of that particular implementation issue.
from umap.
I'm seeing a possible precision-related issue in some of my tests, but it goes away when I change the random_state seed. I'll work on getting some examples...
from umap.
So the good news is I made some progress figuring out how to improve the nearest neighbor issues. The current approach would cause some performance regressions, so I just need to tweak things a little more to work well in cases like this but not lose (too much) performance in general.
The bad news is that it didn't actually "fix" the problem, which tips me back toward it possibly being something structural in the data. I will have to play more to see if I can find a better way to give a nicer presentation.
from umap.
Alright, I have an appropriate solution that should work with the current code! The nearest neighbor approximation does need to be fixed, but that is not so much the problem here, because this is "structurally true" of the data. What we actually want is to have the effective repulsive forces between data points to be dampened (since that is what is actually causing the packing). Fortunately there is already a parameter for this: gamma
. If you raise min_dist
and lower gamma
you can get something much better. Here is what I got with min_dist=0.25
and gamma=0.01
(admittedly with better approx nearest neighbor code still turned on):
As my colleague pointed out, if you have a manifold and noise that is noise in the full dimensional ambient space (as opposed to noise off the manifold) then this is exactly what you expect to happen, and the only way to reasonably combat that is to reduce how hard we force the noise points away from everything else.
from umap.
Here is the same (min_dist=0.25
and gamma=0.01
) with the original approximate nearest neighbors:
I feel like this is (hopefully) the solution you were probably looking for. Clearly some more documentation on parameters and what to tweak under different circumstances is needed. Let me know if this is sufficient in term of what you were looking for, or if you had a different sort of result in mind.
from umap.
with more data, i was hoping for more resolution and data points in these smaller clusters.
and it happens up to a point, but once there are enough points these clusters turn into these "spiking" structures that shoot out. my ideal embedding would avoid those star-like spikes. but i need to look at the actual data closer and see if those small clusters are getting turned into spikes because they exist on a 1d manifold, or if it's just a "bug" and they really should be represented as a small cluster.
going to close this though, since it solves my original issue of everything collapsing to a point.
thanks so much for all your help and involvement in developing this tool :)
from umap.
For reference I have reproduced similar issues on another dataset, again at around the same amount of data. That seems a little suspicious to me, so I will continue digging. Sorry that I still don't have any good answers, but it is hard to understand exactly what is happening, let along what the correct fix is.
from umap.
wow this is great news! this embedding looks incredible! way more like what i would have expected!
edit: i double-checked for myself and confirm i get the same output. to be clear, this is with all default parameters, no gamma or min_dist customization 😱
from umap.
Why not create a repository of dataset
and pre-configured parameters ?
This would be easier for benchmarking.
from umap.
Related Issues (20)
- Failed to save a trained Parametric UMAP model ()
- Interactive plot argument: tool - 'NoneType' object is not iterable
- auto_reduce_topic throws an error when all documents are outliers
- `tbb` optional requirement should be configurable HOT 4
- Python kernel unresponsive on using umap.UMAP().fit_transform()
- Numpy 1.24 removes long, causes import error
- No module named 'pkg_resources' HOT 2
- connectivity plot values not comparable with UMAP transform output
- Penguins example SSL error
- ZeroDivisonError while running update with new data
- No module named importlib HOT 2
- When using umap fit, an error occurred suddenly: Intel MKL ERROR: Parameter 6 was incorrect on entry to DLASWP.
- scipy.sparse._csparsetools.lil_get_lengths Error Running UMAP
- Not able to work with old embedder object created using python 3.8 HOT 1
- Setting a random state still leads to stochastic results
- Implementation of sciki-learn's get_feature_names_out() API is not correct
- Is 'n_training_epochs' working for parameteric UMAP?
- visualize video data
- How to combine UMAP models in new data?
- Edit instructions to make them compatible with zsh
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from umap.