Giter VIP home page Giter VIP logo

Comments (7)

lmcinnes avatar lmcinnes commented on April 28, 2024 5

Hi Graham,

Sorry for the very long delay on ever getting back to you on this. I got rather invested in building the new version of UMAP (which I was hoping would fix some of these issues) and then this fell off my radar for while. The new UMAP, using numba, is now in place, and I think it does fix some of your issues, though not all. I believe some of the rest of the apparent issues can be corrected by more careful plotting. The end result is that I don't believe we get what you want, but it looks less bad in doing so. In particular the default UMAP on your data gives this:

image

This is, admittedly, somewhat underwhelming. If we turn down n_neighbors to 5 and set min_dist to 0.0 we get the following (which shows more structure, but certainly doesn't separate your classes):

image

On the other hand, if we plot the PCA result in the same way we get this:

image

I think in your original iteration the apparent separation was a little bit due to plotting artifacts combined with the fact that the light blue class looks to have slightly larger variance (but ultimately they look like two overlayed gaussian blobs).

Finally, the new version of UMAP does support cosine distance so we can, at least, compute with cosine distance which makes more sense for doc2vec vectors. That results in the following:

image

Still not much notable separation of classes, but then given the PCA result, and these results, I am not sure there is actually good separation in 2D. I know that's not an ideal answer, or even what you were looking for, but hopefully it helps somewhat.

from umap.

lmcinnes avatar lmcinnes commented on April 28, 2024

That definitely looks underwhelming. How does t-SNE compare, or PCA? There may be less structure in the data than one might like. It looks more likely, however, that those two outliers are somehow messing everything up. I'll see if I can get some time and look into exactly what is going on internally. I am fairly busy at the moment with other projects, so I can't promise anything immediate. Sorry.

from umap.

lmcinnes avatar lmcinnes commented on April 28, 2024

If you have some time, the relevant thing to do is run the internals yourself step by step and look to see where things are getting swamped. In particular if you can build fuzzy simplicial set and look at the result (a sparse matrix) I suspect the distribution of non-zero entries will be suspicious (or, at least, the logs of them, since they are probably power law distributed). In particular you should look at the rows (and columns) associated to those two points that seem to end up at extremes.

Another alternative thing to look at is what happens if you don't use spectral initialisation.

from umap.

gclen avatar gclen commented on April 28, 2024

I took a look at the things you suggested. Using a random initialisation still looks underwhelming but there are no huge outliers. There is slightly better separation using PCA but it is still not great (though I haven't messed around with parameters).

I constructed the fuzzy simplicial set and as suspected the distribution of logs of non-zero entries is suspicious. To compare the outlying rows to "normal" rows I calculated the log distributions (and sorted them) for the outlying rows and 10 rows selected at random. What I found was the largest values in the outlying distributions were much bigger than the largest values of the other rows. I'm not sure what this means but it's something. The updated notebook is located here. Let me know if you have any ideas for further tests.

from umap.

vb690 avatar vb690 commented on April 28, 2024

I also played around a bit with language models and UMAP obtaining however some more (marginally) satisfying results, here and here.

from umap.

lmcinnes avatar lmcinnes commented on April 28, 2024

Those are some nice results @vb690 ; would you mind if I referenced them in the example uses section of the documentation?

from umap.

vb690 avatar vb690 commented on April 28, 2024

Hi @lmcinnes , sure thing!

from umap.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.