Giter VIP home page Giter VIP logo

Comments (4)

lmcinnes avatar lmcinnes commented on April 28, 2024

Docs are still a work in progress - -I've been diverted by a number of other things. You can look at issue #25 which has some discussion of what works and some of the potential pitfalls. Right now that's the best documentation there is.

from umap.

birdsarah avatar birdsarah commented on April 28, 2024

Another follow-up question.

I had problems using hdbscan on my UMAP embedding, but I may be interpreting things wrong.

I colored all my clusters with a random color or red if hdbscan returned -1 (aka uncategorized). Here's an example output.
index

Once colored I was surprised to see:

  • the location of the uncategorized clusters
  • the mixed colors in number of locations

But, my understanding of UMAP is that the 2d representation that we see (if we asked for 2d), is the representation, and that clusters we see with our eyes are the clusters.

If that's true, and that's the big if, that I would really appreciate your input on.

If that's true, then I found that for my data, I needed to not use hdbscan to get clusters labeled in the way that I visually expect. I'm more than happy to share what I did, but would appreciate clarification on whether I'm understanding correctly.

from umap.

lmcinnes avatar lmcinnes commented on April 28, 2024

That an intriguing result! I admit I don't entirely understand what has happened here. Are you clustering in high dimensional space and then coloring the embedding accordingly? If that's the case then something is astray indeed -- there is quite a cloud of points that were assigned to clusters but have been cast out by UMAP. I would love to have more details about what you did, and the data involved, because I feel like there is a failure case in the making here that I might need to investigate.

As to how to interpret things... UMAP isn't really a clustering algorithm, its more for dimension reduction, and it handles noise differently than something like HDBSCAN. In particular it will tend to pull noise in toward whatever is nearest. That means that if your data is not very noisy it can be effective, but for very noisy data it may not do entirely as one might expect. In contrast HDBSCAN is very conservative about noise, which can be problematic for high dimensional data where density is scarce and everything starts to look like noise.

Looking at your pictures here my feeling is that you have some number of clusters which UMAP clumped into tight little blobs, and a fair amount of noise which UMAP didn't really know what to do with and is generally just getting pushed away from (and squished between) the blobs.

I feel like I am not really answering your question well, but I feel like I am not understanding what is happening here either. If you can be patient with me and let me know if I'm on the right track I would appreciate it.

from umap.

birdsarah avatar birdsarah commented on April 28, 2024

Thanks @lmcinnes for pointing me in the right direction offline. A datashaded view of my data shows the areas of density:

dbscan_sample_0_embedding_15_script_netloc_func_name_counts

Here's the results colored by cluster label with HDBScan (default params) results (red is no cluster):

hdbscan_sample_0_embedding_15_script_netloc_func_name_cats

and here's scikit-learn's DBSCAN (default params):

dbscan_sample_0_embedding_15_script_netloc_func_name_cats

In playing with this a lot more, I realized that HDBSCAN, for my data, is very sensitive to the parameters that you pass it. I need to figure out the right params for my use case, and decide whether excluding data, in the way that hdbscan is inclined to do by default is helpful for my needs.

from umap.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.