Looks like a great alternative to t-SNE! The readme mentions how fast it is, but I won

[Question] What's the scaling complexity? about umap HOT 6 CLOSED

lmcinnes commented on April 28, 2024

[Question] What's the scaling complexity?

from umap.

Comments (6)

lmcinnes commented on April 28, 2024 7

The complexity is essentially bounded by the nearest neighbor search phase. By making use of algorithms for approximate nearest neighbor graph construction I can get the scaling with input dimension down to essentially linear (effectively scaling with regard to the complexity of the distance computation). With regard to number of sample the scaling is a little harder to pin down (the algorithm I'm using, NN-descent, doesn't have a published theoretical complexity) and is, of course, dataset dependent (i.e. it will depend on properties of the distribution of the samples). Empirically I believe it should be something like O(d*n^1.14) or so. A better theoretical analysis of the complexity is definitely on my list of things to do, but its currently a long list.

from umap.

lmcinnes commented on April 28, 2024 2

It certainly does, but it is an empirical estimate of an average case complexity. I have not seen any complete estimates of NNDescent complexity. Some work by Richard Darling and Jacob Baron demonstrate both some cases where NNDescent will fail (with complexity greater than O(N^2)), and classes of problem for which a complexity of O(N log N) is provable. The catch is that both are relatively theoretical cases: the first involves metrics that fail friend-of-a-friend principles, which isn't true for most of the metrics UMAP would be used with; the second is a fairly constrained class of problem that one is unlikely to see in practice (but for which complexity is at least provable).

So, given that worst cases do exist, I guess the real answer (though it is a highly deceptive one, and would equally apply to any algorithm that requires nearest neighbor computations) is O(N^2). For practical purposes it certainly empirically scales at O(N log N) and there are certainly problem classes for which that is the provable complexity, so perhaps that is a reasonable estimate?

from umap.

jc-healy commented on April 28, 2024 2

Great response Leland. I think this highlights a fundamental problem O() notation. It's a worst case complexity which is potentially the tail end of a distribution of complexities and may not represent real world scaling in practice (even ignoring those pesky constants). As such it can be quite deceptive for many classes of algorithms. I've noticed this in particular with graph algorithms. The worst case graph distribution for many algorithms is a single long connected chain which *may* be incredibly unlikely for any particular family of graphs that users care about. While this is definitely an awful corner case for many graph algorithms it doesn't tell a user much about how the algorithm will perform in practice on their data. For that reason I'm a big fan of the less popular average case complexity. Of course, to do anything useful with average case complexity you need to come up with a distribution that your data is being sampled from which can be hard. The advantage in the case of nearest neighbour algorithms is that it forces you to explicitly think about things like the metric space your algorithm will be running over. While such notions certainly raise the complexity of the complexity analysis they do provide significantly more insight into the question of how well will this algorithm perform on *my* data which, I think, is the question we all want to answer. Now that sounds quite down on O-notation which I really don't mean to be. It's an upper limit on the average case complexity and if an algorithm has great O complexity then you know it will also have great average case complexity on your data. Instead I mean it as a warning, that all O(n^2) algorithms aren't born equal. I too often see folks making statements like all those algorithms are O(n^2) so use any of them. The max is a useful summary statistic of any distribution but it isn't necessarily the most useful descriptive statistic for that distribution.

…

On Mon, Jan 18, 2021 at 12:27 PM Leland McInnes ***@***.***> wrote: It certainly does, but it is an empirical estimate of an average case complexity. I have not seen any complete estimates of NNDescent complexity. Some work by Richard Darling and Jacob Baron demonstrate both some cases where NNDescent will fail (with complexity greater than O(N^2)), and classes of problem for which a complexity of O(N log N) is provable. The catch is that both are relatively theoretical cases: the first involves metrics that fail friend-of-a-friend principles, which isn't true for most of the metrics UMAP would be used with; the second is a fairly constrained class of problem that one is unlikely to see in practice (but for which complexity is at least provable). So, given that worst cases do exist, I guess the real answer (though it is a highly deceptive one, and would equally apply to *any* algorithm that requires nearest neighbor computations) is O(N^2). For practical purposes it certainly empirically scales at O(N log N) and there are certainly problem classes for which that is the provable complexity, so perhaps that is a reasonable estimate? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#8 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AC3IUWV5AUXSWL4DHL2PEK3S2RVP7ANCNFSM4EDKKQPQ> .

from umap.

matanox commented on April 28, 2024 1

The max statistic is useful in matching the human trait of taking assurance and building social trust by being able to commit to some amount with what they consider certainty. It's helpful for planning. Of course the finer view of considering the distribution of the complexity can be very useful in augmenting that approach.

If we have a particular way to frame the factors affecting this algorithm's runtime complexity beyond just the size of the input, or even testing/sampling the data for predicting the runtime, that might be pragmatically helpful.

from umap.

vmarkovtsev commented on April 28, 2024

Thanks a lot!

from umap.

matanox commented on April 28, 2024

O(d*n^1.14) sounds almost too good to be true doesn't it?

from umap.

[Question] What's the scaling complexity? about umap HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent