I got 50 million 1000 dim vectors need too be indexed, but I don't get that much ram t

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

When the data size is too large to fit into the memory? about annoy HOT 7 CLOSED

spotify commented on May 22, 2024

When the data size is too large to fit into the memory?

from annoy.

Comments (7)

piskvorky commented on May 22, 2024

I don't think so; and any workaround would likely be too slow (swapping during indexing/search). Annoy targets the scenario where the whole index fits in main memory.

A scalable solution here would be distributing the index across multiple nodes, in a cluster. But that's not on the roadmap for Annoy AFAIK. Interesting and not that difficult algorithmically, but it would bring a lot of devops complexity.

from annoy.

erikbern commented on May 22, 2024

I haven't tried it, but in theory Annoy works with out of core data. If you put it on an SSD drive then it probably could perform OK. I think you would experience a slowdown of say 10-100x.

One issue that comes to mind is that Annoy builds up the index in RAM and then writes it to disk – it would be more efficient to build it up directly to disk. That should actually be pretty easy to support.

Spinning disk would be ridiculously slow – probably 1000x slower at least.

More harder stuff you can do to optimize for out of core:

Partition the index (like @piskvorky said)
Don't store the vectors in the index, just the search structure
Support axis-aligned splits instead of arbitrary vectors – this way the splits will only take a few bytes rather than 4kB per split (with (1000 dim))

I also encourage you to do some sort of dimensionality reduction before putting the data into Annoy – even a simple SVD down to 100D would probably help tremendously.

from annoy.

erikbern commented on May 22, 2024

See https://github.com/spotify/annoy/blob/master/src/annoylib.h#L399 for how memory is used while adding items. It isn't until you call save that the index is written to disk. Then later if you call load it will perform an mmap. It would make more sense to support mmap during insertion too. Problem is afaik mmap doesn't support resizing, but you could probably just write to the file pointer, flush, then munmap/mmap again (not sure how fast it would be). This way you would always use the file system for persistance, and the kernel will use the page cache to fit as much as possible in RAM. With an SSD this would be pretty reasonable

from annoy.

jonbakerfish commented on May 22, 2024

@piskvorky @erikbern Thanks for your advice. I'm going to try partition the index first, and I also add two functions for get_nns_by_item and get_nns_by_vector to support returning both ids and distances, which are get_nnsd_by_item and get_nnsd_by_vector respectively.

from annoy.

saustar commented on May 22, 2024

@erikbern Hi, I am using annoy tree to get most accurate match of product names. My question is regarding size of tree. When I saved the tree on disk its size ig 2.3GB. But while creating the tree I can see it is using 3 times more memory. Is this okay?

from annoy.

erikbern commented on May 22, 2024

Not clear why, how are you measuring it? Measuring memory consumption of a process is notoriously unreliable

from annoy.

saustar commented on May 22, 2024

Umm.. True that but I am running it on pyspark and I can see memory error

…

On 27 Dec 2016 10:22 p.m., "Erik Bernhardsson" ***@***.***> wrote: Not clear why, how are you measuring it? Measuring memory consumption of a process is notoriously unreliable — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#85 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AQniXmFoG24ke7XQeUi9r26CSIdecrCwks5rMUJEgaJpZM4FSWQI> .

from annoy.

When the data size is too large to fit into the memory? about annoy HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent