Giter VIP home page Giter VIP logo

Comments (7)

piskvorky avatar piskvorky commented on May 22, 2024

I don't think so; and any workaround would likely be too slow (swapping during indexing/search). Annoy targets the scenario where the whole index fits in main memory.

A scalable solution here would be distributing the index across multiple nodes, in a cluster. But that's not on the roadmap for Annoy AFAIK. Interesting and not that difficult algorithmically, but it would bring a lot of devops complexity.

from annoy.

erikbern avatar erikbern commented on May 22, 2024

I haven't tried it, but in theory Annoy works with out of core data. If you put it on an SSD drive then it probably could perform OK. I think you would experience a slowdown of say 10-100x.

One issue that comes to mind is that Annoy builds up the index in RAM and then writes it to disk – it would be more efficient to build it up directly to disk. That should actually be pretty easy to support.

Spinning disk would be ridiculously slow – probably 1000x slower at least.

More harder stuff you can do to optimize for out of core:

  • Partition the index (like @piskvorky said)
  • Don't store the vectors in the index, just the search structure
  • Support axis-aligned splits instead of arbitrary vectors – this way the splits will only take a few bytes rather than 4kB per split (with (1000 dim))

I also encourage you to do some sort of dimensionality reduction before putting the data into Annoy – even a simple SVD down to 100D would probably help tremendously.

from annoy.

erikbern avatar erikbern commented on May 22, 2024

See https://github.com/spotify/annoy/blob/master/src/annoylib.h#L399 for how memory is used while adding items. It isn't until you call save that the index is written to disk. Then later if you call load it will perform an mmap. It would make more sense to support mmap during insertion too. Problem is afaik mmap doesn't support resizing, but you could probably just write to the file pointer, flush, then munmap/mmap again (not sure how fast it would be). This way you would always use the file system for persistance, and the kernel will use the page cache to fit as much as possible in RAM. With an SSD this would be pretty reasonable

from annoy.

jonbakerfish avatar jonbakerfish commented on May 22, 2024

@piskvorky @erikbern Thanks for your advice. I'm going to try partition the index first, and I also add two functions for get_nns_by_item and get_nns_by_vector to support returning both ids and distances, which are get_nnsd_by_item and get_nnsd_by_vector respectively.

from annoy.

saustar avatar saustar commented on May 22, 2024

@erikbern Hi, I am using annoy tree to get most accurate match of product names. My question is regarding size of tree. When I saved the tree on disk its size ig 2.3GB. But while creating the tree I can see it is using 3 times more memory. Is this okay?

from annoy.

erikbern avatar erikbern commented on May 22, 2024

Not clear why, how are you measuring it? Measuring memory consumption of a process is notoriously unreliable

from annoy.

saustar avatar saustar commented on May 22, 2024

from annoy.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.