meilisearch / arroy Goto Github PK

View Code? Open in Web Editor NEW

191.0 9.0 6.0 2.1 MB

Spotify/Annoy-inspired Approximate Nearest Neighbors in Rust, based on LMDB and optimized for memory usage :boom:

Home Page: https://docs.rs/arroy

License: MIT License

Rust 100.00%

annoy approximate-nearest-neighbor-search lmdb rust diskann

arroy's People

Stargazers

Watchers

Forkers

vinicius-ianni davidalphafox mphypermynds shabbirhasan1 placrosse amirouche

arroy's Issues

Try using `MDB_INTEGERKEY` along with a custom comparator

If we can use both options simultaneously, we can avoid storing the length of the keys and compare the keys as big-endian.

Update the pre-processor

About preprocess:
- We can probably compute the max-norm on the fly while adding the items
- If the max-norm doesn't change (from a doc addition OR deletion), then we don't need to run the second part of

Propose to list Arroy into the Annoy README

Check that the `Distance` is the correct one

Make sure that a user doesn't open a database without the original distance.

Improve the deletion of `tmp_nodes`

Sometimes we need to delete new nodes previously inserted in the tmp_nodes.
From my understanding, it was always either the last one or the last last one, but it looks like it's not.

It would be good to check measure if we often do a lot of iterations, and if that's the case try to optimize it.

arroy/src/parallel.rs

Line 83 in 5e23882

if let Some(el) = self.ids.iter_mut().rev().take(2).find(|i| **i == item) {

Fix the failing test

By reducing the number of digits we print for each floats

The `Side` function behave strangely on two dimensions

I wrote a visualizer of every iteration of two_means + the final repartition between the left children and right children on a split node.

At the very end of the split node function, we found two pretty good centroids:

But the final result we chose is this one:

The more dimensions we add, the less the issue shows.

Measure the time it takes for Arroy vs Annoy to build trees in parallel

Update the README accordingly

By keeping a list of unreachable_items that corresponds to the newly added or modified items we can make sure that the Writer::build method keeps track of them and insert them incrementally into all of the trees.

However, it must keep the split plane balanced and therefore recompute some of the normal along the way.

It must also make sure that the tree depth is growing with the number of items too. By appending new split plane if necessary and reducing the branch size. That could be a rebuild method that erase everything and start from the beginning to keep a good balance.

Remove the memcopies of the vectors

It is the part that takes up to 22% to copy, 5% to bzero the vector in advance, and again 5% to drop the allocated vectors. It seems like the AVX implementation can be switched to non-aligned f32 slices.

Make it possible to filter directly in Arroy

Modify the README feature list

The `Writer::new` method is error-free but can return an error

We renamed the Writer::prepare method to Writer::new; this new method can no longer fail. We must change the API to return the Writer directly.

https://github.com/meilisearch/arroy/blob/2ff15674f2a9f3f8f2f2feb020884bca860c3eb5/src/writer.rs#L42C5-L46C6

Log more what is happening in the build function

Speed up filter ANDs operations

By using the latest version of the RoaringBitmaps crate we can do the same as we did on meilisearch/meilisearch#4682.

Make the normal in the splitnode null instead of a vector of 0s

Update the documentation example description

Measure the relevancy when changing

Re-use IDs

Don’t make multiple centroid iterations on the same leaf

arroy/src/distance/mod.rs

Line 136 in 4f193fd

let node_k = leafs.choose(rng)?.unwrap();

Here we should remove the selected leaf from the list of possible leafs.
That means, when we have less than 200 leafs to split we can exit early.
It’ll impact relevancy but I’m pretty sure it won’t have a negative impact

Appending items no longer works

It is no longer possible to append items in the database because we are updating the updated item IDs that live after the item entries. We could move the metadata information before the item entries. We must add a test for this method.

arroy/src/writer.rs

Lines 166 to 176 in 19e0a07

 let mut updated = self 

 .database 

 .remap_data_type::<RoaringBitmapCodec>() 

 .get(wtxn, &Key::updated(self.index))? 

 .unwrap_or_default(); 

 updated.insert(item); 

 self.database.remap_data_type::<RoaringBitmapCodec>().put( 

 wtxn, 

 &Key::updated(self.index), 

 &updated, 

 )?;

Dead code to investigate

Hey,

I noticed some dead code in the euclidean & manhattan distance;

arroy/src/distance/euclidean.rs

Lines 54 to 60 in 973f093

 normal.header.bias = normal 

 .vector 

 .iter() 

 .zip(node_p.vector.iter()) 

 .zip(node_q.vector.iter()) 

 .map(|((n, p), q)| -n * (p + q) / 2.0) 

 .sum();

arroy/src/distance/manhattan.rs

Lines 57 to 63 in 973f093

 normal.header.bias = normal 

 .vector 

 .iter() 

 .zip(node_p.vector.iter()) 

 .zip(node_q.vector.iter()) 

 .map(|((n, p), q)| -n * (p + q) / 2.0) 

 .sum();

Before deleting it, we should investigate and try to understand if we were not supposed to be doing this operation in the pre-process or somewhere else?
Are the bias even set anywhere?

Support binary quantization

It would be great to support binary quantization in arroy. The main principle is to convert the dimensions values x <= 0 to 0 and x > 0 to 1. This way, we can represent the quantized vector with 32x less space and compute the distances in a much faster and CPU-friendly way. We are currently limited to something like 15M (float 32bit, 768dims) on a 63GiB machine, but with binary quantization, we can go up to 480M vectors on the same machine.

Here is an example of implementing the Euclidean distance with binary data. Here is the formula: $\sqrt{(p1-q1)^2+(p2-q2)^2}$.
This means that computing the difference at the power of two is equivalent to a xor:
$(0-1)^2 = (-1)^2 = 1$
$(1-0)^2 = 1^2 = 1$
$(0-0)^2 = 0$
$(1-1)^2 = 0$

Ultimately, the Euclidean operation is the sum of the XORed dimensions of both vectors squared: $\sqrt{(p1 \bigoplus q1)+(p2 \bigoplus q2)}$. All the necessary operations can be SIMD-optimized or maybe using the u8::BitXor and u8::count_ones methods will be SIMD-optimized by itself 🤔

Use append when we're creating the last index

When building the indexes. If we are writing the last prefix of the database, we can use append instead of put which should be faster.

Multi-thread the `make_tree`

We should be able to run one make_tree per thread without issue

Do not use the vector dimensions as the number of items in a descendant node

When porting the build tree functions from Annoy, we kept the constant value for the number of descendants we could fit into a descendant node. The reason why they were doing that is because they needed constant-length nodes. However, our system no longer needs this, as LMDB entries can have any length.

Expose the universe `RoaringBitmap` of arroy

We must expose a method to get the set of known items.

Panic when doing a search in an empty index

    let mut wtxn = handle.env.write_txn().unwrap();
    let writer = Writer::new(handle.database, 0, 2);
    writer.build(&mut wtxn, &mut rng(), None).unwrap();
    wtxn.commit().unwrap();

    let rtxn = handle.env.read_txn().unwrap();
    let reader = Reader::open(&rtxn, 0, handle.database).unwrap();
    let ret = reader.nns_by_vector(&rtxn, &[0., 0.], 10, None, None).unwrap(); // panic here

panicked at src/reader.rs:212:75:
argument of integer logarithm must be positive

Implement a median-based top-k algorithm

https://quickwit.io/blog/top-k-complexity

Measure and improve the performances of the incremental indexing

It would be a good thing to create a small benchmark showing the incremental indexing performances.

Once we have that, we should profile an indexing process and see where most time is spent and if that's easy to fix (related to #60)

Expose `ItemIter` publicly

This ItemIter type is not public and must be. Is there a way to break the CI whenever we forget to expose a type?

Do we need to re-compute the normal of all the impacted nodes when incremental indexing?

This could improve the relevancy, it needs to be tested

The clear function throw a panic

A user tried to call the clear function in arroy and got an interesting result, to say the least:
https://discord.com/channels/1006923006964154428/1197422056711667792/1198185699401289789

That simply shouldn't happen

Inform Bern Hardsson that we ported Arroy to LMDB

https://twitter.com/bernhardsson

Reader do not detect that the database has been partially updated

We need to do an investigation

Fix append_item by moving the item Node mode at the end of the enum

Allow users to omit the dimensions when creating the `Writer`

When the database already contains items and vectors we can guess the dimensions from them.

Make it possible to filter directly from arroy

At Meilisearch, we must ensure documents or vectors are hidden from some queries. The first version of the vector store feature was to iterate on the best results in the order found in the HNSW and conditionally ignore them. The complexity was O(n), which is not great when the user filters the results on a couple, e.g., 5, 10, 100.

We can do better than that now that we control the whole source code. There are two main solutions to implement:

When there is a fairly small number of selected items, compute the distance without going through the whole tree/graph. However, we need to define a correct threshold.
The algorithm must be more clever when many more items are selected. We will filter the items from the Descendant variant when "iterating" over the tree nodes in the nns_by_item/by_vector. We can barely not even touch the build phase. However, it would also be great to store bitmaps instead of lists of u32s in the Descendant nodes to be able to perform faster intersections.

We will also probably need to provide a nss_builder to reduce the number of conditional parameters to specify. For example, we can provide the RoaringBitmap to filter with, the number of trees to explore, and maybe two methods to query the database: with and without the distances (is it useful?).

TODO

Update the README's feature list

Bump dependencies again

Measure and improve the constant numbers used when building the tree

We must take three parameters into account:

Time to build the tree
Relevancy of the searches
Time to search in the tree

Fun fact: the lowest in the tree you are, the less impact a dummy plane has on the search cost.

arroy/src/writer.rs

Lines 248 to 259 in 7fc6031

 if split_imbalance(children_left.len(), children_right.len()) < 0.95 

 || remaining_attempts == 0 

 { 

 break normal; 

 } 

 remaining_attempts -= 1; 

 }; 

 // If we didn't find a hyperplane, just randomize sides as a last option 

 // and set the split plane to zero as a dummy plane. 

 while split_imbalance(children_left.len(), children_right.len()) > 0.99 {

fn split_imbalance(left_indices_len: usize, right_indices_len: usize) -> f64 {
    let ls = left_indices_len as f64;
    let rs = right_indices_len as f64;
    let f = ls / (ls + rs + f64::EPSILON); // Avoid 0/0
    f.max(1.0 - f)
}

fn main() {
    dbg!(split_imbalance(29464, 18394));
    dbg!(split_imbalance(30000, 30000));
    dbg!(split_imbalance(30000, 1580));
}

Look into binary indexes

One of our customers is interested in binary Indexes. It could be interesting to look into this. We can potential find a good fit with this.

BIN_FLAT
This index is exactly the same as FLAT except that this can only be used for binary embeddings.

For vector similarity search applications that require perfect accuracy and depend on relatively small (million-scale) datasets, the BIN_FLAT index is a good choice. BIN_FLAT does not compress vectors, and is the only index that can guarantee exact search results. Results from BIN_FLAT can also be used as a point of comparison for results produced by other indexes that have less than 100% recall.

BIN_FLAT is accurate because it takes an exhaustive approach to search, which means for each query the target input is compared to every vector in a dataset. This makes BIN_FLAT the slowest index on our list, and poorly suited for querying massive vector data. There are no parameters for the BIN_FLAT index in Milvus, and using it does not require data training or additional storage.

BIN_IVF_FLAT
This index is exactly the same as IVF_FLAT except that this can only be used for binary embeddings.

BIN_IVF_FLAT divides vector data into nlist cluster units, and then compares distances between the target input vector and the center of each cluster. Depending on the number of clusters the system is set to query (nprobe), similarity search results are returned based on comparisons between the target input and the vectors in the most similar cluster(s) only — drastically reducing query time.

By adjusting nprobe, an ideal balance between accuracy and speed can be found for a given scenario. Query time increases sharply as both the number of target input vectors (nq), and the number of clusters to search (nprobe), increase.

BIN_IVF_FLAT is the most basic BIN_IVF index, and the encoded data stored in each unit is consistent with the original data.

	let mut updated = self
	.database
	.remap_data_type::<RoaringBitmapCodec>()
	.get(wtxn, &Key::updated(self.index))?
	.unwrap_or_default();
	updated.insert(item);
	self.database.remap_data_type::<RoaringBitmapCodec>().put(
	wtxn,
	&Key::updated(self.index),
	&updated,
	)?;

	normal.header.bias = normal
	.vector
	.iter()
	.zip(node_p.vector.iter())
	.zip(node_q.vector.iter())
	.map(\|((n, p), q)\| -n * (p + q) / 2.0)
	.sum();

	if split_imbalance(children_left.len(), children_right.len()) < 0.95
	\|\| remaining_attempts == 0
	{
	break normal;
	}

	remaining_attempts -= 1;
	};

	// If we didn't find a hyperplane, just randomize sides as a last option
	// and set the split plane to zero as a dummy plane.
	while split_imbalance(children_left.len(), children_right.len()) > 0.99 {