<a href="http://www.clopinet.com/isabelle/Papers/search-eng/node46.html" rel="nofollow

Implement some sort of incremental indexing about meilisearch HOT 4 CLOSED

meilisearch commented on May 13, 2024

Implement some sort of incremental indexing

from meilisearch.

Comments (4)

Kerollmops commented on May 13, 2024

A simple solution

A solution could be to index new documents by batch generating a new index, the index is save somewhere and the engine is notified of this additional index. It must search in it too when new queries comes, agregate results found and do the job like before (sort using the criteria, filter documents, select the range of documents requested...).

This is a viable solution, thanks to the great fst library this is easy to do set operations on multiple fst at the same time, the .idx file is composed of arrays of document indexes using the sdset library it is easy to do set operations without any overhead and finally the .sst file can be loaded in the RocksDB without problem.

The scenario

So we can imagine a little scenario, we have two indexes and want to search over them, doing an union, in other terms searching for all the documents in these two indexes, returning the top 10 best documents based on all these matches.

A dump of what is contained in these two indexes

A dump of the index1

doc(0): The Philosopher's Stone
doc(1): The Chamber of Secrets
doc(2): The Prisoner of Azkaban

A dump of the index2

doc(3): The Goblet of Fire
doc(4): The Order of the Phoenix
doc(5): The Half-Blood Prince
doc(6): The Deathly Hallows

Note that the documents have global unique ids, this is not an auto-incremental id, the ids are ordered in the .idx index file.

We start the engine with the paths of the two indexes named index1 and index2 so it will be able to search along the two indexes by doing an union.

Know we will ask the engine to search for the query "ph", for that we must construct a RankedStream. This construction must be informed about which automatons we want to use to search here we will need only one automaton for the "word" "ph" but the metadata::OpBuilder can accept any number of automatons that could be used to search in one shot in the fst. Once the automatons have been defined we will specify in which fsts we want to search. The metadata::OpBuilder is know ready to understand which set operation we want to do on all these streams (fst) using all these automatons, we want to do an union, note that it uses sdset to do the same set operation on the document indexes (.idx) and avoid using btree/hashset (avoiding heap allocations).

So all of these function calls gives objects and things that will give us documents when we asks them to do so. We can call the RankedStream::retrieve_documents which calls the RankedStream::retrieve_all_documents function now, it will use the union stream constructed above, that will gives us a string associated with all the document indexes after the set operation (union).

The RankedStream::retrieve_document function after having retrieved all the documents will simply do a bucket sort and should return doc(0) then doc(4) according to the default criteria.

So everything works perfectly with this simple scenario but the indexes given to the engine can be considered to be positive but not negative.

from meilisearch.

Kerollmops commented on May 13, 2024

What about deleting documents ?

Deleting documents when incremental indexing is part of the core engine can be a big problem if not handled propely.

The scenario

We have four pre-existing indexes just to help us focus on the real problem. Indexes could be seen has positive or negative. Note that the engine work with batches of additions and substractions stored in indexes.

A dump of what is contained in these four indexes

A dump of the index1 that is positive.

doc(0): The Philosopher's Stone
doc(1): The Chamber of Secrets
doc(2): The Prisoner of Azkaban
doc(3): The Goblet of Fire

A dump of the index2 that is positive.

doc(4): The Order of the Phoenix
doc(5): The Half-Blood Prince
doc(6): The Deathly Hallows

A dump of the index3 that is negative.

doc(1): The Chamber of Secrets
doc(3): The Goblet of Fire
doc(5): The Half-Blood Prince

A dump of the index4 that is positive.

doc(1): The Chamber of Secrets

Note that the documents have global unique ids, this is not an auto-incremental id, the ids are ordered in the .idx index file.

We will start the engine with only one index (as it works today), the first one named "index1", nothing strange happen, the engine will do a simple union on all the indexes that are present resulting in a final stream that will yield all of our Harry Potter books titles. simple.

We add the second index, named "index2", to the party, nothing more complex than before, we do an union of the documents and the engine will be able to return all the documents from the doc(0) to the doc(6).

Deleting documents

Now comes the funny part, what about deleting some documents ? Why not ? We can consider our engine as a database, so, go ahead and delete books titles !

We inform the engine that the "index3" must be considered as a negative index and all the documents specified in it must not appear in the search results. Note that the .sst file is not needed here, the engine already know which fields are stored in its RocksDB.

So the solution can be to do an union of all the positive indexes, like before, an union of all the negative index, this way we obtain two streams, the positive one on the left and the negative one on the right.

The fst library can do a difference on multiple streams but this will not be useful for us, the difference must be done on the resulting documents that each words in the fst yields.

The word "of" must return doc(1, 2, 3, 4), the union of the index1 and index2 is a stream that will gives us the word "of" and the index3 which is negative will return "of" too, if we do the difference at this level, the word will not be present in the final stream and the doc(2) will never be reached.

So the solution could be to union like before: positives and negatives, implement an fst::Stream that returns the list of documents of each stream along with an information about the fact that the stream that returns this list of id is a positive or a negative stream. With this information it will be possible to ignore documents and not words.

Note that in the current implementation the information that is stored under the fst word is not a list of documents id but a list of DocIndexes.

Ok, so we are able to delete/ignore documents with this solution but what happen if we want to index documents that have already been deleted ? For example someone made a mistake by deleting the doc(1) and want to fix it by re-indexing it and informing the engine that a new positive index containing the doc(1) must be taken into account.

The previous solution would have made two big unions of all the positive and all the negative documents but this is not viable. The doc(1) would have been deleted and it would not have been possible to re-index it that way because the big union of all the negative indexes take effect on the whole time indexed positives documents.

A solution is to do unions on groups of positive and negative indexes, this way the "index1" and "index2" can be unified, the negative "index3" is unified and the "index4" too. A difference is made on the "index1+index2" and "index3" (index(1, 2) - index(3)) and a union can be made with the result (index(index(1, 2) - index(3), 4)). Note that this is the document indexed by the newest index that must be stored in the key-value store.

Many other questions persists: What about fields in the key-value store when a document is re-indexed (e.g. doc(1) here) ? Do we need to remove the fields of the documents that have been deleted ? What happen if fewer fields are added, the engine could return fields of an older version of a document.

from meilisearch.

Kerollmops commented on May 13, 2024

How do we store all of these indexes ?

Has previously detailed the indexes could be positive or negative and the order is important.
This means that we should follow some ACID properties / Multiversion concurrency control.

The first idea that has been proposed was to use an SQLite database to store the ids of the indexes along with an information about the sign of it. This is an SQL database that follows the ACID properties so it would have been easy to query it and without fear, there is also good wrappers in Rust.

The little problem is not that "log" database but the way to add these files in the database.

The scenario

The database is launched without any index, it can, this is not a problem at all, it will just respond without any results. If we want it respond we need to feed it with documents.

Due to the nature of the indexes, which are files, we must use the file system to store them.

The first idea is to specify a folder when we run the engine, a folder that contains the SQLite database along with all the indexes files. When the engine must take into account a new index, the index must be added to the index folder and after the files have been verified to be fetched to disk the SQLite database will be updated with a new entry.

This way the engine is safe to be shutdown, the state is atomic and consistent.

The next time the engine will be queried on its index it will look at the SQLite database and will see a new file, positive or negative, to add to its list of index to search into.

Compressing indexes

The previous solution is some kind of correct but it is not viable for a long living program. Over time the database will have to manage a non-stop growing number of indexes and doing unions on multiple indexes is slow.

To solve the problem we can compress the indexes, the advantages of this solution is that it will reduce the disk lookup and cost. To do so we can query the engine with an AlwaysMatch automaton that will return every word in the fsts and aggregate the corresponding documents in a final index. Another good point of doing so is that this query is already ordered lexicographically, so it can use only a little bit of ram because it can be streamed directly to disk (without using a BTreeMap).

Once the compressed index has been produced we want to remove these, now become duplicates, old indexes. Replacing them by the compressed version.

But when will it be allowed to do so ? We can not remove the files before we create an entry in the "log" database. We can create a transaction that will remove old indexes and create an entry for the new one. That will be ok, next queries to the engine will not see the old ones but the new compressed index.

What about the old files ? When will it be possible to remove them ? We can not do it know because the engine can have a query that is searching through them. Do we need to implement some sort of reference-counting on each index ?

from meilisearch.

Kerollmops commented on May 13, 2024

Incremental indexing is implemented directly on top of the key-value store.
More informations on issue #17.

from meilisearch.

Implement some sort of incremental indexing about meilisearch HOT 4 CLOSED

Comments (4)

A simple solution

The scenario

The scenario

Deleting documents

The scenario

Compressing indexes

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent