jina-ai / annlite Goto Github PK
View Code? Open in Web Editor NEW⚡ A fast embedded library for approximate nearest neighbor search
License: Apache License 2.0
⚡ A fast embedded library for approximate nearest neighbor search
License: Apache License 2.0
rebuild the index (sqlite, and vector index) from the local lmdb data
BaseIndex
fit
function to check whether the training validstat
and clear
apiI am trying to install annlite
in my macbook with M1 chip using pip install annlite
but I receive the following error:
clang: error: the clang compiler does not support '-march=native'
error: command '/usr/bin/clang' failed with exit code 1
Is there any suggestion to fix it?
CI workflow
CD workflow
Pros:
Cons:
Support different metrics:
Improve the following parts (probably cython):
The main issue with current version is that unless lot of data is filtered the code ends up beeing slower than cdist.
Hi!
I am currently taking a look at jina-ai. The plan is to get a simple text-based document search going and so far I've managed to make a simple demo locally which uses the PQLiteIndexer
(based on AnnLite).
flow = Flow(port=5050)
flow = (
flow
.add(uses=TfIdfEncoder, uses_with=dict(tfidf_fp=tfidf_fp))
.add(uses='jinahub://PQLiteIndexer/latest', install_requirements=True, uses_with=dict(dim=dim))
)
The next step would be for me to see how I can deploy a prototype to Google Cloud Platform (GCP) and, if possible, use Cloud Run in order to keep costs at minimum.
However, since AnnLite requires access to a local file-system I am not sure if that's possible. I intended to use Cloud Storage but it seems AnnLite would not support this.
What options do I have here?
Hi,
I'm reading the main body of annlite
, and I found some core functions lack comments, which may cause some confusion(at least to me).
Maybe I can add some comments while I'm reading, and open a PR for that?
Explore the effect of n_probe
and propose a better search-cell strategy.
Now, the filtering conditions only support AND combination.
The goal is to convert the following condition into SQL WHERE Clause (support both of AND and OR)
The input DSL
{"and": [
{"eq": ("foo", 3)},
{"gt": ("bar", 4)},
]
}
The resulting SQL WHERE Clause:
WHERE foo = 3 AND bar > 4
Since we move to jcloud
deployment, it's necessary to support uploading/downloading PCA/PQ model to/from Hubble
.
Thus, we need to implement these APIs:
self._projector_codec.upload(artifact='...')
self._projector_codec.download(artifact='...')
The artifact
is determined by users and should be consistency throughout the whole pipeline. And also should be passed to jcloud.yaml
.
Hi, do we consider using clang-format
or something else to format all the cpp/h
files?
If it's necessary, maybe I can open a pr for that.
Hi,
it would be nice to have option to retrieve document by its id from lmdb.
Now it is possible via
_index.doc_store(0).get(['document_id'])}
which is not cumbersome given number of different cells.
thank you
The quantization process expects the whole data to be in memory. This does not scale for big datasets.
In order to reduce memory usage, we need to implement PCA inside ANNlite
. There will be two PRs for this function:
scikit-learn
ANNlite
Hello,
my use case is the search in long text documents.
Documents are split to chunks (lets say sentences) and each chunk has its embedding. Root document has no embedding.
I am not able to index documents with annlite indexer because of missing embedding of root document, only chunks may be indexed.
If I store documents directly to lmdb via self._index.doc_store(0).insert(root_docs)
then when loading query flow it throws error.
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (10,) + inhomogeneous part.
10 means (5 root docs, and 5 chunks together - dummy data)
Can you please help me
Thanks
HM-ANN enables billion-scale similarity search on a single machine without compression technologies.
HM-ANN: Efficient Billion-Point Nearest Neighbor
Search on Heterogeneous Memory
Currently, only the HNSWPostgresIndexer supports persistence of Documents. Could we add the PQLiteIndexer some database persistence, not necessarily Postgres, but any database provider?
Line 59 in e4e706e
Performance benchmark experiment
Something like this:
Sometimes we need to train the PCA model when we already created an indexer. (for example, there is a memory issue after we have indexed thousands or even millions of data, and we need PCA to fix it.)
We need to fetch train data from lmdb
, but this is tricky when we move to jcloud
since we need to fetch data from the server instead of local machine.
One way to solve this is to add a new endpoint in client called /fetch
:
data = client.post('/fetch', params={'batch_size': 1024})
for training we can use partial_train()
:
annlite.partial_train(data)
Add support for bruteforce search with dense numpy arrays, add an example in the readme
I have a simple flow that looks like this:
f = (
Flow(port_expose=8082, protocol='http', monitoring=True, port_monitoring=9090)
.add(name='encoder', uses='jinahub+docker://CLIPEncoder')
.add(name='processor',
uses='jinahub+docker://PQLiteIndexer/latest',
uses_with={
'dim': 512,
'columns': columns
},
)
)
Indexing works fine and I can verify it using /status
endpoint where it shows the number of indexed documents. When I hit the /search
endpoint, I can search and retrieve results correctly.
I also verified that filtering works by testing it with $eq
. However, when I test it with $in
, things go south. Not only does it not return any results, but it also seems to crash my entire database where I can't make calls to endpoints like /status
and /search
. Does anyone have any idea as to what is happening? Here is how I am structuring my filter query:
# this query searches the files with a tag 'owners' of type array which includes the given string
search_results = c.post(on="/search",
parameters={
"query": QUERIES[0],
"traversal_paths": '@r,c',
"limit": 3,
"filter":{"owners": {"$in": ["EGGWLJSUHT6GLWU2KIB0"]}}
})
The bottleneck of search is about table SQL query
Line # Hits Time Per Hit % Time Line Contents
==============================================================
71 @line_profile
72 def ivf_search(
73 self,
74 x: np.ndarray,
75 cells: np.ndarray,
76 where_clause: str = '',
77 where_params: Tuple = (),
78 limit: int = 10,
79 ):
80 15 18.0 1.2 0.0 dists = []
81
82 15 11.0 0.7 0.0 doc_idx = []
83 15 7.0 0.5 0.0 cell_ids = []
84 15 6.0 0.4 0.0 count = 0
85 30 141.0 4.7 0.0 for cell_id in cells:
86 15 23.0 1.5 0.0 cell_table = self.cell_table(cell_id)
87 15 54765.0 3651.0 4.5 cell_size = cell_table.count()
88 15 20.0 1.3 0.0 if cell_size == 0:
89 continue
90
91 15 6.0 0.4 0.0 indices = None
92 15 10.0 0.7 0.0 if where_clause or (cell_table.deleted_count() > 0):
93 15 11.0 0.7 0.0 indices = []
94 500030 806655.0 1.6 66.3 for doc in cell_table.query(
95 15 9.0 0.6 0.0 where_clause=where_clause, where_params=where_params
96 ):
97 500000 274113.0 0.5 22.5 indices.append(doc['_id'])
98
99 15 27.0 1.8 0.0 if len(indices) == 0:
100 continue
101
102 15 13655.0 910.3 1.1 indices = np.array(indices, dtype=np.int64)
103
104 30 63932.0 2131.1 5.3 _dists, _doc_idx = self.vec_index(cell_id).search(
105 15 32.0 2.1 0.0 x, limit=min(limit, cell_size), indices=indices
106 )
107
108 15 22.0 1.5 0.0 if count >= limit and _dists[0] > dists[-1][-1]:
109 continue
110
111 15 24.0 1.6 0.0 dists.append(_dists)
112 15 9.0 0.6 0.0 doc_idx.append(_doc_idx)
113 15 41.0 2.7 0.0 cell_ids.extend([cell_id] * len(_dists))
114 15 13.0 0.9 0.0 count += len(_dists)
115
116 15 113.0 7.5 0.0 cell_ids = np.array(cell_ids, dtype=np.int64)
117 15 13.0 0.9 0.0 if len(dists) != 0:
118 15 459.0 30.6 0.0 dists = np.hstack(dists)
119 15 125.0 8.3 0.0 doc_idx = np.hstack(doc_idx)
120
121 15 105.0 7.0 0.0 indices = dists.argsort(axis=0)[:limit]
122 15 28.0 1.9 0.0 dists = dists[indices]
123 15 14.0 0.9 0.0 cell_ids = cell_ids[indices]
124 15 9.0 0.6 0.0 doc_idx = doc_idx[indices]
125
126 15 6.0 0.4 0.0 doc_ids = []
127 165 163.0 1.0 0.0 for cell_id, offset in zip(cell_ids, doc_idx):
128 150 1750.0 11.7 0.1 doc_id = self.cell_table(cell_id).get_docid_by_offset(offset)
129 150 94.0 0.6 0.0 doc_ids.append(doc_id)
130 15 8.0 0.5 0.0 return dists, doc_ids, cell_ids
None of the code in the readme is working due to API changes
benchmark among: the goal is to make a decision which storage engine to use in qplite:
In Jina 3.x, the in-memory sqlite cannot work in an executor as before.
Python version: 3.9
MacOS version: 12.2.1
CMD used: pip install https://github.com/jina-ai/annlite/archive/refs/heads/main.zip (or pip install "docarray[full]")
Error log:
Building wheels for collected packages: annlite
Building wheel for annlite (PEP 517) ... error
ERROR: Command errored out with exit status 1:
command: /opt/anaconda3/envs/jina/bin/python /opt/anaconda3/envs/jina/lib/python3.9/site-packages/pip/_vendor/pep517/in_process/_in_process.py build_wheel /var/folders/8m/b8y6xg410sz19rwf9v_6fqch0000gn/T/tmps65jnhoh
cwd: /private/var/folders/8m/b8y6xg410sz19rwf9v_6fqch0000gn/T/pip-req-build-zbgplt5p
Complete output (48 lines):
running bdist_wheel
running build
running build_py
creating build
creating build/lib.macosx-11.1-arm64-cpython-39
creating build/lib.macosx-11.1-arm64-cpython-39/annlite
copying annlite/enums.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite
copying annlite/index.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite
copying annlite/profile.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite
copying annlite/init.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite
copying annlite/container.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite
copying annlite/utils.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite
copying annlite/helper.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite
copying annlite/filter.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite
copying annlite/math.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite
creating build/lib.macosx-11.1-arm64-cpython-39/annlite/core
copying annlite/core/init.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite/core
creating build/lib.macosx-11.1-arm64-cpython-39/annlite/storage
copying annlite/storage/init.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite/storage
copying annlite/storage/kv.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite/storage
copying annlite/storage/table.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite/storage
copying annlite/storage/base.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite/storage
creating build/lib.macosx-11.1-arm64-cpython-39/annlite/core/codec
copying annlite/core/codec/vq.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite/core/codec
copying annlite/core/codec/init.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite/core/codec
copying annlite/core/codec/pq.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite/core/codec
copying annlite/core/codec/base.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite/core/codec
creating build/lib.macosx-11.1-arm64-cpython-39/annlite/core/index
copying annlite/core/index/init.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite/core/index
copying annlite/core/index/pq_index.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite/core/index
copying annlite/core/index/flat_index.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite/core/index
copying annlite/core/index/base.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite/core/index
creating build/lib.macosx-11.1-arm64-cpython-39/annlite/core/index/hnsw
copying annlite/core/index/hnsw/index.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite/core/index/hnsw
copying annlite/core/index/hnsw/init.py -> build/lib.macosx-11.1-arm64-cpython-39/annlite/core/index/hnsw
running build_ext
creating var
creating var/folders
creating var/folders/8m
creating var/folders/8m/b8y6xg410sz19rwf9v_6fqch0000gn
creating var/folders/8m/b8y6xg410sz19rwf9v_6fqch0000gn/T
clang -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /opt/anaconda3/envs/jina/include -arch arm64 -I/opt/anaconda3/envs/jina/include -fPIC -O2 -isystem /opt/anaconda3/envs/jina/include -arch arm64 -I/opt/anaconda3/envs/jina/include/python3.9 -c /var/folders/8m/b8y6xg410sz19rwf9v_6fqch0000gn/T/tmplkc2w81j.cpp -o var/folders/8m/b8y6xg410sz19rwf9v_6fqch0000gn/T/tmplkc2w81j.o -std=c++14
building 'annlite.hnsw_bind' extension
creating build/temp.macosx-11.1-arm64-cpython-39
creating build/temp.macosx-11.1-arm64-cpython-39/bindings
clang -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /opt/anaconda3/envs/jina/include -arch arm64 -I/opt/anaconda3/envs/jina/include -fPIC -O2 -isystem /opt/anaconda3/envs/jina/include -arch arm64 -I/private/var/folders/8m/b8y6xg410sz19rwf9v_6fqch0000gn/T/pip-build-env-ibiuvri9/overlay/lib/python3.9/site-packages/pybind11/include -I/private/var/folders/8m/b8y6xg410sz19rwf9v_6fqch0000gn/T/pip-build-env-ibiuvri9/overlay/lib/python3.9/site-packages/numpy/core/include -I./include/hnswlib -I/opt/anaconda3/envs/jina/include/python3.9 -c ./bindings/hnsw_bindings.cpp -o build/temp.macosx-11.1-arm64-cpython-39/./bindings/hnsw_bindings.o -O3 -march=native -stdlib=libc++ -mmacosx-version-min=10.7 -DVERSION_INFO="0.3.2" -std=c++14
clang: error: the clang compiler does not support '-march=native'
error: command '/usr/bin/clang' failed with exit code 1
+++
ERROR: Failed building wheel for annlite
Failed to build annlite
ERROR: Could not build wheels for annlite which use PEP 517 and cannot be installed directly
The performance difference is actuallt introduced by disk (SSD vs Physical ). The index procedure is blocked by LMDB operations.
Originally posted by @numb3r3 in #26 (comment)
TODO:
references:
I was just chatting with @davidbp about filtering in PQLite. For my fashion search example I'm looking at adding filters (similar to Amazon) to pre-filter results. This work is being done in a separate branch of my repo.
At present I'm able to easily search in ranges (e.g. price, year), or above a certain threshold (e.g. rating):
filter = {
"$and": {
"year": {"$gte": 2011, "$lte": 2014},
"price": {"$gte": 100, "$lte": 200},
"rating": {"$gte": 3},
},
}
But what would be really useful is a convenient way to search for AND and XOR.
Previously I tried something (which actually works) like:
filter = {
"$and": {
"year": {"$lte": 2014, "$gte": 2011},
"price": {"$gte": 0, "$lte": 200},
},
"$or": {
"baseColour": {"$eq": "Black"},
"$or": {
"baseColour": {"$eq": "White"},
"$or": {
"baseColour": {"$eq": "Blue"}
}
}
}
}
But this is:
Some new operators: $one_of
and $all_of
filter = {
"$and": {
"year": {"$gte": 2011, "$lte": 2014},
"price": {"$gte": 100, "$lte": 200},
"rating": {"$gte": 3},
"baseColour": {"$one_of": ['White', 'Blue', 'Black']},
"season": {"$all_of": ['Summer', 'Spring', 'Fall']},
},
}
In Commsor (our community analysis tool) we use a lot of filters are useful in real world:
So I'd also like to propose the following operators:
$contains
$notcontains
(e.g. we often want to filter out universities since we focus on enterprises, so we would say company_name $notcontains "university"
)Implement pure filtering without involving vector search.
From today's meeting we've commented
README
TYPE_CHECKING
to protect unnecessary inputfrom_bytes
and to_bytes
for reading/writing binary of DocumentAfter some iterations python examples/hnsw_benchmark.py
included in the PR seems to fail can you reproduce the following?
Xtr: (124980, 128) vs Xte: (20, 128)
2021-11-23 11:42:03.020 | WARNING | pqlite.index:train:131 - The pqlite has been trained or is not trainable. Please use ``force_retrain=True`` to retrain.
2021-11-23 11:42:46.358 | DEBUG | pqlite.core.index.hnsw.index:_expand_capacity:93 - HNSW index capacity is expanded by 10240
2021-11-23 11:43:30.497 | DEBUG | pqlite.container:insert:180 - => 124980 new docs added
{'precision': 0.95, 'recall': 0.95, 'train_time': 0.000240325927734375, 'index_time': 87.68162178993225, 'query_time': 0.1407299041748047, 'query_qps': 142.1162056300232, 'index_qps': 1425.384219049087, 'indexer_hyperparams': {'n_cells': 1, 'n_subvectors': 64}}
2021-11-23 11:43:30.908 | INFO | pqlite.index:clear:259 - Clear the index of cell-0
2021-11-23 11:43:31.179 | INFO | pqlite.index:__init__:86 - Initialize VQ codec (K=8)
2021-11-23 11:43:33.298 | INFO | pqlite.index:train:137 - Start training VQ codec (K=8) with 20480 data...
2021-11-23 11:43:34.021 | INFO | pqlite.index:train:148 - The pqlite is successfully trained!
2021-11-23 11:43:34.021 | INFO | pqlite.index:dump_model:282 - Save the trained parameters to data/c905ae006031e55b1d8d51e87803d278
2021-11-23 11:44:19.429 | DEBUG | pqlite.core.index.hnsw.index:_expand_capacity:93 - HNSW index capacity is expanded by 10240
2021-11-23 11:44:23.197 | DEBUG | pqlite.core.index.hnsw.index:_expand_capacity:93 - HNSW index capacity is expanded by 10240
2021-11-23 11:44:24.466 | DEBUG | pqlite.core.index.hnsw.index:_expand_capacity:93 - HNSW index capacity is expanded by 10240
2021-11-23 11:44:28.833 | DEBUG | pqlite.core.index.hnsw.index:_expand_capacity:93 - HNSW index capacity is expanded by 10240
2021-11-23 11:44:30.179 | DEBUG | pqlite.core.index.hnsw.index:_expand_capacity:93 - HNSW index capacity is expanded by 10240
2021-11-23 11:44:33.951 | DEBUG | pqlite.core.index.hnsw.index:_expand_capacity:93 - HNSW index capacity is expanded by 10240
2021-11-23 11:44:35.024 | DEBUG | pqlite.core.index.hnsw.index:_expand_capacity:93 - HNSW index capacity is expanded by 10240
2021-11-23 11:44:38.036 | DEBUG | pqlite.container:insert:180 - => 124980 new docs added
{'precision': 0.99, 'recall': 0.99, 'train_time': 0.7391390800476074, 'index_time': 64.20390892028809, 'query_time': 0.2022690773010254, 'query_qps': 98.8781887319096, 'index_qps': 1946.6104494536003, 'indexer_hyperparams': {'n_cells': 8, 'n_subvectors': 64}}
2021-11-23 11:44:38.510 | INFO | pqlite.index:clear:259 - Clear the index of cell-0
2021-11-23 11:44:38.736 | INFO | pqlite.index:clear:259 - Clear the index of cell-1
2021-11-23 11:44:38.951 | INFO | pqlite.index:clear:259 - Clear the index of cell-2
2021-11-23 11:44:39.172 | INFO | pqlite.index:clear:259 - Clear the index of cell-3
2021-11-23 11:44:39.610 | INFO | pqlite.index:clear:259 - Clear the index of cell-4
2021-11-23 11:44:39.836 | INFO | pqlite.index:clear:259 - Clear the index of cell-5
2021-11-23 11:44:40.064 | INFO | pqlite.index:clear:259 - Clear the index of cell-6
2021-11-23 11:44:40.290 | INFO | pqlite.index:clear:259 - Clear the index of cell-7
2021-11-23 11:44:40.518 | INFO | pqlite.index:__init__:86 - Initialize VQ codec (K=16)
2021-11-23 11:44:46.918 | INFO | pqlite.index:train:137 - Start training VQ codec (K=16) with 20480 data...
2021-11-23 11:44:47.653 | INFO | pqlite.index:train:148 - The pqlite is successfully trained!
2021-11-23 11:44:47.653 | INFO | pqlite.index:dump_model:282 - Save the trained parameters to data/75115be8393181300ec49112b88b2445
2021-11-23 11:45:30.760 | DEBUG | pqlite.core.index.hnsw.index:_expand_capacity:93 - HNSW index capacity is expanded by 10240
2021-11-23 11:45:46.906 | DEBUG | pqlite.container:insert:180 - => 124980 new docs added
{'precision': 0.9850000000000001, 'recall': 0.9850000000000001, 'train_time': 0.7362098693847656, 'index_time': 59.48536229133606, 'query_time': 0.28374195098876953, 'query_qps': 70.48658095958322, 'index_qps': 2101.021077889663, 'indexer_hyperparams': {'n_cells': 16, 'n_subvectors': 64}}
2021-11-23 11:45:47.490 | INFO | pqlite.index:clear:259 - Clear the index of cell-0
2021-11-23 11:45:47.970 | INFO | pqlite.index:clear:259 - Clear the index of cell-1
2021-11-23 11:45:48.488 | INFO | pqlite.index:clear:259 - Clear the index of cell-2
2021-11-23 11:45:48.952 | INFO | pqlite.index:clear:259 - Clear the index of cell-3
2021-11-23 11:45:49.400 | INFO | pqlite.index:clear:259 - Clear the index of cell-4
2021-11-23 11:45:49.650 | INFO | pqlite.index:clear:259 - Clear the index of cell-5
2021-11-23 11:45:50.114 | INFO | pqlite.index:clear:259 - Clear the index of cell-6
2021-11-23 11:45:50.552 | INFO | pqlite.index:clear:259 - Clear the index of cell-7
2021-11-23 11:45:50.987 | INFO | pqlite.index:clear:259 - Clear the index of cell-8
2021-11-23 11:45:51.448 | INFO | pqlite.index:clear:259 - Clear the index of cell-9
2021-11-23 11:45:51.888 | INFO | pqlite.index:clear:259 - Clear the index of cell-10
2021-11-23 11:45:52.329 | INFO | pqlite.index:clear:259 - Clear the index of cell-11
2021-11-23 11:45:52.773 | INFO | pqlite.index:clear:259 - Clear the index of cell-12
2021-11-23 11:45:53.211 | INFO | pqlite.index:clear:259 - Clear the index of cell-13
2021-11-23 11:45:53.662 | INFO | pqlite.index:clear:259 - Clear the index of cell-14
2021-11-23 11:45:54.112 | INFO | pqlite.index:clear:259 - Clear the index of cell-15
2021-11-23 11:45:54.555 | INFO | pqlite.index:__init__:86 - Initialize VQ codec (K=32)
2021-11-23 11:46:10.758 | INFO | pqlite.index:train:137 - Start training VQ codec (K=32) with 20480 data...
2021-11-23 11:46:11.500 | INFO | pqlite.index:train:148 - The pqlite is successfully trained!
2021-11-23 11:46:11.500 | INFO | pqlite.index:dump_model:282 - Save the trained parameters to data/8f37c3b2ffd1c67e4c81e81f64db0eea
2021-11-23 11:47:01.267 | DEBUG | pqlite.container:insert:180 - => 124980 new docs added
{'precision': 1.0, 'recall': 1.0, 'train_time': 0.7422680854797363, 'index_time': 49.93944001197815, 'query_time': 0.30017614364624023, 'query_qps': 66.62754660333749, 'index_qps': 2502.6311862933007, 'indexer_hyperparams': {'n_cells': 32, 'n_subvectors': 64}}
2021-11-23 11:47:01.804 | INFO | pqlite.index:clear:259 - Clear the index of cell-0
2021-11-23 11:47:02.336 | INFO | pqlite.index:clear:259 - Clear the index of cell-1
2021-11-23 11:47:02.945 | INFO | pqlite.index:clear:259 - Clear the index of cell-2
2021-11-23 11:47:03.533 | INFO | pqlite.index:clear:259 - Clear the index of cell-3
2021-11-23 11:47:04.493 | INFO | pqlite.index:clear:259 - Clear the index of cell-4
2021-11-23 11:47:05.305 | INFO | pqlite.index:clear:259 - Clear the index of cell-5
2021-11-23 11:47:06.152 | INFO | pqlite.index:clear:259 - Clear the index of cell-6
2021-11-23 11:47:06.923 | INFO | pqlite.index:clear:259 - Clear the index of cell-7
2021-11-23 11:47:07.710 | INFO | pqlite.index:clear:259 - Clear the index of cell-8
2021-11-23 11:47:08.489 | INFO | pqlite.index:clear:259 - Clear the index of cell-9
2021-11-23 11:47:09.025 | INFO | pqlite.index:clear:259 - Clear the index of cell-10
2021-11-23 11:47:09.471 | INFO | pqlite.index:clear:259 - Clear the index of cell-11
2021-11-23 11:47:09.901 | INFO | pqlite.index:clear:259 - Clear the index of cell-12
2021-11-23 11:47:10.344 | INFO | pqlite.index:clear:259 - Clear the index of cell-13
2021-11-23 11:47:10.775 | INFO | pqlite.index:clear:259 - Clear the index of cell-14
2021-11-23 11:47:11.211 | INFO | pqlite.index:clear:259 - Clear the index of cell-15
2021-11-23 11:47:11.639 | INFO | pqlite.index:clear:259 - Clear the index of cell-16
2021-11-23 11:47:12.076 | INFO | pqlite.index:clear:259 - Clear the index of cell-17
2021-11-23 11:47:12.503 | INFO | pqlite.index:clear:259 - Clear the index of cell-18
2021-11-23 11:47:12.932 | INFO | pqlite.index:clear:259 - Clear the index of cell-19
2021-11-23 11:47:13.367 | INFO | pqlite.index:clear:259 - Clear the index of cell-20
2021-11-23 11:47:13.815 | INFO | pqlite.index:clear:259 - Clear the index of cell-21
2021-11-23 11:47:14.261 | INFO | pqlite.index:clear:259 - Clear the index of cell-22
2021-11-23 11:47:14.700 | INFO | pqlite.index:clear:259 - Clear the index of cell-23
2021-11-23 11:47:15.132 | INFO | pqlite.index:clear:259 - Clear the index of cell-24
2021-11-23 11:47:15.587 | INFO | pqlite.index:clear:259 - Clear the index of cell-25
2021-11-23 11:47:16.032 | INFO | pqlite.index:clear:259 - Clear the index of cell-26
2021-11-23 11:47:16.472 | INFO | pqlite.index:clear:259 - Clear the index of cell-27
2021-11-23 11:47:16.903 | INFO | pqlite.index:clear:259 - Clear the index of cell-28
2021-11-23 11:47:17.350 | INFO | pqlite.index:clear:259 - Clear the index of cell-29
2021-11-23 11:47:17.787 | INFO | pqlite.index:clear:259 - Clear the index of cell-30
2021-11-23 11:47:18.227 | INFO | pqlite.index:clear:259 - Clear the index of cell-31
2021-11-23 11:47:18.661 | INFO | pqlite.index:__init__:86 - Initialize VQ codec (K=64)
2021-11-23 11:47:56.044 | INFO | pqlite.index:train:137 - Start training VQ codec (K=64) with 20480 data...
2021-11-23 11:47:57.003 | INFO | pqlite.index:train:148 - The pqlite is successfully trained!
2021-11-23 11:47:57.004 | INFO | pqlite.index:dump_model:282 - Save the trained parameters to data/e01ce8063d859fe594084b33a10515e8
2021-11-23 11:48:55.512 | DEBUG | pqlite.container:insert:180 - => 124980 new docs added
Traceback (most recent call last):
File "examples/hnsw_benchmark.py", line 95, in <module>
pq.search(docs, limit=top_k)
File "/Users/davidbuchaca1/Documents/jina_stuff/pqlite/pqlite/index.py", line 238, in search
match_dists, match_docs = self.search_cells(
File "/Users/davidbuchaca1/Documents/jina_stuff/pqlite/pqlite/container.py", line 144, in search_cells
dists, doc_ids, cells = self.ivf_search(
File "/Users/davidbuchaca1/Documents/jina_stuff/pqlite/pqlite/container.py", line 107, in ivf_search
_dists, _doc_idx = self.vec_index(cell_id).search(
File "/Users/davidbuchaca1/Documents/jina_stuff/pqlite/pqlite/core/index/hnsw/index.py", line 77, in search
ids, dists = self._index.knn_query(query, k=limit)
RuntimeError: Cannot return the results in a contigious 2D array. Probably ef or M is too small
Originally posted by @davidbp in #18 (comment)
From my testing:
If the number fo valid data is small, we can directly apply bruteforce search over the subset data to speed up.
Hello,
I have large dataset. My workspace is 27GB after indexing (I have sent. embeddings, token embeddings and metadata).
But after first inference init workspace grows to 53 GB which is totally strange.
I have
annlite==0.3.5
jina==3.6.9
Any clues why this is happening?
Thanks
Support brute force Asymmetric distance computation in the quantized space to lower memory costs.
with this PR, we can directly add new features into the hnsw lib.
I've been trying to use the example from Alex's multimodal search demo and also tested the example code for the PQLite extention on the jina hub page. Testing both examples I get the following errors with jina 2.6.0, 2.6.2 and the latest version.
↪ python ./app.py -t index -n 10
⠏ Fetching PQLiteIndexer from Jina Hub ...DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses (raised from /home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/flatbuffers/compat.py:19)
image_encoder@263243[W]:Pea is being closed before being ready. Most likely some other Pea in the Flow or Pod failed to start
Traceback (most recent call last):
File "/home/markussagen/.pyenv/versions/3.8.5/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/markussagen/.pyenv/versions/3.8.5/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/debugpy/__main__.py", line 45, in <module>
cli.main()
File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/debugpy/server/cli.py", line 444, in main
run()
File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/debugpy/server/cli.py", line 285, in run_file
runpy.run_path(target_as_str, run_name=compat.force_str("__main__"))
File "/home/markussagen/.pyenv/versions/3.8.5/lib/python3.8/runpy.py", line 265, in run_path
return _run_module_code(code, init_globals, run_name,
File "/home/markussagen/.pyenv/versions/3.8.5/lib/python3.8/runpy.py", line 97, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "/home/markussagen/.pyenv/versions/3.8.5/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "./app.py", line 94, in <module>
main()
File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "./app.py", line 88, in main
index(csv_file=CSV_FILE, max_docs=num_docs)
File "./app.py", line 43, in index
with flow_index:
File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/jina/flow/base.py", line 1132, in __enter__
return self.start()
File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/jina/flow/base.py", line 1179, in start
self.enter_context(v)
File "/home/markussagen/.pyenv/versions/3.8.5/lib/python3.8/contextlib.py", line 425, in enter_context
result = _cm_type.__enter__(cm)
File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/jina/peapods/pods/__init__.py", line 208, in __enter__
return self.start()
File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/jina/peapods/pods/__init__.py", line 692, in start
self.enter_context(self.replica_set)
File "/home/markussagen/.pyenv/versions/3.8.5/lib/python3.8/contextlib.py", line 425, in enter_context
result = _cm_type.__enter__(cm)
File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/jina/peapods/pods/__init__.py", line 476, in __enter__
self._peas.append(BasePea(_args).start())
File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/jina/peapods/peas/__init__.py", line 135, in __init__
self.runtime_cls = self._get_runtime_cls()
File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/jina/peapods/peas/__init__.py", line 427, in _get_runtime_cls
update_runtime_cls(self.args)
File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/jina/peapods/peas/helper.py", line 106, in update_runtime_cls
_args.uses = HubIO(_hub_args).pull()
File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/jina/hubble/hubio.py", line 672, in pull
executor, from_cache = HubIO.fetch_meta(
File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/jina/hubble/helper.py", line 323, in wrapper
result = func(*args, **kwargs)
File "/home/markussagen/.pyenv/versions/jina/lib/python3.8/site-packages/jina/hubble/hubio.py", line 588, in fetch_meta
image_name=resp['image'],
KeyError: 'image'```
We don't support loading and saving in HNSW now, every time we need to rebuild the whole graph from the lmdb
which is very slow when the datasize is huge. The better way is to save the HNSW graph directly and load it when initializing the indexer.
We need two APIs inside HNSW:
hnsw_indexer.load_index()
and hnsw_indexer.save_index()
will continue this ticket after #18 is merged
The pqlite has broken changes, and hence the following two use cases are supported!
(Basic) For small-scale data (e.g., < 10M docs),
dtype=np.float32
)(Advanced) For large-scale data (e.g., > 10M docs): combine 1) Product Qunantization, 2) IVF, and 3) HNSW
dtype=np.uint8
)A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.