hdidx / hdidx Goto Github PK

View Code? Open in Web Editor NEW

89.0 14.0 26.0 171 KB

Approximate Nearest Neighbor (ANN) search for high-dimensional data.

Home Page: http://wanji.me/hdidx

License: MIT License

Python 81.03% C 7.99% C++ 10.97%

hdidx's Introduction

HDIdx: Indexing High-Dimensional Data

What is HDIdx?

HDIdx is a python package for approximate nearest neighbor (ANN) search. Nearest neighbor (NN) search is very challenging in high-dimensional space because of the Curse of Dimensionality problem. The basic idea of HDIdx is to compress the original feature vectors into compact binary codes, and perform approximate NN search instead of extract NN search. This can largely reduce the storage requirements and can significantly speed up the search.

Architecture

HDIdx has three main modules: 1) Encoder which can compress the original feature vectors into compact binary hash codes, 2) Indexer which can index the database items and search approximate nearest neighbor for a given query item, and 3) Storage module which encapsulates the underlying data storage, which can be memory or NoSQL database like LMDB, for the Indexer.

The current version implements following feature compressing algorithms:

Product Quantization[1].
Spectral Hashing[2].

To use HDIdx, first you should learn a Encoder from some learning vectors. Then you can map the base vectors into hash codes using the learned Encoder and building indexes over these hash codes by an Indexer, which will write the indexes to the specified storage medium. When a query vector comes, it will be mapped to hash codes by the same Encoder and the Indexer will find the similar items to this query vector.

Installation

HDIdx can be installed by pip:

[sudo] pip install cython
[sudo] pip install hdidx

By default, HDIdx use kmeans algorithm provided by SciPy. To be more efficient, you can install python extensions of OpenCV, which can be installed via apt-get on Ubuntu. For other Linux distributions, e.g. CentOS, you need to compile it from source.

[sudo] apt-get install python-opencv

HDIdx will use OpenCV automatically if it is available.

Windows Guide

General dependencies:

After install the above mentioned software, download stdint.h and put it under the include folder of Visual C++, e.g. C:\Users\xxx\AppData\Local\Programs\Common\Microsoft\Visual C++ for Python\9.0\VC\include. Then hdidx can be installed by pip from the Anaconda Command Prompt.

Example

Here is a simple example. See this notebook for more examples.

# import necessary packages

import hdidx
import numpy as np

# generating sample data
ndim = 16      # dimension of features
ndb = 10000    # number of dababase items
nqry = 10      # number of queries

X_db = np.random.random((ndb, ndim))
X_qry = np.random.random((nqry, ndim))

# create Product Quantization Indexer
idx = hdidx.indexer.IVFPQIndexer()
# build indexer
idx.build({'vals': X_db, 'nsubq': 8})
# add database items to the indexer
idx.add(X_db)
# searching in the database, and return top-10 items for each query
ids, dis = idx.search(X_qry, 10)
print ids
print dis

Citation

Please cite the following paper if you use this library:

@article{wan2015hdidx,
  title={HDIdx: High-Dimensional Indexing for Efficient Approximate Nearest Neighbor Search},
  author={Wan, Ji and Tang, Sheng and Zhang, Yongdong and Li, Jintao and Wu, Pengcheng and Hoi, Steven CH},
  journal={Neurocomputing },
  year={2016}
}

Reference

[1] Jegou, Herve, Matthijs Douze, and Cordelia Schmid.
    "Product quantization for nearest neighbor search."
    Pattern Analysis and Machine Intelligence, IEEE Transactions on 33.1 (2011): 117-128.
[2] Weiss, Yair, Antonio Torralba, and Rob Fergus.
    "Spectral hashing."
    In Advances in neural information processing systems, pp. 1753-1760. 2009.

hdidx's People

Contributors

Stargazers

Watchers

Forkers

wanji chubbymaggie pombredanne webjunkie01 sthomp gojomo ericshape steerapi benchan lixiang0 aimago cwlseu allansp84 vishalbelsare codes-kzhan aagnone3 wjianxz legend147 xdtin greatwallisme

hdidx's Issues

Question: How many dimensions can hdidx support?

The published paper mentions empirical results on a dataset with 128 dimensions. How does hdidx work with 1000 dimensions? Looking at Annoy it works quite well with hundreds of dimensions but breaks down in thousands. Have you investigated empirical results with more dimensions?

Does it support multiple vector features per sample?

I need to search over a data where each item can have from 1 to 1000 descriptors with the dimension of 40. From the examples, I see that's it's not possible as it can work only with one descriptor per item. Or is it possible to index using this tool?

Fix bottleneck version problem

Hi @wanji @stevenhoi

Since from bottleneck 1.2.0, functions partsort and argpartsort have been renamed to partition and argpartition to match NumPy Release Notes
So in util.py line L83, if the bottleneck version is 1.2.0 or higher, it will cause an error, one should do fix the error by replacing

ids = bottleneck.argpartsort(dist, topk)[:topk]

ids = bottleneck.argpartition(dist, topk-1)[:topk]

Support `cosine` and `dotproduct` similarity

Current implementation only supports euclidean distance.
To support cosine and dotproduct similarity, the following modifications are required:

k-means
indexer

Error running sample code

I'm trying to run the sample code on the github page and I get this error:

$ python hdidx_test.py
Traceback (most recent call last):
  File "hdidx_test.py", line 1, in <module>
    import hdidx
  File "/usr/local/Cellar/[email protected]/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/hdidx/__init__.py", line 12, in <module>
    import indexer
ModuleNotFoundError: No module named 'indexer'

I'm using Python 3.8 and installed hdidx from pip.

Installation require Cython

I got an error running pip install that
ImportError: No module named Cython.Distutils
Command "python setup.py egg_info" failed with error code 1 in ...
Solution: remember pip install Cython first

indexing larger dataset

i have 30 million vector with every vector about 2048D. I want store it a use ANN to query..How do it?... Please. Thanks you

Import Hdidx issue on Windows 10

Hi,

I have a windows 10 with Python 3.6.5. I followed the instructions:

download stdint.h and put the file in the specified path in windows
run the command "pip install opencv-python"
run the command "pip install cython"
run the command "pip install hdidx"

when I launch Python and do "import hdidx", I get this error:

File "C:\Temp\Python\Python36\lib\site-packages\hdidx_init_.py", line 12, in
import indexer
ModuleNotFoundError: No module named 'indexer'

Any suggestions?

Many thanks,
Ivan

TypeError: Required argument 'flags' (pos 6) not found

Thanks for creating this package! It works great.

I'm posting this for anyone else who may run into the same issue.

I got the following error, which stems from OpenCV's KMeans function:

/hdidx/util.py in kmeans(vs, ks, niter)
     61         flags = cv2.KMEANS_RANDOM_CENTERS
     62         compactness, labels, centers = cv2.kmeans(
---> 63             vs, ks, criteria, 1, flags)
     64         return centers
     65 except ImportError:

TypeError: Required argument 'flags' (pos 6) not found

I think that in newer versions of OpenCV the KMeans function requires an extra input. My fix was to change line 62 and 63 to:

compactness, labels, centers = cv2.kmeans(vs, ks, None, criteria, 1, flags)

Implement inverted-index for Product Quantization

Implement the inverted-index introduced in the following paper:

Jegou, Herve, Matthijs Douze, and Cordelia Schmid.
"Product quantization for nearest neighbor search."
Pattern Analysis and Machine Intelligence, IEEE Transactions on 33.1 (2011): 117-128.