Giter VIP home page Giter VIP logo

nearpy's Introduction

NearPy Build Status

NearPy is a Python framework for fast (approximated) nearest neighbour search in high dimensional vector spaces using different locality-sensitive hashing methods.

It allows to experiment and to evaluate new methods but is also production-ready. It comes with a redis storage adapter.

To install simply do pip install NearPy. It will also install the packages scipy, numpy and redis.

Both dense and sparse (scipy.sparse) vectors are supported right now.

Read more here: http://pixelogik.github.io/NearPy/

Principle

To find approximated nearest neighbours for a query vector, first the vectors to be stored are indexed. For each vector that should be indexed ('stored') a hash is generated, that is a string value. This hash is used as the bucket key of the bucket the vector is then stored into. Buckets are in most cases just lists of vectors, it is the terminology used in these applications.

This would not be of any use for finding nearest neighbours to a query vector. So the 'secret' to this mechanism is to use so called locality sensitive hashes, LSHs. These hashes take spatiality into account so that they tend to generate identical hash values (bucket keys) for close vectors. This makes it then super fast to get close vectors given a query vector. Because this is a very rough approach it is called approximated nearest neighbour search, because you might not get the real nearest neighbours. In many applications this is fine because you just wanna get 20 vectors that are 'equal enough'.

Engine

When using NearPy you will mostly do this by configuring and using an Engine object. Engines are configurable pipelines for approximated nearest neighbour search (ANNS) using locality sensitive hashes (LSHs).

alt text

Engines are configured using the constructor that accepts the different components along the pipeline:

def __init__(self, dim, lshashes=[RandomBinaryProjections('default', 10)],
             distance=CosineDistance(),
             vector_filters=[NearestFilter(10)],
             storage=MemoryStorage()):

The ANNS pipeline is configured for a fixed dimensionality of the feature space, that is set using the dim parameter of the constructor. This must be an positive integer value.

The engine can use multiple LSHs and takes them from the lshashes parameter, that must be an array of LSHash objects.

Depending on the kind of filters used during querying a distance measure can be specified. This is only needed if you use filters that need a distance (like NearestFilter or DistanceThresholdFilter).

Filters are used in a last step during querying nearest neighbours. Existing implementations are NearestFilter, DistanceThresholdFilter and UniqueFilter.

The is another parameter to the engine called fetch_vector_filters, which is performed after fetching candidate vectors from the buckets and before distances or vector filters are used. This is by default one UniqueFilter and it should always be. I keep this still as a parameter if people need to play around with that.

The engine supports different kinds of ways how the indexed vectors (and the buckets) are stored. Current storage implementations are MemoryStorage and RedisStorage.

There are two main methods of the engine:

store_vector(self, v, data=None)
neighbours(self, v)

store_vector() hashes vector v with all configured LSHs and stores it in all matching buckets in the storage. The optional data argument must be JSON-serializable. It is stored with the vector and will be returned in search results. It is best practice to use only database ids as attached 'data' and store the actual data in a database.

neighbours() hashes vector v with all configured LSHs, collects all candidate vectors from the matching buckets in storage, applies the (optional) distance function and finally the (optional) filter function to construct the returned list of either (vector, data, distance) tuples or (vector, data) tuples.

To remove indexed vectors and their data from the engine these two methods can be used:

clean_all_buckets(self)
clean_buckets(self, hash_name)

Hashes

All LSH implementation in NearPy do subclass nearpy.hashes.LSHash, which has one main method, besides constructor and reset methods.

    hash_vector(self, v)

hash_vector() hashes the specified vector and returns a list of bucket keys with one or more entries.

The LSH RandomBinaryProjections projects the specified vector on n random normalized vectors in the feature space and returns a string made from zeros and ones. If v lies on the positive side of the n-th normal vector the n-th character in the string is a '1', if v lies on the negative side of it, the n-th character in the string is a '0'. This way this LSH projects each possible vector of the feature space ('input space') into one of many possible buckets.

The LSH RandomDiscretizedProjections is almost identical to RandomBinaryProjections. The only difference is, that is divides the projection value by a bin width, and using the bin index in each random projection as part of the bucket key. Given the same count of random projection vectors as RandomBinaryProjections, this results in more buckets given the same vector set. The density of buckets on the projections can be controlled by the bin width, which is part of the constructor.

The LSH PCABinaryProjections is trained with a training set of vectors specified with the constructor. It performs PCA (principal component analysis) to find the directions of highest variance in the training set. It then uses the first n principal components as projection vectors (or dimensions of the subspace that is projected into). The idea was that this makes it more safe to get a good distribution of the vectors among the buckets. I do not have any tests on this and don't know if this makes sense at all.

The LSH PCADiscretizedProjections is the pca version of RandomDiscretizedProjections, not using random vectors but the first n principal components of the training set, like PCABinaryProjections does it.

Hash configurations

To save your index in Redis and re-use it after the indexing process is done you should persist your hash configurations so that afterwards you can re-create the same engine for more indexing or querying.

This is done like this:

# Create redis storage adapter
redis_object = Redis(host='localhost', port=6379, db=0)
redis_storage = RedisStorage(redis_object)

# Get hash config from redis
config = redis_storage.load_hash_configuration('MyHash')

if config is None:
    # Config is not existing, create hash from scratch, with 10 projections
    lshash = RandomBinaryProjections('MyHash', 10)
else:
    # Config is existing, create hash with None parameters
    lshash = RandomBinaryProjections(None, None)
    # Apply configuration loaded from redis
    lshash.apply_config(config)

# Create engine for feature space of 100 dimensions and use our hash.
# This will set the dimension of the lshash only the first time, not when
# using the configuration loaded from redis. Use redis storage to store
# buckets.
engine = Engine(100, lshashes=[lshash], storage=redis_storage)

# Do some stuff like indexing or querying with the engine...

# Finally store hash configuration in redis for later use
redis_storage.store_hash_configuration(lshash)

===========

Example usage:

from nearpy import Engine
from nearpy.hashes import RandomBinaryProjections

# Dimension of our vector space
dimension = 500

# Create a random binary hash with 10 bits
rbp = RandomBinaryProjections('rbp', 10)

# Create engine with pipeline configuration
engine = Engine(dimension, lshashes=[rbp])

# Index 1000000 random vectors (set their data to a unique string)
for index in range(100000):
    v = numpy.random.randn(dimension)
    engine.store_vector(v, 'data_%d' % index)

# Create random query vector
query = numpy.random.randn(dimension)

# Get nearest neighbours
N = engine.neighbours(query)

===========

nearpy's People

Contributors

adityapatadia avatar amorgun avatar erramuzpe avatar filitchp avatar giacobenin avatar mariusvniekerk avatar mikluko avatar owo avatar pixelogik avatar saketkc avatar shixing avatar timgates42 avatar toogle avatar wanji avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nearpy's Issues

nearpy.io down

Seems like it has been out of service for more than a week.

Thanks for the project, it is really nice :)

Sparse matrix support

Hi,
Readme says that sparse vectors are currently support. Is that true?
I've just tries recent version from github and getting ValueError: dimension mismatch when it call a hash_vector function.
It can be just my misunderstanding how to work with this library. I can provide a code where I reproduced the error.

Memory Usage

Hi,
I have a problem with memory usage, my vector dim is 2**14, i know it's very large but in #3 you say about turning vectors to sparse matrix
this is my redis memory usage when my vectors count is 239:

   # Memory
   used_memory:178110184
   used_memory_human:169.86M
   used_memory_rss:181440512
   used_memory_peak:178248520
   used_memory_peak_human:169.99M
   used_memory_lua:33792
   mem_fragmentation_ratio:1.02
   mem_allocator:jemalloc-3.0.0

Why it's Use about 169M of ram? is it normal for my data size?

Retrieving `k` nearest neighbours?

Is there a process/parameter setting that will make sure queries return no less than k ANN? (and preferably also not much more than k... exactly k would be ideal)

Like when the user asks for exactly top-20 most similar items.

I'm imagining something like searching nearby buckets, if there's too few candidates. I haven't checked the code yet, maybe it's already there :)

A bug in class Engine __init__() function

In engine.py, the class Engine __init__() function, the default value for storage is MemoryStorage().
However, in this case, if I create two Engine instances without passing storage instance (i.e. using the default value), the two engine instances would share the same storage instance. I am not sure if you want that behavior, but to me, it's a little bit awkward.

I recommend change to this :

def __init__(...,storage = None):
    if storage == None : 
         self.storage = MemoryStorage()

Use python table storage instead of pickle for hashes

Hi,

I'v been doing some testing with nearpy, and I think it would be a good idea to replace pickle storage with python table for hashes matrixes to be able to work with data in large dimendions.
Real life problems (CF recommendation etc..) have billions of dimensions, and more dimensions we have, the more we need hash methods as well.
Pickle starts to struggle when it comes to store those large matrixes (while numpy can handle without problems matrixes with billion of rows/thousands of columns, if you do have memory for it).
Pickles also has performance memory issues vs table
http://www.shocksolution.com/2010/01/storing-large-numpy-arrays-on-disk-python-pickle-vs-hdf5adsf/

I guess more scalable solutions are probably hadoop based
https://github.com/takahi-i/likelike
https://github.com/mrsqueeze/spark-hash

(still a very handy lib though!)

Thanks!

Mismatch between docstring and functionality

There is at least one mismatch that I'm noticing right now. Specifically, the functionality specified in the docstring of Engine doesn't match what happens.

For example, in engine.py, you say:

There are four different modes of the engine:

. . .

2) No distance - The distance argument is None.
                In this case only the vector filters are applied to
                the bucket contents and the result is a list of
                filtered (vector, data) tuples.

But you're specifically replacing all incoming None arguments with default values:

def __init__(self, ...):
    #. . .
    if distance is None: distance = CosineDistance()
    self.distance = distance

Then, in Engine.neighbors, you check for self.distance being None and execute based on that:

def neighbors(self, v):
    #. . .
    # Apply distance implementation if specified
    if self.distance:
        candidates = self._append_distances(v, self.distance, candidates)

multiple hash tables

In LSH there are two factors that affect the accuracy, the number of bits in the code (projections) and the number of hash tables.
When you create an engin like:

rbp = RandomBinaryProjections('rbp', 10)
engine = Engine(dimension, lshashes=[rbp])

Does it only do one time random projection?
Can we redo the hashing multiple times to increase the accuracy?

TypeError: unhashable type: 'numpy.ndarray'

Hi,
I'm trying NearPy, and I got some error. Below is some code and error message.

import scipy.io as sio

mat_file = sio.loadmat("C:\\Users\\RA2\\Desktop\\Workspaces\\LightenedCNN_A_lfw.mat")
fea = mat_file["features"]
labels = mat_file["labels_original"]
img_path = mat_file["image_path"]

from nearpy import Engine
from nearpy.hashes import RandomBinaryProjections

# Dimension of our vector space
dimension = 256

n_d = 10
rbp1 = RandomBinaryProjections('rbp', n_d)


# Create engine with pipeline configuration
engine = Engine(dimension, lshashes=[rbp1])

for v, i in zip(fea[1:,:], range(1, len(fea)+1)):
    engine.store_vector(v, img_path[i])

query = fea[0]

# Get nearest neighbours
N = engine.neighbours(query)
File "C:\Users\RA2\Anaconda3\lib\site-packages\nearpy\filters\uniquefilter.py", line 51, in filter_vectors
    unique_dict[v[1]] = v

TypeError: unhashable type: 'numpy.ndarray'

The code crash when n_d is below 15.
Is this a bug?

Thanks
Zehao Shi

Storage wrapper for Postgres?

In my opinion, Postgres is a highly used Database and there should be storage support for Postgres Database as well. I am willing to take up this task.
Planning to use Psycopg2 interface

ValueError: dimension mismatch, When using space coo matrix

Hi there,
I have this error when updating nearpy to new version,

    neighbours = nearpy_engine.neighbours(vector)
  File "/home/vahid/dev/mirad/backend/venv/local/lib/python2.7/site-packages/nearpy/engine.py", line 153, in neighbours
    nv = v / np.linalg.norm(v)
  File "/home/vahid/dev/mirad/backend/venv/local/lib/python2.7/site-packages/numpy/linalg/linalg.py", line 2126, in norm
    sqnorm = dot(x, x)
  File "/home/vahid/dev/mirad/backend/venv/local/lib/python2.7/site-packages/scipy/sparse/base.py", line 317, in __mul__
    raise ValueError('dimension mismatch')
ValueError: dimension mismatch

my vector is:

    basic_vector = vectorizer.transform(
        ['some text'])
    vector = basic_vector.transpose().tocoo()

and vectorizer is a sklearn hashing vectrozier (n_features=2**18). np.linalg.norm will dot(v, v) and since v is an sparce matrix this error occurs.

Python 3 compatibility

I've done some work in this direction. Tets in run_tests.py are passing correctly. Pull request is coming.

ImportError: No module named permutation.hashpermutations

I'm trying to get the latest version of NearPy so I'm doing the following:

root@67f7a4cf4faf:/# pip install git+git://github.com/pixelogik/NearPy.git
*** cut ***
root@67f7a4cf4faf:/# pip freeze
NearPy==0.1.2
argparse==1.2.1
bitarray==0.8.1
numpy==1.9.1
redis==2.10.3
scipy==0.14.0
virtualenv==1.11.6
wsgiref==0.1.2

No virtualenv in my case. Everything is fine until I try to use it:

root@67f7a4cf4faf:/# python -c 'import nearpy'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/nearpy/__init__.py", line 20, in <module>
    from nearpy.engine import Engine
  File "/usr/local/lib/python2.7/dist-packages/nearpy/engine.py", line 27, in <module>
    from nearpy.hashes import RandomBinaryProjections
  File "/usr/local/lib/python2.7/dist-packages/nearpy/hashes/__init__.py", line 31, in <module>
    from nearpy.hashes.permutation.hashpermutations import HashPermutations
ImportError: No module named permutation.hashpermutations

I believe this could be related to 90239b7.

Use Hilbert space filling curve?

I'm curious about the choice of random binary projections versus Hilbert curve. Does rbp outperform it?

Would it make sense to add another hasher?

engine.store_vector

For a large number of vectors, it takes a long time to call "engine.store_vector()" for all the data points.
Is there faster way to index the vectors?

BTW, thank you for answering quickly.

Nearest Neighbours in Cosine distance

A vector V is hashed into a bucket B, the V is compared with all vectors in B. This only bucket B usually leads to bad performance.

A good way to solve it is to find several closest neighbour buckets in term of hamming distance. e.g. B= "1001", the neighbour buckets could be "1000" ,"1011" ... Then compare V with all vectors in these buckets.

How do you think about it ? I've implemented an extension class based on your class Engine and would like to incorporate into your code.

Thanks,
Xing

Crashing in storing and retrieve when data is string or list of string

Try this library when data is a string or a list of string.
Expect storage a success and get nearest neighbor etc upon querying.

Code:
import numpy
from numpy import random
from nearpy import Engine
from nearpy.hashes import RandomBinaryProjections
from nearpy.filters import NearestFilter
from nearpy.distances import EuclideanDistance

dimension=5000
encoding_length = 10
rbp = RandomBinaryProjections('rbp', encoding_length) #or PCABinaryProjection
engine = Engine(dimension, lshashes=[rbp],
distance=EuclideanDistance(),
vector_filters=[NearestFilter(encoding_length)])
engine.clean_all_buckets()
i = numpy.random.randn(dimension)
s = 'abcde'
engine.store_vector(i, s)
y = engine.neighbours(s)

CRASH:
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python2.7/dist-packages/nearpy/engine.py", line 93, in neighbours
for bucket_key in lshash.hash_vector(v):
File "/usr/local/lib/python2.7/dist-packages/nearpy/hashes/randombinaryprojections.py", line 63, in hash_vector
return [''.join(['1' if x > 0.0 else '0' for x in projection])]
TypeError: 'NotImplementedType' object is not iterable
ret = engine.neighbours(s)

Struggle in sparse vector

I used csr_matrix in scipy library to represent sparse vector, but it can not suit for store_vector function of our library because a vector in scipy still be a 1xn matrix, which give the dimension =1 will be mismatch with our lshash. Could you give me an example or let me know the suitable library for computing with sparse vector and can be applied to our library.

Possibility to update vectors

Hi,

I think it would be very useful to have possibility to update vectors index.
Typically, a CF application would need to update movies/users vectors regularilly as users buy/see movies. From my undertanding of the model, it would require 1) vectorize movie before update 2) retrieve buckets containing movie vectors and remove it from set 3) vectorize movie after update and push it.
While it's certainly doable, there might be some more intermediary data structures that could help in that task.

Thanks!

Support for generating and searching in L hash tables

Hi,
I didn't see this feature while experimenting with NearPy API, so apologies if I missed something...
As I understand NearPy.Engine creates a single LSH hash table. Most ANN algorithms use a number of hash tables to reduce false negative rates (missing a close neighbor).
Is it possible to create a simple API for this in NearPy?

Thanks!

Improvement on README about role of RandomBinaryProjections

The information about role of RandomBinaryProjections is sparse.
Please improve that?

The code for experiment is clearer about that they are doing then the example usage.
Suggest improve example usage. Also have a folder 'example' containing self contained samples.

Add function to `engine` to remove an item from the index

I would like to be able to remove things from a bucket.

I think the function should look like this:

engine.remove(id)
# what should it return?
# what should happen if `id` is not in the index?

So simple I might come back with a PR.

scipy.sparse.issparse(v) object has no attribute 'sparse'

Hi, pixelogik. I'm very interested in the Local Sensitive Hashing and plan to use it for image retrieval. I install the NearPy and the redis packages. Then I test the following example:

import numpy
from nearpy import Engine
from nearpy.hashes import RandomBinaryProjections

# Dimension of our vector space
dimension = 500

# Create a random binary hash with 10 bits
rbp = RandomBinaryProjections('rbp', 10)

# Create engine with pipeline configuration
engine = Engine(dimension, lshashes=[rbp])

# Index 1000000 random vectors (set their data to a unique string)
for index in range(100000):
    v = numpy.random.randn(dimension)
    engine.store_vector(v, 'data_%d' % index)

# Create random query vector
query = numpy.random.randn(dimension)

# Get nearest neighbours
N = engine.neighbours(query)

It throws out the error:

C:\Python27\python.exe D:/python/projects/lsh/lsh-image-search.py
Traceback (most recent call last):
  File "D:/python/projects/lsh/lsh-image-search.py", line 17, in <module>
    engine.store_vector(v, 'data_%d' % index)
  File "C:\Python27\lib\site-packages\nearpy\engine.py", line 79, in store_vector
    for bucket_key in lshash.hash_vector(v):
  File "C:\Python27\lib\site-packages\nearpy\hashes\randombinaryprojections.py", line 65, in hash_vector
    if scipy.sparse.issparse(v):
AttributeError: 'module' object has no attribute 'sparse'

This happens because the scipy module doesn't have any attribute named sparse. That attribute only gets defined when you import scipy.sparse.

So in the randombinaryprojections.py, it's good to import scipy.sparse. After I do this, it succeed.

Number of returned neighbors

I have this problem that I always get 10 results from the engine.neighbours() function. It doesn't matter how many projections I have of what is the bucket size.

Am I missing something? Could not find any setting on number of neighbors returned.

Adding new images and removing an exisitng images index

I'm new to using the NearPy package. I'm having set of images with their vectors of features that are already indexed. If i get a new image, Is there a function that allows me to add to specific bucket according to this image's similarity to other images already existing?

Also if i need to remove an index of a specific image that i deleted from my dataset, is there a function for that?

Thanks

Please fix the Redis store_many_vectors_case

@xieqihui Sorry for the hassle.

Could you please create a new pull request for your store_many_vectors addition and make this little adjustment I suggested, so that store_many_vectors works even if not Redis is used for storage?

nearest neighbour list is empty

On running the example code, the list of nearest neighbours is empty.
N = engine.neighbours(query)

len(N) # 0

Can you explain why is that so?

Empty Sequence returned by Nearest Neighbour query

Hi,

I am using the example usage code snippet to hash vectors into buckets and find the nearest neighbour. I have stored 20 vectors of 4096 dimensions each in the engine. Querying the engine for the nearest neighbour of a test vector occasionally returns an empty sequence.

I tried increasing the number of random projections to 4000 with the same results. Can you help me to understand why this is happening?

Thanks!
Deshana

what's the effect of permutations hash object?

At first, I only put RandomBinaryProjections hash in my codes, and it did not yield good results ,but I can repeat run my query function, not include index vector function, eg engine.store_vector(vector, data);

when I add HashPermutations object in my code. such as
permutations = HashPermutations('p')
rbp_conf = {'num_permutation':50, 'beam_size':10, 'num_neighbour':10}
permutations.add_child_hash(rbp_perm, rbp_conf)
......
redis_storage.store_hash_configuration(rbp_perm)
permutations.build_permuted_index()
Finally the result is better than first.
but if I want to run query function ,I cann't remove 'engine.store_vector(vector, data); '.
so how can I only execute query function code, not include store vector codes.
just like,

redis_object = redis.Redis(host='localhost', port=0, db=0, unix_socket_path='/tmp/redis.sock')
redis_storage = RedisStorage(redis_object)
config = redis_storage.load_hash_configuration('v')
if config is None:
rbp_perm = RandomBinaryProjections('v', 30)
else:
rbp_perm = RandomBinaryProjections(None, None)
rbp_perm.apply_config(config)

 engine.candidate_count(query)
 engine.neighbours(query)

I hope you will soon answer my question. ths very much

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.