Giter VIP home page Giter VIP logo

falconn-lib / falconn Goto Github PK

View Code? Open in Web Editor NEW
1.1K 46.0 191.0 5.04 MB

FAst Lookups of Cosine and Other Nearest Neighbors (based on fast locality-sensitive hashing)

Home Page: http://falconn-lib.org/

License: MIT License

Makefile 0.59% C++ 17.92% C 78.71% Shell 0.05% Python 2.74%
nearest-neighbor-search lsh cosine-similarity locality-sensitive-hashing sketches fast-lookups falconn

falconn's Introduction

FALCONN - FAst Lookups of Cosine and Other Nearest Neighbors

FALCONN is a library with algorithms for the nearest neighbor search problem. The algorithms in FALCONN are based on Locality-Sensitive Hashing (LSH), which is a popular class of methods for nearest neighbor search in high-dimensional spaces. The goal of FALCONN is to provide very efficient and well-tested implementations of LSH-based data structures.

Currently, FALCONN supports two LSH families for the cosine similarity: hyperplane LSH and cross polytope LSH. Both hash families are implemented with multi-probe LSH in order to minimize memory usage. Moreover, FALCONN is optimized for both dense and sparse data. Despite being designed for the cosine similarity, FALCONN can often be used for nearest neighbor search under the Euclidean distance or a maximum inner product search.

FALCONN is written in C++ and consists of several modular core classes with a convenient wrapper around them. Many mathematical operations in FALCONN are vectorized through the Eigen and FFHT libraries. The core classes of FALCONN rely on templates in order to avoid runtime overhead.

How to use FALCONN

We provide a C++ interface for FALCONN as well as a Python wrapper (that uses NumPy). In the future, we plan to support more programming languages such as Julia. For C++, FALCONN is a header-only library and has no dependencies besides Eigen (which is also header-only), so FALCONN is easy to set up. For further details, please see our documentation.

How fast is FALCONN?

On data sets with about 1 million points in around 100 dimensions, FALCONN typically requires a few milliseconds per query (running on a reasonably modern desktop CPU).

For more detailed results, see ann-benchmarks of Erik Bernhardsson. Let us point out that FALCONN is especially competitive, when the RAM budget is quite restrictive, which is not the regime the above benchmarks use.

Questions

Maybe your question is already answered in our Frequently Asked Questions. If you have additional questions about using FALCONN, we would be happy to help. Please send an email to [email protected].

Authors

FALCONN is mainly developed by Ilya Razenshteyn and Ludwig Schmidt. FALCONN has grown out of a research project with our collaborators Alexandr Andoni, Piotr Indyk, and Thijs Laarhoven.

Many of the ideas used in FALCONN were proposed in research papers over the past 20 years (see the documentation).

If you want to cite FALCONN in a publication, here is the bibliographic information of our research paper (bibtex):

Practical and Optimal LSH for Angular Distance Alexandr Andoni, Piotr Indyk, Thijs Laarhoven, Ilya Razenshteyn, Ludwig Schmidt NIPS 2015

License

FALCONN is available under the MIT License (see LICENSE.txt). Note that the third-party libraries in the external/ folder are distributed under other open source licenses. The Eigen library is licensed under the MPL2. The googletest and googlemock libraries are licensed under the BSD 3-Clause License. The pybind11 library is licensed under a BSD-style license.

falconn's People

Contributors

akashin avatar azadkuh avatar csssaz avatar danix800 avatar dblalock avatar fil avatar ludwigschmidt avatar maumueller avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

falconn's Issues

Write a contributor guide

Points to cover:

  • Linting clang-format --style=Google -i <file-name>
  • Running unit tests (ideally also set up some continuous integration service, see #13)
  • Pull requests
  • What else?

Migrate to Eigen's sparse vector class?

It would be nice to use Eigen's sparse vector class because it supports linear algebra operations without additional work on our side (as opposed to the std::vector<std::pair<int, float>> we are currently using). But we should first do a small benchmark to compare the two options.

R / Rcpp

Seeing as you're planning a future Julia wrapper, I thought I'd ask for an R one too, so I can try FALCONN instead of RcppAnnoy.

Migrate Python wrapper to pybind11

pybind11 is a fairly new project for accessing C++ code from Python:

https://github.com/pybind/pybind11

There are three advantages for us:

  • They translate from Scipy sparse matrices to Eigen sparse matrices. This would make #24 easy to implement.
  • It's a header-only library, so we can add it to 'external'.
  • The project is actively developed and has very good documentation. It's specifically for connecting C++ and Python, so the semantics are often clearer than in the more generic swig.

pybind11 is planning to release v2 in early January. v2 brings many improvements for the Eigen wrapper, so it's probably good to start with v2 on our side.

How to find near neighbors in given candidates instead of all candidates

Hi,

I know the definition of find_near_neighbors are as follows:

def find_near_neighbors(self, query, threshold)
Find all the points within threshold distance from query.

I would like to know how to calculate near neighbors within distance of threshold from a GIVEN candidates. For example, given smaller candidates is provided as a list of indexes.

Actually, I have a large dataset, I only want to calculate a document's nearest neighbors in a time window [-3 days, +3 days]. And I want to avoid to build a LSH index for each days' data.

Thanks very much.

find_near_neighbors for only one NN

Is there a way to get only one result when using the find_near_neighbors? I am interesting in solving the decision problem; is there a point in the pointset thats lies within an r radius with the query?

By using the find_near_neighbors it tries to find all the points that satisfy this condition.

Write a Julia wrapper

It might make sense to wait with this until Cxx.jl (https://github.com/Keno/Cxx.jl) is supported by a default Julia installation (maybe in 0.5?). Otherwise, we might need to write a C wrapper around FALCONN (which might actually be useful in other cases, too).

Store only hash tables in memory

As far as I understand, the underlying algorithm doesn't require storing the dataset in memory, only the generated hash tables.

Given the favorable computational complexity, falconn is very interesting to use with large datasets, but then storing whole datasets in memory becomes either infeasible or very expensive.

All in all, I'm not sure where the requirement of storing the whole dataset comes from โ€“ is it an inherent limtation of the method or was it just not implemented yet (and if so, are there plans to?)

setup.py should NOT check if libraries (e.g. numpy) are installed

It is not common practice, and for a good reason.
With that code in place, it's impossible to do a one-pass pip install of a requirements.txt file including both numpy and falconn, because the setup.py of falconn would be run before numpy is installed (and fail!).

clang missing brace warnings

Remove the following warnings. From a quick search it looked like clang might be too cautious here.

In file included from falconn/swig/falconn_wrap.cc:3591:
falconn/src/include/falconn/lsh_nn_table.h:179:5: warning: suggest braces around initialization of subobject [-Wmissing-braces]
    "unknown",
    ^~~~~~~~~~
falconn/src/include/falconn/lsh_nn_table.h:209:5: warning: suggest braces around initialization of subobject [-Wmissing-braces]
    "unknown",
    ^~~~~~~~~~
2 warnings generated.

Release v1.3

  • Expose linear scan
  • Make the inner tables return the amount of memory they use
  • Parameter search in the wrapper
  • Add query objects to wrapper for multi-threaded querying (#6)
  • Add check in wrapper for k and bit-packed table (#45)

pip install error on OS X 10.11

Hello,
I'm experiencing the following error while trying to instal falconn with pip install.
Gist
My Setup: OS X 10.11, virtual environment with python 3.5 and numpy 1.11.1 . Shouldn't make a difference, but I have Eigen installed with homebrew.

Make possible to link all src/test/*.cc to one binary

Now linker fails because of not inlined functon in src/test/test_utils.h.
The fix is trivial:

--- a/src/test/test_utils.h
+++ b/src/test/test_utils.h
@@ -11,7 +11,7 @@
 namespace falconn {
 namespace test {
 
-void compare_vectors(const std::vector<float>& expected,
+inline void compare_vectors(const std::vector<float>& expected,
                      const std::vector<float>& result, float eps) {
   ASSERT_EQ(expected.size(), result.size());
   for (size_t ii = 0; ii < expected.size(); ++ii) {

pip install FALCONN failed on windows 7

I use Anaconda2 64 bit on my Windows 7 system. After I typed "pip install FALCONN", it downloads the FALCONN package and attempt to install but I got the following errors:

KeyType = int; PointSet = falconn::PlainArrayPointSet]'
falconn/swig/python_wrapper.h:337:31: required from here
falconn/src/include/falconn/wrapper/../core/polytope_hash.h:90:9: error: 'posix_memalign' was not declared in this scope
In file included from falconn/external/eigen/Eigen/Core:344:0,
from falconn/external/eigen/Eigen/Dense:1,
from falconn/src/include/falconn/eigen_wrapper.h:11,
from falconn/src/include/falconn/falconn_global.h:8,
from falconn/swig/falconn_wrap.cc:3689:
falconn/external/eigen/Eigen/src/Core/arch/SSE/PacketMath.h: In function 'Packet Eigen::internal::pmadd(const Packet&, const Packet&, const Packet
&) [with Packet = __vector(2) double]':
falconn/external/eigen/Eigen/src/Core/arch/SSE/PacketMath.h:250:134: warning: control reaches end of non-void function [-Wreturn-type]
falconn/external/eigen/Eigen/src/Core/arch/SSE/PacketMath.h: In function 'Packet Eigen::internal::pmadd(const Packet&, const Packet&, const Packet
&) [with Packet = __vector(4) float]':
falconn/external/eigen/Eigen/src/Core/arch/SSE/PacketMath.h:249:134: warning: control reaches end of non-void function [-Wreturn-type]
error: command 'C:\Anaconda2\MinGW\bin\gcc.exe' failed with exit status 1

----------------------------------------

Command "C:\Anaconda2\python.exe -u -c "import setuptools, tokenize;file='c:\users\admini1\appdata\local\temp\pip-build-xmpbnu\FALCONN\se
tup.py';exec(compile(getattr(tokenize, 'open', open)(file).read().replace('\r\n', '\n'), file, 'exec'))" install --record c:\users\admini
1\ap
pdata\local\temp\pip-yj92v2-record\install-record.txt --single-version-externally-managed --compile" failed with error code 1 in c:\users\admini~1\app
data\local\temp\pip-build-xmpbnu\FALCONN\

Rename find_closest ?

We have the following methods for getting near(est) neighbors:

  • find_closest
  • find_k_nearest_neighbors
  • find_near_neighbors

I find this a little inconsistent. Should we rename find_closest to find_nearest_neighbor?

Python wrapper for the sparse case

Currently, the Python wrapper supports only dense data (via Numpy arrays), would be nice to generalize it to handle sparse data as well.

Release v1.2

  • A bit-packed storage hash table.
  • Multi-threaded table setup.
  • Remove sorted candidate query methods.
  • Update Python wrapper.
  • Run GloVe benchmarks.
  • Test multi-threaded setup on a few more machines / platforms.
  • Update Documentation.
  • Create github release.
  • Update PyPI package.
  • Ludwig run GloVe locally.

Postponed to v1.3:

  • Add query objects to wrapper for multi-threaded querying. (#6)
  • Add check in wrapper for k and bit-packed table. (#45)

Write a Python wrapper

  • First version of the wrapper
  • Documentation (make clear that the NumPy array must stay valid!)
  • GloVe example
  • Helper functions (compute_number_of_hash_functions, etc.)
  • Performance debugging to get the wrapper closer to C++ performance.
  • Update README (include new license information for numpy.i)
  • Create a Python package.
  • Add a thin wrapper on the Python side.

Release v1.2.1

The main point of this release is to upload a package to PyPi built with a corrected version of swig (see #58 for the discussion). However, it would be good to address the following minor issues:

  • update the list of contributors
  • add more detailed instructions (especially, regarding recent idiosyncrasies with clang and swig, see #58)
  • re-generate and re-upload documentation

Slow and memory-costly compilation

Currently, compiling anything that includes FALCONN takes quite a bit of time and RAM (on towhee running make python_package_install eats more than 2.4 Gb of RAM).

It's a bit painful: for instance, one can't run make python_package_install on the machine, where the website is hosted, which is unfortunate, since it'd be way more convenient to generate docs on it, instead of pulling them from somewhere.

Make a list of contributors

We should keep track of people who report bugs, help with testing, etc.!

So far we have:

  • Grey Ballard
  • Dhiraj Holden
  • Ilya Kornakov
  • Andrew Sabisch

Refactor the Makefile

The build targets for all the tests should be generated fully automatically by a script. The list of headers can also be generated by a script.

Maximum inner product search

The readme states: "Despite being designed for the cosine similarity, FALCONN can often be used for nearest neighbor search under the Euclidean distance or a maximum inner product search."

So how would one go about using the maximum inner product search?

Use helpers during the construction

Need to use the code like below instead of "exponential" case analysis in the wrapper.

#include <iostream>
#include <memory>

class Base {
public:
    virtual void impl() = 0;

    virtual ~Base() {}
};

template<bool u, bool v> class Derived : public Base {
public:
    void impl() {
        std::cout << u << " " << v << std::endl;
    }
};

class Factory {
public:
    static std::shared_ptr<Base> construct(bool u, bool v) {
        if (u) {
            return construct_1<true>(v);
        }
        else {
            return construct_1<false>(v);
        }
    }
private:
    template<bool u> static std::shared_ptr<Base> construct_1(bool v) {
        if (v) {
            return construct_2<u, true>();
        }
        else {
            return construct_2<u, false>();
        }
    }

    template<bool u, bool v> static std::shared_ptr<Base> construct_2() {
        return std::shared_ptr<Base>(new Derived<u, v>);
    }
};

int main() {
    std::shared_ptr<Base> x = Factory::construct(false, true);
    x->impl();
    return 0;
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.