Giter VIP home page Giter VIP logo

vpsearch's Introduction

VP-tree nearest neighbor search

A relatively simple and readable Rust implementation of Vantage Point tree search algorithm.

The VP tree algorithm doesn't need to know coordinates of items, only distances between them. It can efficiently search multi-dimensional spaces and abstract things as long as you can define similarity between them (e.g. points, colors, and even images).

Please see the API reference or examples for details.

This algorithm does not work with squared distances. When implementing Euclidean distance, you MUST use sqrt(). Vantage Point trees require metric spaces.

#[derive(Copy, Clone)]
struct Point {
    x: f32, y: f32,
}

/// `MetricSpace` makes items comparable. It's a bit like Rust's `PartialOrd`.
impl vpsearch::MetricSpace for Point {
    type UserData = ();
    type Distance = f32;

    fn distance(&self, other: &Self, _: &Self::UserData) -> Self::Distance {
        let dx = self.x - other.x;
        let dy = self.y - other.y;

        // You MUST use sqrt here! The algorithm will give wrong results for squared distances.
        (dx*dx + dy*dy).sqrt()
    }
}

fn main() {
    let points = vec![Point{x:2.0,y:3.0}, Point{x:0.0,y:1.0}, Point{x:4.0,y:5.0}];

    let vp = vpsearch::Tree::new(&points);
    let (index, _) = vp.find_nearest(&Point{x:1.0,y:2.0});

    println!("The nearest point is at ({}, {})", points[index].x, points[index].y);
}

Implementing MetricSpace for Rust built-in types

This library includes a workaround for orphan rules. You need to add your crate's type when implementing MetricSpace:

struct MyImpl; // it can be any type, as long as it's yours
impl vpsearch::MetricSpace<MyImpl> for Vec<u8> {
    // continue as usual
}

Memory efficiency tip

Tree clones all the items and puts them in the tree. If the items are too big to clone and you'd rather keep the items elsewhere, you can!

Instead of storing the items, make the tree store indices into your items storage, and pass actual items as user_data to your MetricSpace::distance() function.

let items = /* your actual items are here */;
let indices: Vec<_> = (0...items.len() as u32).collect();
let tree = Tree::new_with_user_data_ref(&items, &indices);
let res = tree.find_nearest(&needle, &items);

vpsearch's People

Contributors

gijsvs avatar kornelski avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

vpsearch's Issues

Query item's lifetime must same to the tree's lifetime?

Hi,

I'm a newbie to Rust, I want use this library to write an application for sequence processing.
But there are somethig confusing me. Here is my simplified code:

use strsim::{hamming};
extern crate vpsearch;

#[derive(Copy, Clone)]
struct Barcode<'a> {
    code: &'a String
}

impl<'a> Barcode<'a> {
    fn new(code: &'a String) -> Self {
        Barcode {
            code: code
        }
    }
}

impl<'a> vpsearch::MetricSpace for Barcode<'a> {
    type UserData = ();
    type Distance = f32;

    fn distance<'b>(&self, other: &'b Self, _: &Self::UserData) -> Self::Distance {
        let d = hamming(self.code, other.code).unwrap();
        d as f32
    }
}


fn main() {
    let seqs: Vec<String> = vec!["AATT", "TTCC"].iter().map(|s| s.to_string()).collect();
    let mut items = vec![];
    for i in 0..seqs.len() {
        items.push(Barcode::new(&seqs[i]));
    }
    let tree = vpsearch::Tree::new(&items);

    {
        let q = "TATT".to_string();
        let q = Barcode::new(&q);
        let (index, dist) = tree.find_nearest(&q);
        println!("{} {}", index, dist);
    }
    {
        let q = "TTCA".to_string();
        let q = Barcode::new(&q);
        let (index, dist) = tree.find_nearest(&q);
        println!("{} {}", index, dist);
    }
}

Compile it, got:

error[E0597]: `q` does not live long enough
  --> src/main.rs:38:30
   |
38 |         let q = Barcode::new(&q);
   |                              ^^ borrowed value does not live long enough
...
41 |     }
   |     - `q` dropped here while still borrowed
...
45 |         let (index, dist) = tree.find_nearest(&q);
   |                             ---- borrow later used here

error: aborting due to previous error

For more information about this error, try `rustc --explain E0597`.
error: could not compile `vptest`.

In the real application, for saving the memory the query should be droped when it has been processed. Is there are any solution to make the query item and tree have independent lifetimes?

Implement Serialize and Deserialize for Tree

Would it be possible to implement serde's Serialize and Deserialize for the Tree struct? This would allow me to generate the tree, save it to a file and load it when I need it to speed up things.

Parallelise Tree creation with Rayon

First of all, I have to say thank you for this library; I have found it extremely useful, ergonomic and it is an essential part of the project I am using it in.

As the size of the input data I am processing increases I have found the creation of the tree becomes a bottleneck as it is CPU bound and limited to only 1 core with the current implementation.

I have only skimmed the codebase and am generally unfamiliar with the creation of VP trees, but it appears a few choice applications of Rayon's parallel iterators (in the index sorting for example) would greatly reduce the creation time of the tree structure.

I'll be experimenting with it over the coming days but wanted to raise this issue to get your thoughts on this approach.

Thank you again.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.