Giter VIP home page Giter VIP logo

setsimilaritysearch's Introduction

Set Similarity Search

Python package

Efficient set similarity search algorithms in Python. For even better performance see the Go Implementation.

What is set similarity search?

Let's say we have a database of users and the books they have read. Assume that we want to recommend "friends" for each user, and the "friends" must have read very similar set of books as the user have. We can model this as a set similarity search problem, by representing each user's books as a set:

Alice: {"Anna Karenina", "War and Peace", "The Chameleon", ...}
Bob: {"Lolita", "The Metamorphosis", "The Judgement", ...}
Joey: {"Anna Karenina", "The Chameleon" ...}

A popular way to measure the similarity between two sets is Jaccard similarity, which gives a fractional score between 0 and 1.0.

There are two versions of set similarity search problem, both can be defined given a collection of sets, a similarity function and a threshold:

  1. All-Pairs: find all pairs of sets that have similarities greater than (or equal to) the threshold;
  2. Query: given a query set, from the collection of sets, find all that have similarities greater than (or equal to) the threshold with respect to the query set.

Both versions of the problem can be very computationally expensive as the collection can be large and the set sizes can be large. The simple brute-force algorithm is O(n^2) for (1) and O(n) for (2).

This package includes a Python implementation of the "All-Pair-Binary" algorithm in Scaling Up All Pairs Similarity Search paper, with additional position filter optimization. This algorithm still has the same worst-case complexity as the brute-force algorithm, however, by taking advantage of skewness in empirical distributions of set sizes and frequencies, it often runs much faster (even better than MinHash LSH).

Benchmarks

Run All-Pairs on 3.5 GHz Intel Core i7, using similarity function jaccard and similarity threshold 0.5. The running time of datasketch.MinHashLSH is also shown below for comparison (num_perm=32).

Dataset Input Sets Avg. Size SetSimilaritySearch Runtime datasketch Runtime datasketch Accuracy
Pokec social network (relationships): from-nodes are set IDs; to-nodes are elements 1432693 27.31 10m49s 11m4s Precision: 0.73; Recall: 0.67
LiveJournal: from-nodes are set IDs; to-nodes are elements 4308452 16.01 28m51s 31m58s Precision: 0.79; Recall: 0.74

Although datasketch.MinHashLSH is an approximate algorithm, and I am using num_perm=32 which is quite low, it is still a bit slower than the exact algorithm SetSimilaritySearch. The time for creating datasketch.MinHash is also included in the end-to-end time, while in practice this time can be saved through pre-computation. However, for ad hoc computation of All-Pairs, SetSimilaritySearch is still the better choice, especially when sets are small and fit in memory.

Run Query on 3.5 GHz Intel Core i7, using similarity function jaccard and similarity threshold 0.5. The query sets are sampled from the dataset itself. The running time of datasketch.MinHashLSH is also shown below for comparison (num_perm=32).

Dataset Indexed Sets Query Sets Avg. Size SetSimilaritySearch Indexing & Querying Time datasketch Indexing & Querying Time datasketch Accuracy
Pokec social network (relationships): from-nodes are set IDs; to-nodes are elements 1432693 10k 27.31 Indexing: 1m7s; Querying (90pct): 2.3ms Indexing: 9m23s; Querying (90pct): 0.72ms Precision: 0.90; Recall: 0.88
LiveJournal: from-nodes are set IDs; to-nodes are elements 4308452 10k 16.01 Indexing: 2m32s; Querying (90pct): 1.6ms Indexing: 30m58s; Querying (90pct): 2.1ms Precision: 0.85; Recall: 0.78

The indexing time for datasketch.MinHashLSH, including the time for creating datasketch.MinHash, is much worse than SetSimilaritySearch -- nearly 10x and 15x. Therefore SetSimilaritySearch is much better for ad hoc computation of the Query problem. For the scenario in which the same search index is reused for many Query problems, datasketch.MinHashLSH is faster than SetSimilaritySearch when the set sizes are large. This is easy to understand: the size of datasketch.MinHash is constant, wheres a set can be arbitrarily large, so the query time for large sets is faster when sketch is used. However, when the set sizes become smaller, the sketch looses its advantage.

Install

Pip

pip install -U SetSimilaritySearch

Conda

conda install -c conda-forge setsimilaritysearch

Library usage

For All-Pairs, it takes an input of a list of sets, and output pairs that meet the similarity threshold.

from SetSimilaritySearch import all_pairs

# The input sets must be a Python list of iterables (i.e., lists or sets).
sets = [[1,2,3], [3,4,5], [2,3,4], [5,6,7]]
# all_pairs returns an iterable of tuples.
pairs = all_pairs(sets, similarity_func_name="jaccard", 
        similarity_threshold=0.1)
list(pairs)
# [(1, 0, 0.2), (2, 0, 0.5), (2, 1, 0.5), (3, 1, 0.2)]
# Each tuple is (<index of the first set>, <index of the second set>, <similarity>).
# The indexes are the list indexes of the input sets.

For Query, it takes an input of a list of sets, and builds a search index that can compute any number of queries. Currently the search index only supports a static collection of sets with no updates.

from SetSimilaritySearch import SearchIndex

# The input sets must be a Python list of iterables (i.e., lists or sets).
sets = [[1,2,3], [3,4,5], [2,3,4], [5,6,7]]
# The search index cannot be updated.
index = SearchIndex(sets, similarity_func_name="jaccard", 
    similarity_threshold=0.1)
# The query function takes input a set.
results = index.query([5,3,4])
results
# [(1, 1.0), (0, 0.2), (2, 0.5), (3, 0.2)]
# Each tuple is (<index of the found set>, <similarity>).
# The index is the list index of the sets in the search index.

Supported similarity functions (more to come):

  • Jaccard: intersection size divided by union size; set similarity_func_name="jaccard".
  • Cosine: intersection size divided by square root of the product of sizes; set similarity_func_name="cosine".
  • Containment: intersection size divided by the size of the first set (or query set); set similarity_func_name="containment".

Command line usage

You can also use the command line program all_pairs.py. The input must be one or two files with each line a unique SetID Token tuple. For example:

# Line starts with # will be ignored.
# Each line is <Set ID> <Token (i.e. Set Element)>, separate by a whitespace or tab.
# Every line must be unique.
1 a
1 b
1 c
1 d
2 a
2 b
2 c
3 d
3 e

When one input file is given, it computes All-Pairs; when two input files are given, it computes Query by building a search index on the first collection and querying with sets from the second collection -- effectively computes cross-collection pairs.

Example usage (All-Pairs):

all_pairs.py --input-sets testdata/example_input.txt \
    --output-pairs testdata/example_output.txt \
    --similarity-func jaccard \
    --similarity-threshold 0.1

setsimilaritysearch's People

Contributors

ardate avatar ekzhu avatar sarthakpati avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

setsimilaritysearch's Issues

What does it mean that SearchIndex cannot be updated?

Hi,

Interesting work. I'd like to evaluate it for my use case. However, I am not getting what you mean by "Search index cannot be updated." Are you saying that it cannot be updated at all (maybe blow up the database and do over?) or that it has no update method? What if I update the sets and then create the index again? That would be creating a brand new index. To me, that's a method of updating.

Please clarify.

Thanks so much,

Incorrect output during Query

I have two files where some tokens are disjoint between the files (regardless of the set they belong to). The following line filters any tokens from the query set that are not present in the index file:

s1 = np.sort([self.order[token] for token in s if token in self.order])

This generates erroneous Jaccard calculations as the filtered size of the query set is being used.

I have included some data to reproduce this:
jaccard.csv
query.txt
index.txt

set_ID_x set_ID_y set_size_x set_size_y similarity
CP000438.1:1214715 PAGI-9|6|6 5 1 0.5

Here is a reduced dataset:
index.txt
query.txt

Debug Output
2022-07-12 17:16:30,568: Reading set tuples from /home/nolan/Projects/curated_GIs/gene_order.txt (reversed=False)...
2022-07-12 17:16:30,591: 24737 tuples read.
2022-07-12 17:16:30,591: Creating sets...
2022-07-12 17:16:30,600: 379 sets created.
2022-07-12 17:16:30,601: Reading set tuples from /home/nolan/Projects/curated_GIs/wild_gis/pseudomonas/genes_blast_sets.tabular (reversed=False)...
2022-07-12 17:16:30,672: 67023 tuples read.
2022-07-12 17:16:30,672: Creating sets...
2022-07-12 17:16:30,682: 3010 sets created.
2022-07-12 17:16:30,687: Building search index on 379 sets.
2022-07-12 17:16:30,687: Building SearchIndex on 379 sets.
2022-07-12 17:16:30,687: Start frequency transform.
2022-07-12 17:16:30,687: Applying frequency order transform on tokens...
2022-07-12 17:16:30,700: Done applying frequency order.
2022-07-12 17:16:30,700: Finish frequency transform, 14048 tokens in total.
2022-07-12 17:16:30,700: Start indexing sets.
2022-07-12 17:16:30,705: Finished indexing sets.
2022-07-12 17:16:30,705: Finished building search index.
2022-07-12 17:16:30,705: Find pairs with similarity >= 0.5.
2022-07-12 17:16:30,705: 29 original tokens and 10 tokens after applying frequency order.
2022-07-12 17:16:30,705: 1 candidates found.
2022-07-12 17:16:30,705: 0 verified sets found.
2022-07-12 17:16:30,705: 9 original tokens and 5 tokens after applying frequency order.
2022-07-12 17:16:30,705: 0 candidates found.
2022-07-12 17:16:30,705: 0 verified sets found.
2022-07-12 17:16:30,705: 29 original tokens and 11 tokens after applying frequency order.
2022-07-12 17:16:30,706: 0 candidates found.
2022-07-12 17:16:30,706: 0 verified sets found.
2022-07-12 17:16:30,706: 25 original tokens and 7 tokens after applying frequency order.
2022-07-12 17:16:30,706: 0 candidates found.
2022-07-12 17:16:30,706: 0 verified sets found.
2022-07-12 17:16:30,706: 53 original tokens and 18 tokens after applying frequency order.
2022-07-12 17:16:30,706: 0 candidates found.
2022-07-12 17:16:30,706: 0 verified sets found.
2022-07-12 17:16:30,706: 31 original tokens and 11 tokens after applying frequency order.
2022-07-12 17:16:30,706: 0 candidates found.
2022-07-12 17:16:30,706: 0 verified sets found.
2022-07-12 17:16:30,706: 5 original tokens and 1 tokens after applying frequency order.
2022-07-12 17:16:30,706: 0 candidates found.
2022-07-12 17:16:30,706: 0 verified sets found.
2022-07-12 17:16:30,706: 3 original tokens and 2 tokens after applying frequency order.
2022-07-12 17:16:30,706: 0 candidates found.
2022-07-12 17:16:30,706: 0 verified sets found.
2022-07-12 17:16:30,706: 3 original tokens and 3 tokens after applying frequency order.
2022-07-12 17:16:30,706: 0 candidates found.
2022-07-12 17:16:30,706: 0 verified sets found.
2022-07-12 17:16:30,706: 24 original tokens and 13 tokens after applying frequency order.
2022-07-12 17:16:30,706: 0 candidates found.
2022-07-12 17:16:30,706: 0 verified sets found.
2022-07-12 17:16:30,706: 44 original tokens and 14 tokens after applying frequency order.
2022-07-12 17:16:30,706: 1 candidates found.
2022-07-12 17:16:30,706: 0 verified sets found.
2022-07-12 17:16:30,706: 13 original tokens and 5 tokens after applying frequency order.
2022-07-12 17:16:30,706: 0 candidates found.
2022-07-12 17:16:30,706: 0 verified sets found.
2022-07-12 17:16:30,706: 1 original tokens and 1 tokens after applying frequency order.
2022-07-12 17:16:30,706: 0 candidates found.
2022-07-12 17:16:30,706: 0 verified sets found.
2022-07-12 17:16:30,706: 19 original tokens and 7 tokens after applying frequency order.
2022-07-12 17:16:30,706: 0 candidates found.
2022-07-12 17:16:30,706: 0 verified sets found.
2022-07-12 17:16:30,706: 69 original tokens and 22 tokens after applying frequency order.
2022-07-12 17:16:30,706: 1 candidates found.
2022-07-12 17:16:30,707: 0 verified sets found.
2022-07-12 17:16:30,707: 3 original tokens and 2 tokens after applying frequency order.
2022-07-12 17:16:30,707: 0 candidates found.
2022-07-12 17:16:30,707: 0 verified sets found.
2022-07-12 17:16:30,707: 1 original tokens and 1 tokens after applying frequency order.
2022-07-12 17:16:30,707: 0 candidates found.
2022-07-12 17:16:30,707: 0 verified sets found.
2022-07-12 17:16:30,707: 11 original tokens and 3 tokens after applying frequency order.
2022-07-12 17:16:30,707: 0 candidates found.
2022-07-12 17:16:30,707: 0 verified sets found.
2022-07-12 17:16:30,707: 3 original tokens and 1 tokens after applying frequency order.
2022-07-12 17:16:30,707: 0 candidates found.
2022-07-12 17:16:30,707: 0 verified sets found.
2022-07-12 17:16:30,707: 34 original tokens and 11 tokens after applying frequency order.
2022-07-12 17:16:30,707: 0 candidates found.
2022-07-12 17:16:30,707: 0 verified sets found.
2022-07-12 17:16:30,707: 55 original tokens and 18 tokens after applying frequency order.
2022-07-12 17:16:30,707: 1 candidates found.
2022-07-12 17:16:30,707: 0 verified sets found.
2022-07-12 17:16:30,707: 9 original tokens and 4 tokens after applying frequency order.
2022-07-12 17:16:30,707: 0 candidates found.
2022-07-12 17:16:30,707: 0 verified sets found.
2022-07-12 17:16:30,707: 13 original tokens and 3 tokens after applying frequency order.
2022-07-12 17:16:30,707: 0 candidates found.
2022-07-12 17:16:30,707: 0 verified sets found.
2022-07-12 17:16:30,707: 1 original tokens and 0 tokens after applying frequency order.
2022-07-12 17:16:30,707: 0 candidates found.
2022-07-12 17:16:30,707: 0 verified sets found.
2022-07-12 17:16:30,707: 55 original tokens and 20 tokens after applying frequency order.
2022-07-12 17:16:30,707: 1 candidates found.
2022-07-12 17:16:30,707: 0 verified sets found.
2022-07-12 17:16:30,707: 31 original tokens and 8 tokens after applying frequency order.
2022-07-12 17:16:30,707: 0 candidates found.
2022-07-12 17:16:30,707: 0 verified sets found.
2022-07-12 17:16:30,707: 15 original tokens and 8 tokens after applying frequency order.
2022-07-12 17:16:30,707: 0 candidates found.
2022-07-12 17:16:30,707: 0 verified sets found.
2022-07-12 17:16:30,707: 23 original tokens and 8 tokens after applying frequency order.
2022-07-12 17:16:30,707: 0 candidates found.
2022-07-12 17:16:30,707: 0 verified sets found.
2022-07-12 17:16:30,708: 5 original tokens and 1 tokens after applying frequency order.
2022-07-12 17:16:30,708: 0 candidates found.
2022-07-12 17:16:30,708: 0 verified sets found.
2022-07-12 17:16:30,708: 5 original tokens and 2 tokens after applying frequency order.
2022-07-12 17:16:30,708: 0 candidates found.
2022-07-12 17:16:30,708: 0 verified sets found.
2022-07-12 17:16:30,708: 3 original tokens and 3 tokens after applying frequency order.
2022-07-12 17:16:30,708: 0 candidates found.
2022-07-12 17:16:30,708: 0 verified sets found.
2022-07-12 17:16:30,708: 5 original tokens and 3 tokens after applying frequency order.
2022-07-12 17:16:30,708: 0 candidates found.
2022-07-12 17:16:30,708: 0 verified sets found.
2022-07-12 17:16:30,708: 17 original tokens and 8 tokens after applying frequency order.
2022-07-12 17:16:30,708: 0 candidates found.
2022-07-12 17:16:30,708: 0 verified sets found.
2022-07-12 17:16:30,708: 5 original tokens and 3 tokens after applying frequency order.
2022-07-12 17:16:30,708: 0 candidates found.
2022-07-12 17:16:30,708: 0 verified sets found.
2022-07-12 17:16:30,708: 17 original tokens and 7 tokens after applying frequency order.
2022-07-12 17:16:30,708: 0 candidates found.
2022-07-12 17:16:30,708: 0 verified sets found.
2022-07-12 17:16:30,708: 5 original tokens and 3 tokens after applying frequency order.
2022-07-12 17:16:30,708: 0 candidates found.
2022-07-12 17:16:30,708: 0 verified sets found.
2022-07-12 17:16:30,708: 24 original tokens and 8 tokens after applying frequency order.
2022-07-12 17:16:30,708: 0 candidates found.
2022-07-12 17:16:30,708: 0 verified sets found.
2022-07-12 17:16:30,708: 9 original tokens and 4 tokens after applying frequency order.
2022-07-12 17:16:30,708: 0 candidates found.
2022-07-12 17:16:30,708: 0 verified sets found.
2022-07-12 17:16:30,708: 11 original tokens and 4 tokens after applying frequency order.
2022-07-12 17:16:30,708: 0 candidates found.
2022-07-12 17:16:30,708: 0 verified sets found.
2022-07-12 17:16:30,708: 1 original tokens and 1 tokens after applying frequency order.
2022-07-12 17:16:30,708: 0 candidates found.
2022-07-12 17:16:30,708: 0 verified sets found.
2022-07-12 17:16:30,708: 21 original tokens and 13 tokens after applying frequency order.
2022-07-12 17:16:30,708: 1 candidates found.
2022-07-12 17:16:30,708: 0 verified sets found.
2022-07-12 17:16:30,708: 11 original tokens and 4 tokens after applying frequency order.
2022-07-12 17:16:30,708: 0 candidates found.
2022-07-12 17:16:30,708: 0 verified sets found.
2022-07-12 17:16:30,708: 32 original tokens and 12 tokens after applying frequency order.
2022-07-12 17:16:30,709: 0 candidates found.
2022-07-12 17:16:30,709: 0 verified sets found.
2022-07-12 17:16:30,709: 23 original tokens and 6 tokens after applying frequency order.
2022-07-12 17:16:30,709: 0 candidates found.
2022-07-12 17:16:30,709: 0 verified sets found.
2022-07-12 17:16:30,709: 11 original tokens and 3 tokens after applying frequency order.
2022-07-12 17:16:30,709: 0 candidates found.
2022-07-12 17:16:30,709: 0 verified sets found.
2022-07-12 17:16:30,709: 40 original tokens and 20 tokens after applying frequency order.
2022-07-12 17:16:30,709: 3 candidates found.
2022-07-12 17:16:30,709: 0 verified sets found.
2022-07-12 17:16:30,709: 3 original tokens and 1 tokens after applying frequency order.
2022-07-12 17:16:30,709: 0 candidates found.
2022-07-12 17:16:30,709: 0 verified sets found.
2022-07-12 17:16:30,709: 25 original tokens and 10 tokens after applying frequency order.
2022-07-12 17:16:30,709: 0 candidates found.
2022-07-12 17:16:30,709: 0 verified sets found.
2022-07-12 17:16:30,709: 19 original tokens and 7 tokens after applying frequency order.
2022-07-12 17:16:30,709: 0 candidates found.
2022-07-12 17:16:30,709: 0 verified sets found.
2022-07-12 17:16:30,709: 17 original tokens and 6 tokens after applying frequency order.
2022-07-12 17:16:30,709: 0 candidates found.
2022-07-12 17:16:30,709: 0 verified sets found.
2022-07-12 17:16:30,709: 44 original tokens and 13 tokens after applying frequency order.
2022-07-12 17:16:30,709: 0 candidates found.
2022-07-12 17:16:30,709: 0 verified sets found.
2022-07-12 17:16:30,709: 3 original tokens and 2 tokens after applying frequency order.
2022-07-12 17:16:30,709: 0 candidates found.
2022-07-12 17:16:30,709: 0 verified sets found.
2022-07-12 17:16:30,709: 3 original tokens and 1 tokens after applying frequency order.
2022-07-12 17:16:30,709: 0 candidates found.
2022-07-12 17:16:30,709: 0 verified sets found.
2022-07-12 17:16:30,709: 9 original tokens and 3 tokens after applying frequency order.
2022-07-12 17:16:30,709: 0 candidates found.
2022-07-12 17:16:30,709: 0 verified sets found.
2022-07-12 17:16:30,709: 3 original tokens and 1 tokens after applying frequency order.
2022-07-12 17:16:30,709: 0 candidates found.
2022-07-12 17:16:30,709: 0 verified sets found.
2022-07-12 17:16:30,709: 5 original tokens and 2 tokens after applying frequency order.
2022-07-12 17:16:30,709: 0 candidates found.
2022-07-12 17:16:30,709: 0 verified sets found.
2022-07-12 17:16:30,710: 23 original tokens and 6 tokens after applying frequency order.
2022-07-12 17:16:30,710: 0 candidates found.
2022-07-12 17:16:30,710: 0 verified sets found.
2022-07-12 17:16:30,710: 50 original tokens and 17 tokens after applying frequency order.
2022-07-12 17:16:30,710: 1 candidates found.
2022-07-12 17:16:30,710: 0 verified sets found.
2022-07-12 17:16:30,710: 11 original tokens and 4 tokens after applying frequency order.
2022-07-12 17:16:30,710: 0 candidates found.
2022-07-12 17:16:30,710: 0 verified sets found.
2022-07-12 17:16:30,710: 3 original tokens and 2 tokens after applying frequency order.
2022-07-12 17:16:30,710: 0 candidates found.
2022-07-12 17:16:30,710: 0 verified sets found.
2022-07-12 17:16:30,710: 3 original tokens and 2 tokens after applying frequency order.
2022-07-12 17:16:30,710: 0 candidates found.
2022-07-12 17:16:30,710: 0 verified sets found.
2022-07-12 17:16:30,710: 9 original tokens and 3 tokens after applying frequency order.
2022-07-12 17:16:30,710: 0 candidates found.
2022-07-12 17:16:30,710: 0 verified sets found.
2022-07-12 17:16:30,710: 5 original tokens and 2 tokens after applying frequency order.
2022-07-12 17:16:30,710: 0 candidates found.
2022-07-12 17:16:30,710: 0 verified sets found.
2022-07-12 17:16:30,710: 7 original tokens and 2 tokens after applying frequency order.
2022-07-12 17:16:30,710: 0 candidates found.
2022-07-12 17:16:30,710: 0 verified sets found.
2022-07-12 17:16:30,710: 3 original tokens and 1 tokens after applying frequency order.
2022-07-12 17:16:30,710: 0 candidates found.
2022-07-12 17:16:30,710: 0 verified sets found.
2022-07-12 17:16:30,710: 9 original tokens and 5 tokens after applying frequency order.
2022-07-12 17:16:30,710: 0 candidates found.
2022-07-12 17:16:30,710: 0 verified sets found.
2022-07-12 17:16:30,710: 35 original tokens and 17 tokens after applying frequency order.
2022-07-12 17:16:30,710: 0 candidates found.
2022-07-12 17:16:30,710: 0 verified sets found.
2022-07-12 17:16:30,710: 1 original tokens and 0 tokens after applying frequency order.
2022-07-12 17:16:30,710: 0 candidates found.
2022-07-12 17:16:30,710: 0 verified sets found.
2022-07-12 17:16:30,710: 55 original tokens and 17 tokens after applying frequency order.
2022-07-12 17:16:30,710: 1 candidates found.
2022-07-12 17:16:30,710: 0 verified sets found.
2022-07-12 17:16:30,711: 17 original tokens and 6 tokens after applying frequency order.
2022-07-12 17:16:30,711: 0 candidates found.
2022-07-12 17:16:30,711: 0 verified sets found.
2022-07-12 17:16:30,711: 15 original tokens and 7 tokens after applying frequency order.
2022-07-12 17:16:30,711: 0 candidates found.
2022-07-12 17:16:30,711: 0 verified sets found.
2022-07-12 17:16:30,711: 30 original tokens and 14 tokens after applying frequency order.
2022-07-12 17:16:30,711: 1 candidates found.
2022-07-12 17:16:30,711: 0 verified sets found.
2022-07-12 17:16:30,711: 23 original tokens and 7 tokens after applying frequency order.
2022-07-12 17:16:30,711: 0 candidates found.
2022-07-12 17:16:30,711: 0 verified sets found.
2022-07-12 17:16:30,711: 11 original tokens and 6 tokens after applying frequency order.
2022-07-12 17:16:30,711: 0 candidates found.
2022-07-12 17:16:30,711: 0 verified sets found.
2022-07-12 17:16:30,711: 39 original tokens and 15 tokens after applying frequency order.
2022-07-12 17:16:30,711: 1 candidates found.
2022-07-12 17:16:30,711: 0 verified sets found.
2022-07-12 17:16:30,711: 5 original tokens and 2 tokens after applying frequency order.
2022-07-12 17:16:30,711: 0 candidates found.
2022-07-12 17:16:30,711: 0 verified sets found.
2022-07-12 17:16:30,711: 18 original tokens and 5 tokens after applying frequency order.
2022-07-12 17:16:30,711: 0 candidates found.
2022-07-12 17:16:30,711: 0 verified sets found.
2022-07-12 17:16:30,711: 13 original tokens and 5 tokens after applying frequency order.
2022-07-12 17:16:30,711: 0 candidates found.
2022-07-12 17:16:30,711: 0 verified sets found.
2022-07-12 17:16:30,711: 17 original tokens and 4 tokens after applying frequency order.
2022-07-12 17:16:30,711: 0 candidates found.
2022-07-12 17:16:30,711: 0 verified sets found.
2022-07-12 17:16:30,712: 75 original tokens and 29 tokens after applying frequency order.
2022-07-12 17:16:30,712: 2 candidates found.
2022-07-12 17:16:30,712: 0 verified sets found.
2022-07-12 17:16:30,712: 83 original tokens and 30 tokens after applying frequency order.
2022-07-12 17:16:30,712: 0 candidates found.
2022-07-12 17:16:30,712: 0 verified sets found.
2022-07-12 17:16:30,712: 16 original tokens and 5 tokens after applying frequency order.
2022-07-12 17:16:30,712: 0 candidates found.
2022-07-12 17:16:30,712: 0 verified sets found.
2022-07-12 17:16:30,712: 13 original tokens and 3 tokens after applying frequency order.
2022-07-12 17:16:30,712: 0 candidates found.
2022-07-12 17:16:30,712: 0 verified sets found.
2022-07-12 17:16:30,712: 14 original tokens and 3 tokens after applying frequency order.
2022-07-12 17:16:30,712: 0 candidates found.
2022-07-12 17:16:30,712: 0 verified sets found.
2022-07-12 17:16:30,712: 23 original tokens and 9 tokens after applying frequency order.
2022-07-12 17:16:30,712: 0 candidates found.
2022-07-12 17:16:30,712: 0 verified sets found.
2022-07-12 17:16:30,712: 7 original tokens and 4 tokens after applying frequency order.
2022-07-12 17:16:30,712: 0 candidates found.
2022-07-12 17:16:30,712: 0 verified sets found.
2022-07-12 17:16:30,712: 15 original tokens and 4 tokens after applying frequency order.
2022-07-12 17:16:30,712: 0 candidates found.
2022-07-12 17:16:30,712: 0 verified sets found.
2022-07-12 17:16:30,712: 1 original tokens and 1 tokens after applying frequency order.
2022-07-12 17:16:30,712: 0 candidates found.
2022-07-12 17:16:30,712: 0 verified sets found.
2022-07-12 17:16:30,712: 41 original tokens and 14 tokens after applying frequency order.
2022-07-12 17:16:30,713: 0 candidates found.
2022-07-12 17:16:30,713: 0 verified sets found.
2022-07-12 17:16:30,713: 3 original tokens and 1 tokens after applying frequency order.
2022-07-12 17:16:30,713: 0 candidates found.
2022-07-12 17:16:30,713: 0 verified sets found.
2022-07-12 17:16:30,713: 59 original tokens and 20 tokens after applying frequency order.
2022-07-12 17:16:30,713: 0 candidates found.
2022-07-12 17:16:30,713: 0 verified sets found.
2022-07-12 17:16:30,713: 3 original tokens and 1 tokens after applying frequency order.
2022-07-12 17:16:30,713: 0 candidates found.
2022-07-12 17:16:30,713: 0 verified sets found.
2022-07-12 17:16:30,713: 3 original tokens and 1 tokens after applying frequency order.
2022-07-12 17:16:30,713: 0 candidates found.
2022-07-12 17:16:30,713: 0 verified sets found.
2022-07-12 17:16:30,713: 43 original tokens and 17 tokens after applying frequency order.
2022-07-12 17:16:30,713: 0 candidates found.
2022-07-12 17:16:30,713: 0 verified sets found.
2022-07-12 17:16:30,713: 3 original tokens and 1 tokens after applying frequency order.
2022-07-12 17:16:30,713: 0 candidates found.
2022-07-12 17:16:30,713: 0 verified sets found.
2022-07-12 17:16:30,713: 5 original tokens and 2 tokens after applying frequency order.
2022-07-12 17:16:30,713: 0 candidates found.
2022-07-12 17:16:30,713: 0 verified sets found.
2022-07-12 17:16:30,713: 9 original tokens and 3 tokens after applying frequency order.
2022-07-12 17:16:30,713: 0 candidates found.
2022-07-12 17:16:30,713: 0 verified sets found.
2022-07-12 17:16:30,713: 7 original tokens and 2 tokens after applying frequency order.
2022-07-12 17:16:30,713: 0 candidates found.
2022-07-12 17:16:30,713: 0 verified sets found.
2022-07-12 17:16:30,713: 3 original tokens and 2 tokens after applying frequency order.
2022-07-12 17:16:30,713: 0 candidates found.
2022-07-12 17:16:30,713: 0 verified sets found.
2022-07-12 17:16:30,713: 32 original tokens and 9 tokens after applying frequency order.
2022-07-12 17:16:30,713: 0 candidates found.
2022-07-12 17:16:30,713: 0 verified sets found.
2022-07-12 17:16:30,713: 9 original tokens and 2 tokens after applying frequency order.
2022-07-12 17:16:30,713: 0 candidates found.
2022-07-12 17:16:30,713: 0 verified sets found.
2022-07-12 17:16:30,713: 39 original tokens and 11 tokens after applying frequency order.
2022-07-12 17:16:30,714: 0 candidates found.
2022-07-12 17:16:30,714: 0 verified sets found.
2022-07-12 17:16:30,714: 24 original tokens and 9 tokens after applying frequency order.
2022-07-12 17:16:30,714: 0 candidates found.
2022-07-12 17:16:30,714: 0 verified sets found.
2022-07-12 17:16:30,714: 13 original tokens and 5 tokens after applying frequency order.
2022-07-12 17:16:30,714: 0 candidates found.
2022-07-12 17:16:30,714: 0 verified sets found.
2022-07-12 17:16:30,714: 9 original tokens and 2 tokens after applying frequency order.
2022-07-12 17:16:30,714: 0 candidates found.
2022-07-12 17:16:30,714: 0 verified sets found.
2022-07-12 17:16:30,714: 1 original tokens and 1 tokens after applying frequency order.
2022-07-12 17:16:30,714: 0 candidates found.
2022-07-12 17:16:30,714: 0 verified sets found.
2022-07-12 17:16:30,714: 13 original tokens and 6 tokens after applying frequency order.
2022-07-12 17:16:30,714: 0 candidates found.
2022-07-12 17:16:30,714: 0 verified sets found.
2022-07-12 17:16:30,714: 3 original tokens and 2 tokens after applying frequency order.
2022-07-12 17:16:30,714: 0 candidates found.
2022-07-12 17:16:30,714: 0 verified sets found.
2022-07-12 17:16:30,714: 5 original tokens and 3 tokens after applying frequency order.
2022-07-12 17:16:30,714: 0 candidates found.
2022-07-12 17:16:30,714: 0 verified sets found.
2022-07-12 17:16:30,714: 21 original tokens and 7 tokens after applying frequency order.
2022-07-12 17:16:30,714: 0 candidates found.
2022-07-12 17:16:30,714: 0 verified sets found.
2022-07-12 17:16:30,714: 1 original tokens and 1 tokens after applying frequency order.
2022-07-12 17:16:30,714: 0 candidates found.
2022-07-12 17:16:30,714: 0 verified sets found.
2022-07-12 17:16:30,714: 117 original tokens and 51 tokens after applying frequency order.
2022-07-12 17:16:30,714: 4 candidates found.
2022-07-12 17:16:30,714: 0 verified sets found.
2022-07-12 17:16:30,714: 35 original tokens and 11 tokens after applying frequency order.
2022-07-12 17:16:30,714: 1 candidates found.
2022-07-12 17:16:30,714: 0 verified sets found.
2022-07-12 17:16:30,714: 3 original tokens and 1 tokens after applying frequency order.
2022-07-12 17:16:30,714: 0 candidates found.
2022-07-12 17:16:30,714: 0 verified sets found.
2022-07-12 17:16:30,715: 52 original tokens and 16 tokens after applying frequency order.
2022-07-12 17:16:30,715: 1 candidates found.
2022-07-12 17:16:30,715: 0 verified sets found.
2022-07-12 17:16:30,715: 9 original tokens and 4 tokens after applying frequency order.
2022-07-12 17:16:30,715: 0 candidates found.
2022-07-12 17:16:30,715: 0 verified sets found.
2022-07-12 17:16:30,715: 15 original tokens and 9 tokens after applying frequency order.
2022-07-12 17:16:30,715: 0 candidates found.
2022-07-12 17:16:30,715: 0 verified sets found.
2022-07-12 17:16:30,715: 23 original tokens and 7 tokens after applying frequency order.
2022-07-12 17:16:30,715: 0 candidates found.
2022-07-12 17:16:30,715: 0 verified sets found.
2022-07-12 17:16:30,715: 1 original tokens and 1 tokens after applying frequency order.
2022-07-12 17:16:30,715: 0 candidates found.
2022-07-12 17:16:30,715: 0 verified sets found.
2022-07-12 17:16:30,715: 13 original tokens and 4 tokens after applying frequency order.
2022-07-12 17:16:30,715: 0 candidates found.
2022-07-12 17:16:30,715: 0 verified sets found.
2022-07-12 17:16:30,715: 5 original tokens and 2 tokens after applying frequency order.
2022-07-12 17:16:30,715: 0 candidates found.
2022-07-12 17:16:30,715: 0 verified sets found.
2022-07-12 17:16:30,715: 5 original tokens and 2 tokens after applying frequency order.
2022-07-12 17:16:30,715: 0 candidates found.
2022-07-12 17:16:30,715: 0 verified sets found.
2022-07-12 17:16:30,715: 3 original tokens and 1 tokens after applying frequency order.
2022-07-12 17:16:30,715: 0 candidates found.
2022-07-12 17:16:30,715: 0 verified sets found.
2022-07-12 17:16:30,715: 17 original tokens and 5 tokens after applying frequency order.
2022-07-12 17:16:30,715: 0 candidates found.
2022-07-12 17:16:30,715: 0 verified sets found.
2022-07-12 17:16:30,715: 39 original tokens and 12 tokens after applying frequency order.
2022-07-12 17:16:30,715: 0 candidates found.
2022-07-12 17:16:30,715: 0 verified sets found.
2022-07-12 17:16:30,715: 11 original tokens and 4 tokens after applying frequency order.
2022-07-12 17:16:30,715: 0 candidates found.
2022-07-12 17:16:30,715: 0 verified sets found.
2022-07-12 17:16:30,715: 11 original tokens and 3 tokens after applying frequency order.
2022-07-12 17:16:30,715: 0 candidates found.
2022-07-12 17:16:30,715: 0 verified sets found.
2022-07-12 17:16:30,715: 5 original tokens and 2 tokens after applying frequency order.
2022-07-12 17:16:30,715: 0 candidates found.
2022-07-12 17:16:30,715: 0 verified sets found.
2022-07-12 17:16:30,716: 79 original tokens and 26 tokens after applying frequency order.
2022-07-12 17:16:30,716: 1 candidates found.
2022-07-12 17:16:30,716: 0 verified sets found.
2022-07-12 17:16:30,716: 15 original tokens and 3 tokens after applying frequency order.
2022-07-12 17:16:30,716: 0 candidates found.
2022-07-12 17:16:30,716: 0 verified sets found.
2022-07-12 17:16:30,716: 3 original tokens and 2 tokens after applying frequency order.
2022-07-12 17:16:30,716: 0 candidates found.
2022-07-12 17:16:30,716: 0 verified sets found.
2022-07-12 17:16:30,716: 3 original tokens and 1 tokens after applying frequency order.
2022-07-12 17:16:30,716: 0 candidates found.
2022-07-12 17:16:30,716: 0 verified sets found.
2022-07-12 17:16:30,716: 11 original tokens and 4 tokens after applying frequency order.
2022-07-12 17:16:30,716: 0 candidates found.
2022-07-12 17:16:30,716: 0 verified sets found.
2022-07-12 17:16:30,716: 9 original tokens and 3 tokens after applying frequency order.
2022-07-12 17:16:30,716: 0 candidates found.
2022-07-12 17:16:30,716: 0 verified sets found.
2022-07-12 17:16:30,716: 29 original tokens and 9 tokens after applying frequency order.
2022-07-12 17:16:30,716: 0 candidates found.
2022-07-12 17:16:30,716: 0 verified sets found.
2022-07-12 17:16:30,716: 3 original tokens and 2 tokens after applying frequency order.
2022-07-12 17:16:30,716: 0 candidates found.
2022-07-12 17:16:30,716: 0 verified sets found.
2022-07-12 17:16:30,716: 17 original tokens and 6 tokens after applying frequency order.
2022-07-12 17:16:30,716: 0 candidates found.
2022-07-12 17:16:30,716: 0 verified sets found.
...
2022-07-12 17:16:30,944: 17 original tokens and 6 tokens after applying frequency order.
2022-07-12 17:16:30,945: 0 candidates found.
2022-07-12 17:16:30,945: 0 verified sets found.
2022-07-12 17:16:30,945: 11 original tokens and 5 tokens after applying frequency order.
2022-07-12 17:16:30,945: 0 candidates found.
2022-07-12 17:16:30,945: 0 verified sets found.
2022-07-12 17:16:30,945: Found 21 pairs.
2022-07-12 17:16:30,945: Average query time: 7.890807433777869e-05.
2022-07-12 17:16:30,945: Median query time: 7.890807433777869e-05.
2022-07-12 17:16:30,945: 90pct query time: 0.00011157989501953125.

ordering of x,y coordinates when converting to identity matrix

Hi - I noticed an unexpected ordering of x,y coordinates when converting the output of all_pairs to an identity matrix. Here's the behavior that I see with identical sets:

import numpy as np
from SetSimilaritySearch import all_pairs
import random

nsets = 10
population = list(range(100))

sets = [set(population) for i in range(nsets)]
coords = all_pairs(sets, similarity_threshold=0)

arr = np.nan * np.empty((nsets, nsets))
x, y, z = zip(*coords)
arr[x, y] = z
print(np.round(arr, 2))

The output is a nice lower-triangular matrix.

[[nan nan nan nan nan nan nan nan nan nan]
 [ 1. nan nan nan nan nan nan nan nan nan]
 [ 1.  1. nan nan nan nan nan nan nan nan]
 [ 1.  1.  1. nan nan nan nan nan nan nan]
 [ 1.  1.  1.  1. nan nan nan nan nan nan]
 [ 1.  1.  1.  1.  1. nan nan nan nan nan]
 [ 1.  1.  1.  1.  1.  1. nan nan nan nan]
 [ 1.  1.  1.  1.  1.  1.  1. nan nan nan]
 [ 1.  1.  1.  1.  1.  1.  1.  1. nan nan]
 [ 1.  1.  1.  1.  1.  1.  1.  1.  1. nan]]

But when the sets are not identical, the x and y indices seem to be ordered arbitrarily:

sets = [set(population) - set(random.choices(population, k=10)) for i in range(nsets)]
coords = all_pairs(sets, similarity_threshold=0)

arr = np.nan * np.empty((nsets, nsets))
x, y, z = zip(*coords)
arr[x, y] = z
print(np.round(arr, 2))
[[ nan  nan  nan  nan  nan  nan  nan  nan  nan  nan]
 [0.83  nan 0.83 0.81 0.81  nan 0.81  nan 0.87  nan]
 [0.84  nan  nan  nan  nan  nan  nan  nan  nan  nan]
 [0.8   nan 0.8   nan  nan  nan  nan  nan  nan  nan]
 [0.84  nan 0.8  0.82  nan  nan  nan  nan  nan  nan]
 [0.84 0.83 0.82 0.82 0.84  nan 0.84 0.83 0.82  nan]
 [0.84  nan 0.84 0.82 0.8   nan  nan  nan  nan  nan]
 [0.81 0.84 0.83 0.87 0.81  nan 0.87  nan 0.83  nan]
 [0.82  nan 0.82 0.82 0.82  nan 0.82  nan  nan  nan]
 [0.82 0.87 0.82 0.86 0.84 0.84 0.84 0.85 0.82  nan]]

I can restore the lower-triangular matrix by adding the following line before assigning to the array:

x, y = zip(*[sorted(pair) for pair in zip(x, y)])

So I can still accomplish what I want with minimal difficulty, but I thought I'd let you know because it seems like generating an identity matrix might be a common use case and the behavior is a bit surprising.

Thanks a lot for sharing this project!

Using all_pairs() on a 14k dataset causes MemoryErrors

Hi,

I'm trying to use the all_pairs() function to find all the (near-)duplicates in a set of about 14000 text documents (after turning them into ngram shingles first). However, I'm running against MemoryErrors crashing my script, despite the virtual machine I work on having 16GB of RAM. I've checked, and indeed the entire RAM and swap get maxed out before the script stops working.

Do you have any advice on how to reduce RAM usage, or any indication of how much memory the algorithm uses? I don't run into any memory issues when I use LSH with datasketch, but I'd rather have an exact list of duplicates.

all_pairs: get *all* pairwise indices, including 0?

I noticed that not all possible pairs are returned when I tried the all_pairs function and that even with threshold = 0.0, the lowest values are still above 0. Is there a possibility to get all pairwise similarities?

Unexpected multiset behaviour

Firstly thanks for this package and the datasketch one—they're both great.

I noticed some unexpected behaviour when using the all_pairs function with input data that aren't set-like:

sets = [["a", "b"], ["a", "a"]]
list(all_pairs(sets, similarity_func_name="jaccard", similarity_threshold=0.1))
# [(1, 0, 1.0)]

sets = [["a", "a", "b"], ["a", "a"]]
list(all_pairs(sets, similarity_func_name="jaccard", similarity_threshold=0.1))
# [(1, 0, 1.5)]

I assume that this package doesn't support multisets, and that the outputs in such cases are undefined (setting the threshold to 0.75 in the second example leads to an empty result set, for instance), but if that's the case perhaps it would be a good idea to make this explicit in the documentation, and to mention that it's the user's responsibility to ensure that there are no duplicates in their input sets/lists.

In my case this simply means that I have to convert my lists to sets before passing them to all_pairs, but it did catch me off guard because that step wouldn't be necessary if I were applying MinHash LSH.

Very slow performance on large sets

Hi, this package looks really cool and I'd love to use it for my use case.

I have about 7,000 sets with about 1,000 elements each that I'm using as my index. I also have a set of about 1,000 queries with similar sizes, about 1,000 elements each, as queries. However, when I profile the times for queries for this package vs. datasketch's MinHashLSHEnsemble method, the results are pretty wildly off-base from the numbers presented in the readme.

In general, a single minhash LSH ensemble query in my case is taking about 10ms, and the SetSimilarity query is taking anywhere from 300ms to 500ms, even whole seconds in some cases. Are these numbers to be expected, and is SetSimilaritySearch simply not suitable for sets this large? My sets are exclusively integers, if that matters.

Any insight or help is appreciated.

Allow 'sets' argument to be any iterable

Currently

https://github.com/ekzhu/SetSimilaritySearch/blob/master/SetSimilaritySearch/search.py#L27
https://github.com/ekzhu/SetSimilaritySearch/blob/master/SetSimilaritySearch/all_pairs.py#L28

state:

if not isinstance(sets, list) or len(sets) == 0:
        raise ValueError("Input parameter sets must be a non-empty list.")

I propose to change this to:

if not isinstance(sets, Iterable) or len(sets) == 0:
        raise ValueError("Input parameter sets must be a non-empty iterable.")

Which then allows inputs as tuple as well, as well as ordered key-sets. Was helpful in my use case, rather than having to create a copy of the data in list form.

Setting up a PR from my fork, let me know what you think.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.