fmmfonseca / completely Goto Github PK

View Code? Open in Web Editor NEW

111.0 8.0 19.0 297 KB

Java autocomplete library.

Home Page: http://miguelfonseca.com/completely/

License: Other

Java 99.97% Dockerfile 0.03%

java autocomplete library indexing text trie approximate-string-matching

completely's Introduction

Description

Completely is a Java autocomplete library.

Autocomplete involves predicting a word or phrase that the user may type based on a partial query. The goal is to provide instant feedback and avoid unnecessary typing as the user formulates queries. Performance is a key issue since each keystroke from the user could invoke a query, and each query should be answered within a few milliseconds. What's more, because users often make spelling mistakes while typing, autocomplete should tolerate errors and differences in representation.

Needless to say, a standard sequential search is bound to be ineffective for anything other than small data sets. By contrast, Completely relies on text preprocessing to create an in-memory index for efficiently answering searches in large data sets. All in all, there are three fundamental components at play:

Analyzer function to filter, tokenize and/or transform text prior to indexing;
Index data structure for storing the mapping of text to the corresponding sources;
Automaton engine for text matching when searching;

Together these can used to tackle a variety of use cases, wherein the choice of components or combination thereof depends solely on the application at hand.

Download

All release artifacts are available for download from the Maven central repository.

Build from source

Building Completely requires Maven 3 and Java 11, or newer.

Download the source code:

git clone https://github.com/fmmfonseca/completely.git

Build the JAR package:

mvn clean package -DskipTests

Run the sample

Install artifacts to the local repository:

mvn install

Execute the main class:

mvn exec:java -pl sample

References

Bořivoj Melichar. Approximate String Matching by Finite Automata;
Gonzalo Navarro. A Guided Tour to Approximate String Matching;
Leonid Boytsov. Indexing Methods for Approximate Dictionary Searching: Comparative Analysis;
Marios Hadjieleftheriou and Divesh Srivastava. Approximate String Processing;
Surajit Chaudhuri and Raghav Kaushik. Extending Autocompletion To Tolerate Errors;

License

Released under The Apache Software License, Version 2.0

completely's People

Contributors

Stargazers

Watchers

Forkers

ang2ara fabiankessler sandeepkrmishra relaxar wjtan ahlawatrohit sridhar-newsdistill zhaoguochen relink chrisge4 hargun20 joshua-qa zhouyonglong dev2dad saumiljain andy-wagner icodein maofofana lvxingtu

completely's Issues

Sorting concept

Currently sorting works using the Comparator in the AutocompleteEngine.
There can be a default comparator, and one can be given to the search() method to override the default.

After sorting, there is also the limit feature to cut the result at some point.

This flexibility requires 4 public methods in the interface right now.

Unfortunately, the Comparator logic is not enough for my use case. And I believe that also for others it won't be.

No one is forced to use it, it can be left as null, and then a custom sorting can be applied after searching in the user's code.

What I need is to include the search query in the comparison. It is not enough to just look at the results isolated.

Upload to Maven Central

It would be great if you could include this library as Maven-style dependency into your project without having to clone and "install" it locally. Uploading it to Maven Central - or one of its equivalents/mirrors - would certainly help.

Regarding reverse searching

I have a string to get indexed
a. [ Panasonic lcd item] [ SampleRecord that i saved [ count:10, location: delhi]]
b. [Iphone 7 buy ] [ SampleRecord that i saved [ count:100, location: delhi]]
c. [5 seater sofa set ] [ SampleRecord that i saved [ count:50, location: new york]]
d. [iphone 7 buy] [ SampleRecord that i saved [ count:400, location: jersey]]
Here count is no. of times particular text is searched, and location is in which location this search is made.

So i want to search like this "Top 100 searches made in delhi"

Your code should return me 'a' 'b' record not 'c','d'

and also if someone searches for iphone 7 delhi, should return 'c' 'd'
Can you suggest me how to do this via your code.

Regarding Top 10 results

Can your logic be tweaked and return me top K results...

Autocompletion breaks for first character after space characters

Hey,

I have been having an issue which can be reproduced even with the sample completely application. Basically, whenever a search term consists of two words, Completely will stop working if only one character is entered for the second word. To show this using an example from your sample application, here's what happens if you keep adding one character between every search:

Query: "Western"

Western Sahara

Query: "Western "

Western Sahara

Query: "Western S"
No Results

Query: "Western Sa"

Western Sahara

In other words:

When searching "Western S", I would expect "Western Sahara" to be returned, however, Completely returns nothing. Once one more character is added - in the case the latter "a" - Completely becomes functional again. I have looked at your source code but I have not been able to see why this happens, however, I may have just missed something obvious.

Support multiple lookup types using one index

There are different concepts built-in already for index lookups. It's very flexible. Some examples:

exact match, lower case:
using a HashMultiMap in the Index, EqualityAutomaton in the index lookup, and a LowerCaseTransformer as the Analyzer in the engine
starts-with exact match, lower case:
using a PatriciaTrie in the Index, EqualityAutomaton in the index lookup, and a LowerCaseTransformer as the Analyzer in the engine
stats-with fuzzy match, lower case:
using a PatriciaTrie in the Index, EditDistanceAutomaton in the index lookup, and a LowerCaseTransformer as the Analyzer in the engine

In my use case, with a million of indexed entries, I want to perform "exact starts-with" matching first. If that brings good results, fine, I take it. If not, then I go ahead and thry the "fuzzy starts-with" matching.

For this I currently either need 2 indexes (not an option, and technically not necessary), or some ugly syntax.

The problem is that the IndexLookupStrategy.lookup() method is not flexible. One option would be to pass in a closure how I want the lookup to be done. Then it could be controlled from the outside.

My current solution is this:
Create a PatriciaTrie instance.
Create 2 instances of IndexAdapter, give both the same trie. One uses the EqualityAutomaton and the other the EditDistanceAutomaton.
Create 2 engines, one per IndexAdapter,
Now feed my data for indexing only to 1 engine.
Now both engines are ready for querying.

What do you think?

AutocompleteEngine.search() should require an input

Currently, the search method accepts

null (explicitly with the annotation @nullable)
empty string

Running a search(null) throws an NPE, so that's against the documentation.
Running a search("") returns 0 results, that's as expected.

Instead of fixing the NPE case, I recommend not permitting both of those inputs. Why? They are user errors. It's useless to search for nothing, it's clear from the start that nothing can be found. Therefore the best is to document this and to throw an IllegalArgumentException.