universal-automata / liblevenshtein-java Goto Github PK

Various utilities regarding Levenshtein transducers. (Java)

License: MIT License

Java 85.16% HTML 2.60% Smalltalk 4.14% Protocol Buffer 0.14% Groovy 5.03% XSLT 0.97% Shell 1.46% CSS 0.29% JavaScript 0.22%

approximate-string-matching bioinformatics computational-biology computer-science data-science dictionary distance-metric edit-distance finite-state-automata finite-state-transducer fuzzy-search genomics information-retrieval levenshtein-automata levenshtein-distance machine-learning natural-language-processing search-engine spelling-correction universal-automata

liblevenshtein-java's People

Contributors

Stargazers

Watchers

Forkers

khalefa asb-capfan superharry catap alexander-myltsev gitter-badger ubuntu733 realhumanbean1 finamtrade shannonyu renthunt lushers captainteknics pombredanne khatchad trident1998 andy-wagner albertoandreottiatgmail

liblevenshtein-java's Issues

Support the construction of SortedDawg from a BufferedInputStream

If the collection is already sorted, don't require it to be loaded into memory as a list.

Use better data structures and algorithms for the core, matching logic

E.g. use specialized, linked-list structures for merging and unsubsuming state positions.

Remove unnecessary, generic parameters

Not everything has to be modeled as a generic parameter ...

Merge documentation generators

Create task to update dependencies

Remove the recycle methods of the factories

Make the library and tests threadsafe.

Remove @NonNull where unnecessary

Replace TestNG's assertions with AssertJ

They tend to be more-readable and are more powerful.

Standard algorithm seems to be off

In the example below, the distance between "foo" and "foo" should be 0, not 1:

% gradle shell
:compileJava
:processResources UP-TO-DATE
:classes
> Building 75% > :shellGroovy Shell (2.2.2, JVM: 1.8.0)
Type 'help' or '\h' for help.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
groovy:000> import com.github.dylon.liblevenshtein.levenshtein.factory.TransducerBuilder
===> [import com.github.dylon.liblevenshtein.levenshtein.factory.TransducerBuilder]
groovy:000> new TransducerBuilder().dictionary(['foo', 'bar', 'baz']).build().transduce('foo')
===> CandidateCollection.WithDistance[maxCandidates=2147483647,candidates=[Candidate(term=foo, distance=1), Candidate(term=bar, distance=3), Candidate(term=baz, distance=3)]]
:shell

BUILD SUCCESSFUL

Total time: 1 mins 41.327 secs

Integrate with Artifactory, Bintray, and Sonatype for deployments

Dictionary phrases sometimes appear to be dropped

Thanks again for the library!

I think I might have found an issue: it appears that some dictionary entries will be dropped by the transducer when (i) the dictionary is sufficiently large and (ii) there is a longer entry that includes the dropped entry.

Demo code below. The transducer will forget that "Resource" is one of the dictionary entries. Note that the issue will not occur if "Resource" and "Resources" are the only two entries in the mockDictionary.

    List<String> mockDictionary = new ArrayList<String>();

        mockDictionary.add("Representatives");
        mockDictionary.add("Resource");
        mockDictionary.add("Resources");

        final ITransducer<Candidate> transducer = new TransducerBuilder()
                .algorithm(Algorithm.TRANSPOSITION)
                .defaultMaxDistance(2)
                .dictionary(mockDictionary)
                .build();

        for(String query : mockDictionary) {

            boolean exactMatchFound = false;

            for(Candidate candidate : transducer.transduce(query)) {
                if(candidate.distance() == 0) {
                    exactMatchFound = true;
                    break;
                }
            }

            // There should be an exact match for each query.
            if(!exactMatchFound) {
                System.out.println(query);
                for(Candidate candidate : transducer.transduce(query)) {
                    System.out.println("\t" + candidate.term());
                }
            }

        }

Replace POJO's with protobuf models

This will reduce logic, improve cross-platform maintainabilty, and speedup (de)serialization as the models don't have to be copied to-and-fro.

Enforce coding standards with Checkstyle

Tune the bytecode serializer

Serialization to a common format

It should be sharable across programming languages
YAML or related should work well

Generate all strings within a given levenshtein distance of a string?

Hi, is it possible to use the library so that given a string X it returns {Y_i} such that lev(X, Y_i) < K?

Make the dictionary automaton indexable

Return a set of indexed objects by key, like a fuzzy, associative map
Use KeyProviders to extract key terms from arbitrary objects, and index those objects accordingly

Consider making IDawg inherit from java.util.Collection

Set the default, max edit distance to 2

Support (de)serialization directly to/from Paths

This will be useful for DRY-ing up a lot of serialization code.

Add support for serializing query results

This would be useful for caching frequently-misspelled terms, etc.

Remove Versioneye, API key from gradle.properties and generate a new one

Even if it is a readonly key, it shouldn't be in the source code ...

Steps

Change API key
Remove API key from source code

Merge simpler classes into main codebase

Convert the graphical SortedDawg to an array-based structure.

This is a good, next step to something akin to a double-array trie, etc.

Drop, "dylon", from package names and make them saner

Package common, task flags into map for templates

Fully-automate the continuous integration (CI) pipeline

I currently perform a lot of manual work during deployments, which is burdensome and error-prone. Automate this process so I can focus on the core library without the corresponding ops work.

Release Tasks to Automate

Annotate the transition operations

Describe what the transition functions are doing, such as where insertions, deletions, substitutions, transpositions, merges, and splits are occurring.

Make .gitignore a whitelist instead of a blacklist

Add support for (de)compressing serialization streams

Integrate with Slf4j logging

Don't go wild with logging, but make it useful and replace the current logging statements with Slf4j statements.

Fix every warning by all consumed tools

Warnings are potential errors. They also pollute logs and stuff.

Tests failing or dev setup instructions are most welcome

After a fair amount of directory structure fiddling, for making the resources in the shared sibling project being read on gradle test, tests on the default git branch seem to fail:

com.github.dylon.liblevenshtein.levenshtein.MergeAndSplitTransducerTest.setUp FAILED
    java.lang.IllegalArgumentException: Due to caveats with the current DAWG implementation, terms must be inserted in ascending order
        at com.github.dylon.liblevenshtein.collection.dawg.SortedDawg.add(SortedDawg.java:97)
        at com.github.dylon.liblevenshtein.collection.dawg.AbstractDawg.addAll(AbstractDawg.java:121)
        at com.github.dylon.liblevenshtein.collection.dawg.SortedDawg.<init>(SortedDawg.java:85)
        at com.github.dylon.liblevenshtein.collection.dawg.factory.DawgFactory.build(DawgFactory.java:86)
        at com.github.dylon.liblevenshtein.collection.dawg.factory.DawgFactory.build(DawgFactory.java:24)
        at com.github.dylon.liblevenshtein.levenshtein.factory.TransducerBuilder.dictionary(TransducerBuilder.java:129)
        at com.github.dylon.liblevenshtein.levenshtein.MergeAndSplitTransducerTest.setUp(MergeAndSplitTransducerTest.java:37)

So I find that either setup instructions are missing, or the test is "really" failing.
I think, in case tests cannot be run when cloning only the Java project, the recommended dev setup for the overall project should very helpfully be provided at the bottom of the readme.

gradle 2.8 build failed on master branch

What is the gradle way to fix that?

$ gradle help
:help

Welcome to Gradle 2.8.
...
$ gradle test --info
Starting Build
Settings evaluated using settings file '/liblevenshtein-java/settings.gradle'.
Projects loaded. Root project using build file '/liblevenshtein-java/build.gradle'.
Included projects: [root project 'liblevenshtein']
Evaluating root project 'liblevenshtein' using build file '/liblevenshtein-java/build.gradle'.
All projects evaluated.
Selected primary task 'test' from project :

FAILURE: Build failed with an exception.

* What went wrong:
Could not determine the dependencies of task ':processTestResources'.
> Source directory '/liblevenshtein-java/src/test/resources' is not a directory.

* Try:
Run with --stacktrace option to get the stack trace. Run with --debug option to get more log output.

BUILD FAILED

Total time: 7.829 secs
Stopped 0 compiler daemon(s).

Clean up code with PMD

https://pmd.github.io/

Use better package name

Instead of, com.github.dylon.liblevenshtein, use something like, com.github.liblevenshtein. The former is residual of my personal repos before I created the universal-automata org.

Generate development snapshots

... accessible via JCenter and Maven Central

Out-of-dictionary results returned

I have locally confirmed getting findAll results that are not in the dictionary provided, using the Algorithm.TRANSPOSITION algorithm.

I think I could add an assertion for this case in TranspositionTransducerTest.java, where currently it seems that this condition is not being asserted against. Or is this already tested for in some other part of the test suite? Then try to reproduce on the existing test dictionary...

For now unable to get existing tests working though..

Add additional serializers

Additional serializers:

PlainTextSerializer
- Serializes dictionaries to plain text files (newline-delimited terms)
PropertiesSerializer
- Serializes transducer attributes
XMLSerializer
JSONSerializer
YAMLSerializer