Giter VIP home page Giter VIP logo

universal-automata / liblevenshtein-java Goto Github PK

View Code? Open in Web Editor NEW
54.0 54.0 19.0 5.97 MB

Various utilities regarding Levenshtein transducers. (Java)

License: MIT License

Java 85.16% HTML 2.60% Smalltalk 4.14% Protocol Buffer 0.14% Groovy 5.03% XSLT 0.97% Shell 1.46% CSS 0.29% JavaScript 0.22%
approximate-string-matching bioinformatics computational-biology computer-science data-science dictionary distance-metric edit-distance finite-state-automata finite-state-transducer fuzzy-search genomics information-retrieval levenshtein-automata levenshtein-distance machine-learning natural-language-processing search-engine spelling-correction universal-automata

liblevenshtein-java's People

Contributors

alexander-myltsev avatar dylon avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

liblevenshtein-java's Issues

Standard algorithm seems to be off

In the example below, the distance between "foo" and "foo" should be 0, not 1:

% gradle shell
:compileJava
:processResources UP-TO-DATE
:classes
> Building 75% > :shellGroovy Shell (2.2.2, JVM: 1.8.0)
Type 'help' or '\h' for help.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
groovy:000> import com.github.dylon.liblevenshtein.levenshtein.factory.TransducerBuilder
===> [import com.github.dylon.liblevenshtein.levenshtein.factory.TransducerBuilder]
groovy:000> new TransducerBuilder().dictionary(['foo', 'bar', 'baz']).build().transduce('foo')
===> CandidateCollection.WithDistance[maxCandidates=2147483647,candidates=[Candidate(term=foo, distance=1), Candidate(term=bar, distance=3), Candidate(term=baz, distance=3)]]
:shell

BUILD SUCCESSFUL

Total time: 1 mins 41.327 secs

Dictionary phrases sometimes appear to be dropped

Thanks again for the library!

I think I might have found an issue: it appears that some dictionary entries will be dropped by the transducer when (i) the dictionary is sufficiently large and (ii) there is a longer entry that includes the dropped entry.

Demo code below. The transducer will forget that "Resource" is one of the dictionary entries. Note that the issue will not occur if "Resource" and "Resources" are the only two entries in the mockDictionary.

    List<String> mockDictionary = new ArrayList<String>();

        mockDictionary.add("Representatives");
        mockDictionary.add("Resource");
        mockDictionary.add("Resources");

        final ITransducer<Candidate> transducer = new TransducerBuilder()
                .algorithm(Algorithm.TRANSPOSITION)
                .defaultMaxDistance(2)
                .dictionary(mockDictionary)
                .build();

        for(String query : mockDictionary) {

            boolean exactMatchFound = false;

            for(Candidate candidate : transducer.transduce(query)) {
                if(candidate.distance() == 0) {
                    exactMatchFound = true;
                    break;
                }
            }

            // There should be an exact match for each query.
            if(!exactMatchFound) {
                System.out.println(query);
                for(Candidate candidate : transducer.transduce(query)) {
                    System.out.println("\t" + candidate.term());
                }
            }

        }

Replace POJO's with protobuf models

This will reduce logic, improve cross-platform maintainabilty, and speedup (de)serialization as the models don't have to be copied to-and-fro.

Make the dictionary automaton indexable

  • Return a set of indexed objects by key, like a fuzzy, associative map
  • Use KeyProviders to extract key terms from arbitrary objects, and index those objects accordingly

Fully-automate the continuous integration (CI) pipeline

I currently perform a lot of manual work during deployments, which is burdensome and error-prone. Automate this process so I can focus on the core library without the corresponding ops work.

Release Tasks to Automate
  • Ensure the source is well-tested.
  • Update the dependencies and ensure they don't break anything.
  • Make sure there are no quality or style violations.
  • Update the submodules in universal-automata/liblevenshtein
  • Update the README.md
  • Update the CHANGELOG.md
  • Update the LICENSE
  • Update the release branch
  • Tag the new version
  • Update the wiki submodule
  • Update the javadoc
  • Generate the POM and related files for the release.
  • Upload the release files to Artifactory.
  • Upload the release files to Bintray.
  • Upload the release files to Sonatype.
  • Release the files to Maven Central.

Annotate the transition operations

Describe what the transition functions are doing, such as where insertions, deletions, substitutions, transpositions, merges, and splits are occurring.

Integrate with Slf4j logging

Don't go wild with logging, but make it useful and replace the current logging statements with Slf4j statements.

Tests failing or dev setup instructions are most welcome

After a fair amount of directory structure fiddling, for making the resources in the shared sibling project being read on gradle test, tests on the default git branch seem to fail:

com.github.dylon.liblevenshtein.levenshtein.MergeAndSplitTransducerTest.setUp FAILED
    java.lang.IllegalArgumentException: Due to caveats with the current DAWG implementation, terms must be inserted in ascending order
        at com.github.dylon.liblevenshtein.collection.dawg.SortedDawg.add(SortedDawg.java:97)
        at com.github.dylon.liblevenshtein.collection.dawg.AbstractDawg.addAll(AbstractDawg.java:121)
        at com.github.dylon.liblevenshtein.collection.dawg.SortedDawg.<init>(SortedDawg.java:85)
        at com.github.dylon.liblevenshtein.collection.dawg.factory.DawgFactory.build(DawgFactory.java:86)
        at com.github.dylon.liblevenshtein.collection.dawg.factory.DawgFactory.build(DawgFactory.java:24)
        at com.github.dylon.liblevenshtein.levenshtein.factory.TransducerBuilder.dictionary(TransducerBuilder.java:129)
        at com.github.dylon.liblevenshtein.levenshtein.MergeAndSplitTransducerTest.setUp(MergeAndSplitTransducerTest.java:37)


So I find that either setup instructions are missing, or the test is "really" failing.
I think, in case tests cannot be run when cloning only the Java project, the recommended dev setup for the overall project should very helpfully be provided at the bottom of the readme.

gradle 2.8 build failed on master branch

What is the gradle way to fix that?

$ gradle help
:help

Welcome to Gradle 2.8.
...
$ gradle test --info
Starting Build
Settings evaluated using settings file '/liblevenshtein-java/settings.gradle'.
Projects loaded. Root project using build file '/liblevenshtein-java/build.gradle'.
Included projects: [root project 'liblevenshtein']
Evaluating root project 'liblevenshtein' using build file '/liblevenshtein-java/build.gradle'.
All projects evaluated.
Selected primary task 'test' from project :

FAILURE: Build failed with an exception.

* What went wrong:
Could not determine the dependencies of task ':processTestResources'.
> Source directory '/liblevenshtein-java/src/test/resources' is not a directory.

* Try:
Run with --stacktrace option to get the stack trace. Run with --debug option to get more log output.

BUILD FAILED

Total time: 7.829 secs
Stopped 0 compiler daemon(s).

Use better package name

Instead of, com.github.dylon.liblevenshtein, use something like, com.github.liblevenshtein. The former is residual of my personal repos before I created the universal-automata org.

Out-of-dictionary results returned

I have locally confirmed getting findAll results that are not in the dictionary provided, using the Algorithm.TRANSPOSITION algorithm.

I think I could add an assertion for this case in TranspositionTransducerTest.java, where currently it seems that this condition is not being asserted against. Or is this already tested for in some other part of the test suite? Then try to reproduce on the existing test dictionary...

For now unable to get existing tests working though..

Add additional serializers

Additional serializers:
  • PlainTextSerializer
    • Serializes dictionaries to plain text files (newline-delimited terms)
  • PropertiesSerializer
    • Serializes transducer attributes
  • XMLSerializer
  • JSONSerializer
  • YAMLSerializer

Remove unnecessary interfaces

As per the preferences of those I've worked with, only use interfaces when they are necessary. Everything does not require an interface.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.