universal-automata / liblevenshtein-java Goto Github PK
View Code? Open in Web Editor NEWVarious utilities regarding Levenshtein transducers. (Java)
License: MIT License
Various utilities regarding Levenshtein transducers. (Java)
License: MIT License
It broke somehow ...
If the collection is already sorted, don't require it to be loaded into memory as a list.
E.g. use specialized, linked-list structures for merging and unsubsuming state positions.
Not everything has to be modeled as a generic parameter ...
They tend to be more-readable and are more powerful.
In the example below, the distance between "foo" and "foo" should be 0, not 1:
% gradle shell
:compileJava
:processResources UP-TO-DATE
:classes
> Building 75% > :shellGroovy Shell (2.2.2, JVM: 1.8.0)
Type 'help' or '\h' for help.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
groovy:000> import com.github.dylon.liblevenshtein.levenshtein.factory.TransducerBuilder
===> [import com.github.dylon.liblevenshtein.levenshtein.factory.TransducerBuilder]
groovy:000> new TransducerBuilder().dictionary(['foo', 'bar', 'baz']).build().transduce('foo')
===> CandidateCollection.WithDistance[maxCandidates=2147483647,candidates=[Candidate(term=foo, distance=1), Candidate(term=bar, distance=3), Candidate(term=baz, distance=3)]]
:shell
BUILD SUCCESSFUL
Total time: 1 mins 41.327 secs
Thanks again for the library!
I think I might have found an issue: it appears that some dictionary entries will be dropped by the transducer when (i) the dictionary is sufficiently large and (ii) there is a longer entry that includes the dropped entry.
Demo code below. The transducer will forget that "Resource" is one of the dictionary entries. Note that the issue will not occur if "Resource" and "Resources" are the only two entries in the mockDictionary.
List<String> mockDictionary = new ArrayList<String>();
mockDictionary.add("Representatives");
mockDictionary.add("Resource");
mockDictionary.add("Resources");
final ITransducer<Candidate> transducer = new TransducerBuilder()
.algorithm(Algorithm.TRANSPOSITION)
.defaultMaxDistance(2)
.dictionary(mockDictionary)
.build();
for(String query : mockDictionary) {
boolean exactMatchFound = false;
for(Candidate candidate : transducer.transduce(query)) {
if(candidate.distance() == 0) {
exactMatchFound = true;
break;
}
}
// There should be an exact match for each query.
if(!exactMatchFound) {
System.out.println(query);
for(Candidate candidate : transducer.transduce(query)) {
System.out.println("\t" + candidate.term());
}
}
}
This will reduce logic, improve cross-platform maintainabilty, and speedup (de)serialization as the models don't have to be copied to-and-fro.
Hi, is it possible to use the library so that given a string X it returns {Y_i} such that lev(X, Y_i) < K?
This will be useful for DRY-ing up a lot of serialization code.
This would be useful for caching frequently-misspelled terms, etc.
Even if it is a readonly key, it shouldn't be in the source code ...
This is a good, next step to something akin to a double-array trie, etc.
I currently perform a lot of manual work during deployments, which is burdensome and error-prone. Automate this process so I can focus on the core library without the corresponding ops work.
Describe what the transition functions are doing, such as where insertions, deletions, substitutions, transpositions, merges, and splits are occurring.
Don't go wild with logging, but make it useful and replace the current logging statements with Slf4j statements.
Warnings are potential errors. They also pollute logs and stuff.
After a fair amount of directory structure fiddling, for making the resources in the shared sibling project being read on gradle test
, tests on the default git branch seem to fail:
com.github.dylon.liblevenshtein.levenshtein.MergeAndSplitTransducerTest.setUp FAILED
java.lang.IllegalArgumentException: Due to caveats with the current DAWG implementation, terms must be inserted in ascending order
at com.github.dylon.liblevenshtein.collection.dawg.SortedDawg.add(SortedDawg.java:97)
at com.github.dylon.liblevenshtein.collection.dawg.AbstractDawg.addAll(AbstractDawg.java:121)
at com.github.dylon.liblevenshtein.collection.dawg.SortedDawg.<init>(SortedDawg.java:85)
at com.github.dylon.liblevenshtein.collection.dawg.factory.DawgFactory.build(DawgFactory.java:86)
at com.github.dylon.liblevenshtein.collection.dawg.factory.DawgFactory.build(DawgFactory.java:24)
at com.github.dylon.liblevenshtein.levenshtein.factory.TransducerBuilder.dictionary(TransducerBuilder.java:129)
at com.github.dylon.liblevenshtein.levenshtein.MergeAndSplitTransducerTest.setUp(MergeAndSplitTransducerTest.java:37)
So I find that either setup instructions are missing, or the test is "really" failing.
I think, in case tests cannot be run when cloning only the Java
project, the recommended dev setup for the overall project should very helpfully be provided at the bottom of the readme.
What is the gradle way to fix that?
$ gradle help
:help
Welcome to Gradle 2.8.
...
$ gradle test --info
Starting Build
Settings evaluated using settings file '/liblevenshtein-java/settings.gradle'.
Projects loaded. Root project using build file '/liblevenshtein-java/build.gradle'.
Included projects: [root project 'liblevenshtein']
Evaluating root project 'liblevenshtein' using build file '/liblevenshtein-java/build.gradle'.
All projects evaluated.
Selected primary task 'test' from project :
FAILURE: Build failed with an exception.
* What went wrong:
Could not determine the dependencies of task ':processTestResources'.
> Source directory '/liblevenshtein-java/src/test/resources' is not a directory.
* Try:
Run with --stacktrace option to get the stack trace. Run with --debug option to get more log output.
BUILD FAILED
Total time: 7.829 secs
Stopped 0 compiler daemon(s).
Instead of, com.github.dylon.liblevenshtein
, use something like, com.github.liblevenshtein
. The former is residual of my personal repos before I created the universal-automata org.
... accessible via JCenter and Maven Central
I have locally confirmed getting findAll
results that are not in the dictionary provided, using the Algorithm.TRANSPOSITION
algorithm.
I think I could add an assertion for this case in TranspositionTransducerTest.java, where currently it seems that this condition is not being asserted against. Or is this already tested for in some other part of the test suite? Then try to reproduce on the existing test dictionary...
For now unable to get existing tests working though..
Most of them add a lot of unnecessary code to maintain.
Create demos for the following JVM-based languages:
When the dictionary depth is greater than 100, the default behavior of protobuf's CodedInputStream is to throw an exception. The recursion depth limit needs to be increased, probably to Integer.MAX_VALUE.
Some of them are pulling-in snapshot releases, etc.
As per the preferences of those I've worked with, only use interfaces when they are necessary. Everything does not require an interface.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.