Comments (7)
Glad you like my suggestion and thank you for the quick response :)
I'd like to convey in the APIs somehow that this is specifically configuring the default values for those aspects of the fuzzy search (they could be overridden explicitly at a query level), so something like
FuzzySearchDefaultMaxEditDistance
andFuzzySearchDefaultMaxSequentialEdits
- they're quite a mouthful though! 😄
Maybe an additional configuration builder for the fuzzy search or its defaults then? The existing AssumeFuzzySearch()
could optionally be renamed and moved into it so the fluent API could read something like these (exclusive) options:
.WithQueryParser(options => options
.FuzzySearch(fuzzy => fuzzy
.EnableByDefault() // this was options.AssumeFuzzySearch() before
.WithMaxEditDistance(term => term.Length / 4)
.WithMaxSequentialEdits(term => term.Length / 10)))
.WithQueryParser(options => options
.FuzzySearch(fuzzy => fuzzy
.EnableByDefault() // this was options.AssumeFuzzySearch() before
.WithDefaults(
maxEditDistance: term => term.Length / 4,
maxSequentialEdits: term => term.Length / 10)))
.WithQueryParser(options => options
.AssumeFuzzySearch() // remains unchanged
.WithFuzzySearchDefaults(
maxEditDistance: term => term.Length / 4,
maxSequentialEdits: term => term.Length / 10))
The last option would be the backwards-compatible approach.
2.1 feels very similar to 3 as a concept, and I'm not sure what additional context we'd be able to get there.
Yes, 1, 2.1 and 3 are all similar and only differ by what gets passed into the callback Func<>
that returns the short
value for the setting. The only useful info I can think of ATM when it comes to guessing an appropriate value for maxEditDistance
or maxSequentialEdits
would be the term itself (?) or even only its length (option 3).
Option 1 is comparatively useless, as it can always be achieved by returning a constant value from the callback.
If you can you think of other useful information in this context, passing it alongside the term/length into the callback as a custom object (option 2.1) or as multiple parameters (via method overloads) may be the way to go.
The main challenge is to not confuse configuration parts of the syntax with the token text being matched, so using an additional
?
to delimit the configuration from the term, e.g.?2,0?term
might be less ambiguous.
That's fine with me. That would enable fuzzy-searching for numbers as well, which both 2.2.1 and 2.2.2 would indeed have ambiguous syntax for. And that's why I'd leave the consistency of the API and query syntax in your hands - I'm sure you've spend more thought on it than I have :)
from lifti.
.WithQueryParser(options => options .AssumeFuzzySearch() // remains unchanged .WithFuzzySearchDefaults( maxEditDistance: term => term.Length / 4, maxSequentialEdits: term => term.Length / 10))
I'm going to have a play with this option mainly because it maintains backwards compatibility and also there aren't so many configuration points on the query parser to warrant splitting fuzzy matching out into its own builder code. If the query builder itself starts to become too bloated, that decision can always be revisited for a breaking change release.
from lifti.
Thanks for the feedback! This code commit and docs commit should address all of these points .
- The query syntax to support the comma being omitted, i.e.
?2?term
will be allowed. - The default calculation for max sequential edits will change from
termLength / 4
totermLength < 4 ? 1 : termLength / 4
- yes this was an unintended consequence of the change! :) - All the docs have been updated and links fixed
from lifti.
In version 3.4
- the query syntax now works as I would have expected it without reading the manual and
- the default calculation for max sequential edits works better for short fuzzy search terms.
Also, the doco links work again :)
Thanks a bunch <3
from lifti.
Great suggestions, thanks for making the effort to write them up so clearly.
Generally speaking this is definitely something I'd like to get added in - when I was putting together the original fuzzy matching implementation I was aware this was missing, but I ran out of spare time to get it fully implemented. Having some level of fuzzy matching was better than none at all, and It's possible to manually create a Query
containing a custom configured FuzzyMatchQueryPart
if it was a real issue.
Addressing your specific points/suggestions:
Index-level configuration
Points 1) and 3): I like both these suggestions - my original thinking was only along the lines of the the former, but allowing for a dynamic calculation of the parameters based on search term length is a neat idea.
I'd like to convey in the APIs somehow that this is specifically configuring the default values for those aspects of the fuzzy search (they could be overridden explicitly at a query level), so something like FuzzySearchDefaultMaxEditDistance
and FuzzySearchDefaultMaxSequentialEdits
- they're quite a mouthful though! 😄
Query/search term level configuration
2.1 feels very similar to 3 as a concept, and I'm not sure what additional context we'd be able to get there.
2.2 The main challenge is to not confuse configuration parts of the syntax with the token text being matched, so using an additional ?
to delimit the configuration from the term, e.g. ?2,0?term
might be less ambiguous. ?
is already a special character in the query language anyway, so it's not adding anything additional there.
I'll try to find some time to have a look at this
from lifti.
@h0lg Ok, v3.3.0 is published now. Let me know how you get on!
from lifti.
Works like a charm at first glance - thank you for the prompt solution <3
- I've tried the index-level config and ended up with the following for a start - after realizing I want
maxSequentialEdits
to return at least 1. In other words, returning zero formaxSequentialEdits
reverts the fuzzy search back to an exact search, independent of what is returned frommaxEditDistance
.
.WithQueryParser(o => o.WithFuzzySearchDefaults(
maxEditDistance: termLength => (ushort)(termLength / 3),
// avoid returning zero here to allow for edits in the first place
maxSequentialEdits: termLength => (ushort)(termLength < 6 ? 1 : termLength / 6)))
Maybe that's something that should be sanity-checked or corrected at runtime? Hinted at in the XML and/or online docs? I'm not sure. It was kind of obvious to me what the problem was after realizing that termLength / x
rounds down to 0
if termLength < x
and that querying for ?2,0?term
yielded only exact matches.
-
The query syntax for overriding the defaults seems to work just fine.
One remark: I tried?3?term
instead of?3,?term
at first and expected that to only configure themaxEditDistance
before reading about the syntax . So that was unexpected, but not a problem if it makes things easier to parse or less ambiguous. -
And even the config-less defaults return better matches for short fuzzy search terms now :)
I've also noticed that the internal doco-links on page https://mikegoatly.github.io/lifti/docs/searching/fuzzy-matching/ are broken.
I'll leave the issue open for you to further work on my feedback if you choose to do so, but consider it done myself.
Thanks again for your efforts and keep up the good work!
from lifti.
Related Issues (20)
- Write up implementing a custom serializer
- Apply field and document filtering when collecting results from IndexNavigator HOT 1
- Add README.md to nuget package
- Provide a method to calculate the size of the index in memory
- Query syntax: Add support for spaces in field names HOT 6
- Query syntax: Support wildcard field searches/searching across all dynamic fields from a specific provider HOT 3
- Remove dependency on System.Collections.Immutable HOT 2
- Suggestion: custom stemmers HOT 2
- Search for words with a `=` character HOT 5
- Escaped characters in LIFTI query syntax HOT 1
- Q: is possible to fetch the whole document by Id? HOT 2
- Refresh documentation
- Split IdPool and ItemStore HOT 1
- Consider switching to using ValueTask across the library HOT 1
- Operaterrors as a text HOT 3
- Standardize terminology
- Track source object type against a document's metadata
- Add a "not contains" query operator
- v6 documentation changes
- Create a standardised way of rehydrating an index from a serializer
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from lifti.