Giter VIP home page Giter VIP logo

Comments (7)

h0lg avatar h0lg commented on May 11, 2024 1

Glad you like my suggestion and thank you for the quick response :)

I'd like to convey in the APIs somehow that this is specifically configuring the default values for those aspects of the fuzzy search (they could be overridden explicitly at a query level), so something like FuzzySearchDefaultMaxEditDistance and FuzzySearchDefaultMaxSequentialEdits - they're quite a mouthful though! 😄

Maybe an additional configuration builder for the fuzzy search or its defaults then? The existing AssumeFuzzySearch() could optionally be renamed and moved into it so the fluent API could read something like these (exclusive) options:

.WithQueryParser(options => options
    .FuzzySearch(fuzzy => fuzzy
        .EnableByDefault() // this was options.AssumeFuzzySearch() before
        .WithMaxEditDistance(term => term.Length / 4)
        .WithMaxSequentialEdits(term => term.Length / 10)))

.WithQueryParser(options => options
    .FuzzySearch(fuzzy => fuzzy
        .EnableByDefault() // this was options.AssumeFuzzySearch() before
        .WithDefaults(
            maxEditDistance: term => term.Length / 4,
            maxSequentialEdits: term => term.Length / 10)))

.WithQueryParser(options => options
    .AssumeFuzzySearch() // remains unchanged
    .WithFuzzySearchDefaults(
        maxEditDistance: term => term.Length / 4,
        maxSequentialEdits: term => term.Length / 10))

The last option would be the backwards-compatible approach.

2.1 feels very similar to 3 as a concept, and I'm not sure what additional context we'd be able to get there.

Yes, 1, 2.1 and 3 are all similar and only differ by what gets passed into the callback Func<> that returns the short value for the setting. The only useful info I can think of ATM when it comes to guessing an appropriate value for maxEditDistance or maxSequentialEdits would be the term itself (?) or even only its length (option 3).
Option 1 is comparatively useless, as it can always be achieved by returning a constant value from the callback.

If you can you think of other useful information in this context, passing it alongside the term/length into the callback as a custom object (option 2.1) or as multiple parameters (via method overloads) may be the way to go.

The main challenge is to not confuse configuration parts of the syntax with the token text being matched, so using an additional ? to delimit the configuration from the term, e.g. ?2,0?term might be less ambiguous.

That's fine with me. That would enable fuzzy-searching for numbers as well, which both 2.2.1 and 2.2.2 would indeed have ambiguous syntax for. And that's why I'd leave the consistency of the API and query syntax in your hands - I'm sure you've spend more thought on it than I have :)

from lifti.

mikegoatly avatar mikegoatly commented on May 11, 2024 1
.WithQueryParser(options => options
    .AssumeFuzzySearch() // remains unchanged
    .WithFuzzySearchDefaults(
        maxEditDistance: term => term.Length / 4,
        maxSequentialEdits: term => term.Length / 10))

I'm going to have a play with this option mainly because it maintains backwards compatibility and also there aren't so many configuration points on the query parser to warrant splitting fuzzy matching out into its own builder code. If the query builder itself starts to become too bloated, that decision can always be revisited for a breaking change release.

from lifti.

mikegoatly avatar mikegoatly commented on May 11, 2024 1

Thanks for the feedback! This code commit and docs commit should address all of these points .

  • The query syntax to support the comma being omitted, i.e. ?2?term will be allowed.
  • The default calculation for max sequential edits will change from termLength / 4 to termLength < 4 ? 1 : termLength / 4 - yes this was an unintended consequence of the change! :)
  • All the docs have been updated and links fixed

from lifti.

h0lg avatar h0lg commented on May 11, 2024 1

In version 3.4

  • the query syntax now works as I would have expected it without reading the manual and
  • the default calculation for max sequential edits works better for short fuzzy search terms.

Also, the doco links work again :)

Thanks a bunch <3

from lifti.

mikegoatly avatar mikegoatly commented on May 11, 2024

Great suggestions, thanks for making the effort to write them up so clearly.

Generally speaking this is definitely something I'd like to get added in - when I was putting together the original fuzzy matching implementation I was aware this was missing, but I ran out of spare time to get it fully implemented. Having some level of fuzzy matching was better than none at all, and It's possible to manually create a Query containing a custom configured FuzzyMatchQueryPart if it was a real issue.

Addressing your specific points/suggestions:

Index-level configuration

Points 1) and 3): I like both these suggestions - my original thinking was only along the lines of the the former, but allowing for a dynamic calculation of the parameters based on search term length is a neat idea.

I'd like to convey in the APIs somehow that this is specifically configuring the default values for those aspects of the fuzzy search (they could be overridden explicitly at a query level), so something like FuzzySearchDefaultMaxEditDistance and FuzzySearchDefaultMaxSequentialEdits - they're quite a mouthful though! 😄

Query/search term level configuration

2.1 feels very similar to 3 as a concept, and I'm not sure what additional context we'd be able to get there.

2.2 The main challenge is to not confuse configuration parts of the syntax with the token text being matched, so using an additional ? to delimit the configuration from the term, e.g. ?2,0?term might be less ambiguous. ? is already a special character in the query language anyway, so it's not adding anything additional there.

I'll try to find some time to have a look at this

from lifti.

mikegoatly avatar mikegoatly commented on May 11, 2024

@h0lg Ok, v3.3.0 is published now. Let me know how you get on!

from lifti.

h0lg avatar h0lg commented on May 11, 2024

Works like a charm at first glance - thank you for the prompt solution <3

  1. I've tried the index-level config and ended up with the following for a start - after realizing I want maxSequentialEdits to return at least 1. In other words, returning zero for maxSequentialEdits reverts the fuzzy search back to an exact search, independent of what is returned from maxEditDistance.
.WithQueryParser(o => o.WithFuzzySearchDefaults(
    maxEditDistance: termLength => (ushort)(termLength / 3),
    // avoid returning zero here to allow for edits in the first place
    maxSequentialEdits: termLength => (ushort)(termLength < 6 ? 1 : termLength / 6)))

Maybe that's something that should be sanity-checked or corrected at runtime? Hinted at in the XML and/or online docs? I'm not sure. It was kind of obvious to me what the problem was after realizing that termLength / x rounds down to 0 if termLength < x and that querying for ?2,0?term yielded only exact matches.

  1. The query syntax for overriding the defaults seems to work just fine.
    One remark: I tried ?3?term instead of ?3,?term at first and expected that to only configure the maxEditDistance before reading about the syntax . So that was unexpected, but not a problem if it makes things easier to parse or less ambiguous.

  2. And even the config-less defaults return better matches for short fuzzy search terms now :)


I've also noticed that the internal doco-links on page https://mikegoatly.github.io/lifti/docs/searching/fuzzy-matching/ are broken.

I'll leave the issue open for you to further work on my feedback if you choose to do so, but consider it done myself.
Thanks again for your efforts and keep up the good work!

from lifti.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.