Giter VIP home page Giter VIP logo

Comments (3)

mikegoatly avatar mikegoatly commented on May 24, 2024 1

@h0lg If no field is specified, then the currently the default index tokenizer is used to parse and normalize the search text - it's only if a specific field is being searched on, LIFTI uses the index tokenizer that was configured for that.

In that respect, you're right in that searching across all fields will be a problem if different tokenization has been used for them, and that's exactly the same as the problem that needs to be solved here.

I'd need to spend a bit more time thinking about this than I have right now, but I'm wondering if when searching for text across multiple fields:

  • All affected fields are collected (all fields, or a subset when a wildcarded field name is specified)
  • Each unique tokenizer is used to parse the search text.
  • The distinct search terms yielded from the tokenizers are combined with a field filter operator with the appropriate field ids. (A search term in this context could be any number number of tokens if a bracketed statement is encountered)

Edge cases to consider:

  • When searching across all fields, if all tokenizers are the same or all unique tokenizers produce the same search terms, then no field filters need to be applied.

I think this will require quite a bit of rework in the query parser logic, but it's certainly not impossible...

from lifti.

h0lg avatar h0lg commented on May 24, 2024

I understand that in your example it is unclear which tokenizer to apply to the search text if the index itself uses a different tokenizer than the field(s) being searched. I never thought about this configuration and don't have an answer.

But how does lifti decide which tokenizer to use for the search text when searching across all fields with different configured tokenizers? Isn't that a similar question? O am I missing some important difference?

from lifti.

h0lg avatar h0lg commented on May 24, 2024

I see, thanks for the clarification and sharing your thoughts.

Explaining the intricacies of the tokenization during the field search process and what happens in which case seems daunting to me. Maybe we're thinking about it too complicated? You could go with some rule that's easy to communicate and doesn't require you to explain the underlying mechanics - even if it has limitations. e.g.

If you search the same term/query across multiple fields (using wild cards or pipes or whatever), you can only do so if they share the same tokenizer. Otherwise you have write separate field queries.

Would that make things easier?

from lifti.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.