Comments (4)
Hi @xbaha - is your explicit use-case? I.e. you're storing names, and want to be able to search for them appearing in those sort of permutations?
from lifti.
Not explicit, i just gave an example, I am doing a search for youtube channels, and youtube channels might be pronounced as 2 separate names (or more) but it's actually one word in the channel title, i.e. (mrwhoestheboss) , so usually people will type variations like (Mr who is the boss, whose the boss, mrwhosthe boss) because it's hard to remember the exact title. Thank you.
from lifti.
Ok, that makes sense. The tricky bit here is that each entry will only have one token (word) associated to it; the channel name. Normally you'd be indexing a body of text against an entry, and in this case there's no easy way to delimit the text to index ("mrwhoestheboss") into separate tokens.
I'm not sure how good a fit LIFTI will be for you, but I can think of a couple of ways you could approach the problem:
Wildcard searches
Create a wildcard search query, e.g. *mr* & *whose* & *the* & *boss*
- this would look for all the search terms appearing anywhere in the indexed text. The drawback here is that you won't be able to use any sort of fuzzy matching on the words, so *who* *is*
wouldn't match. Also these kind of wildcards at the start of search is a bit more computationally expensive.
Index multiple substrings for each channel name and starts with searches
If each channel name was indexed against a number of substrings, e.g.
whoistheboss
hoistheboss
oistheboss
istheboss
...
boss
oss
ss
You could speed query time up by just using starts with search terms. One way to do this would be to provide a custom index tokenizer like this:
public class SubstringTokenizer : IIndexTokenizer
{
public bool IsSplitCharacter(char character)
{
return false;
}
public string Normalize(ReadOnlySpan<char> tokenText)
{
return tokenText.ToString().ToLower();
}
public IReadOnlyCollection<Token> Process(IEnumerable<DocumentTextFragment> input)
{
var results = new List<Token>();
foreach (var fragment in input)
{
results.AddRange(Process(fragment.Text.Span));
}
return results;
}
public IReadOnlyCollection<Token> Process(ReadOnlySpan<char> tokenText)
{
const int MinTokenLength = 2;
if (tokenText.Length <= 2)
{
return new[] { new Token(tokenText.ToString(), new TokenLocation(0, 0, (ushort)tokenText.Length)) };
}
var endIndex = tokenText.Length - MinTokenLength + 1;
var results = new List<Token>(endIndex);
for (var i = 0; i < tokenText.Length - 2; i++)
{
var substring = tokenText.Slice(i).ToString();
results.Add(new Token(substring, new TokenLocation(i, i, (ushort)substring.Length)));
}
return results;
}
}
public override async Task RunAsync()
{
// Create a full text index with default settings
var index = new FullTextIndexBuilder<int>()
.WithQueryParser(x => x.AssumeFuzzySearchTerms())
.WithDefaultTokenization(x => x.WithFactory(o => new SubstringTokenizer()))
.Build();
// Index
await index.AddAsync(1, "mrwhoestheboss");
await index.AddAsync(2, "someotherchannel");
await index.AddAsync(3, "awesomesauce");
RunSearch(index, "who* the* boss*");
/* OUTPUT:
* Executing query: who* the* boss*
* Matched items total score:
* 1 (2.431662135269188)
*/
// Using ands between the search terms means this won't work because "is" doesn't appear in the channel name at all
RunSearch(index, "who* is* the* boss*");
/* OUTPUT:
* Executing query: who* is* the* boss*
* Matched items total score:
*/
// But you could use ORs - you'd get more results back, but because each matched substring would contribute to the overall score, you're likely
// to get the best match returned at the top of the list:
RunSearch(index, "who* | is* | the* | boss*");
/* OUTPUT:
* Executing query: who* | is* | the* | boss*
* Matched items total score:
* 1 (2.431662135269188)
* 2 (0.4400033975917526)
*/
WaitForEnterToReturnToMenu();
}
from lifti.
Closing this issue for now as I don't think this needs any changes to the library.
from lifti.
Related Issues (20)
- Write up implementing a custom serializer
- Apply field and document filtering when collecting results from IndexNavigator HOT 1
- Add README.md to nuget package
- Execution plans
- Consider dropping support for netstandard2
- Query syntax: Support wildcard field searches/searching across all dynamic fields from a specific provider HOT 3
- Remove dependency on System.Collections.Immutable HOT 2
- Suggestion: custom stemmers HOT 2
- Search for words with a `=` character HOT 5
- Escaped characters in LIFTI query syntax HOT 1
- Q: is possible to fetch the whole document by Id? HOT 2
- Refresh documentation HOT 20
- Split IdPool and ItemStore HOT 1
- Consider switching to using ValueTask across the library HOT 1
- Operaterrors as a text HOT 3
- Standardize terminology
- Track source object type against a document's metadata
- Add a "not contains" query operator
- v6 documentation changes
- Create a standardised way of rehydrating an index from a serializer
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from lifti.