Giter VIP home page Giter VIP logo

lifti's Introduction

Build Status

LIFTI

A lightweight full text indexer for .NET

Key features:

Getting started

📖 Read the documentation - there's lots of useful information and examples there, along with some getting started guides.

🧑‍💻 Check out some sample code - the repo contains examples that can be run as a console application.

🤹‍♀️ Use LIFTI in a Blazor app - try out various queries against Wikipedia content

Support

If you find LIFTI useful, why not buy me a coffee to power the development work?

buymeacoffee

Contribute

It would be great to have more people contributing to LIFTI - how can you help?

  • Create issues for bugs you find - level 1
  • Create feature suggestions - level 2
  • Create pull requests for documentation changes - level 3
  • Create pull requests for bug fixes or features - boss level

lifti's People

Contributors

dependabot[bot] avatar mikegoatly avatar mikegoatly-coeo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

lifti's Issues

non English Fuzzy search

Thanks for your library, it looks amazing,
I want to implement it for psqqq.com service for products matching

Find in the file list of product names
product_names_samples.csv

here is my test code to search 1 productfrom the mentioned file.

using Lifti;
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace FullTextSearchTest
{
    internal class Program
    {
        static void Main(string[] args)
        {
            var r = RunAsync();
        }

        static async Task RunAsync()
        {
            var lines = File.ReadAllLines(@"product_names_samples.csv", Encoding.UTF8);

            var index = new FullTextIndexBuilder<int>()
            .WithDefaultTokenization(o => o.WithStemming())
            .WithQueryParser(o => o.AssumeFuzzySearchTerms())
            .Build();

            //var nn = index.DefaultTokenizer.Normalize("720 мм x 20 мкн. x 3000 м., 3\", KDX MATT, шт");

            //index.DefaultTokenizer.
            index.BeginBatchChange();
            //index.DefaultTokenizer.Normalize()

            var idx2line = new Dictionary<int, string>();
            var i = 1;
            foreach (var l in lines)
            {
                idx2line[i] = l;
                if (i % 100 == 0)
                {
                    Console.WriteLine(i);
                    //break;
                }
                await index.AddAsync(i++, l.Trim('"'));
                //index.ad
                //z.
            }
            await index.CommitBatchChangeAsync();
            //Lifti.Querying.QueryParserException: 'Token expected: EndAdjacentTextOperator'

            var sk = "720 мм x 20 мкн. x 3000 м., 3\", KDX MATT, шт";
            sk = "Нож для TFD-550  1 комплект - 4 ножа ((2 шт  90 град, 2 шт  120 град))";
            sk = sk.Replace("\"", "");

            var res = index.Search(sk);
            //var res = index.Search("720 мм");
            //var res = index.Search("?720 ?мм");

            //var res = index.Search();

            Console.WriteLine($"Found: {res.Count()}");

            foreach (var sa in res)
            {
                var l = idx2line[sa.Key];
                Console.WriteLine($"{sa.Score} - {l}");
                //sa.
            }
        }
    }
}

Any ideas how to match the best result here?

PS.
I think that I build query wrong, but do not have idea how to adjust it right way...

Chaining multiple SerializeAsync/DeserializeAsync - header data issue

I need to serialize two FullTextIndex instances into a single file.

When trying to read the file, I receive the following error:

System.AggregateException: One or more errors occurred. (Unable to read header data from serialized index content.)

My deserialization is done inside a task:

return Task.Run(async () =>
{
   await serializer.DeserializeAsync(index1, stream, disposeStream: false);
   await serializer.DeserializeAsync(index2, stream, disposeStream: true);
});

The index is serialized by separate program similarly inside a Task.

return Task.Run(async () =>
{
   await serializer.SerializeAsync(index1, stream, disposeStream: false);
   await serializer.SerializeAsync(index2, stream, disposeStream: true);
});

Wild card matching in exact sequence seems broken in v4

Querying with wild cards in phrases seems to be broken in v4 compared to v3.5.2.

E.g. querying above sentence for "* to *" (note that the * are inside quotes denoting a phrase) matched seems to be and compared to v3.5.2 in v3.5.2, but doesn't match anything in v4.

I'm unaware of query syntax changes in this regard between the two versions. If I'm using the syntax differently from how it was intended for matching such phrases in v4, please let me know how I'd have to change it. Thanks!

Wildcard

I was hoping to use the wildcard to find "stationary" and "stationery" using %. Searching for "station%ry" did not return any results - "stationary stationery" works as expected.

Enclosed is Json.
umClasses.zip

var jsonFile = File.ReadAllText("C:\Users\manderson\Documents\umClasses.json");
var searchModels =
JsonConvert.DeserializeObject<System.Collections.Generic.List>(jsonFile);
var index = new FullTextIndexBuilder()
.WithDefaultTokenization(o => o.WithStemming())
.WithQueryParser(o => o.WithDefaultJoiningOperator(QueryTermJoinOperatorKind.Or))
.WithTextExtractor()
.WithObjectTokenization(
itemOptions => itemOptions
.WithKey(c => c.ClassDescriptionId)
.WithField("GLGuidelines", f => f.GLGuidelines)
).Build();
index.AddRangeAsync(searchModels);

        //Was hoping these two counts would be the same
        var searchResults = index?.Search("stationary stationery").Count();
        var searchResults2 = index?.Search("station%ry").Count();

The positional near operator ~> sometimes reverses the match direction

Depending on the number of documents matched on the left vs the right of the operator, the direction is reversed, meaning.

some ~> thing could return documents containing thing some rather than some thing.

This is due to a bug in the optimization logic where the left and right results are swapped around, but the left/right tolerances are not.

Storage

Now need to build to index every time, right?
I want to use this in the desktop application so can we not store index/snapshot after the index building as a file and pack with setup and load whenever need at run time?

Documentation

This is a placeholder issue to track the documentation that needs to be written:

  • Programatically traversing the index using index.CreateNavigator()
  • Different tokenization options, including stemming
  • Support for async indexing of an item's field
  • Serialization of non-standard index key types
  • Enumeration of indexed words
  • Getting set up - installing from nuget
  • Responding to index changes using index modification actions
  • Using the XML tokenizer
  • Improving indexing with batch processing
  • Writing custom text extractors
  • Using a custom tokenizer
  • Directly using a tokenizer to verify the tokens that are being returned
    ... more to come

Create a builder for TokenizationOptions

So it can be used like this:

bookIndex.WithItemTokenization<Book>(
    options => options
        .WithKey(b => b.BookId)
        .WithField("Title", b => b.Title, t => t.ContainingXml().WithStemming()));

Which is a bit more concise than:

bookIndex.WithItemTokenization<Book>(
    options => options
        .WithKey(b => b.BookId)
        .WithField("Title", b => b.Title, new TokenizationOptions(TokenizerKind.Xml) { Stem = true }));

sharding

Hi,
just a question,
is there a way or plan to do index sharding, so that index itself is spiltted in different files or memory streams,
before I search for example 100 000 documents, loading and reading only relevant part of the index in memory?

thanks for good work

Surrogate pair characters crash serializer

  var stream = new MemoryStream();
 var serializer = new BinarySerializer<string>();
 var index = new FullTextIndexBuilder<string>().Build();
 await index.AddAsync("A", "🎶");

 await serializer.SerializeAsync(index, stream);

This throws System.ArgumentException : Unicode surrogate characters must be written out as pairs together in the same call, not individually. Consider passing in a character array instead.

Item not getting indexed

First off, this is an amazing library and I really appreciate it. It's so well organized and documented!

The problem I'm having is this: I have indexed a corpus of 792 objects. The word "broker" appears in four of these objects, but does not appear to be indexed at all. When I build the index and search for the word "broker" it returns 0 search results.

I've experimented with just selecting the TopicContentSearchModels with the word "broker" and indexing those. When I do this, the word "broker" does get indexed and the search works properly, returning four search results.

I also experimented with sorting the original list of objects and building the index that way. When I do that, "broker" gets indexed and the search works properly, returning four search result records.

For reference I've included the search model as well as all my 'sandbox' code. I've serialized the search models and attached a file if you have any interest in reproducing the issue, as well as having some real-world data to play with.

public class TopicContentSearchModel
    {
        public int TopicId { get; set; }
        public string TopicName { get; set; }
        public string Content { get; set; }
    }

private void SetUpTestForLifti()
        {
            //*** BEGIN STUFF FOR LIFTI testing
            //load objects to be indexed
            var jsonFile = File.ReadAllText("C:\\Users\\manderson\\Documents\\umdata.json");
            var searchModels =
                JsonConvert.DeserializeObject<System.Collections.Generic.List<TopicContentSearchModel>>(jsonFile);
            //****SCENARIO 1
            //**** This is the defect - the word 'broker' exists 4 times in this corpus
            //**** but when the search is performed it is not found
            
            var indexWithAllModels = new FullTextIndexBuilder<int>()
                .WithDefaultTokenization(o => o.WithStemming())
                .WithQueryParser(o => o.WithDefaultJoiningOperator(QueryTermJoinOperatorKind.Or))
                .WithTextExtractor<XmlTextExtractor>()
                .WithObjectTokenization<TopicContentSearchModel>(
                    itemOptions => itemOptions
                        .WithKey(c => c.TopicId)
                        .WithField("TopicName", f => f.TopicName)
                        .WithField("Content", f => f.Content))
                .Build();

            indexWithAllModels.AddRangeAsync<TopicContentSearchModel>(searchModels);
            //Test Search A will have zero records - this is the issue
            var searchAllModels = indexWithAllModels.Search("broker").ToList();
            
            //Serialize search models for possible shipping off to Lifti author for help
            var topicJson = JsonConvert.SerializeObject(searchModels);

            var brokerTopics =
                searchModels.Where(x =>
                    x.TopicName.IndexOf("broker", StringComparison.CurrentCultureIgnoreCase) != -1 ||
                    x.Content.IndexOf("broker", StringComparison.CurrentCultureIgnoreCase) != -1
                ).ToList();

            var nonBrokerTopics = searchModels.Except(brokerTopics).ToList();

            //For this test, add topics with "broker" first, then
            //add other items one by one and see if the number
            //of items found when searching for "broker" changes
            var indexWithBrokersThenAddNonBrokerTopics = new FullTextIndexBuilder<int>()
                .WithDefaultTokenization(o => o.WithStemming())
                .WithQueryParser(o => o.WithDefaultJoiningOperator(QueryTermJoinOperatorKind.Or))
                .WithTextExtractor<XmlTextExtractor>()
                .WithObjectTokenization<TopicContentSearchModel>(
                    itemOptions => itemOptions
                        .WithKey(c => c.TopicId)
                        .WithField("TopicName", f => f.TopicName)
                        .WithField("Content", f => f.Content))
                .Build();

            //Add records with term 'broker'
            indexWithBrokersThenAddNonBrokerTopics.AddRangeAsync(brokerTopics);
            //add in each record that does not have broker, see if search returns other than 4
            foreach (var nonBrokerTopic in nonBrokerTopics)
            {
                indexWithBrokersThenAddNonBrokerTopics.AddAsync(nonBrokerTopic);
                var testAddingNonBrokerTopicToBrokerTopics = indexWithBrokersThenAddNonBrokerTopics.Search("broker").ToList();
                if (testAddingNonBrokerTopicToBrokerTopics.Count != 4)
                {
                    throw new Exception($"Look out non-broker topic {nonBrokerTopic.TopicId} threw up");
                }
            }

            //for this test, add all topics without the word "broker"
            //then add in the topics with the word broker one at a time
            //and see if the index can find the words
            var testIndex2 = new FullTextIndexBuilder<int>()
                .WithDefaultTokenization(o => o.WithStemming())
                .WithQueryParser(o => o.WithDefaultJoiningOperator(QueryTermJoinOperatorKind.Or))
                .WithTextExtractor<XmlTextExtractor>()
                .WithObjectTokenization<TopicContentSearchModel>(
                    itemOptions => itemOptions
                        .WithKey(c => c.TopicId)
                        .WithField("TopicName", f => f.TopicName)
                        .WithField("Content", f => f.Content))
                .Build();
            testIndex2.AddRangeAsync(nonBrokerTopics);
            for (var index = 0; index < brokerTopics.Count; index++)
            {
                var brokerTopic = brokerTopics[index];
                testIndex2.AddAsync(brokerTopic);
                var testSearch2 = testIndex2.Search("broker").ToList();
                if (testSearch2.Count != index + 1)
                {
                    throw new Exception($"Look out non-broker topic {brokerTopic.TopicId} threw up");
                }
            }
        }

umdata.zip

Generating search result phrases from match locations

I've got the following code working and I'm able to display "matching phrases" in a popup that appears over any given search result. It is a distinct list of terms and phrases found. Searching for "toy store" in a topic that includes the full search term plus separate instances of both words would result in the list "toy store, toy, store". I'm using stemming, so it can also look like "toy store, toys, toy, store, storage". This is fine w/me. It allows the user to know at a glance what terms will be in a topic before clicking and navigating to that topic.

I just wanted to double check that I'm using Lifti properly. If you could just skim this at your convenience, Hopefully this only takes a minute or two to confirm that I'm using the library correctly - I don't intend to make you feel like you're doing a full code review but just confirming right track/wrong track (though any comments are welcome).

First is a summary of my logic and after that copy in code just for reference.

  • Get searchResults from Lifti
  • For each search result, look up its business object by Id.
  • For each indexed field on the business object ("TopicName" and "Content"), find the locations of each match in the field text.
  • If the token indexes of any match locations are sequential, infer that this is a phrase and add the phrase to a list by taking a substring from the first position of the first item all the way through then end of the last sequential item
  • If item is not sequential then just add it to the list.
  • The distinct set of items will be the matching terms/phrases eventually displayed to the user.
searchResults = _index.Search(searchTerm);

foreach (var searchResult in searchResults)
{
  var topic = allVms.FirstOrDefault(t => t.TopicId == searchResult.Key);
  if (topic != null)
  {
    var matchPhrases = new List<string>();
    
    foreach (var match in searchResult.FieldMatches)
    {
      if (match.FoundIn == "TopicName")
        matchPhrases.AddRange(LiftiUtils.MakePhrases(topic.TopicContent.TopicName, match.Locations));
      
      if (match.FoundIn == "Content")
        matchPhrases.AddRange(LiftiUtils.MakePhrases(topic.TopicContent.Content, match.Locations));
    }
    
    topic.Phrases = matchPhrases.Distinct().ToArray();
    topic.Score = searchResult.Score;
  }
}
			
public static List<string> MakePhrases(string text, IReadOnlyList<TokenLocation> matchLocations)
{
  List<String> phrases = new List<String>();
  if (matchLocations.Count == 0) return phrases;
  var runLength = 1;
  text = text.ToLower();
  for (var i = 1; i <= matchLocations.Count; i++)
  {
    if (i == matchLocations.Count || matchLocations[i].TokenIndex - matchLocations[i - 1].TokenIndex != 1)
    {
      var end = matchLocations[i - 1].Start + matchLocations[i - 1].Length -
	      matchLocations[i - runLength].Start;
      var substring = text.Substring(matchLocations[i - runLength].Start, end);
      phrases.Add(substring);
      runLength = 1;
    }
    else
    {
      runLength++;
    }
  }

  return phrases.GroupBy(x => x, StringComparer.InvariantCultureIgnoreCase)
    .Select(g => new {value = g.Key, count = g.Count()}).OrderByDescending(x => x.count)
    .Select(f => $"{f.value}").ToList();
}

Use field tokenization options by default when searching in a field

Given this:

bookIndex.WithItemTokenization<Book>(
                options => options
                    .WithKey(b => b.BookId)
                    .WithField("Title", b => b.Title, new TokenizationOptions(TokenizerKind.Default) { Stem = true }));

bookIndex.Search("Title=foo");

The search will not use stemming, so you may not get exactly the results you were expecting given the field was indexed with stemming enabled.

You can work around this by indicating that stemming should be used when searching:

bookIndex.Search("Title=foo", new TokenizationOptions(TokenizerKind.Default) { Stem = true });

But it would be much better if when doing a field search it used the tokenization options of the field by default.

Replicate the reader/writer locking mechanism from the previous project

Lock access to reading/writing from/to the index using a reader/writer locking mechanism.

Still need to consider thread safety for multiple readers though, and I'll probably need to revisit the XmlTokenizer and InputPreprocessorPipeline implementations if multiple concurrent readers are to be supported.

Performing search CPU spikes

Hi, we haven't gone live yet but are very close. I have deployed our app w/Lifti to a QA server and a staging server, and both exhibit CPU spikes when searches are performed. In each case the server is a medium AWS instance with 32 gigs of RAM and a medium AMD processor. They are idling at around 3% when a single search is performed, after which one of server's CPU's can range up into the 70% range while the other one for some reason goes into the 90's. FWIW, there are 4 indexes that get hit asynchronously. They are fairly small, each one under a couple thousand database records comprising a cumulative total of 6 MB. At runtime the user submits a search term and each of the indexes is queried.
When I do the search on my laptop the cpu only spikes to around 60%.

Any thoughts on this? I'll try to get more CPU profiling information to see if I can tell specifically what is happening.

Searching with punctuation

Hi,

If I have
await index.AddAsync(1, "Murphy's law");

then
index.Search("*Murphy's*")
will match that entry, but
index.Search("*Murphys*")
will not, but I would like it to.

What's the best approach to solving this? I could strip punctuation from the strings I input to the index and the query, but there may be a nicer way. I have played around with fuzzy matching, but never quite got it right.

Thanks

Dynamic fields (was: DictionaryTokenization)

Total noob here, but rather then using a POCO as the source document I want to use a class that has a composite key and a DIctionary<string,string> for the "fields". There does not seem to be anyway to do this. Could I build a DictionaryTokenizationBuilder?

I would need to iterate through the pairs in the dictionary using the key for the Field name and the Value for the text that needs to tokenized. Am I barking up the wrong tree? Is this even possible or sensible?

Lifti is a perfect fit for my use case so I hope there's a way.
Thanks
Jim

Can field restrictions be used with fuzzy and wild card matching?

I notice unexpected behavior when combining field restrictions with either fuzzy or wild card matching.

If I have an indexed model Video with term in the keywords, but not in the title, the query title=term will not match the keywords - as expected. Also exact word queries like title="exact term" seem to work fine.

But both queries title=?term and title=term* seem to ignore the title field filter and match term in the keywords as well.

Neither the doco nor the tests hint at this behavior - making it feel like a bug.

Score boosting

This is a brain dump of some thoughts around how LIFTI could make its scoring system more flexible. This issue will be updated as the thinking evolves.

Extend the object tokenization builder to provide:

.WithObjectTokenization(o =>
  o.WithField("Name", c => c.Name, scoreBoost: 3) // Boost any scores from this field by x3
  o.WithScoreBoosting(b =>
    b.Freshness(item => item.LastModified, multiplier: 3) // Boost results on a scale between oldest and newest. The value returned by the delegate must be a DateTime. E.g. if using DateTimeOffset, dto.DateTimeUtc can be used.
      .Magnitude(item => item.Rating, multiplier: 3) // Boost results on a scale based on a numeric value. The value returned by the delegate must be a double. 
      

Questions: Need to think about score boosting dynamic fields.

We could also add dynamic score boosting to the LIFTI query syntax (similar to Lucene):

term^3

Where if the term matches, it's score is boosted by that amount.

Wildcard searching

Currently you can only use a wildcard operator (*) to query for words starting with a fragment of text, e.g.

Search Example matches
foo* food foolish foot

This proposal is to extend wildcard searching in two ways:

  1. Add to support the * operator anywhere in a word search to match any number of characters
  2. Add the '%' operator to match a single character

Examples:

Search Example matches
f*d food feed fiend fad
%ish fish dish wish
%%cket bucket locket
*cket cricket locket thicket ticket
wi* wink win window

The current StartsWithWordQueryPart implementation will become deprecated in favour of this new implementation as it provides a strict subset of the functionality proposed here.

Proposal for general rules:

  • Multi-character wildcards (*):
    • When between two text patterns (f*d): Match zero or more characters between the end of the first text and start of the following text. Only tokens that end with the second text will be returned.
    • When appearing at the end of some search text (f*): Any tokens appearing after the first text will be returned. This is the same behaviour as the old "starts with" operator.
    • More than one multi-character wildcard can be used in a query, e.g. f*o*d
    • Multiple sequential multi-character wildcards will be reduced to a single wildcard, e.g. w**n will be reduced to w*n. The two are semantically identical, so this doesn't matter.
  • Single character wildcards (%)
    • Can appear anywhere in the search text, at the start, middle or end
    • Single character wildcards can appear sequentially to indicate a fixed number of substitute letters, e.g. f%%d will match f followed by any two characters and then a d.
    • Single character wildcards immediately preceding a multi-character wildcard can be used to require that n or more characters are matched, e.g. c%%%* would match cake and cakes but not cat, because at least 3 characters are required after the c.
    • Any single character wildcards immediately following a multi-character wildcard will cause an error to be thrown (e.g. d*%%, Semantically f* and f*%% are very different, so we can't collapse them - the first can have any number of characters following the f, whereas the second can have any number of characters, but at least two at the end. I currently think that implementing this increase search complexity significantly, but it's possible it could be implemented at a later stage.

Fuzzy Search

Really cool search library.

The only thing I'm still missing is a fuzzy search.
Did I miss something?
Is fuzzy search planned?

Fluent query building

Have a think about what a fluent API for building queries would look like. This would likely be provided as an additional nuget package, e.g. Lifti.FluentQueries.

Perhaps something like:

index.Query().For("search text").And.For("something else").Execute();

index.Query()
    .InField("FieldName", fieldSearch => fieldSearch.For("something").Or.For("another"))
    .Or.InField("FIeld2", fieldSearch => fieldSearch.For("foo"))
    .Execute();

serialization options

Hi,
I just brainstorming and I ve few questions..
im comparing your serialization options with other libraries.
I ve data 117KB, indexed (WithObjectTokenization 3fields) and serialized it binary, writen to file 193KB.
Question is could it get even smaller?

and could we serialize it to json?
if try JsonSerializer.Serialize
it fails with

System.ReadOnlyMemory1[System.Char] is invalid for serialization or deserialization because 
it is a pointer type, is a ref struct, or contains generic parameters that have not been replaced
by specific types.

and last fundamental question would be, if index is always bigger than data and loaded fully to memory
or only some part of it (thinking here about future option paging the index and search results - iterating, to not load all of them)

regards

How to search for combined words?

how to get all the keys when I search for the term "john smith" or "johnsmith" ?


var index = new FullTextIndexBuilder<int>()
        .WithQueryParser(o => o.AssumeFuzzySearchTerms())
        .Build();

await index.AddAsync(1, "john smith");
await index.AddAsync(2, "johnsmith");
await index.AddAsync(3, "smithjohn");


var results = index.Search("john smith").ToList(); // ==> 1
results = index.Search("johnsmith").ToList(); // ==> 2

in this example, the result returned is the exact key only, I couldn't find anything in the docs that helps in such a case.

Add EnumerateIndexedWords to IndexNavigator

Replicates functionality that was available in original LIFTI.

Once a navigator has been used to traverse an index, the EnumerateIndexedWords should return all the words that start with the traveresed text. E.g.

navigator.Navigate("tr");
navigator.EnumerateIndexedWords();

Would return all words in the index starting with "tr".

Adjacent words query treats a single word as being adjacent to itself

The following example prints 1, however it should emit 0 because the document doesn't contain "some" followed by "some".:

using System;
using System.Threading.Tasks;
using System.Linq;
using Lifti;
					
public class Program
{
	public static async Task Main()
	{
		var index = new FullTextIndexBuilder<string>()
			.Build();

		await index.AddAsync("Foo", "Some text");

		Console.WriteLine(index.Search("\"some some\"").Count());
	}
}

.NET Fiddle

Split out text extraction and tokenization into two separate concepts

When configuring tokenization options, you can specify WithXml to indicate that the text is wrapped in XML so only the element text content should be tokenized.

This is mixing two concepts:

  • Text extraction from source text (picking only the bits of text that should be tokenized from a document)
  • Tokenization of extracted text

Add support for stop words

It would be nice to configure an index so that it doesn't contain stop words, e.g. the, and, it`.

The list should be configurable against the tokenizer.

V5 checklist

A quick braindump of the bits that will need to be tidied up before V5 can be released:

  • Add documentation for the concept of DynamicFields
  • Add documentation for new ObjectTokenizationBuilder methods for registering dynamic field readers
  • Update documentation for binary serializer to mention how fields can be re-mapped and constraints of changing the index definition when deserializing an existing index
  • Update documentation for LIFTI binary serialized format

Indexing text from nested objects

Created from this initial discussion.

Description

Allow for nested objects to provide content for a field. Similar to indexing an array of strings from an object, but an additional delegate needs to be provided to read the text from each of the nested objects in turn.

.WithField(
    "Captions", // The name of the field to record all the nested object text under
    v => v.CaptionTracks, // The set of nested objects
    ct => ct.GetFullText()) // A delegate to read the text for each nested object

Add support for document scoring

Initial investigations will be around using Okapi BM25.

The net result will be a new Score field on SearchResult.

Information needed for BM25:

Total number of documents indexed
IFullTextIndex.Count

Number of documents containing a searched term
Should just be IntermediateQueryResult.Matches.Count but I'll need to double check this when implementing.

Average word count of documents
Total number of words in a document
Not currently stored anywhere.
Possibly add metadata against IdLookup for the word count in each field? Total document words is a sum of all of these.

Store total word count against index. Average word count can be calculated from there, and index mutations can add/subtract from this.

Number of times the term appears in a given document
QueryWordMatch.Locations.Sum(l=> l.FieldMatches.Count)

k1 / b
Free parameters - these should be sensible defaults but overridable. ~ Should they be overrideable at query time, or specified when index is created?~ For now they will only be specifiable when the index is created.

  • Add support for calculating and storing word count metadata, including updating the de/serialization logic.
  • Add default scoring implementation
  • Allow scoring implementation to be specified when building the index
  • Update binary serialization documentation
  • Update querying documentation

Pseudocode (well, PowerShell):

$DocumentCount = 100 #N
$DocumentsContainingWord = 10 # n(qi)

$AverageWordCountInDocuments = 20# avgdl

# Adding one before calculating the log ensures an always positive number
$IDF = [Math]::Log(1 + (($DocumentCount - $DocumentsContainingWord + 0.5) `
                    / ($DocumentsContainingWord + 0.5)))

$k1 = 1.2; #k1 (CONFIGURATION)
$b = 0.75; #b (CONFIGURATION)
$WordFrequencyInDocument = 10; #f(qi, D)
$WordsInDocument = 30; #|D|

$WordsInDocumentWeighting = $WordsInDocument / $AverageWordCountInDocuments

$TermScore = $IDF * (($WordFrequencyInDocument * ($k1 + 1)) `
                    / ($WordFrequencyInDocument + $k1 * (1 - $b + $b * $WordsInDocumentWeighting)))

$TermScore.ToString("0.0000000") # = 4.0675915

Use a builder method to construct an index

This will make it much easier to provide overrides for specific parts of an index and mean that once constructed the configuration of the index can be immutable. For example, Item index configuration would be done at builder time only.

Synonyms and related items

I am curious as to whether it would be appealing to you to add the capability to facilitate synonyms and/or related terms into the search. I would like to have the search engine return items for "manufacturing" and "mfg" automatically. We have a number of such abbreviations in our text. Another one that comes to mind is "California" and "CA", etc. Examples of related terms would be search term "sports" would yield "baseball", "football", etc.

Barring have Lifti do that natively, what would be your suggestion for implementing that outside of Lifti? The only thing that comes to mind for me is to parse the input search string, looking for synonyms and the like. I am hesitant to do that however, because we're utilizing the Lifti query language and it seems like it might be problematic to directly parse a string looking for "mfg" and directly substituting in "(mfg | manufacturing)". Is there any better way to add new conditions to a query term? I looked at the section about manually creating queries, but then doesn't that put me in the position of writing all the parsing from scratch?

Any guidance would be greatly appreciated.

Allow for custom ITokenizers

Currently the index supports plain text and Xml content during tokenization. The TokenizationOptionsBuilder should allow for a custom ITokenizer to be provided, e.g.

TokenizationOptionsBuilder.WithCustomerTokenizer<T>(Func<TokenizationOptions, T> constructionDelegate) where T: ITokenizer;
builder.WithCustomTokenizer(options => new CustomTokenizer(options));

Intra-node text containing surrogate pair characters breaks serialization

A variation of #30

The fix is similar, and will require a serialization format version bump to v4.

To keep serialization sizes down for the majority case, when writing intra-node text out, the structure will be:

1 byte: Boolean - true when no surrogates and data is written in a sequence of chars, false when the data is a sequence of shorts

Remove all punctuation from index and query

Hi, I asked a similar question recently, but I am struggling a bit with punctuation and looking for advice.

If I have something like this in my index
J.J. O'Hara's

I would like to match that on the query
jj oharas

Using fuzzy logic, I think this would involve allowing more edits than I would like. It may have an impact on the amount of irrelevant results I get.

Is there a way of just stripping out all punctuation in both the index and the query?
Is WithDefaultTokenization(options => options.SplitOnPunctuation(false)) ok for the index? Is there something similar for the query? Thanks

Configure maxEditDistance and maxSequentialEdits for fuzzy searches

The current values for the maxEditDistance (4) and maxSequentialEdits (1) arguments of the FuzzyMatchQueryPart constructor as used in QueryParser.CreateWordQueryPart() yield too many results that are nothing like the search term for my taste - especially for short search terms.

E.g. if I search an English text for ?term with maxEditDistance = 4 I get matches like very, here, were or seems which occur quite often - creating a lot of noise in the search result.

It would be nice to have a way of configuring those values

  1. on a FullTextIndex level via the FullTextIndexBuilder using something like
    .WithQueryParser(o => o.FuzzyMaxEditDistance(2).FuzzyMaxSequentialEdits(0))

  2. on a Query level
    2.1 by either intercepting the query parsing using some hook, e.g.
    .WithQueryParser(o => o.FuzzyMaxEditDistance(someContext => 2).FuzzyMaxSequentialEdits(someContext => 0))
    2.2 and/or by supplying the values with the query (similar to the nearness syntax), e.g.
    2.2.1 ?2,0term (both integers comma-separated after the ? with the first int being maxEditDistance and the second one maxSequentialEdits) or
    2.2.2 2?0term or 2?term or ?0term (maxEditDistance before the ? and maxSequentialEdits after)

  3. dynamically based on the length of the search term to account for shorter words, e.g.

.WithQueryParser(o => o
     // a maximum of a fourth of the letters may differ
    .FuzzyMaxEditDistance(termLength => termLength / 4)
     // with a maximum of a tenth of the letters edited in sequence
    .FuzzyMaxSequentialEdits(termLength => termLength / 10))

Do you have any plans for implementing something like this - one way or the other?
Please consider my syntax suggestions above as just that; I'm not opinionated on them. I only wanted to get across how I'd like to use the API to get fewer results while expanding on the existing API where I would expect it.

Thank you for your work and compliments on the API design and documentation! This library is exceptionally easy to use for its complexity.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.