Giter VIP home page Giter VIP logo

prefixtree's Introduction

Prefix Tree

PrefixTree

A C# Prefix Tree implementation with RTL language support and Unicode Normalization. The prefix tree is constructed via a builder and exposed to developers as an immutable data structure. The tree exposes only search semantics for runtime consumption.

Internally, the tree contains a node for each letter (UTF-32 codepoint) and termination conditions on words. For a small tree, this represents as:

image

Overview

Provides an implementation of a Prefix Tree API.

public List<string> FindMatches(string prefix)

This API takes a standard string and returns all strings that match the prefix.

In the tree pictured above, we would see the following results:

Search Returned Words
A A, AT, AM, ALL, ALT
AL ALL, ALT
C CAR

Interesting Aspects

  1. Immutable. The PrefixTree, once constructed, is immutable. This allows consumption of the tree to be done without concerns around concurrency or async patterns. All mutating operations happen at object construction, allowing for the immutable pattern.

  2. Unicode. The Prefix Tree is Unicode aware. All input strings are Unicode Form C normalized, and all comparisons are done with fully normalized strings. This is tested, both with and without normalization, using English, Russian and Arabic. Tests explicitly include Emoji's, UTF-16 Combining characters, and UTF-32 codepoints.

  3. Right-to-Left tested (RTL). The code is both Left-To-Right and Right-To-Left tested. I've never written RTL code before, and was looking to see what was needed. Plumbing that all the way through was very interesting! To confirm code, normalization and comparisons are working properly the Arabic word file is the top 1000 frequent Arabic words and the RTL nature of the tests appears to be working.

Emoji's

Emoji's can be treated as words in many scenarios. A complex emoji may have mulitple UTF-32 Codepoints and is subject to composistion rules. For use in this Prefix Tree, Emoji's are treated as words, and may be encoded into the Prefix Tree and searched by prefix. A sample enoding of 3 Emoji's sharing compound prefixes is seen here:

image

The Unit Tests validate Emoji support through theory testing. An understanding of UTF-32, UTF-16 Surrogate Pairs, and grapheme's may be helpful to understand these tests. The mechanism by which Emoji's are composed is fascinating. The applicability of the Prefix Tree on emoji selection seems clear, although a purpose build Emoji version that understands the Codepoint atributes (selectors, sequences, modifiers, qualified & fully-qualified & non-qualifiied combinations) would be better.

Future Considerations

  1. This code is inefficient. There's no caching, and memory use is unoptimized, and the obvious algorithmic changes around word frequency and common letter combinations is not examined. Nevertheless, I was able to load in the entire English Scrabble dictionary (it's in the tests) with no trouble.
  2. There are far more efficient representations of words than the "1 character per node" solution here (the ZIP / huffman coding algorithm comes immediately to mind). Given the public API surface and the immutable nature of the runtime, different implementations could be swapped out without impact.

Unit Tests

The majority of Unit Tests are XUnit Theory Tests. These allow the same Assemble/Action/Assert code to be used with multiple data set. An example is here, where you can see the data being tested is encapsulated in the Unit Test attribute.

Here we see two different tests run using the same code logic. In the first test, a search pattern of "aal" is run, and expected to return an exact result set. In the 2nd example a pattern of "zymologi" is run and expected to return 5 results.

image

This approach lets me write data oriented tests to validate logic, which increases the quality of the tests.

Testing in Codespaces

This project works in Github Codespaces. To use Codespaces, be sure to install the .Net Core Test Explorer into the VS Code container.

Once the Test Exporer is installed, the tests can be viewed, run, and debugged all in the browser. image

prefixtree's People

Contributors

cleemullins avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.