ashvardanian / stringzilla Goto Github PK

Up to 10x faster strings for C, C++, Python, Rust, and Swift, leveraging SWAR and SIMD on Arm Neon and x86 AVX2 & AVX-512-capable chips to accelerate search, sort, edit distances, alignment scores, etc 🦖

Home Page: https://ashvardanian.com/posts/stringzilla/

License: Apache License 2.0

CMake 1.48% C++ 60.43% Python 4.86% JavaScript 0.26% C 21.48% Jupyter Notebook 5.36% Swift 1.74% Rust 4.09% Shell 0.31%

simd csv dataset ndjson string string-manipulation string-matching substring information-retrieval sorting-algorithms

stringzilla's People

Contributors

Stargazers

Watchers

stringzilla's Issues

Faster sorting algorithm

The original sorting algorithm in StringZilla had two passes - Radix and QuickSort. The first one sorted strings by the first 32 bits, and the second operated on chunks with equivalent prefixes. For QuickSort, we use the qsort function, which has different signatures on different Operating Systems.

#if defined(WIN32) || defined(_WIN32) || defined(__WIN32__) || defined(__NT__)
        qsort_s(sequence->order, split, sizeof(sz_size_t), qsort_comparator, (void *)sequence);
        qsort_s(sequence->order + split,
                sequence->count - split,
                sizeof(sz_size_t),
                qsort_comparator,
                (void *)sequence);
#elif __APPLE__
        qsort_r(sequence->order, split, sizeof(sz_size_t), (void *)sequence, qsort_comparator);
        qsort_r(sequence->order + split,
                sequence->count - split,
                sizeof(sz_size_t),
                (void *)sequence,
                qsort_comparator);
#else
        qsort_r(sequence->order, split, sizeof(sz_size_t), qsort_comparator, (void *)sequence);
        qsort_r(sequence->order + split,
                sequence->count - split,
                sizeof(sz_size_t),
                qsort_comparator,
                (void *)sequence);
#endif

It's ugly and slow, especially compared to the std::sort in the C++ Standard Templates Library. Let's replace that with an excellent specialized algorithm to achieve higher performance! After some initial testing, I know that doubling the speed is possible, but there may be even more room for improvement!

[BUG] Instant error STATUS_ACCESS_VIOLATION on Windows with Rust lib

I got instant STATUS_ACCESS_VIOLATION while running the memchr_stringzilla benchmark, it works totally fine with WSL2 on Windows but not on Windows itself.

Rust Bindings

Continuation of #66.

Arm NEON support in v3

Avoid Python GIL in `write_to`, sorting, Levenshtein

Saving to disk is an IO-intensive operation. Sorting and string alignments are also quite expensive. Unlocking the GIL we can allow multi-threaded Python scripts to perform such operations concurrently, not hurting each other.

Overwrite LibC symbols with `LD_PRELOAD`

We may want to extend the c/lib.c to override some of the most commonly used LibC operations, that StringZilla may be better equipped to handle. This includes:

memmem and memchr - definitely
memcpy, memmove, memset - possibly
memrchr - GNU extension for reverse-order memchr
memfrob - GNU extension with the sz_generate random generator

If we can afford a strlen call before the operation, several other APIs and GNU extensions can be replaced.

SWAR acceleration for UTF8 Hamming Distance

The upcoming implementation contains only a serial variant. Once the demand for a more efficient implementation grows, one can add a SWAR accelerated implementation.

Draft

SZ_PUBLIC sz_size_t sz_hamming_distance_utf8( //
    sz_cptr_t a, sz_size_t a_length,          //
    sz_cptr_t b, sz_size_t b_length,          //
    sz_size_t bound) {

    sz_size_t const min_length = sz_min_of_two(a_length, b_length);
    sz_size_t const max_length = sz_max_of_two(a_length, b_length);
    sz_cptr_t const a_min_end = a + min_length;
    sz_size_t distance = 0;
    bound = bound == 0 ? max_length : bound;

#if SZ_USE_MISALIGNED_LOADS && !SZ_DETECT_BIG_ENDIAN
    if (min_length >= SZ_SWAR_THRESHOLD) {
        sz_u64_vec_t a_vec, b_vec;
        sz_u64_vec_t a_2byte_vec, b_2byte_vec;
        sz_u64_vec_t prefixes_2byte_vec, match_vec, prefix_resolution_mask_vec;
        prefixes_2byte_vec.u64 = 0xC080C080C080C080ull;

        // Walk through both strings using SWAR and counting the number of differing characters.
        // At every point check if the next 8 bytes of content contain characters of equal length.
        for (; a + 8 <= a_min_end && distance < bound; a += 8, b += 8) {
            a_vec = sz_u64_load(a);
            b_vec = sz_u64_load(b);

            // Check if the prefixes contain up to 8x codepoints of length 1.
            // UTF8 1-byte codepoints look like: 0xxxxxxx.
            sz_size_t a_1byte_chars = sz_u64_ctz(a_vec.u64 & 0x8080808080808080ull) / 8;
            sz_size_t b_1byte_chars = sz_u64_ctz(b_vec.u64 & 0x8080808080808080ull) / 8;
            sz_size_t max_1byte_chars = sz_max_of_two(a_1byte_chars, b_1byte_chars);
            sz_size_t min_1byte_chars = sz_min_of_two(a_1byte_chars, b_1byte_chars);

            // Check if at least one of the strings starts with a 1-byte character.
            if (max_1byte_chars) {
                // One of the strings doesn't start with 1-byte characters.
                if (!a_1byte_chars && !b_1byte_chars) {
                    a += min_1byte_chars, b += min_1byte_chars;
                    continue;
                }
                // Both strings start with `min_1byte_chars`-many 1-byte characters.
                // Compare them using SWAR and skip.
                else {
                    match_vec = _sz_u64_each_byte_equal(a_vec, b_vec);
                    prefix_resolution_mask_vec = ...;
                    sz_size_t matching_prefix_length = sz_u64_popcount((~match_vec.u64) & 0x8080808080808080ull);
                    ...
                }
            }

            // Check if the prefixes contain up to 4x codepoints of length 2.
            // UTF8 2-byte codepoints look like: 110xxxxx 10xxxxxx.
            // Those bits will be matches with a 0xE0C0 repeated mask, matching the 0xC080 pattern.
            a_2byte_vec.u64 = a_vec.u64 & 0xE0C0E0C0E0C0E0C0ull;
            b_2byte_vec.u64 = b_vec.u64 & 0xE0C0E0C0E0C0E0C0ull;
            sz_size_t a_2byte_chars = sz_u64_ctz(_sz_u64_each_2byte_equal(a_2byte_vec, prefixes_2byte_vec).u64) / 16;
            sz_size_t b_2byte_chars = sz_u64_ctz(_sz_u64_each_2byte_equal(b_2byte_vec, prefixes_2byte_vec).u64) / 16;
            
        }
    }
#endif

    sz_rune_t a_rune, b_rune;
    sz_rune_length_t a_rune_length, b_rune_length;
    for (; a != a_min_end && distance < bound; a += a_rune_length, b += b_rune_length) {
        _sz_extract_utf8_rune(a, &a_rune_length, &a_rune);
        _sz_extract_utf8_rune(b, &b_rune_length, &b_rune);
        distance += (a_rune != b_rune);
    }

    return sz_min_of_two(distance, bound);
}

Inconsistent compiler flags with Clang

When building with Clang it seems that there is some inconsistcy between different builds.

On WSL (Intel i5) and in the pipelines, it looks like the some of the compiler flags are not properly being set.

When I build using Clang 10.0.0 on wsl, all of the test programs were built with C++17, and none of them with AVX. (All of the built test programs were identical).

It is strange because if I build it with Clang 17.0.6 using termux on android, everything builds correctly (albeit with a few minor tweaks to get it to build), with the right compiler flags.

So I don't know if its a CMake issue or a Clang version issue (or both).

Standard-compliant `split` implementation

The 4c738ea commit introduces a prototype for StringZilla-based Command-Line toolkit, including the split utility replacement. The original prototype suggests a 4x performance improvement opportunity, but it can't currently handle multiple inputs and flags. Those must be easy to add in cli/split.py purely in the Python layer. We are aiming to match the following specification:

$ split --help
Usage: split [OPTION]... [FILE [PREFIX]]
Output pieces of FILE to PREFIXaa, PREFIXab, ...;
default size is 1000 lines, and default PREFIX is 'x'.

With no FILE, or when FILE is -, read standard input.

Mandatory arguments to long options are mandatory for short options too.
  -a, --suffix-length=N   generate suffixes of length N (default 2)
      --additional-suffix=SUFFIX  append an additional SUFFIX to file names
  -b, --bytes=SIZE        put SIZE bytes per output file
  -C, --line-bytes=SIZE   put at most SIZE bytes of records per output file
  -d                      use numeric suffixes starting at 0, not alphabetic
      --numeric-suffixes[=FROM]  same as -d, but allow setting the start value
  -x                      use hex suffixes starting at 0, not alphabetic
      --hex-suffixes[=FROM]  same as -x, but allow setting the start value
  -e, --elide-empty-files  do not generate empty output files with '-n'
      --filter=COMMAND    write to shell COMMAND; file name is $FILE
  -l, --lines=NUMBER      put NUMBER lines/records per output file
  -n, --number=CHUNKS     generate CHUNKS output files; see explanation below
  -t, --separator=SEP     use SEP instead of newline as the record separator;
                            '\0' (zero) specifies the NUL character
  -u, --unbuffered        immediately copy input to output with '-n r/...'
      --verbose           print a diagnostic just before each
                            output file is opened
      --help     display this help and exit
      --version  output version information and exit

The SIZE argument is an integer and optional unit (example: 10K is 10*1024).
Units are K,M,G,T,P,E,Z,Y (powers of 1024) or KB,MB,... (powers of 1000).
Binary prefixes can be used, too: KiB=K, MiB=M, and so on.

CHUNKS may be:
  N       split into N files based on size of input
  K/N     output Kth of N to stdout
  l/N     split into N files without splitting lines/records
  l/K/N   output Kth of N to stdout without splitting lines/records
  r/N     like 'l' but use round robin distribution
  r/K/N   likewise but only output Kth of N to stdout

GNU coreutils online help: <https://www.gnu.org/software/coreutils/>
Report any translation bugs to <https://translationproject.org/team/>
Full documentation <https://www.gnu.org/software/coreutils/split>
or available locally via: info '(coreutils) split invocation'

Add search/split iterators for Python

In C++ we have special smart iterators for bulk search and split operations. They lazily report the matches, avoiding heap allocations for the array of match offsets.

For that, an arbitrary matcher (string / character / character set ; in normal / reverse order) is combined with search / split ranges. Similar functionality should be added in Python, where we currently materialize the matches into a "compressed" Strs object.

Quick-start instructions for C++, Rust, and Swift

On Debian 12, cmake v. 3.25.1, builds fine. Passes all tests.

However, the make file appears to be useless so far as installing for C/C++ goes.
Having a small section detailing the purpose of the CMakeLists.txt, and how to install for each supported language would clear things up.

Split on multiple characters

Is there a way to split on multiple characters?
More specifically I am trying to make something like this work

from stringzilla import Str

text1 = Str("i have multiple whitespace, tabs\tand newlines\nwithin my length")
text2 = text1.split(separator=' \n')
for i in text2:
    print(i)

Currently it seems to try to match on the whole of ' \n' (space and newline) instead of space or newline.

Hopefully this could be extended to splitting on anykind of ascii whitespace and eventually anykind of unicode whitespace.

Add benchmarks for bulk-replace operations against Boost or STL

The C++ class implements a replace_all operation, that can be used to replace all occurrences of a substring or a character with a different string. The implementation is designed to minimize memory allocations and should be faster than boost::algorithm::replace_all in Boost and Python's str.replace. That intuition should be supported by evidence, so new benchmarks should be added.

V4 Wishlist

Features

Better hashing algorithms
Automata-based fuzzy searching algorithms

Breaking naming and organizational changes

Rename edit_distance to levenshtein_distance to match Hamming

Any other requests?

Missing `sz::string::shrink_to_fit`

The C layer defines the sz_string_shrink_to_fit, that is not covered by tests and is not yet used by the C++ API to implement the shrink_to_fit.

uniform_int_distribution has UB with uint8_t

As for https://en.cppreference.com/w/cpp/numeric/random/uniform_int_distribution , the uniform_int_distribution supports only template arguments of the following list:
short, int, long, long long, unsigned short, unsigned int, unsigned long, or unsigned long long, while you have uint8_t two times. This follows UB.

Also, uniform_int_distribution<uint8_t> is not supported also by VS compiler.

Benchmark Discrepancies Across Implementations

just wondering if you have any comment as to why your benchmarks show such a wide gap between std::strstr and avx2 and avx512 when these ones from sse4-strstr show basically no gain unless you're using AVX512F or AVX512BW, and then only twice the throughput?

is your implementation just that much better?

I'm away from my machines right now but will run my own benchmarks next week!

https://github.com/WojciechMula/sse4-strstr/blob/master/results/cascadelake-Gold-5217-gcc-7.4.0-avx512bw.txt

Add support for negative slices in Python

C++ compatibility errors with stringzilla.h

Using clang-cl this error is generated when including stringzilla.hpp.

C:/dev/utily/build-native/_deps/stringzilla-src/include\stringzilla/stringzilla.h:1946:50: warning: use of old-style cast [-Wold-style-cast]
    if (h_length < n_length || !n_length) return SZ_NULL;
                                                 ^~~~~~~
C:/dev/utily/build-native/_deps/stringzilla-src/include\stringzilla/stringzilla.h:1095:18: note: expanded from macro 'SZ_NULL'
#define SZ_NULL ((void *)0)
                 ^       ~
C:/dev/utily/build-native/_deps/stringzilla-src/include\stringzilla/stringzilla.h:1946:50: error: cannot initialize return object of type 'sz_cptr_t' (aka 'const char *') with an rvalue of type 'void *'
    if (h_length < n_length || !n_length) return SZ_NULL;
                                                 ^~~~~~~
C:/dev/utily/build-native/_deps/stringzilla-src/include\stringzilla/stringzilla.h:1095:17: note: expanded from macro 'SZ_NULL'
#define SZ_NULL ((void *)0)

To resolve this issue, I think you can wrap the include in file: stringzilla.hpp line 62:
From

#include <stringzilla/stringzilla.h>

extern "C" {
     #include <stringzilla/stringzilla.h>
}

But I'm not 100%, probably worth some investigating.

In addition, fetch content section probably isn't verbose enough (especially considering the joy of working with cmake)

   # without this fetchcontent will fail with no indicative error message
   include(FetchContent)

    FetchContent_Declare(
        stringzilla 
        GIT_REPOSITORY https://github.com/ashvardanian/stringzilla.git 
        GIT_TAG main     # Need this or the operation fails
        GIT_SHALLOW FALSE
    )
    FetchContent_MakeAvailable(stringzilla)

    # Not sure how you intend for users to link/find your library as make_available 
    # doesn't add it to include path or anything. So, you have to do something like this.
    target_link_libraries(my_proj PRIVATE stringzilla)

Generalize skewed diagonal approach to rectangular matrices

Currently, the anti-diagonal approach for Levenshtein distance is only implemented for square matrices.
It should be easy to generalize it to rectangular ones, opening opportunities for more aggressive vectorization.

StringZilla/include/stringzilla/stringzilla.h

Lines 2010 to 2013 in 0642318

 // TODO: Generalize to remove the following asserts! 

 sz_assert(!bound && "For bounded search the method should only evaluate one band of the matrix."); 

 sz_assert(shorter_length == longer_length && "The method hasn't been generalized to different length inputs yet."); 

 sz_unused(longer_length && bound);

Documentation is incorrect? `sz_string_append` does not exist

In the readme, a function named sz_string_append is referenced, but looking through the source code, its not defined at all, and I cannot use it in any code

Replace GFNI with nibble-based approach for character-set-search

Current AVX-512 implementation relies on rare Galois Field extensions, that are not very fast to compute. @WojciechMula has a great article on a nibble-based approach for "SIMD byte lookup" with SSE. That same approach should work well with Arm NEON, AVX2, and AVX-512, and should be considered in the future.

Using Chrome, Mozilla or some other major project for testing

StringZilla currently passes ASAN check, unit tests, and fuzzy tests across several programming languages for MacOS, Linux, and Windows on every PR. This, however, is not the same as taking a mature 20yo codebase and replacing all inclusions of std::string and std::string_view with sz::string and sz::string_view. On one side, it may make the target project noticeably faster. On the other side, it will help polish the rough corners, where we may not yet entirely match the C++ standard.

Add RegEx support

Searching was optimized out

Searching can be optimized out. For example, VS compiler does this.
Probably, you should do bm::DoNotOptimize that way:

enumerate_matches(buffer_span, needles[state.iterations() % needles.size()], engine, [](size_t s) {
       bm::DoNotOptimize(s);
});

Standard-compliant `wc` implementation

The 4c738ea commit introduces a prototype for StringZilla-based Command-Line toolkit, including the wc utility replacement. The original prototype suggests a 3x performance improvement opportunity, but it can't currently handle multiple inputs and flags. Those must be easy to add in cli/wc.py purely in the Python layer. We are aiming to match the following specification:

$ wc --help
Usage: wc [OPTION]... [FILE]...
  or:  wc [OPTION]... --files0-from=F
Print newline, word, and byte counts for each FILE, and a total line if
more than one FILE is specified.  A word is a non-zero-length sequence of
characters delimited by white space.

With no FILE, or when FILE is -, read standard input.

The options below may be used to select which counts are printed, always in
the following order: newline, word, character, byte, maximum line length.
  -c, --bytes            print the byte counts
  -m, --chars            print the character counts
  -l, --lines            print the newline counts
      --files0-from=F    read input from the files specified by
                           NUL-terminated names in file F;
                           If F is - then read names from standard input
  -L, --max-line-length  print the maximum display width
  -w, --words            print the word counts
      --help     display this help and exit
      --version  output version information and exit

GNU coreutils online help: <https://www.gnu.org/software/coreutils/>
Report any translation bugs to <https://translationproject.org/team/>
Full documentation <https://www.gnu.org/software/coreutils/wc>
or available locally via: info '(coreutils) wc invocation'

Supporting more Swift versions

Current CI pipeline uploads only a package for Swift 5.9 and only MacOS+Linux. The package, however, should work just fine on other versions of Swift, iOS, visionOS, watchOS, and tvOS.

Enhanced Load Masking

This issue is the symmetric counterpart of SimSIMD#29.

Add `rfind`, `rsplit`, and `rsplitlines`

Existing interfaces support forward iteration over strings. Similarly, the following methods should receive reverse-iterating siblings:

strzl_naive_find_char -> strzl_naive_rfind_char,
strzl_naive_find_2chars -> strzl_naive_rfind_2chars,
strzl_naive_find_3chars -> strzl_naive_rfind_3chars,
strzl_naive_find_4chars -> strzl_naive_rfind_4chars,
strzl_naive_find_substr -> strzl_naive_rfind_substr for scalar, AVX and NEON backends/

Remove LibC dependency

Case-insensitive Unicode manipulation

Python strings offer a lot of powerful methods, such as:

isalnum, isalpha, isascii, isdecimal, isdigit, isspace, islower, isupper, istitle, isnumeric for checks.
lower and upper that copy the string.
casfold described in section 3.13 of the Unicode Standard.

There are very few C-level libraries that provide such functionality, and most of them are not characterized by speed. Covering a subset of that functionality in StringZilla makes sense.

Pure CPython rewrite of the bindings

Let's make the Str class almost as native to Python as the str. For that, let's avoid PyBind11 and all the expensive high-level constructs and implement a pure CPython binding.

Adding reverse-order SWAR search backends

StringZilla currently implements several SWAR search optimizations for needles of different length: _sz_find_2byte_serial, _sz_find_3byte_serial, _sz_find_4byte_serial. Those currently lack reverse-order variants and should be implemented to guarantee the same level of throughput on non-SIMD devices in reverse-order operations.

Implementing those is fairly strait-forward, and can be a great first issue for people used to C programming.

When I use chinese string to test bechmark, the performance is lower than python method

python scripts/bench.py --haystack_pattern "你好，StringZilla" --haystack_length 1e9 --needle "你好abce"

Improve Rolling Hashes

In StringZilla a 64-bit rolling hash function is reused for both string hashes and substring hashes, Rabin-style fingerprints. Rolling hashes take the same amount of time to compute hashes with different window sizes, and are fast to update. Those are not however perfect hashes, and collisions are frequent. StringZilla attempts to use SIMD, but the performance is not yet satisfactory. On Intel Sapphire Rapids, the following numbers can be expected for N-way parallel variants.

4-way AVX2 throughput with 64-bit integer multiplication (no native support): 0.28 GB/s.
4-way AVX2 throughput with 32-bit integer multiplication: 0.54 GB/s.
4-way AVX-512DQ throughput with 64-bit integer multiplication: 0.46 GB/s.
4-way AVX-512 throughput with 32-bit integer multiplication: 0.58 GB/s.
8-way AVX-512 throughput with 32-bit integer multiplication: 0.11 GB/s.

That's extremely slow. Using SIMD and a better scheme, we should be approaching memcpy speeds. Trying alternative rolling hash approaches is the biggest goal for the library in the upcoming releases, so algorithm recommendations would be appreciated!

Add search/split iterators for Rust

In C++ we have special smart iterators for bulk search and split operations. They lazily report the matches, avoiding heap allocations for the array of match offsets.

For that, an arbitrary matcher (string / character / character set ; in normal / reverse order) is combined with search / split ranges. Similar functionality should be added in Rust.

Expose Levenshtein distance and NW to Rust

Unlike C, C++, Swift, and Python, the Rust interface lacks any exposure to sz_edit_distance and sz_alignment_score, which might be of interest to many users.

[Query] to read a large text file with delimited of size 75 GB

Hi Ash
Sorry for the Spam if the query is answered already.
Here is my usecase.
I have file of size 75GB which has 500mil lines.
For every line i need to scan 500mil lines to run some checks and prepare a result.
sorting is not required in our case.
we have a modern hardware with 64 cores and 512gb RAM with nvme ssd with 10GB/sec IOPS.
I would like to leverage the featurs of modern hardware than the traditional approach.
Infact I have loaded these in a postgres DB but running these with 100 parallel connection
takes huge time to iterate on 500mil records.
After looking at the talk on youtube, want to explore that stringzilla framework will be fit for our usecase.
Please let us know any pointers on the same.
Also if it is possible through stringzilla, are there any sample reference implementation to start with please let me know.
Thanks

Thanks
C.R.Bala

Memory footprint of StringZilla

Do you have any benchmarks and comparisons of the memory footprint of Stringzilla?

As some examples:

saving a string with 1Mio chars as stringzilla.Str vs a python str
saving many strings, e.g. 1Mio strings with 50 chars each as
- python list[str]
- list[stringzilla.Str]
- stringzilla.Strs

In the case of stringzilla.Strs: What is their behavior when reading from them with multiprocessing? Will they be copied on read or not?

Fix `findLast` offsets calculation in Swift and extend tests

It is not entirely clear, what the match offset should be in this test. Overall, we need more test coverage for such non-trivial Unicode-heavy strings in Swift.

EPSMA

have you avoided implementing EPSMA because AVX2 in your tests is memory bound anyway so that's the point?

https://vecpar2018.ncc.unesp.br/wp-content/uploads/2018/09/VECPAR_2018_paper_22.pdf

Empty substrings after .sort()

Hi, thanks for another awesome library!
I encountered some strange behavior while testing it, namely, .sort() method returns substrings that are empty except for the last word.

from stringzilla import Str, File, Strs

sent = 'This trick has already saved me a few thousand dollars in the last couple of weeks.'
s_zilla = Str(sent).split()
s_zilla.sort()
for s in s_zilla:
    print(f"Substring:{s}, len: {len(s)}")

And the output is:

Substring:weeks., len: 6 Substring:, len: 0 Substring:, len: 0 Substring:, len: 0 Substring:, len: 0 Substring:, len: 0 Substring:, len: 0 Substring:, len: 0 Substring:, len: 0 Substring:, len: 0 Substring:, len: 0 Substring:, len: 0 Substring:, len: 0 Substring:, len: 0 Substring:, len: 0 Substring:, len: 0

Aggregate a plain non-synthetic dataset for Bio sequences

For fair benchmarks of Needleman-Wunsch scoring algorithms we should find a real-world protein bank and ideally export it into a whitespace or newline delimited .txt file, that will be easy to parse not only in Python, but also in C++. Community contributions more than welcome 🤗

Refactor Str and SplitIterator to use `sz_string_view_t`

In python/lib.c several classes store a combination of a pointer and length. It's worth refactoring the file to use the sz_string_view_t structure.

Better Heuristics for Substring Search

So I recently had to implement a fast text filter for work (in browser JavaScript, so no use for your library (yet) I'm afraid), and used the same-ish kind of insight but inverted: given that nearby characters are strongly correlated, then the first and last characters in a substring should be the least correlated. Or put in other words: suppose two (eventually) different words match their first n characters so far. Of all remaining characters, character n+1 is most likely to be the same, so actually the worst character to test next.

Statistically speaking, comparing the first and last characters has a higher chance to filter out mismatching strings early, avoiding false positives (of course, I'm well aware that linear memory access patterns might have a much bigger impact than this statistic).

This was not my original insight - Wojciech Muła also does this in one of his SIMD string search algorithms: http://0x80.pl/articles/simd-strfind.html#algorithm-1-generic-simd (without justification though, he left that as an exercise for the reader)

Looking through the code-base, I see that you first match on a four-character needle, then finish comparing on the suffixes with this sz_equal function:

StringZilla/stringzilla/stringzilla.h

Lines 94 to 98 in 14e7a78

 inline static sz_bool_t sz_equal(sz_string_start_t a, sz_string_start_t b, sz_size_t length) { 

 sz_string_start_t const a_end = a + length; 

 while (a != a_end && *a == *b) a++, b++; 

 return a_end == a; 

 }

My idea could be (relatively) quickly benchmarked by modifying this function. I haven't written C++ in a while, but I think this would be a correct "first test the last character, then test the remaining string characters" variation:

sz_bool_t sz_equal(sz_string_start_t a, sz_string_start_t b, sz_size_t length) {
    sz_string_start_t const a_end = a + length - 1;
    if (a <= a_end && *a_end != b[length - 1]) return false;
    while (a != a_end && *a == *b) a++, b++;
    return a_end == a;
}

Or a "compare back-to-front" variation:

sz_bool_t sz_equal(sz_string_start_t a, sz_string_start_t b, sz_size_t length) {
    sz_string_start_t a_end = a + length - 1;
    sz_string_start_t b_end = b + length - 1;
    if (a <= a_end && *a_end != b[length - 1]) return false;
    while (a != a_end && *a == *b) a++, b++;
    return a_end == a;
}

I am aware that simple forward linear memory access is faster than random or backward memory access, so I would not be surprised if both of these functions are slower in practice, regardless of what statistics says. But we won't know until we try, right?

For completeness sake here is my JS data structure, although I don't think it is particularly relevant as it solves a slightly different (but adjacent) problem than StringZilla

https://observablehq.com/@jobleonard/a-data-structure-for-table-row-filtering

The use-case is real-time filtering a data table based on user-input, leaving only rows that contain a substring match in any data cell with the input string. So it's more optimized for "latency". It pre-processes the indices of all trigrams of the strings in the table, and uses these as needles. For an input string it filters on the first trigram, then filters backwards from the last trigram in the input substring. Because trigrams are looked up from a hasmap the forward or backward access pattern concern does not apply here. It also uses previous results to narrow down the results faster (e.g. if a user typed the, then types a y to produce they, the filter only searches the existing results of the instead of starting over)

Adjust compilation flags to support SVE and all AVX-512 extensions

The setup.py needs to pass custom compilation flags depending on which target architecture (x86 vs Arm) and which compiler (GCC vs LLVM) is used.

Broader benchmarks

Every heuristic has its weaknesses. Current benchmarks could be more helpful in understanding them. The bench.py should be changed to allow command-line arguments for various patterns, and if those aren't provided, it should, by default, cover a diverse set of use cases, printing final results into the console.

	// TODO: Generalize to remove the following asserts!
	sz_assert(!bound && "For bounded search the method should only evaluate one band of the matrix.");
	sz_assert(shorter_length == longer_length && "The method hasn't been generalized to different length inputs yet.");
	sz_unused(longer_length && bound);

	inline static sz_bool_t sz_equal(sz_string_start_t a, sz_string_start_t b, sz_size_t length) {
	sz_string_start_t const a_end = a + length;
	while (a != a_end && a == b) a++, b++;
	return a_end == a;
	}

ashvardanian / stringzilla Goto Github PK

stringzilla's People

Contributors

Stargazers

Watchers

Forkers

stringzilla's Issues

Features

Breaking naming and organizational changes

Recommend Projects

Recommend Topics

Recommend Org