So I recently had to implement a fast text filter for work (in browser JavaScript, so no use for your library (yet) I'm afraid), and used the same-ish kind of insight but inverted: given that nearby characters are strongly correlated, then the first and last characters in a substring should be the least correlated. Or put in other words: suppose two (eventually) different words match their first n
characters so far. Of all remaining characters, character n+1
is most likely to be the same, so actually the worst character to test next.
Statistically speaking, comparing the first and last characters has a higher chance to filter out mismatching strings early, avoiding false positives (of course, I'm well aware that linear memory access patterns might have a much bigger impact than this statistic).
This was not my original insight - Wojciech Muła also does this in one of his SIMD string search algorithms: http://0x80.pl/articles/simd-strfind.html#algorithm-1-generic-simd (without justification though, he left that as an exercise for the reader)
Looking through the code-base, I see that you first match on a four-character needle, then finish comparing on the suffixes with this sz_equal
function:
|
inline static sz_bool_t sz_equal(sz_string_start_t a, sz_string_start_t b, sz_size_t length) { |
|
sz_string_start_t const a_end = a + length; |
|
while (a != a_end && *a == *b) a++, b++; |
|
return a_end == a; |
|
} |
My idea could be (relatively) quickly benchmarked by modifying this function. I haven't written C++ in a while, but I think this would be a correct "first test the last character, then test the remaining string characters" variation:
sz_bool_t sz_equal(sz_string_start_t a, sz_string_start_t b, sz_size_t length) {
sz_string_start_t const a_end = a + length - 1;
if (a <= a_end && *a_end != b[length - 1]) return false;
while (a != a_end && *a == *b) a++, b++;
return a_end == a;
}
Or a "compare back-to-front" variation:
sz_bool_t sz_equal(sz_string_start_t a, sz_string_start_t b, sz_size_t length) {
sz_string_start_t a_end = a + length - 1;
sz_string_start_t b_end = b + length - 1;
if (a <= a_end && *a_end != b[length - 1]) return false;
while (a != a_end && *a == *b) a++, b++;
return a_end == a;
}
I am aware that simple forward linear memory access is faster than random or backward memory access, so I would not be surprised if both of these functions are slower in practice, regardless of what statistics says. But we won't know until we try, right?
For completeness sake here is my JS data structure, although I don't think it is particularly relevant as it solves a slightly different (but adjacent) problem than StringZilla
https://observablehq.com/@jobleonard/a-data-structure-for-table-row-filtering
The use-case is real-time filtering a data table based on user-input, leaving only rows that contain a substring match in any data cell with the input string. So it's more optimized for "latency". It pre-processes the indices of all trigrams of the strings in the table, and uses these as needles. For an input string it filters on the first trigram, then filters backwards from the last trigram in the input substring. Because trigrams are looked up from a hasmap the forward or backward access pattern concern does not apply here. It also uses previous results to narrow down the results faster (e.g. if a user typed the
, then types a y
to produce they
, the filter only searches the existing results of the
instead of starting over)