Giter VIP home page Giter VIP logo

nanderoo / stringzilla Goto Github PK

View Code? Open in Web Editor NEW

This project forked from ashvardanian/stringzilla

0.0 0.0 0.0 6.24 MB

10x faster string sort, search, split, and shuffle for long strings and multi-gigabyte files in Python and C, leveraging SIMD with Arm Neon and x86 AVX2 & AVX-512 intrinsics ๐Ÿฆ–

Home Page: https://ashvardanian.com/posts/stringzilla/

License: Apache License 2.0

Shell 0.90% JavaScript 1.19% C++ 80.06% Python 11.53% CMake 5.86% Dockerfile 0.46%

stringzilla's Introduction

StringZilla ๐Ÿฆ–

StringZilla is the Godzilla of string libraries, splitting, sorting, and shuffling large textual datasets faster than you can say "Tokyo Tower" ๐Ÿ˜…

Performance

StringZilla uses a heuristic so simple it's almost stupid... but it works. It matches the first few letters of words with hyper-scalar code to achieve memcpy speeds. The implementation fits into a single C 99 header file and uses different SIMD flavors and SWAR on older platforms. So if you're haunted by open(...).readlines() and str().splitlines() taking forever, this should help ๐Ÿ˜Š

Substring Search

Backend \ Device IoT Laptop Server
Speed Comparison ๐Ÿ‡
Python for loop 4 MB/s 14 MB/s 11 MB/s
C++ for loop 520 MB/s 1.0 GB/s 900 MB/s
C++ string.find 560 MB/s 1.2 GB/s 1.3 GB/s
Scalar StringZilla 2 GB/s 3.3 GB/s 3.5 GB/s
Hyper-Scalar StringZilla 4.3 GB/s 12 GB/s 12.1 GB/s
Efficiency Metrics ๐Ÿ“Š
CPU Specs 8-core ARM, 0.5 W/core 8-core Intel, 5.6 W/core 22-core Intel, 6.3 W/core
Performance/Core 2.1 - 3.3 GB/s 11 GB/s 10.5 GB/s
Bytes/Joule 4.2 GB/J 2 GB/J 1.6 GB/J

Partition & Sort

Coming soon.

Quick Start: Python ๐Ÿ

1๏ธ. Install via pip: pip install stringzilla
2. Import classes: from stringzilla import Str, File, Strs

Basic Usage

StringZilla offers two mostly interchangeable core classes:

from stringzilla import Str, File

text1 = Str('some-string')
text2 = File('some-file.txt')

The Str is designed to replace long Python str strings and wrap our C-level API. On the other hand, the File memory-maps a file from persistent memory without loading its copy into RAM. The contents of that file would remain immutable, and the mapping can be shared by multiple Python processes simultaneously. A standard dataset pre-processing use case would be to map a sizeable textual dataset like Common Crawl into memory, spawn child processes, and split the job between them.

Basic Operations

  • Length: len(text) -> int
  • Indexing: text[42] -> str
  • Slicing: text[42:46] -> str

Advanced Operations

  • 'substring' in text -> bool
  • text.contains('substring', start=0, end=9223372036854775807) -> bool
  • text.find('substring', start=0, end=9223372036854775807) -> int
  • text.count('substring', start=0, end=9223372036854775807, allowoverlap=False) -> int
  • text.splitlines(keeplinebreaks=False, separator='\n') -> Strs
  • text.split(separator=' ', maxsplit=9223372036854775807, keepseparator=False) -> Strs

Collection-Level Operations

Once split into a Strs object, you can sort, shuffle, and reorganize the slices.

lines: Strs = text.split(separator='\n')
lines.sort()
lines.shuffle(seed=42)

Need copies?

sorted_copy: Strs = lines.sorted()
shuffled_copy: Strs = lines.shuffled(seed=42)

Basic list-like operations are also supported:

lines.append('Pythonic string')
lines.extend(shuffled_copy)

Quick Start: C ๐Ÿ› ๏ธ

There is an ABI-stable C 99 interface, in case you have a database, an operating system, or a runtime you want to integrate with StringZilla.

#include "stringzilla.h"

// Initialize your haystack and needle
strzl_haystack_t haystack = {your_text, your_text_length};
strzl_needle_t needle = {your_subtext, your_subtext_length, your_anomaly_offset};

// Perform string-level operations
size_t character_count = strzl_naive_count_char(haystack, 'a');
size_t character_position = strzl_naive_find_char(haystack, 'a');
size_t substring_position = strzl_naive_find_substr(haystack, needle);

// Perform collection level operations
strzl_array_t array = {your_order, your_count, your_get_begin, your_get_length, your_handle};
strzl_sort(&array, &your_config);

Contributing ๐Ÿ‘พ

Future development plans include:

  • Faster string sorting algorithm.
  • Bindings for JavaScript, Java, and Rust.
  • Support for reverse-order operations in Python.
  • Splitting CSV rows into columns.
  • Arm SVE backend.

Here's how to set up your dev environment and run some tests.

Development

# Clean up and install
rm -rf build && pip install -e . && pytest scripts/test.py -s -x

# Install without dependencies
pip install -e . --no-index --no-deps

Benchmarking

To benchmark on some custom file and pattern combinations:

python scripts/bench.py --haystack_path "your file" --needle "your pattern"

To benchmark on synthetic data:

python scripts/bench.py --haystack_pattern "abcd" --haystack_length 1e9 --needle "abce"

Packaging

To validate packaging:

cibuildwheel --platform linux

Compiling C++ Tests

# Install dependencies
brew install libomp llvm

# Compile and run tests
cmake -B ./build_release \
    -DCMAKE_C_COMPILER="/opt/homebrew/opt/llvm/bin/clang" \
    -DCMAKE_CXX_COMPILER="/opt/homebrew/opt/llvm/bin/clang++" \
    -DSTRINGZILLA_USE_OPENMP=1 \
    -DSTRINGZILLA_BUILD_TEST=1 \
    && \
    make -C ./build_release -j && ./build_release/stringzilla_test

License ๐Ÿ“œ

Feel free to use the project under Apache 2.0 or the Three-clause BSD license at your preference.


If you like this project, you may also enjoy USearch, UCall, UForm, UStore, SimSIMD, and TenPack ๐Ÿค—

stringzilla's People

Contributors

ashvardanian avatar arman-ghazaryan avatar davvard avatar semantic-release-bot avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.