Giter VIP home page Giter VIP logo

semchunk's Introduction

semchunk

semchunk is a fast and lightweight Python library for splitting text into semantically meaningful chunks.

Owing to its complex yet highly efficient chunking algorithm, semchunk is both more semantically accurate than langchain.text_splitter.RecursiveCharacterTextSplitter (see How It Works ๐Ÿ”) and is also over 90% faster than semantic-text-splitter (see the Benchmarks ๐Ÿ“Š).

Installation ๐Ÿ“ฆ

semchunk may be installed with pip:

pip install semchunk

Usage ๐Ÿ‘ฉโ€๐Ÿ’ป

The code snippet below demonstrates how text can be chunked with semchunk:

import semchunk
from transformers import AutoTokenizer # Neither `transformers` nor `tiktoken` are required,
import tiktoken                        # they are here for demonstration purposes.

chunk_size = 2 # A low chunk size is used here for demonstration purposes. Keep in mind that
               # `semchunk` doesn't take special tokens into account unless you're using a
               # custom token counter, so you probably want to deduct your chunk size by the
               # number of special tokens added by your tokenizer.
text = 'The quick brown fox jumps over the lazy dog.'

# As you can see below, `semchunk.chunkerify` will accept the names of all OpenAI models, OpenAI
# `tiktoken` encodings and Hugging Face models (in that order of precedence), along with custom
# tokenizers that have an `encode()` method (such as `tiktoken`, `transformers` and `tokenizers`
# tokenizers) and finally any function that can take a text and return the number of tokens in it.
chunker = semchunk.chunkerify('umarbutler/emubert', chunk_size) or \
          semchunk.chunkerify('gpt-4', chunk_size) or \
          semchunk.chunkerify('cl100k_base', chunk_size) or \
          semchunk.chunkerify(AutoTokenizer.from_pretrained('umarbutler/emubert'), chunk_size) or \
          semchunk.chunkerify(tiktoken.encoding_for_model('gpt-4'), chunk_size) or \
          semchunk.chunkerify(lambda text: len(text.split()), chunk_size)

# The resulting `chunker` can take and chunk a single text or a list of texts, returning a list of
# chunks or a list of lists of chunks, respectively.
assert chunker(text) == ['The quick', 'brown', 'fox', 'jumps', 'over the', 'lazy', 'dog.']
assert chunker([text], progress = True) == [['The quick', 'brown', 'fox', 'jumps', 'over the', 'lazy', 'dog.']]

# If you have a large number of texts to chunk and speed is a concern, you can also enable
# multiprocessing by setting `processes` to a number greater than 1.
assert chunker([text], processes = 2) == [['The quick', 'brown', 'fox', 'jumps', 'over the', 'lazy', 'dog.']]

Chunkerify

def chunkerify(
    tokenizer_or_token_counter: str | tiktoken.Encoding | transformers.PreTrainedTokenizer | \
                                tokenizers.Tokenizer | Callable[[str], int],
    chunk_size: int = None,
    max_token_chars: int = None,
    memoize: bool = True,
) -> Callable[[str | Sequence[str], bool, bool], list[str] | list[list[str]]]:

chunkerify() constructs a chunker that splits one or more texts into semantically meaningful chunks of a specified size as determined by the provided tokenizer or token counter.

tokenizer_or_token_counter is either: the name of a tiktoken or transformers tokenizer (with priority given to the former); a tokenizer that possesses an encode attribute (eg, a tiktoken, transformers or tokenizers tokenizer); or a token counter that returns the number of tokens in a input.

chunk_size is the maximum number of tokens a chunk may contain. It defaults to None in which case it will be set to the same value as the tokenizer's model_max_length attribute (deducted by the number of tokens returned by attempting to tokenize an empty string) if possible otherwise a ValueError will be raised.

max_token_chars is the maximum numbers of characters a token may contain. It is used to significantly speed up the token counting of long inputs. It defaults to None in which case it will either not be used or will, if possible, be set to the numbers of characters in the longest token in the tokenizer's vocabulary as determined by the token_byte_values or get_vocab methods.

memoize flags whether to memoize the token counter. It defaults to True.

This function returns a chunker that takes either a single text or a sequence of texts and returns, if a single text has been provided, a list of chunks up to chunk_size-tokens-long with any whitespace used to split the text removed, or, if multiple texts have been provided, a list of lists of chunks, with each inner list corresponding to the chunks of one of the provided input texts.

The resulting chunker can be passed a processes argument that specifies the number of processes to be used when chunking multiple texts.

It is also possible to pass a progress argument which, if set to True and multiple texts are passed, will display a progress bar.

Technically, the chunker will be an instance of the semchunk.Chunker class to assist with type hinting, though this should have no impact on how it can be used.

Chunk

def chunk(
    text: str,
    chunk_size: int,
    token_counter: Callable,
    memoize: bool = True,
) -> list[str]

chunk() splits a text into semantically meaningful chunks of a specified size as determined by the provided token counter.

text is the text to be chunked.

chunk_size is the maximum number of tokens a chunk may contain.

token_counter is a callable that takes a string and returns the number of tokens in it.

memoize flags whether to memoize the token counter. It defaults to True.

This function returns a list of chunks up to chunk_size-tokens-long, with any whitespace used to split the text removed.

How It Works ๐Ÿ”

semchunk works by recursively splitting texts until all resulting chunks are equal to or less than a specified chunk size. In particular, it:

  1. Splits text using the most semantically meaningful splitter possible;
  2. Recursively splits the resulting chunks until a set of chunks equal to or less than the specified chunk size is produced;
  3. Merges any chunks that are under the chunk size back together until the chunk size is reached; and
  4. Reattaches any non-whitespace splitters back to the ends of chunks barring the final chunk if doing so does not bring chunks over the chunk size, otherwise adds non-whitespace splitters as their own chunks.

To ensure that chunks are as semantically meaningful as possible, semchunk uses the following splitters, in order of precedence:

  1. The largest sequence of newlines (\n) and/or carriage returns (\r);
  2. The largest sequence of tabs;
  3. The largest sequence of whitespace characters (as defined by regex's \s character class);
  4. Sentence terminators (., ?, ! and *);
  5. Clause separators (;, ,, (, ), [, ], โ€œ, โ€, โ€˜, โ€™, ', " and `);
  6. Sentence interrupters (:, โ€” and โ€ฆ);
  7. Word joiners (/, \, โ€“, & and -); and
  8. All other characters.

Benchmarks ๐Ÿ“Š

On a desktop with a Ryzen 3600, 64 GB of RAM, Windows 11 and Python 3.11.9, it takes semchunk 6.69 seconds to split every sample in NLTK's Gutenberg Corpus into 512-token-long chunks with GPT-4's tokenizer (for context, the Corpus contains 18 texts and 3,001,260 tokens). By comparison, it takes semantic-text-splitter 116.48 seconds to chunk the same texts into 512-token-long chunks โ€” a difference of 94.26%.

The code used to benchmark semchunk and semantic-text-splitter is available here.

Licence ๐Ÿ“„

This library is licensed under the MIT License.

semchunk's People

Contributors

jcobol avatar r0bk avatar umarbutler avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

semchunk's Issues

Problem with very large files

I am trying to chunk a huge document but it runs forever. Did I miss something in my code?

File here

import semchunk                  
import pandas as pd 

df = pd.read_parquet("free_trade_agreement.parquet")
chunk_size = 100 # A low chunk size is used here for demonstration purposes.

text = df.text.item()

# As you can see below, `semchunk.chunkerify` will accept the names of all OpenAI models, OpenAI
# `tiktoken` encodings and Hugging Face models (in that order of precedence), along with custom
# tokenizers that have an `encode()` method (such as `tiktoken`, `transformers` and `tokenizers`
# tokenizers) and finally any function that can take a text and return the number of tokens in it.
chunker = semchunk.chunkerify(lambda text: len(text.split()), chunk_size)

a = chunker(text)

Referencing benbrandt/text-splitter#184 in semantic-text-splitter where I can now chunk the same document in ~2s.

Option to have overlapping chunks

Hey,

First off, I wanna say this is a pretty cool library! Thank you for the amazing work!

I'm just curious if there is an option to have overlapping chunks as part of the splitting. For example if we have 10 sentences, it would be nice for me to generate chunks of 3 sentences each with an overlap of 1 sentence. Obviously I know we can do it by splicing the chunks returned by manipulating lists, but just thought it might be nice feature to have!

Let me know what you think!

Memoization destorys `semchunk.chunk()`'s signature

Memoization is destorying semchunk.chunk()'s signature as described in this issue. A fix is to run:

def chunk(text: str, chunk_size: int, token_counter: callable, memoize: bool=True, _recursion_depth: int = 0) -> list[str]:
    ...

chunk = functools.wraps(chunk)(functools.cache(chunk))

Instead of:

@functools.cache
def chunk(text: str, chunk_size: int, token_counter: callable, memoize: bool=True, _recursion_depth: int = 0) -> list[str]:
    ...

Divide by zero error

It is possible for a token counter to return zero tokens, leading to a division by zero error in the merge_splits function:

  File "C:\Users\user\miniconda3\envs\project\Lib\site-packages\semchunk\semchunk.py", line 78, in merge_splits
    average = cumulative_lengths[midpoint] / tokens if cumulative_lengths[midpoint] else average
              ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~
ZeroDivisionError: division by zero

Request: offer and and generators and async generators

Hi there!

Thanks for this neat library. I'm giving it a go.

It would be great to have two variants of the chunkerify function that return a generator and async generator, and a version that is async.

Use cases:

  • async evaluation is good for non-blocking situations, for example - chunking dynamically inside a web request, which in a blocking (sync scenario) will impact the backend service as a whole in some cases. Furthermore, it could allow for creating a concurrent (not parallel) version of chunking perhaps.
  • returning a generator allows evaluating in intervals and executing code in between, for example - a for loop.
  • returning an async generator offers the same, within an async context.

The simplest option (but non performant) version for implementing async logic, is simply to execute the sync version using something like anyio.to_thread.run_sync: https://anyio.readthedocs.io/en/stable/threads.html.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.