semchunk is a fast and lightweight pure Python library for splitting text into semantically meaningful chunks.
Owing to its complex yet highly efficient chunking algorithm, semchunk is both more semantically accurate than langchain.text_splitter.RecursiveCharacterTextSplitter (see How It Works 🔍) and is also over 70% faster than semantic-text-splitter (see the Benchmarks 📊).
semchunk may be installed with pip:
pip install semchunkThe code snippet below demonstrates how text can be chunked with semchunk:
>>> import semchunk
>>> import tiktoken # `tiktoken` is not required but is used here to quickly count tokens.
>>> text = 'The quick brown fox jumps over the lazy dog.'
>>> chunk_size = 2 # A low chunk size is used here for demo purposes.
>>> encoder = tiktoken.encoding_for_model('gpt-4')
>>> token_counter = lambda text: len(encoder.encode(text)) # `token_counter` may be swapped out for any function capable of counting tokens.
>>> semchunk.chunk(text, chunk_size=chunk_size, token_counter=token_counter)
['The quick', 'brown fox', 'jumps over', 'the lazy', 'dog.']def chunk(
text: str,
chunk_size: int,
token_counter: callable,
memoize: bool=True
) -> list[str]chunk() splits text into semantically meaningful chunks of a specified size as determined by the provided token counter.
text is the text to be chunked.
chunk_size is the maximum number of tokens a chunk may contain.
token_counter is a callable that takes a string and returns the number of tokens in it.
memoize flags whether to memoise the token counter. It defaults to True.
This function returns a list of chunks up to chunk_size-tokens-long, with any whitespace used to split the text removed.
semchunk works by recursively splitting texts until all resulting chunks are equal to or less than a specified chunk size. In particular, it:
- Splits text using the most semantically meaningful splitter possible;
- Recursively splits the resulting chunks until a set of chunks equal to or less than the specified chunk size is produced;
- Merges any chunks that are under the chunk size back together until the chunk size is reached; and
- Reattaches any non-whitespace splitters back to the ends of chunks barring the final chunk if doing so does not bring chunks over the chunk size, otherwise adds non-whitespace splitters as their own chunks.
To ensure that chunks are as semantically meaningful as possible, semchunk uses the following splitters, in order of precedence:
- The largest sequence of newlines (
\n) and/or carriage returns (\r); - The largest sequence of tabs;
- The largest sequence of whitespace characters (as defined by regex's
\scharacter class); - Sentence terminators (
.,?,!and*); - Clause separators (
;,,,(,),[,],“,”,‘,’,',"and`); - Sentence interrupters (
:,—and…); - Word joiners (
/,\,–,&and-); and - All other characters.
semchunk also relies on memoization to cache the results of token counters and the chunk() function, thereby improving performance.
On a desktop with a Ryzen 3600, 64 GB of RAM, Windows 11 and Python 3.12.0, it takes semchunk 24.41s seconds to split every sample in NLTK's Gutenberg Corpus into 512-token-long chunks (for context, the Corpus contains 18 texts and 3,001,260 tokens). By comparison, it takes semantic-text-splitter 1 minute and 48.01 seconds to chunk the same texts into 512-token-long chunks — a difference of 77.35%.
The code used to benchmark semchunk and semantic-text-splitter is available here.
This library is licensed under the MIT License.