semchunk

semchunk is a fast and lightweight pure Python library for splitting text into semantically meaningful chunks.

Owing to its complex yet highly efficient chunking algorithm, semchunk is both more semantically accurate than langchain.text_splitter.RecursiveCharacterTextSplitter (see How It Works 🔍) and is also over 70% faster than semantic-text-splitter (see the Benchmarks 📊).

Installation 📦

semchunk may be installed with pip:

pip install semchunk

Usage 👩‍💻

The code snippet below demonstrates how text can be chunked with semchunk:

>>> import semchunk
>>> import tiktoken # `tiktoken` is not required but is used here to quickly count tokens.
>>> text = 'The quick brown fox jumps over the lazy dog.'
>>> chunk_size = 2 # A low chunk size is used here for demo purposes.
>>> encoder = tiktoken.encoding_for_model('gpt-4')
>>> token_counter = lambda text: len(encoder.encode(text)) # `token_counter` may be swapped out for any function capable of counting tokens.
>>> semchunk.chunk(text, chunk_size=chunk_size, token_counter=token_counter)
['The quick', 'brown fox', 'jumps over', 'the lazy', 'dog.']

Chunk

def chunk(
    text: str,
    chunk_size: int,
    token_counter: callable,
    memoize: bool=True
) -> list[str]

chunk() splits text into semantically meaningful chunks of a specified size as determined by the provided token counter.

text is the text to be chunked.

chunk_size is the maximum number of tokens a chunk may contain.

token_counter is a callable that takes a string and returns the number of tokens in it.

memoize flags whether to memoise the token counter. It defaults to True.

This function returns a list of chunks up to chunk_size-tokens-long, with any whitespace used to split the text removed.

How It Works 🔍

semchunk works by recursively splitting texts until all resulting chunks are equal to or less than a specified chunk size. In particular, it:

Splits text using the most semantically meaningful splitter possible;
Recursively splits the resulting chunks until a set of chunks equal to or less than the specified chunk size is produced;
Merges any chunks that are under the chunk size back together until the chunk size is reached; and
Reattaches any non-whitespace splitters back to the ends of chunks barring the final chunk if doing so does not bring chunks over the chunk size, otherwise adds non-whitespace splitters as their own chunks.

To ensure that chunks are as semantically meaningful as possible, semchunk uses the following splitters, in order of precedence:

The largest sequence of newlines (\n) and/or carriage returns (\r);
The largest sequence of tabs;
The largest sequence of whitespace characters (as defined by regex's \s character class);
Sentence terminators (., ?, ! and *);
Clause separators (;, ,, (, ), [, ], “, ”, ‘, ’, ', " and `);
Sentence interrupters (:, — and …);
Word joiners (/, \, –, & and -); and
All other characters.

semchunk also relies on memoization to cache the results of token counters and the chunk() function, thereby improving performance.

Benchmarks 📊

On a desktop with a Ryzen 3600, 64 GB of RAM, Windows 11 and Python 3.12.0, it takes semchunk 24.41s seconds to split every sample in NLTK's Gutenberg Corpus into 512-token-long chunks (for context, the Corpus contains 18 texts and 3,001,260 tokens). By comparison, it takes semantic-text-splitter 1 minute and 48.01 seconds to chunk the same texts into 512-token-long chunks — a difference of 77.35%.

The code used to benchmark semchunk and semantic-text-splitter is available here.

Licence 📄

This library is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.github/workflows		.github/workflows
src/semchunk		src/semchunk
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENCE		LICENCE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

semchunk

Installation 📦

Usage 👩‍💻

Chunk

How It Works 🔍

Benchmarks 📊

Licence 📄

About

Uh oh!

Releases

Packages

Languages

License

PocketDocLabs/semchunk

Folders and files

Latest commit

History

Repository files navigation

semchunk

Installation 📦

Usage 👩‍💻

Chunk

How It Works 🔍

Benchmarks 📊

Licence 📄

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages