Eliminate cleaners/core import time bottleneck by aseembits93 · Pull Request #4167 · Unstructured-IO/unstructured

aseembits93 · 2026-01-07T03:44:25Z

Previously the code was going through a million unicode characters and for each of them checking if the character was unicode punctuation or not. This was heavy and took about 350ms during import time.

After processing the number of punctuation was only about ~750 so we just hard coded the unicode punctuation characters in source code itself. This way we eliminate the runtime calculation.

To ensure correctness, we added a test.
Before

After

qued · 2026-01-08T18:37:00Z

I don't think the gains here (1/3 second at import time) are worth putting data in a code file.

misrasaurabh1 · 2026-01-08T19:15:42Z

@qued ok, there are a couple of other approaches - storing the list in a separate data file or reducing the sys.maxunicode range to a group of smaller ranges and checking for punctuation membership within that. Let me know what approach works better

qued · 2026-01-08T23:34:29Z

@misrasaurabh1 My first instinct is to use a data file, but does the file IO to load it erode all the gains from not checking all the unicode characters?

misrasaurabh1 · 2026-01-08T23:38:43Z

we can experiment with the other approaches and check

aseembits93 added 2 commits January 6, 2026 19:25

cache unicode punctuation indices

2363c2e

linter fixes

3d3de3e

aseembits93 changed the title ~~Eliminate cleaners/core import time bottlenec~~ Eliminate cleaners/core import time bottleneck Jan 7, 2026

aseembits93 added 4 commits January 6, 2026 19:51

Merge branch 'main' into unicode-punct-indices-optimization

f469610

changelog and version

d45c4f3

formatting fix

9059892

unicode superset to cover python 3.10-3.12

b3a2636

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eliminate cleaners/core import time bottleneck#4167

Eliminate cleaners/core import time bottleneck#4167
aseembits93 wants to merge 6 commits intoUnstructured-IO:mainfrom
misrasaurabh1:unicode-punct-indices-optimization

aseembits93 commented Jan 7, 2026

Uh oh!

qued commented Jan 8, 2026

Uh oh!

misrasaurabh1 commented Jan 8, 2026

Uh oh!

qued commented Jan 8, 2026

Uh oh!

misrasaurabh1 commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

aseembits93 commented Jan 7, 2026

Uh oh!

qued commented Jan 8, 2026

Uh oh!

misrasaurabh1 commented Jan 8, 2026

Uh oh!

qued commented Jan 8, 2026

Uh oh!

misrasaurabh1 commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants