Skip to content

Add Smart Quotes canonicizer#143

Open
taserz wants to merge 1 commit into
evllabs:masterfrom
taserz:feat/smart-quotes-canonicizer
Open

Add Smart Quotes canonicizer#143
taserz wants to merge 1 commit into
evllabs:masterfrom
taserz:feat/smart-quotes-canonicizer

Conversation

@taserz
Copy link
Copy Markdown

@taserz taserz commented May 12, 2026

Closes #112

Adds a canonicizer that replaces Unicode smart/curly quotes with plain ASCII equivalents. Word processors automatically substitute typographic quotes for straight ones, so the same phrase can tokenize differently depending on where the text came from. This normalizes that before feature extraction.

Characters covered: curly single and double quotes (U+2018, U+2019, U+201B, U+201C, U+201D, U+201E, U+201F), double angle quotation marks (U+00AB, U+00BB), single angle quotation marks (U+2039, U+203A), and prime/double prime (U+2032, U+2033). Shows up in the GUI. Unit tests included.

Closes evllabs#112. Adds a canonicizer that replaces Unicode smart/curly quotes
with plain ASCII equivalents. Word processors automatically substitute
typographic quotes for straight ones, so the same phrase can tokenize
differently depending on where the text came from. This normalizes that
before feature extraction.

Characters handled:
- U+2018, U+2019, U+201B (curly single quotes, high-reversed-9) -> '
- U+201C, U+201D, U+201E, U+201F (curly double quotes, low/high-9)  -> "
- U+00AB, U+00BB (double angle quotation marks) -> "
- U+2039, U+203A (single angle quotation marks) -> '
- U+2032, U+2033 (prime, double prime) -> ', "

Shows up in the GUI. Unit tests included.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@taserz taserz force-pushed the feat/smart-quotes-canonicizer branch from 16dc594 to fe1ec51 Compare May 12, 2026 23:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Smart Quote Canonicizer

1 participant