Skip to content

Version 1.0.0

Latest

Choose a tag to compare

@santhoshtr santhoshtr released this 09 Dec 08:58
a257906

Highlights

  • This project is now fully rewritten in Rust. Binding for python, nodejs and wasm are also available.
  • The rewrite keeps the API same as before. An additional api get_sentence_boundaries provides low level information about segmentation. See README.md for more details
  • Various performance optimizations along with the Rust rewrite make this library faster than previous versions. Preliminary benchmarks indicates that sentencex might be the fastest rules based sentence segmention library with wide language support. More benchmark results will be published soon

What's Changed

  • ci: publish rust crates using OIDC-(985408c)
  • en: Remove abbreviations that are complete words-(ae36abf)
  • Fix offset calculation bug-(21f600c)
  • Fix char boundary bugs-(d40bece)
  • Expose char index to boundary struct-(6202c93)
  • ack tqsm-(42d8fce)
  • Version bump-(7494e30)
  • Fix boundary advance check for multibyte chars-(215e172)
  • Add abbreviations for Greek-(d1b3564)
  • Add tests for regexes-(23d6969)
  • Remove serde by defining fallbacks in rust-(9835b2c)
  • replace lazy_static with lazyLock-(18fbd38)
  • Add some more test cases of specific sentences-(5dcdf72)
  • major: remove is_punctuation_between_quotes and implement a boundary extension logic-(5a1156c)
  • refactor: Define type for Skippable Range-(d215d12)
  • Add test for nested range skip-(ee1840d)
  • Use ceil_char_boundary from rust 1.91 to find nearest char boundary-(48b964f)
  • Indicate sentence beginning with a *-(601d855)
  • fix: Avoid ignoring trailing whitespaces in lines-(3869889)
  • Allow spaces inside numbered reference sequences-(d623839)
  • Add script to publish python binding-(a59439d)
  • doc: Add comment for Global boundaries-(1a740ef)
  • tests: add coverage for bracketed punctuation and optional-space references; broaden parens regex to avoid false splits-(7fe3a82)
  • regex: allow optional leading space before numbered references (e.g., ' [1]')-(5adb928)
  • Add chunking logic for processing large texts-(c2ee9b8)
  • Optimize skippable range processing-(22bb619)
  • Use char_indices for more efficient character iteration-(9973a22)
  • Sort ranges by start position for more efficient lookups-(a3567cc)
  • performance: Eliminate string allocation overhead-(9be16d3)
  • Pre-allocate sentence_boundaries once and reuse for all paragraphs-(18df36a)
  • performance: improve get_last_word-(5ea3bc9)
  • performance: cache regex-(afa5ba5)
  • performance: improve boundary symbol detection-(e88fd46)
  • pre-calculate paragraph offsets in one pass-(7434bd8)
  • performance: cached regex, preallocations-(8372b5c)
  • perf: Avoid unnecessary clone for abreviations-(9220f5a)
  • Remove warnings about deprecation-(e5454ac)
  • handle char boundary-(0f1fb0e)
  • explose low level api for boundary information-(32a738f)
  • Read content from stdin too-(32fc18c)
  • dependency updates-(92f840f)
  • minor-(3d4ed2a)
  • formatting-(c9d362a)
  • Documentation fix-(35653ae)
  • Build wasm for both web and node-(ca725d8)
  • Add wasm binding-(96e9e37)
  • Use requirejs format for node binding-(7b5113e)
  • feat: add test.html to demonstrate WASM binding usage for the library-(93cf64c)
  • feat: update Cargo.toml to include wasm bindings in workspace members-(ba1abef)
  • feat: add WASM binding for sentence segmentation library-(da62240)
  • Documentation-(81e7124)
  • feat: create examples folder with Rust, Python, and Node.js examples-(0338c1f)
  • docs: Add documentation with example usage for segment function-(9c083af)
  • docs: add documentation with example usage for segment function-(54a68eb)
  • Fix unused variable warning-(c1040eb)
  • Add nodejs binding-(217d463)
  • feat: create Node.js binding for segment functionality using Neon-(02106ba)
  • refactor: replace character check with regex in continue_in_next_word function-(dfc33bd)
  • refactor: rewrite JS regex logic in Rust for continue_in_next_word function-(902c7b1)
  • feat: add benchmarking setup for sentence segmentation in English text-(f56cb03)
  • Lazy load abbreviations - only once-(9051f09)
  • efactor: Move language mapping to a static HashMap in language_factory-(24e9af5)
  • refactor: simplify language_factory function by removing unnecessary condition-(b514a44)
  • fix: fix formatting by adding a newline in lib.rs-(0c54144)
  • refactor: simplify language_factory function by removing loop structure-(344bd2e)
  • fix: correct if condition in language_factory function-(bd8ad31)
  • refactor: replace fallback handling with recursive language_factory calls-(e9c6623)
  • refactor: update comment to clarify fallback logic in language_factory-(2c7f06e)
  • feat: add support for Telegu language in language factory-(df77f68)
  • Add fallbacks support-(de17aaf)
  • feat: add LANGUAGE_FALLBACKS from YAML file using include! macro-(9750a20)
  • refactor: simplify language segmentation by removing SentenceSegmenter struct-(57d4f81)
  • feat: export Rust function to Python with proper struct and methods-(e36a0a7)
  • refactor: simplify Python binding for sentence segmentation function-(8cdc2f1)
  • feat: implement LanguageFactory for language code resolution and fallback-(a66880a)
  • feat: add LanguageFactory for language code resolution in SentenceSegmenter-(4d22499)
  • refactor: Rewrite LANGUAGE_FALLBACKS as a Rust static variable-(8442818)
  • feat: add fallback languages support in src/languages/fallbacks.rs-(b90c506)
  • feat: add Kazakh language support file-(7b7f7e6)
  • feat: append vowels, consonants, and consonant_vowels to abbreviations-(c574419)
  • feat: add method to check punctuation between quotes in Tamil language-(529b278)
  • refactor: rewrite consonant-vowel loop in Rust for clarity and efficiency-(0be342b)
  • feat: add Tamil language support in src/languages/ta.rs-(9a62a03)
  • Refactor Language trait-(6af3d58)
  • Rename tests-(583ae9f)
  • Refactor python bindig-(8c32d0f)
  • feat: include abbreviations from both ml and en files in Malayalam language-(e8311e1)
  • docs: add comment for abbreviation file usage in Malayalam struct-(877d332)
  • feat: add Python bindings for SentenceSegmenter using pyo3-(074b2a5)

New Contributors

Full Changelog: v0.6.1...v1.0.0