Handle irregularities between pySBD & pySBD + spaCy sentence output

pySBD spaCy pipeline component uses a token-based approach and sets `is_sent_start` to `True` or `False` depending on `Span`s obtained from pySBD character offsets. We create `Span` objects using `doc.char_span` method by creating a slice - `doc.text[start:end]` which is a sentence span whose first `Token` object needs to have attribute `is_sent_start` set to `True`. On the other hand, if the character indices don’t map to a valid span it returns `None` . Hence we get irregularities in pySBD & pySBD + spaCy sentence output.

The inability to get `Span` object from pySBD character offsets can be tackled using the deconstruction of `Doc` object like the way [PKSHATechnology-Research/camphr](https://github.com/PKSHATechnology-Research/camphr) authors have written [`get_doc_char_span`](https://github.com/PKSHATechnology-Research/camphr/blob/b00a136e96775b3aef4fa9a91fa5729308569dd0/camphr/utils.py#L66) which uses [`destruct_token`](https://github.com/PKSHATechnology-Research/camphr/blob/b00a136e96775b3aef4fa9a91fa5729308569dd0/camphr/utils.py#L57)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Handle irregularities between pySBD & pySBD + spaCy sentence output #59

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Handle irregularities between pySBD & pySBD + spaCy sentence output #59

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions