Page MenuHomePhabricator

☂ Investigate alternative segmenter
Closed, ResolvedPublic4 Estimated Story Points

Description

The segmenter currently used is ad hoc and not very sophisticated. Investigate what segmenters are available and what would be needed to use them in Wikispeech. The current method should remain as a fallback.

Alternatives

OpenNLP

This was looked at in T286984. It's written in Java which may make it a bit harder to work with for us. There's also a note about the license of the Swedish model that makes it sound like anything generated using it need to include a copyright note. I'm not sure if that's correct.

sentencex

Developed by WMF. Uses language specific rules to some extent. Uses fallback when a language doesn't have it's own implementation.

Related Objects

Event Timeline

Sebastian_Berlin-WMSE changed the point value for this task from 8 to 4.

After looking around for a bit, I couldn't find any segmenter that we could easily add to the extension. The ones in the description could be added as services, but then we need to take into consideration how much that would complicate development, maintenance and setup.

In general it doesn't seem like segmenters are very complicated. Usually some mix of searching for punctuation and abbreviation lists. It may be better for us just to extend the segmenter we have at the moment to cover the cases we missing at the moment.

I agree that using external services can easily complicate things. Do we have a list of the common segmentation issues we're currently seeing that we could start addressing?

No, but we probably should start making note of them. I think there were one or two during the user test.

No, but we probably should start making note of them. I think there were one or two during the user test.

Sounds good. We'll have that in mind when further developing, or even make a new task for this, sort of an umbrella task maybe?

It may make more sense to add a column to Wikispeech-Text-to-Speech for the segmenter. This will be one of those ongoing things that we'll realistically not finish (i.e. create a perfect segmenter).

Sounds reasonable, I'll add that extra column