Tags: wikimedia/wikimedia-textcat
Tags
Update PHP TextCat Models to 10K n-grams
Update all LM/ and LM-query/ models to 10K n-grams. The number of spaces
('_') counted in the LM models has gone down by 2 for every model, but
doesn't change the rank statistics for any model.
Update lm2php.php to handle slightly changed Perl model format (a stray
space was removed).
Add a couple of test cases that differ by model size above 5K (previous
max).
Bug: T155672
Change-Id: If35912574e833a677459531f994ae95f314b042d
Add newly validated query-based language models to TextCat Add validated models for language ID of queries that are not already present. Includes Czech, Indonesian, Italian, Japanese, Dutch, Polish, Portuguese, Swedish, Turkish, Ukrainian, and Vietnamese Bug: T121539 Change-Id: I44cb67fe411de32c9b0848058ef18cc95e83231f
Create Wiki-Text-based language models for TextCat Moved existing query-based models to LM-query/. Created 70 new models based on random articles from the relevant Wikipedia. Minor updates to PHP code, including change output join text to "OR" so as not to conflict with language model "or.lm". Major updates to README.md. These models have not been evaluated (see T121539), but are made available as is. BUG: T121545 Change-Id: I772670f2fa97dfe3981fd139ea40c62f921ccda7