Skip to content

Commit 18be4d4

Browse files
committed
Added files via upload
1 parent 78428b9 commit 18be4d4

File tree

1 file changed

+9
-0
lines changed

1 file changed

+9
-0
lines changed

README.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,15 @@ See [http://odur.let.rug.nl/~vannoord/TextCat/](http://odur.let.rug.nl/~vannoord
1111

1212
See [https://github.com/wikimedia/wikimedia-textcat](https://github.com/wikimedia/wikimedia-textcat) for a PHP port.
1313

14+
## Updates
15+
16+
Updates from the original version include:
17+
18+
* updated to handle Unicode characters
19+
* modified the output to include scores (in case we want to limit based on the score)
20+
* pre-loaded all language models so that when processing line by line it is many times faster (a known deficiency mentioned in the comments of the original)
21+
* put in an alphabetic sub-sort after frequency sorting of n-grams (as noted in the comments of the original, not having this is faster, but without it, results are not unique, and can vary from run to run on the same input!!), and similarly sorted outputs with the same score alphabetically.
22+
* removed the benchmark timers (after re-shuffling some parts of the code, they weren't in a convenient location anymore, so I just took them out.
1423

1524
## Classification and Model Generation
1625

0 commit comments

Comments
 (0)