-
Notifications
You must be signed in to change notification settings - Fork 4
Add CJK datasets #24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CJK datasets #24
Conversation
|
I did :
|
Codecov Report
@@ Coverage Diff @@
## master #24 +/- ##
==========================================
+ Coverage 36.87% 37.11% +0.24%
==========================================
Files 17 17
Lines 518 520 +2
==========================================
+ Hits 191 193 +2
Misses 327 327
Continue to review full report at Codecov.
|
|
I just rebased and pushed changes for this. I'll rebuild Chinese, Japanese, and Korean vectors. If everything goes as expected, I'll merge. |
|
@halfak just a thought before merging: version=$(python -c"import mwtext; print(mwtext.__version__)")
learned_vectors: \
datasets/arwiki-$(dump_date)-learned_vectors.$(vector_dimensions)_cell_v$(version).vec.bz2
... |
|
I ended up adding "cjk_words" to the vector file name. I think that will work for now. But I do like the version # strategy. @5uperpalo would you submit a follow-up PR with that implemented? |
No description provided.