Add CJK datasets #24

5uperpalo · 2021-02-11T16:16:39Z

No description provided.

5uperpalo · 2021-02-14T01:09:09Z

I did :

a small cleanup in Makefile
added traditional -> simplified Chinese symbols conversion
added tok_strategy parameter to CJK models in Makefile
@halfak take a look if I can merge it...

codecov-io · 2021-02-14T01:09:19Z

Codecov Report

Merging #24 (9e29955) into master (ae94aa4) will increase coverage by 0.24%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master      #24      +/-   ##
==========================================
+ Coverage   36.87%   37.11%   +0.24%     
==========================================
  Files          17       17              
  Lines         518      520       +2     
==========================================
+ Hits          191      193       +2     
  Misses        327      327

Impacted Files	Coverage Δ
mwtext/content_transformers/wikitext2words.py	`95.00% <100.00%> (+0.26%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ae94aa4...9e29955. Read the comment docs.

halfak · 2021-02-28T00:46:50Z

I just rebased and pushed changes for this. I'll rebuild Chinese, Japanese, and Korean vectors. If everything goes as expected, I'll merge.

5uperpalo · 2021-02-28T19:40:58Z

@halfak just a thought before merging:
you mentioned on the call that it would be nice to have versions of the learned vectors to have some idea when/how they were made, how about adding following lines to makefile; it extracts the module version and adds it to the name of the file

version=$(python -c"import mwtext; print(mwtext.__version__)")

learned_vectors: \
datasets/arwiki-$(dump_date)-learned_vectors.$(vector_dimensions)_cell_v$(version).vec.bz2
...

halfak · 2021-03-02T21:52:26Z

I ended up adding "cjk_words" to the vector file name. I think that will work for now. But I do like the version # strategy. @5uperpalo would you submit a follow-up PR with that implemented?

5uperpalo self-assigned this Feb 14, 2021

5uperpalo requested a review from halfak February 14, 2021 01:09

Pavol86 added 5 commits February 28, 2021 00:45

initial commit

732f030

adjusted test with simpliefied symbols

9a7701e

updated tests

233c5f2

flake 8 adjustment

a31c16c

flake 8 adjustment

d0cdda7

halfak force-pushed the CJK_datasets branch from 9e29955 to d0cdda7 Compare February 28, 2021 00:46

5uperpalo mentioned this pull request Mar 2, 2021

CJK_models wikimedia/drafttopic#57

Open

Some makefile cleanup.

82be65f

halfak merged commit cd71adc into master Mar 2, 2021

halfak deleted the CJK_datasets branch March 2, 2021 22:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add CJK datasets #24

Add CJK datasets #24

Uh oh!

5uperpalo commented Feb 11, 2021

Uh oh!

5uperpalo commented Feb 14, 2021

Uh oh!

codecov-io commented Feb 14, 2021 •

edited

Loading

Uh oh!

halfak commented Feb 28, 2021

Uh oh!

5uperpalo commented Feb 28, 2021

Uh oh!

halfak commented Mar 2, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add CJK datasets #24

Add CJK datasets #24

Uh oh!

Conversation

5uperpalo commented Feb 11, 2021

Uh oh!

5uperpalo commented Feb 14, 2021

Uh oh!

codecov-io commented Feb 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

halfak commented Feb 28, 2021

Uh oh!

5uperpalo commented Feb 28, 2021

Uh oh!

halfak commented Mar 2, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-io commented Feb 14, 2021 •

edited

Loading