Skip to content

Conversation

@5uperpalo
Copy link
Contributor

No description provided.

@5uperpalo
Copy link
Contributor Author

I did :

  • a small cleanup in Makefile
  • added traditional -> simplified Chinese symbols conversion
  • added tok_strategy parameter to CJK models in Makefile
    @halfak take a look if I can merge it...

@5uperpalo 5uperpalo self-assigned this Feb 14, 2021
@codecov-io
Copy link

codecov-io commented Feb 14, 2021

Codecov Report

Merging #24 (9e29955) into master (ae94aa4) will increase coverage by 0.24%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #24      +/-   ##
==========================================
+ Coverage   36.87%   37.11%   +0.24%     
==========================================
  Files          17       17              
  Lines         518      520       +2     
==========================================
+ Hits          191      193       +2     
  Misses        327      327              
Impacted Files Coverage Δ
mwtext/content_transformers/wikitext2words.py 95.00% <100.00%> (+0.26%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ae94aa4...9e29955. Read the comment docs.

@5uperpalo 5uperpalo requested a review from halfak February 14, 2021 01:09
@halfak
Copy link
Member

halfak commented Feb 28, 2021

I just rebased and pushed changes for this. I'll rebuild Chinese, Japanese, and Korean vectors. If everything goes as expected, I'll merge.

@5uperpalo
Copy link
Contributor Author

@halfak just a thought before merging:
you mentioned on the call that it would be nice to have versions of the learned vectors to have some idea when/how they were made, how about adding following lines to makefile; it extracts the module version and adds it to the name of the file

version=$(python -c"import mwtext; print(mwtext.__version__)")

learned_vectors: \
datasets/arwiki-$(dump_date)-learned_vectors.$(vector_dimensions)_cell_v$(version).vec.bz2
...

@halfak
Copy link
Member

halfak commented Mar 2, 2021

I ended up adding "cjk_words" to the vector file name. I think that will work for now. But I do like the version # strategy. @5uperpalo would you submit a follow-up PR with that implemented?

@halfak halfak merged commit cd71adc into master Mar 2, 2021
@halfak halfak deleted the CJK_datasets branch March 2, 2021 22:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants