-
Notifications
You must be signed in to change notification settings - Fork 102
Closed
Description
Hey team,
I use the latest enwiki and train a model with the following cmd:
wikipedia2vec train enwiki-20190701-pages-articles.xml.bz2 enwiki-20190701-300d \
--dim-size=300 \
--no-lowercase \
--min-word-count=30 \
--min-entity-count=10
To my understanding, by default, the category flag is False. Therefore, it should filter all wiki category pages.
However, by examine the titles in the wikipedia2vec model, I found the following titles:
:Category:American actors
:Category:American architects
:Category:American film actors
...
categories are not fully filtered.
We could change the code here by adding more filters and check if the title starts with :Category:.
In a similar fashion, you might also want to filter non-entity with title that starts with:
:wikt:
:Category:
:Image:
:category:
:commons:
:Template:
:File:
...
There are quite some suspicious non-entity title that starts with :
Metadata
Metadata
Assignees
Labels
No labels