English | 简体中文
DeepKE is a knowledge extraction toolkit supporting low-resource, document-level and multimodal scenarios for entity, relation and attribute extraction. We provide documents, Google Colab tutorials, online demo, and slides for beginners.
To promote efficient Chinese knowledge graph construction, we provide DeepKE-cnSchema, a specific version of DeepKE, containing off-the-shelf models based on cnSchema. DeepKE-cnSchema supports multiple tasks such as Chinese entity extraction and relation extraction. It can extract 50 relation types and 28 entity types, including common entity types such as person, location, city, institution, etc and the common relation types such as ancestral home, birthplace, nationality and other types.
For entity extraction and relation extraction tasks, we provide models based on RoBERTa-wwm-ext, Chinese and BERT-wwm, Chinese respectively.
| Model | Task | Google Download | Baidu Netdisk Download |
|---|---|---|---|
DeepKE(NER), RoBERTa-wwm-ext, Chinese |
entity extraction | PyTorch | Pytorch(password:u022) |
DeepKE(NER), BERT-wwm, Chinese |
entity extraction | PyTorch | Pytorch(password:1g0t) |
DeepKE(NER), BiLSTM-CRF, Chinese |
entity extraction | PyTorch | Pytorch(password:my4x) |
DeepKE(RE), RoBERTa-wwm-ext, Chinese |
relation extraction | PyTorch | Pytorch(password:78pq) |
DeepKE(RE), BERT-wwm, Chinese |
relation extraction | PyTorch | Pytorch(password:6psm) |
It is recommended to use Baidu Netdisk download in Chinese Mainland, and Google download for overseas users.
As for the entity extraction model, take pytoch version DeepKE(RE), RoBERTa-wwm-ext, Chinese as an example. After downloading, files of the model are obtained:
checkpoints_robert
|- added_tokens.json # added tokens
|- config.json # config
|- eval_results.txt # evaluation results
|- model_config.json # model config
|- pytorch_model.bin # model
|- special_tokens_map.json # special tokens map
|- tokenizer_config.bin # tokenizer config
|- vocab.txt # vocabulary
where config.json and vocab.txt is completely consistent with the original Google RoBERTa-wwm-ext, Chinese. PyTorch version contains pytorch_model. bin, config. json, vocab. txt file.
As for the relation extraction model, take pytoch version DeepKE(RE), RoBERTa-wwm-ext, Chinese as an example. The model is pth file after downloading.
After downloading the model, users can directly quick-load it to extract entity and relation.
We have conduct experiments on Chinese named entity recognition and relation extraction datasets. The experimental results are as follows:
DeepKE leverageschinese-bert-wwmandchinese-roberta-wwm-extto train and obtain the DeepKE-cnSchema(NER) model. Hyper-parameters used in the model are predefined. Finally, we can obtain the following results after training:
| Model | P | R | F1 |
|---|---|---|---|
| DeepKE(NER), RoBERTa-wwm-ext, Chinese | 0.8028 | 0.8612 | 0.8310 |
| DeepKE(NER), BERT-wwm, Chinese | 0.7841 | 0.8587 | 0.8197 |
DeepKE leverageschinese-bert-wwmandchinese-roberta-wwm-extto train and obtain the DeepKE-cnschema(RE) model. Hyper-parameters used in the model are predefined. Finally, we can obtain the following results after training:
| Model | P | R | F1 |
|---|---|---|---|
| DeepKE(RE), RoBERTa-wwm-ext, Chinese | 0.7890 | 0.7370 | 0.7327 |
| DeepKE(RE), BERT-wwm, Chinese | 0.7861 | 0.7506 | 0.7473 |
DeepKE-cnSchema is an off-the-shelf version that supports the Chinese knowledge graphs construction. CnSchema is developed for Chinese information processing, which uses advanced knowledge graphs, natural language processing and machine learning technologies. It integrates structured text data, supports rapid domain knowledge modeling and open data automatic processing across data sources, domains and languages, and provides schema-level support and services for emerging application markets such as intelligent robots, semantic search and intelligent computing. Currently, the Schema types supported by DeepKE-cnSchema are as follows:
After aforementioned trained models are downloaded, entites and their relations in a text can be extracted together. If there are more than two entities in one sentence, some predicted entity pairs may be incorrect because these entity pairs are not in training sets and need to be exracted further. The detailed steps are as follows:
-
In
conf, modifytextinpredict.yamlas the sentence to be predicted,nerfpas the directory of the trained NER model andrefpas the directory of the trained RE model. -
Predict
python predict.py
Many results will be output. Take the input text
此外网易云平台还上架了一系列歌曲,其中包括田馥甄的《小幸运》等as example.(1) Output the result of NER:
[('田', 'B-YAS'), ('馥', 'I-YAS'), ('甄', 'I-YAS'), ('小', 'B-QEE'), ('幸', 'I-QEE'), ('运', 'I-QEE')](2) Output the processed result:
{'田馥甄': '人物', '小幸运': '歌曲'}(3) Output the result of RE:
"田馥甄" 和 "小幸运" 在句中关系为:"歌手",置信度为0.92。(4) Output the result as
jsonld{ "@context": { "歌手": "https://cnschema.openkg.cn/item/%E6%AD%8C%E6%89%8B/16693#viewPageContent" }, "@id": "田馥甄", "歌手": { "@id": "小幸运" } }
If the resources or technologies in this project are helpful to your research work, you are welcome to cite the following papers in your thesis:
@inproceedings{DBLP:conf/emnlp/ZhangXTYYQXCLL22,
author = {Ningyu Zhang and
Xin Xu and
Liankuan Tao and
Haiyang Yu and
Hongbin Ye and
Shuofei Qiao and
Xin Xie and
Xiang Chen and
Zhoubo Li and
Lei Li},
editor = {Wanxiang Che and
Ekaterina Shutova},
title = {DeepKE: {A} Deep Learning Based Knowledge Extraction Toolkit for Knowledge
Base Population},
booktitle = {Proceedings of the The 2022 Conference on Empirical Methods in Natural
Language Processing, {EMNLP} 2022 - System Demonstrations, Abu Dhabi,
UAE, December 7-11, 2022},
pages = {98--108},
publisher = {Association for Computational Linguistics},
year = {2022},
url = {https://aclanthology.org/2022.emnlp-demos.10},
timestamp = {Thu, 23 Mar 2023 16:56:00 +0100},
biburl = {https://dblp.org/rec/conf/emnlp/ZhangXTYYQXCLL22.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}The contents of this project are only for technical research reference and shall not be used as any conclusive basis. Users can freely use the model within the scope of the license, but we are not responsible for the direct or indirect losses caused by the use of the project.
If you have any questions, please submit them in GitHub issue.
