Skip to content

Latest commit

 

History

History

README.md

English | 简体中文

Introduction

DeepKE is a knowledge extraction toolkit supporting low-resource, document-level and multimodal scenarios for entity, relation and attribute extraction. We provide documents, Google Colab tutorials, online demo, and slides for beginners.

To promote efficient Chinese knowledge graph construction, we provide DeepKE-cnSchema, a specific version of DeepKE, containing off-the-shelf models based on cnSchema. DeepKE-cnSchema supports multiple tasks such as Chinese entity extraction and relation extraction. It can extract 50 relation types and 28 entity types, including common entity types such as person, location, city, institution, etc and the common relation types such as ancestral home, birthplace, nationality and other types.

Chinese Model Download

For entity extraction and relation extraction tasks, we provide models based on RoBERTa-wwm-ext, Chinese and BERT-wwm, Chinese respectively.

Model Task Google Download Baidu Netdisk Download
DeepKE(NER), RoBERTa-wwm-ext, Chinese entity extraction PyTorch Pytorch(password:u022)
DeepKE(NER), BERT-wwm, Chinese entity extraction PyTorch Pytorch(password:1g0t)
DeepKE(NER), BiLSTM-CRF, Chinese entity extraction PyTorch Pytorch(password:my4x)
DeepKE(RE), RoBERTa-wwm-ext, Chinese relation extraction PyTorch Pytorch(password:78pq)
DeepKE(RE), BERT-wwm, Chinese relation extraction PyTorch Pytorch(password:6psm)

Instructions

It is recommended to use Baidu Netdisk download in Chinese Mainland, and Google download for overseas users.

As for the entity extraction model, take pytoch version DeepKE(RE), RoBERTa-wwm-ext, Chinese as an example. After downloading, files of the model are obtained:

checkpoints_robert
    |- added_tokens.json          # added tokens
    |- config.json                # config
    |- eval_results.txt           # evaluation results
    |- model_config.json          # model config
    |- pytorch_model.bin          # model
    |- special_tokens_map.json    # special tokens map
    |- tokenizer_config.bin       # tokenizer config
    |- vocab.txt                  # vocabulary

where config.json and vocab.txt is completely consistent with the original Google RoBERTa-wwm-ext, Chinese. PyTorch version contains pytorch_model. bin, config. json, vocab. txt file.

As for the relation extraction model, take pytoch version DeepKE(RE), RoBERTa-wwm-ext, Chinese as an example. The model is pth file after downloading.

After downloading the model, users can directly quick-load it to extract entity and relation.

Datasets and Chinese Baseline Performance

Datasets

We have conduct experiments on Chinese named entity recognition and relation extraction datasets. The experimental results are as follows:

Named Entity Recognition(NER)

DeepKE leverageschinese-bert-wwmandchinese-roberta-wwm-extto train and obtain the DeepKE-cnSchema(NER) model. Hyper-parameters used in the model are predefined. Finally, we can obtain the following results after training:

Model P R F1
DeepKE(NER), RoBERTa-wwm-ext, Chinese 0.8028 0.8612 0.8310
DeepKE(NER), BERT-wwm, Chinese 0.7841 0.8587 0.8197

Relation Extraction(RE)

DeepKE leverageschinese-bert-wwmandchinese-roberta-wwm-extto train and obtain the DeepKE-cnschema(RE) model. Hyper-parameters used in the model are predefined. Finally, we can obtain the following results after training:

Model P R F1
DeepKE(RE), RoBERTa-wwm-ext, Chinese 0.7890 0.7370 0.7327
DeepKE(RE), BERT-wwm, Chinese 0.7861 0.7506 0.7473

Support Knowledge Schema Type

DeepKE-cnSchema is an off-the-shelf version that supports the Chinese knowledge graphs construction. CnSchema is developed for Chinese information processing, which uses advanced knowledge graphs, natural language processing and machine learning technologies. It integrates structured text data, supports rapid domain knowledge modeling and open data automatic processing across data sources, domains and languages, and provides schema-level support and services for emerging application markets such as intelligent robots, semantic search and intelligent computing. Currently, the Schema types supported by DeepKE-cnSchema are as follows:

Quick Load

After aforementioned trained models are downloaded, entites and their relations in a text can be extracted together. If there are more than two entities in one sentence, some predicted entity pairs may be incorrect because these entity pairs are not in training sets and need to be exracted further. The detailed steps are as follows:

  1. In conf, modify text in predict.yaml as the sentence to be predicted, nerfp as the directory of the trained NER model and refp as the directory of the trained RE model.

  2. Predict

    python predict.py

    Many results will be output. Take the input text 此外网易云平台还上架了一系列歌曲,其中包括田馥甄的《小幸运》等 as example.

    (1) Output the result of NER: [('田', 'B-YAS'), ('馥', 'I-YAS'), ('甄', 'I-YAS'), ('小', 'B-QEE'), ('幸', 'I-QEE'), ('运', 'I-QEE')]

    (2) Output the processed result: {'田馥甄': '人物', '小幸运': '歌曲'}

    (3) Output the result of RE: "田馥甄" 和 "小幸运" 在句中关系为:"歌手",置信度为0.92。

    (4) Output the result as jsonld

       {
         "@context": {
       "歌手": "https://cnschema.openkg.cn/item/%E6%AD%8C%E6%89%8B/16693#viewPageContent"
         },
         "@id": "田馥甄",
         "歌手": {
       "@id": "小幸运"
         }
       }

Citation

If the resources or technologies in this project are helpful to your research work, you are welcome to cite the following papers in your thesis:

@inproceedings{DBLP:conf/emnlp/ZhangXTYYQXCLL22,
  author    = {Ningyu Zhang and
               Xin Xu and
               Liankuan Tao and
               Haiyang Yu and
               Hongbin Ye and
               Shuofei Qiao and
               Xin Xie and
               Xiang Chen and
               Zhoubo Li and
               Lei Li},
  editor    = {Wanxiang Che and
               Ekaterina Shutova},
  title     = {DeepKE: {A} Deep Learning Based Knowledge Extraction Toolkit for Knowledge
               Base Population},
  booktitle = {Proceedings of the The 2022 Conference on Empirical Methods in Natural
               Language Processing, {EMNLP} 2022 - System Demonstrations, Abu Dhabi,
               UAE, December 7-11, 2022},
  pages     = {98--108},
  publisher = {Association for Computational Linguistics},
  year      = {2022},
  url       = {https://aclanthology.org/2022.emnlp-demos.10},
  timestamp = {Thu, 23 Mar 2023 16:56:00 +0100},
  biburl    = {https://dblp.org/rec/conf/emnlp/ZhangXTYYQXCLL22.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

Disclaimers

The contents of this project are only for technical research reference and shall not be used as any conclusive basis. Users can freely use the model within the scope of the license, but we are not responsible for the direct or indirect losses caused by the use of the project.

Problem Feedback

If you have any questions, please submit them in GitHub issue.