English | 简体中文
IFAformer (AAAI SA'23) is a novel dual Multimodal Transformer model with implicit feature alignment, which utilizes the Transformer structure uniformly in the visual and textual without explicitly designing modal alignment structure (Details in https://arxiv.org/pdf/2211.07504.pdf).
The overall experimental results on IFAformer for Multi-Modal RE task can be seen as follows:
| Methods | Acc | Precision | Recall | F1 | |
|---|---|---|---|---|---|
| text | PCNN* | 73.36 | 69.14 | 43.75 | 53.59 |
| BERT* | 71.13 | 58.51 | 60.16 | 59.32 | |
| MTB* | 75.34 | 63.28 | 65.16 | 64.20 | |
| text+image | |||||
| BERT+SG+Att | 74.59 | 60.97 | 66.56 | 63.64 | |
| ViLBERT | 74.89 | 64.50 | 61.86 | 63.61 | |
| MEGA | 76.15 | 64.51 | 68.44 | 66.41 | |
| Ours | |||||
| Vanilla IFAformer | 87.75 | 69.90 | 68.11 | 68.99 | |
| w/o Text Attn. | 76.21 | 66.95 | 61.72 | 64.23 | |
| w/ Visual Objects | 92.38 | 82.59 | 80.78 | 81.67 |
python == 3.8
- torch == 1.5
- transformers == 3.4.0
- hydra-core == 1.0.6
- deepke
Attention!
Here transformers == 3.4.0 is the environmental requirement of the whole DeepKE. But to load the openai/clip-vit-base-patch32 model used in multimodal parts, transformers == 4.11.3 is needed indeed. So you are recommended to download the pretrained model on huggingface and use the local path to load the model.
git clone https://github.com/zjunlp/DeepKE.git
cd DeepKE/example/re/multimodal- Create and enter the python virtual environment.
- Install dependencies:
pip install -r requirements.txt.
-
Dataset
-
Download the dataset to this directory.
The MNRE dataset comes from https://github.com/thecharm/Mega, many thanks.
You can download the MNRE dataset with detected visual objects using folloing command:
wget 121.41.117.246:8080/Data/re/multimodal/data.tar.gz tar -xzvf data.tar.gz
-
The dataset MNRE with detected visual objects is stored in
data:img_detect:Detected objects using RCNNimg_vg:Detected objects using visual groundingimg_org: Original imagestxt: Text setvg_data:Bounding image andimg_vgours_rel2id.jsonRelation set
-
We use RCNN detected objects and visual grounding objects as visual local information, where RCNN via faster_rcnn and visual grounding via onestage_grounding.
-
-
Training
-
Parameters, model paths and configuration for training are in the
conffolder and users can modify them before training. -
Training on MNRE
python run.py
-
The trained model is stored in the
checkpointdirectory by default and you can change it by modifying "save_path" intrain.yaml. -
Start to train from last-trained model
modify
load_pathintrain.yamlas the path of the last-trained model -
Logs for training are stored in the current directory by default and the path can be configured by modifying
log_dirin.yaml
-
-
Prediction
Modify "load_path" in
predict.yamlto the trained model path. In addition, we provide the model trained on MNRE dataset for users to predict directly.python predict.py
If you use or extend our work, please cite the following paper:
@article{DBLP:journals/corr/abs-2211-07504,
author = {Lei Li and
Xiang Chen and
Shuofei Qiao and
Feiyu Xiong and
Huajun Chen and
Ningyu Zhang},
title = {On Analyzing the Role of Image for Visual-enhanced Relation Extraction},
journal = {CoRR},
volume = {abs/2211.07504},
year = {2022},
url = {https://doi.org/10.48550/arXiv.2211.07504},
doi = {10.48550/arXiv.2211.07504},
eprinttype = {arXiv},
eprint = {2211.07504},
timestamp = {Tue, 27 Dec 2022 08:22:45 +0100},
biburl = {https://dblp.org/rec/journals/corr/abs-2211-07504.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}