This repo provides the source code & data of our paper "Learn Together: Joint Multitask Finetuning of Pretrained KG-enhanced LLM for Downstream Tasks".
Installation guide for training Roberta with H/A 100
Requirements: python3.8, cuda11.8, torch2.0.1, pyg2.4, transformers4.10
Run the following commands to create a conda environment:
mamba create -y -n multitask_finetune python=3.8
mamba activate multitask_finetune
mamba install pytorch torchvision torchaudio pytorch-cuda=11.8 pyg=2.4 pytorch-sparse=0.6.17 -c pytorch -c nvidia -c pyg
pip install transformers==4.10 wandb nltk spacy==2.1.6
python -m spacy download en
pip install scispacy==0.3.0
pip install sentencepieceDownload pretrained models and place files under ./models
| Model | Size | Pretraining Text | Pretraining Knowledge Graph | Download Link |
|---|---|---|---|---|
| RoBERTa | 360M parameters | BookCorpus (filtered) | ConceptNet | link (will be updated) |
Download all the preprocessed data from here (596 GB), kgqa data here (2 GB).
If you would like to train model on single task, run:
scripts/run_train__{qa/mrc/kgqa_dataset_name}.sh
For joint training, run (don't forget to specify the task flags):
scripts/run_train__joint.sh
(Optional) To pretrain model on your own data, you can run:
scripts/run_pretrain.sh
This repo is built upon the following works:
DRAGON: Deep Bidirectional Language-Knowledge Graph Pretraining
[https://github.com/snap-stanford/GreaseLM](https://github.com/michiyasunaga/dragon)
