experiments

Experiments

We are releasing all the details of our experiments, including code, data, model weights, and guides to reproduce the experiments in the paper.

Overview

In Section 4.2 of our paper, we train models on test sets of benchmark datasets to simulate different data contamination scenarios:

Supervised fine-tuning (SFT): Problem texts and answers are formatted as input-output pairs.
Continual pre-training (PT): Problem texts and answers are concatenated as unlabeled data.

We use the same base model (Llama-2-7b-hf) without instruction-following capability for all experiments. To enable instruction-following, we instruction-tune all models on the same instruction dataset (which does not overlap with the benchmark datasets).

Here's an illustration of the training process:

Reproducing the Experiments

Training the Models

For all experiments requiring model training, we use LLaMA-Factory and DeepSpeed. We provide:

Training scripts (*.sh)
DeepSpeed configuration (ds_config.json)
Model weights for all experiments

To train the models yourself:

Set up the environment for LLaMA-Factory and DeepSpeed.
Replace the data directory in LLaMA-Factory with the one in this repository.
Put the training scripts and DeepSpeed configuration in the root directory of LLaMA-Factory.
Run the training scripts to reproduce the experiments.

As we do full-parameter training for all models, we recommend using a machine with at least 4 * 80GB GPUs.

Using Pre-trained Models

If you prefer to reproduce the experiments without training the models, we have provided the model weights on HuggingFace:

llama-2-7b-hf (base model): https://huggingface.co/meta-llama/Llama-2-7b-hf
Normal: https://huggingface.co/zhuohaoyu/KIEval-Experiments-Normal-Model
SFT-Cheater: https://huggingface.co/zhuohaoyu/KIEval-Experiments-SFT-Cheater
PT-Cheater: https://huggingface.co/zhuohaoyu/KIEval-Experiments-PT-Cheater

To download and deploy the model weights on your local machine (requires Docker and at least a single GPU with 24GB VRAM), use this script:

model=zhuohaoyu/KIEval-Experiments-SFT-Cheater
# Share a volume with the Docker container to avoid downloading weights every run
volume=$PWD/data

docker run --rm -d --gpus "device=0" --shm-size 32g -p 8080:80 -v $volume:/data \
    ghcr.io/huggingface/text-generation-inference:2.1.1 --model-id $model --sharded false --trust-remote-code

After deploying the model, you can:

Interact with your model using the HuggingFace Inference API, or
Run our experiments by replacing the TGI URL and port in the config templates (e.g. config/template-basic.json) and executing our launch script.

Name		Name	Last commit message	Last commit date
parent directory ..
assets		assets
data		data
README.md		README.md
ds_config.json		ds_config.json
train-mt-bench-cheater-model.sh		train-mt-bench-cheater-model.sh
train-normal-model.sh		train-normal-model.sh
train-pt-cheater-model.sh		train-pt-cheater-model.sh
train-sft-cheater-model.sh		train-sft-cheater-model.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Experiments

Overview

Reproducing the Experiments

Training the Models

Using Pre-trained Models

FilesExpand file tree

experiments

Directory actions

More options

Directory actions

More options

Latest commit

History

experiments

Folders and files

parent directory

README.md

Experiments

Overview

Reproducing the Experiments

Training the Models

Using Pre-trained Models