Skip to content

We propose prompt augmentation, a data augmentation technique for RL post-training on mathematical reasoning that substantially extends training duration, stabilizes training in low-entropy regimes, and improves reasoning performance.

Notifications You must be signed in to change notification settings

wenquanlu/prompt-augmentation-GRPO

Repository files navigation

Prompt Augmentation Scales up GRPO Training on Mathematical Reasoning

News

2026.02.03 The paper is posted on arxiv! We are actively working on applying prompt augmentation to 7B models.

2026.02.04 The results are still growing in our reproduced 1.5B runs, stay tuned for updates for even higher SOTA. 🚀

Introduction

Reinforcement learning algorithms such as group-relative policy optimization (GRPO) have demonstrated strong potential for improving the mathematical reasoning capabilities of large language models. However, prior work has consistently observed an entropy collapse phenomenon during reinforcement post-training, characterized by a monotonic decrease in policy entropy that ultimately leads to training instability and collapse. As a result, most existing approaches restrict training to short horizons (typically 5–20 epochs), limiting sustained exploration and hindering further policy improvement. In addition, nearly all prior work relies on a single, fixed reasoning prompt or template during training. In this work, we introduce prompt augmentation, a training strategy that instructs the model to generate reasoning traces under diverse templates and formats, thereby increasing rollout diversity. We show that, without a KL regularization term, prompt augmentation enables stable scaling of training duration under a fixed dataset and allows the model to tolerate low-entropy regimes without premature collapse. Empirically, a Qwen-Math-1.5B model trained with prompt augmentation on the MATH Level 3–5 dataset achieves state-of-the-art performance, reaching 44.5 per-benchmark accuracy and 51.3 per-question accuracy on standard mathematical reasoning benchmarks, including AIME24, AMC, MATH500, Minerva, and OlympiadBench.

Results

Weights 🏋🏻‍♂️

Training

Wandb training logs for experiments: here.

First install required packages for verl following their installation instructions. Our project is developed on top of this version of verl, using conda environment. We have provided our environment.yml.

To train the model with prompt augmetation:

sbatch examples/grpo_trainer/run_qwen2_5_math_1_5b.sh

To convert checkpoints to huggingface format

sbatch convert_model.sh

Evaluation

To evaluate the trained model:

sbatch eval/eval_iterative.sh

We use greedy decoding with average of 5 inference results as our final reported result due to the stochasticity of vLLM inference engine.

To evaluate our provided checkpoint, first fill in the downloaded checkpoint path in eval/eval_checkpoint.sh, then run:

sbatch eval/eval_checkpoint.sh

Important Q&A

  • Why other papers do not show collapse in their graphs?
  • You might wonder, why the other papers do not show the collapse in their graphs. Well, my take is if some paper show up to 500 steps, then what will happen at 1500 steps (or 5000 steps) is not shown. If you can keep a model stably training on a dataset for infinite steps in reinforcement learning setting without performance degeneracy, then I think it would be a major breakthrough (that is very continual-learning, AGI-vibe 🤖🧠). That's why I think scaling up the training duration is important to extract the most juice out of a given dataset.

  • How reproducible are the results?
  • I would say they are fairly fairly reproducible, but as always with RL, things can be uncertain. If your run collapse a bit early, then maybe give it another try I bet it is gonna run for much longer. In our experiments, we didn't do any cherrypicking in runs, (like we didn't run things three times and pick the best one). All results we reported in the paper are our first-run result. I think the key insight is as long as you can keep the model alive for bit longer, it can improve a bit further and reach higher performance.

    Update: Our second-run result already surpasses the reported first-run result, so I am now fairly confident about reproducibility.

Comments

Our codebase builds heavily on verl, understand-r1-zero and TreePO. Thanks to their great works!

Citation

If you find this repository helpful, please consider giving this repo a star ⭐ and citing:

@misc{lu2026promptaugmentationscalesgrpo,
      title={Prompt Augmentation Scales up GRPO Training on Mathematical Reasoning}, 
      author={Wenquan Lu and Hai Huang and Randall Balestriero},
      journal={arXiv preprint arXiv:2602.03190}
      year={2026},
}

About

We propose prompt augmentation, a data augmentation technique for RL post-training on mathematical reasoning that substantially extends training duration, stabilizes training in low-entropy regimes, and improves reasoning performance.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published