Check out our paper for more details: Scoring Verifiers: Evaluating Synthetic Verification in Code and Reasoning
- HE-R
- HE-R+
- MBPP-R
- MBPP-R+
We provide the scripts we used to generate the scoring and ranking augmented benchmarks in our paper:
generate_solutions.pymakes inference requests to OpenAI and executes the solutions to determine their ground truth fraction of predefined tests passed.combine_solutions.pyaggregates all of the candidates solutions generated in allexec_{}.jsonlfiles for each sample into one file.filter_solutions.pyfilter solutions to generate k evenly spaced solutiosn for each sample.evaluate.pyevaluates the Top-1, Bottom-1, Spearman's, Kendall's Tau, MAE and R^2 for a target file.
The file used to evaluate should include both the original ranks at key rank and expected test score at ground_average_test_score as found in the ranking datasets. Each solution is seperated as its own entry and is grouped based on keys dataset and task_id.
To compare test case generation set method to utg and the generated test scores at average_test_score.
To compare reward scoring set method to reward and the generated test scores at reward, reward_score.
pip install -r requirements.txt
@misc{ficek2025scoringverifiersevaluatingsynthetic,
title={Scoring Verifiers: Evaluating Synthetic Verification in Code and Reasoning},
author={Aleksander Ficek and Somshubra Majumdar and Vahid Noroozi and Boris Ginsburg},
year={2025},
eprint={2502.13820},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2502.13820},
}