Releases · evalplus/evalplus

@soryxie

For the past 6+ months, we have been actively maintaining and improving the EvalPlus repository. Now we are thrilled to announce a new release!

🔥 EvalPerf for Code Efficiency Evaluation

Based on our COLM'24 paper, we integrated the EvalPerf dataset into the EvalPlus repository.
EvalPerf is a dataset curated using the Differential Performance Evaluation methodology proposed by the paper, which argues that effective code efficiency evaluation requires:

Performance-exercising tasks -- our tasks are testified to be code efficiency challenging!
Performance-exercising inputs -- For each task, we generate performance-challenging test input!
Compound metric: Differential Performance Score (DPS) -- inspired by LeetCode's efficiency ranking of submissions, it tells conclusions like "your submission can outperform 80% of LLM solutions..."

The EvalPerf dataset initially has 118 coding tasks^ (a subset of the latest HumanEval+ and MBPP+) -- running EvalPerf is as simple as running the following commands:

pip install "evalplus[perf,vllm]" --upgrade
# Or: pip install "evalplus[perf,vllm] @ git+https://github.com/evalplus/evalplus" --upgrade

sudo sh -c 'echo 0 > /proc/sys/kernel/perf_event_paranoid' # Enable perf
evalplus.evalperf --model "ise-uiuc/Magicoder-S-DS-6.7B" --backend vllm

At evaluation time we by default perform the following steps:

Correctness sampling: We sample LLMs for 100 solutions (n_samples) and perform correctness checking
Efficiency evaluation: For tasks with 10+ passing solutions, we evaluate the code efficiency of (at most 20) passing solutions:
- Primitive metric: # CPU instructions
- We profile the # CPU instructions over (i) new LLM solutions; and (ii) representative performance reference solutions; running through the performance-challenging test input
- We match the profiled new solution to the reference solution with comparative code efficiency as calculate $DPS$ and $DPS_{norm}$
- e.g., Given 10 reference samples in 4 clusters: [3, 2, 3, 2], matching the 3rd cluster leads to $DPS = \frac{sample\ rank}{total\ samples} = \frac{3+2+3}{10}=80$% and $DPS_{norm} = \frac{cluster\ rank}{total\ cluster} = \frac{1+1+1}{4}=75$%

Collaborated work with @soryxie @FatPigeorz !

🔥 Command-line Interface (CLI) Simplification

We largely simplified the evaluation pipelines:

Previously: run evalplus.codegen, and then evalplus.sanitize, and then evalplus.evaluate with different parameters

evalplus.codegen --model "ise-uiuc/Magicoder-S-DS-6.7B" \
                  --dataset [humaneval|mbpp]            \
                  --backend vllm                        \
                  --greedy

evalplus.sanitize --samples [path/to/samples]

evalplus.evaluate --samples [path/to/samples]

Now:

evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \
                  --dataset [humaneval|mbpp]             \
                  --backend vllm                         \
                  --greedy

Other notable updates

Sanitizer improvements (#189, #190) -- thanks to @Co1lin
Fixing edge case of is_float (#196)
HumanEval+ maintenance: v0.1.10 by improving contracts & oracle (#186, #201) -- thanks to @Kristoff-starling @Co1lin
MBPP+ maintenance: v0.2.1 by improving contracts & oracle (#211, #212)
Default behavior change: code generation results saved as .jsonal rather than massive individual files and folders
Using the official tree-sitter package and Python binary
Prompt: adding a new liner after stripping the prompt as some models are more familiar with """\n over """
Configurable maximum evaluation process memory via environmental variable EVALPLUS_MAX_MEMORY_BYTES
When sampling size per task > 1, batch size is automatically set as min(n_samples, 32) if --bs is not set
Sanitizer behavior: when the code is too broken to be sanitized, return the broken code rather than an empty string for debuggability.
vLLM: automatic prefix caching is enabled to accelerate sampling (hopefully)
Setting top_p = 0.95 for OpenAI, Google, and Anthropic backends
New arguments: --trust-remote-code

PyPI: https://pypi.org/project/evalplus/0.3.1/
Docker Hub: https://hub.docker.com/layers/ganler/evalplus/v0.3.1/images/sha256-26b118098bef281fe8dfe999bf05f1d5b45374b4e6c00161ec0f30592aef4740

^In our COLM paper, we presented 121 tasks based on the February version of MBPP+ (v0.1.0) which by then there were 399 MBPP+ tasks -- in MBPP+ (v0.2.0) we removed some broken tasks (399 -> 378) leading a slight cut in the number of EvalPerf tasks as well.

^We skipped/yanked the release of v0.3.0 and directly released v0.3.1 due to broken dependency in v0.3.0.

Main updates

HumanEval+ and MBPP+ datasets are on the hub now:
- HumanEval+: https://huggingface.co/datasets/evalplus/humanevalplus
- MBPP+: https://huggingface.co/datasets/evalplus/mbppplus
HumanEval+ is ported to original HumanEval format. Release files have a new home now:
- HumanEval+: https://github.com/evalplus/humanevalplus_release
- MBPP+: https://github.com/evalplus/mbppplus_release
You can use EvalPlus through bigcode-evaluation-harness now
Docker image now uses Python 3.10 since some model might generate Python code using latest syntax, leading to false positive using older Python
Sanitizer is now merged into the pacakge
Several improvements and bug fixes to the sanitizer
Test suite reduction now moved to tools
Fixes the CACHE_DIR nonexistance issue
Simplified the format of eval_results.json for readability
Use EVALPLUS_TIMEOUT_PER_TASK env var to set the maximum testing time for each task
Timeout per test is set to 0.5s by default
Fixes argument validity for inputgen.py

Dataset maintainence

HumanEval/32: fixes the oracle

Supported codegen models

Now EvalPlus leaderboard lists 82 models
WizardCoders
Stable Code
OpenCodeInterpreter
antropic API
mistral API
CodeLlama instruct
Phi-2
Solar
Dophin
OpenChat
CodeMillenials
Speechless
xdan-l1-chat
etc.

PyPI: https://pypi.org/project/evalplus/0.2.1/
Docker Hub: https://hub.docker.com/layers/ganler/evalplus/v0.2.1/images/sha256-2bb315e40ea502b4f47ebf1f93561ef88280d251bdc6f394578c63d90e1825d7

🔥 Announcing MBPP+

MBPP is a dataset curated by Google. Its full set includes around 1000 crowd-sourced Python programming problems. However, certain amount of problems can be noisy (e.g., prompts make no sense or tests are broken). Consequently, a subset (~427 problems) of the data has been hand-verified by original author -- MBPP-sanitized.

MBPP+ improves MBPP based on its sanitized version (MBPP-sanitized):

We further hand-verify the problems to trim ill-formed problems to keep 399 problems
We also fix the problems whose implementation is wrong (more details can be found here)
We perform test augmentation to improve the number of tests by 35x (on avg from 3.1 to 108.5)
We mantain the scripting compatibility against HumanEval+ where one simply needs to toggle the switch by --dataset mbpp for evalplus.evaluate, codegen/generate.py, tools/checker.py as well as tools/sanitize.py
Initial leaderboard is made available on https://evalplus.github.io/leaderboard.html and we will keep updating

A typical workflow to use MBPP+:

# Step 1: Generate MBPP solutions
from evalplus.data import get_mbpp_plus, write_jsonl

def GEN_SOLUTION(prompt: str) -> str:
    # LLM produce the whole solution based on prompt

samples = [
    dict(task_id=task_id, solution=GEN_SOLUTION(problem["prompt"]))
    for task_id, problem in get_mbpp_plus().items()
]
write_jsonl("samples.jsonl", samples)
# May perform some post-processing to sanitize LLM produced code
# e.g., https://github.com/evalplus/evalplus/blob/master/tools/sanitize.py

# Step 2: Evaluation on MBPP+
docker run -v $(pwd):/app ganler/evalplus:latest --dataset mbpp --samples samples.jsonl
# STDOUT will display the scores for "base" (with MBPP tests) and "base + plus" (with additional MBPP+ tests)

🔥 HumanEval+ Maintainance

Leaderboard updates (now 41 models!): https://evalplus.github.io/leaderboard.html
- DeepSeek Coder series
- Phind-CodeLlama
- Mistral and Zephyr series
- Smaller StarCoders
HumanEval+ now upgrades to v0.1.9 from v0.1.6
- Test-case fixes: 0, 3, 9, 148
- Prompt fixes: 114
- Contract fixes: 1, 2, 99, 35, 28, 32, 160

PyPI: https://pypi.org/project/evalplus/0.2.0/
Docker Hub: https://hub.docker.com/layers/ganler/evalplus/v0.2.0/images/sha256-6f1b9bd13930abfb651a99d4c6a55273271f73e5b44c12dcd959a00828782dd6

EvalPlus leader board: https://evalplus.github.io/leaderboard.html
Evaluated CodeLlama, CodeT5+ and WizardCoder
Fixed contract (HumanEval+): 116, 126, 006
Removed extreme inputs (HumanEval+): 32
Established HUMANEVAL_OVERRIDE_PATH which allows to override the original dataset with customized dataset

PyPI: https://pypi.org/project/evalplus/0.1.7/
Docker Hub: https://hub.docker.com/layers/ganler/evalplus/v0.1.7/images/sha256-69fe87df89b8c1545ff7e3b20232ac6c4841b43c20f22f4a276ba03f1b0d79ae

Supporting configurable timeouts $T=\max(T_{base}, T_{gt}\times k)$, where:
- $T_{base}$ is the minimal timeout (configurable by --min-time-limit; default to 0.2s);
- $T_{gt}$ is the runtime of the ground-truth solutions (achieved via profiling);
- $k$ is a configurable factor --gt-time-limit-factor (default to 4);
Using a more conservative timeout setting to mitigate test-beds with weak performance ($T_{base}: 0.05s \to 0.2s$ and $k: 2\to 4$).
HumanEval+ dataset bug fixes:
- Medium contract fixesL P129 (#4), P148 (self-identified)
- Minor contract fixes: P75 (#4), P53 (#8), P0 (self-identified), P3 (self-identified), P9 (self-identified)
- Minor GT fixes: P140 (#3)

PyPI: https://pypi.org/project/evalplus/0.1.6/
Docker Hub: https://hub.docker.com/layers/ganler/evalplus/v0.1.6/images/sha256-5913b95172962ad61e01a5d5cf63b60e1140dd547f5acc40370af892275e777c

🚀 `HumanEval+[mini]` -- 47x smaller while equivalently effective as HumanEval+

Add --mini to evalplus.evaluate ... you can use a minimal and best-quality set of extra tests to accelerate evaluation!
HumanEval+[mini] (avg 16.5 tests) is smaller than HumanEval+ (avg 774.8 tests) by 47x.
This is achieved via test-suite reduction -- we run a set covering algorithm to preserve the same coverage (coverage analysis), mutant-killings (mutation analysis) and sample-killings (pass-fail status of each sample-test pair).

PyPI: https://pypi.org/project/evalplus/0.1.5/
Docker Hub: https://hub.docker.com/layers/ganler/evalplus/v0.1.5/images/sha256-01ef3275ab02776e94edd4a436a3cd33babfaaf7a81e7ae44f895c2794f4c104

Performance:
- Lazy loading of the cache of evaluation results
- Use ProcessPoolExecutor over ThreadPoolExecutor
- Caching groudtruth outputs
Observability:
- Concurrent logger when a task gets stuck for 10s
New models:
- CodeGen2 (infill)
- StarCoder (infill)
Fixes:
- Deterministic hashing of input problem

PyPI: https://pypi.org/project/evalplus/0.1.4/
Docker Hub: https://hub.docker.com/layers/ganler/evalplus/v0.1.4/images/sha256-a0ea8279c71afa9418808326412b1e5cd11f44b3b59470477ecf4ba999d4b73a

Fixes evaluation when input sample format is .jsonl

PyPI: https://pypi.org/project/evalplus/0.1.3/
Docker Hub: https://hub.docker.com/layers/ganler/evalplus/v0.1.3/images/sha256-fd13ab6ee2aa313eb160fc29debe8c761804cb6af7309280b4e200b6549bd75a

Fix the bug induced by using --base-only
Build docker image locally instead of simply doing a pip install

PyPI: https://pypi.org/project/evalplus/0.1.2/
Docker Hub: https://hub.docker.com/layers/ganler/evalplus/v0.1.2/images/sha256-747ae02f0bfbd300c0205298113006203d984373e6ab6b8fb3048626f41dbe08

In this version, efforts are mainly made to sanitize and standardize code in evalplus. Most importantly, evalplus strictly follows the dataset usage style of HumanEval. As a result, users can use evalplus in this way:

For more details, the main changes are (tracked in #1):

Package build and pypi setup
(HumanEval Compatibility) Support sample files as .jsonl
(HumanEval Compatibility) get_human_eval_plus() returns a dict instead of list
(HumanEval Compatibility) Use HumanEval task ID splitter "/" over "_"
Optimize the evaluation parallelism scheme to the sample-level granularity (original: task level)
Optimize IPC via shared memory
Remove groundtruth solutions to avoid data leakage
Use docker the sandboxing mechanism
Support Codegen2 in generation
Split dependency into multiple categories

PyPI: https://pypi.org/project/evalplus/0.1.1/
Docker Hub: https://hub.docker.com/layers/ganler/evalplus/v0.1.1/images/sha256-4993a0dc0ec13d6fe88eb39f94dd0a927e1f26864543c8c13e2e8c5d5c347af0

Releases: evalplus/evalplus

EvalPlus v0.3.1

🔥 EvalPerf for Code Efficiency Evaluation

🔥 Command-line Interface (CLI) Simplification

Other notable updates

Contributors

Uh oh!

EvalPlus v0.2.1

Main updates

Dataset maintainence

Supported codegen models

Uh oh!

EvalPlus v0.2.0

🔥 Announcing MBPP+

🔥 HumanEval+ Maintainance

Uh oh!

EvalPlus v0.1.7

Uh oh!

EvalPlus v0.1.6

Uh oh!

EvalPlus v0.1.5

🚀 HumanEval+[mini] -- 47x smaller while equivalently effective as HumanEval+

Uh oh!

EvalPlus v0.1.4

Uh oh!

EvalPlus v0.1.3

Uh oh!

EvalPlus v0.1.2

Uh oh!

EvalPlus v0.1.1

Uh oh!

🚀 `HumanEval+[mini]` -- 47x smaller while equivalently effective as HumanEval+