Skip to content

Releases: evalplus/evalplus

EvalPlus v0.3.1

20 Oct 21:59

Choose a tag to compare

For the past 6+ months, we have been actively maintaining and improving the EvalPlus repository. Now we are thrilled to announce a new release!

🔥 EvalPerf for Code Efficiency Evaluation

Based on our COLM'24 paper, we integrated the EvalPerf dataset into the EvalPlus repository.
EvalPerf is a dataset curated using the Differential Performance Evaluation methodology proposed by the paper, which argues that effective code efficiency evaluation requires:

  • Performance-exercising tasks -- our tasks are testified to be code efficiency challenging!
  • Performance-exercising inputs -- For each task, we generate performance-challenging test input!
  • Compound metric: Differential Performance Score (DPS) -- inspired by LeetCode's efficiency ranking of submissions, it tells conclusions like "your submission can outperform 80% of LLM solutions..."

The EvalPerf dataset initially has 118 coding tasks^ (a subset of the latest HumanEval+ and MBPP+) -- running EvalPerf is as simple as running the following commands:

pip install "evalplus[perf,vllm]" --upgrade
# Or: pip install "evalplus[perf,vllm] @ git+https://github.com/evalplus/evalplus" --upgrade

sudo sh -c 'echo 0 > /proc/sys/kernel/perf_event_paranoid' # Enable perf
evalplus.evalperf --model "ise-uiuc/Magicoder-S-DS-6.7B" --backend vllm

At evaluation time we by default perform the following steps:

  1. Correctness sampling: We sample LLMs for 100 solutions (n_samples) and perform correctness checking
  2. Efficiency evaluation: For tasks with 10+ passing solutions, we evaluate the code efficiency of (at most 20) passing solutions:
    • Primitive metric: # CPU instructions
    • We profile the # CPU instructions over (i) new LLM solutions; and (ii) representative performance reference solutions; running through the performance-challenging test input
    • We match the profiled new solution to the reference solution with comparative code efficiency as calculate $DPS$ and $DPS_{norm}$
    • e.g., Given 10 reference samples in 4 clusters: [3, 2, 3, 2], matching the 3rd cluster leads to $DPS = \frac{sample\ rank}{total\ samples} = \frac{3+2+3}{10}=80$% and $DPS_{norm} = \frac{cluster\ rank}{total\ cluster} = \frac{1+1+1}{4}=75$%

Collaborated work with @soryxie @FatPigeorz !

🔥 Command-line Interface (CLI) Simplification

We largely simplified the evaluation pipelines:

  • Previously: run evalplus.codegen, and then evalplus.sanitize, and then evalplus.evaluate with different parameters
evalplus.codegen --model "ise-uiuc/Magicoder-S-DS-6.7B" \
                  --dataset [humaneval|mbpp]            \
                  --backend vllm                        \
                  --greedy

evalplus.sanitize --samples [path/to/samples]

evalplus.evaluate --samples [path/to/samples]
  • Now:
evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \
                  --dataset [humaneval|mbpp]             \
                  --backend vllm                         \
                  --greedy

Other notable updates

  • Sanitizer improvements (#189, #190) -- thanks to @Co1lin
  • Fixing edge case of is_float (#196)
  • HumanEval+ maintenance: v0.1.10 by improving contracts & oracle (#186, #201) -- thanks to @Kristoff-starling @Co1lin
  • MBPP+ maintenance: v0.2.1 by improving contracts & oracle (#211, #212)
  • Default behavior change: code generation results saved as .jsonal rather than massive individual files and folders
  • Using the official tree-sitter package and Python binary
  • Prompt: adding a new liner after stripping the prompt as some models are more familiar with """\n over """
  • Configurable maximum evaluation process memory via environmental variable EVALPLUS_MAX_MEMORY_BYTES
  • When sampling size per task > 1, batch size is automatically set as min(n_samples, 32) if --bs is not set
  • Sanitizer behavior: when the code is too broken to be sanitized, return the broken code rather than an empty string for debuggability.
  • vLLM: automatic prefix caching is enabled to accelerate sampling (hopefully)
  • Setting top_p = 0.95 for OpenAI, Google, and Anthropic backends
  • New arguments: --trust-remote-code

PyPI: https://pypi.org/project/evalplus/0.3.1/
Docker Hub: https://hub.docker.com/layers/ganler/evalplus/v0.3.1/images/sha256-26b118098bef281fe8dfe999bf05f1d5b45374b4e6c00161ec0f30592aef4740

^In our COLM paper, we presented 121 tasks based on the February version of MBPP+ (v0.1.0) which by then there were 399 MBPP+ tasks -- in MBPP+ (v0.2.0) we removed some broken tasks (399 -> 378) leading a slight cut in the number of EvalPerf tasks as well.

^We skipped/yanked the release of v0.3.0 and directly released v0.3.1 due to broken dependency in v0.3.0.

EvalPlus v0.2.1

02 Apr 22:40
ae74712

Choose a tag to compare

Main updates

Dataset maintainence

  • HumanEval/32: fixes the oracle

Supported codegen models

  • Now EvalPlus leaderboard lists 82 models
  • WizardCoders
  • Stable Code
  • OpenCodeInterpreter
  • antropic API
  • mistral API
  • CodeLlama instruct
  • Phi-2
  • Solar
  • Dophin
  • OpenChat
  • CodeMillenials
  • Speechless
  • xdan-l1-chat
  • etc.

PyPI: https://pypi.org/project/evalplus/0.2.1/
Docker Hub: https://hub.docker.com/layers/ganler/evalplus/v0.2.1/images/sha256-2bb315e40ea502b4f47ebf1f93561ef88280d251bdc6f394578c63d90e1825d7

EvalPlus v0.2.0

24 Nov 22:12

Choose a tag to compare

🔥 Announcing MBPP+

MBPP is a dataset curated by Google. Its full set includes around 1000 crowd-sourced Python programming problems. However, certain amount of problems can be noisy (e.g., prompts make no sense or tests are broken). Consequently, a subset (~427 problems) of the data has been hand-verified by original author -- MBPP-sanitized.

MBPP+ improves MBPP based on its sanitized version (MBPP-sanitized):

  • We further hand-verify the problems to trim ill-formed problems to keep 399 problems
  • We also fix the problems whose implementation is wrong (more details can be found here)
  • We perform test augmentation to improve the number of tests by 35x (on avg from 3.1 to 108.5)
  • We mantain the scripting compatibility against HumanEval+ where one simply needs to toggle the switch by --dataset mbpp for evalplus.evaluate, codegen/generate.py, tools/checker.py as well as tools/sanitize.py
  • Initial leaderboard is made available on https://evalplus.github.io/leaderboard.html and we will keep updating

A typical workflow to use MBPP+:

# Step 1: Generate MBPP solutions
from evalplus.data import get_mbpp_plus, write_jsonl

def GEN_SOLUTION(prompt: str) -> str:
    # LLM produce the whole solution based on prompt

samples = [
    dict(task_id=task_id, solution=GEN_SOLUTION(problem["prompt"]))
    for task_id, problem in get_mbpp_plus().items()
]
write_jsonl("samples.jsonl", samples)
# May perform some post-processing to sanitize LLM produced code
# e.g., https://github.com/evalplus/evalplus/blob/master/tools/sanitize.py
# Step 2: Evaluation on MBPP+
docker run -v $(pwd):/app ganler/evalplus:latest --dataset mbpp --samples samples.jsonl
# STDOUT will display the scores for "base" (with MBPP tests) and "base + plus" (with additional MBPP+ tests)

🔥 HumanEval+ Maintainance

  • Leaderboard updates (now 41 models!): https://evalplus.github.io/leaderboard.html
    • DeepSeek Coder series
    • Phind-CodeLlama
    • Mistral and Zephyr series
    • Smaller StarCoders
  • HumanEval+ now upgrades to v0.1.9 from v0.1.6
    • Test-case fixes: 0, 3, 9, 148
    • Prompt fixes: 114
    • Contract fixes: 1, 2, 99, 35, 28, 32, 160

PyPI: https://pypi.org/project/evalplus/0.2.0/
Docker Hub: https://hub.docker.com/layers/ganler/evalplus/v0.2.0/images/sha256-6f1b9bd13930abfb651a99d4c6a55273271f73e5b44c12dcd959a00828782dd6

EvalPlus v0.1.7

06 Sep 18:35

Choose a tag to compare

  • EvalPlus leader board: https://evalplus.github.io/leaderboard.html
  • Evaluated CodeLlama, CodeT5+ and WizardCoder
  • Fixed contract (HumanEval+): 116, 126, 006
  • Removed extreme inputs (HumanEval+): 32
  • Established HUMANEVAL_OVERRIDE_PATH which allows to override the original dataset with customized dataset

PyPI: https://pypi.org/project/evalplus/0.1.7/
Docker Hub: https://hub.docker.com/layers/ganler/evalplus/v0.1.7/images/sha256-69fe87df89b8c1545ff7e3b20232ac6c4841b43c20f22f4a276ba03f1b0d79ae

EvalPlus v0.1.6

26 Jun 06:57

Choose a tag to compare

  • Supporting configurable timeouts $T=\max(T_{base}, T_{gt}\times k)$, where:
    • $T_{base}$ is the minimal timeout (configurable by --min-time-limit; default to 0.2s);
    • $T_{gt}$ is the runtime of the ground-truth solutions (achieved via profiling);
    • $k$ is a configurable factor --gt-time-limit-factor (default to 4);
  • Using a more conservative timeout setting to mitigate test-beds with weak performance ($T_{base}: 0.05s \to 0.2s$ and $k: 2\to 4$).
  • HumanEval+ dataset bug fixes:
    • Medium contract fixesL P129 (#4), P148 (self-identified)
    • Minor contract fixes: P75 (#4), P53 (#8), P0 (self-identified), P3 (self-identified), P9 (self-identified)
    • Minor GT fixes: P140 (#3)

PyPI: https://pypi.org/project/evalplus/0.1.6/
Docker Hub: https://hub.docker.com/layers/ganler/evalplus/v0.1.6/images/sha256-5913b95172962ad61e01a5d5cf63b60e1140dd547f5acc40370af892275e777c

EvalPlus v0.1.5

02 Jun 08:19

Choose a tag to compare

🚀 HumanEval+[mini] -- 47x smaller while equivalently effective as HumanEval+

  • Add --mini to evalplus.evaluate ... you can use a minimal and best-quality set of extra tests to accelerate evaluation!
  • HumanEval+[mini] (avg 16.5 tests) is smaller than HumanEval+ (avg 774.8 tests) by 47x.
  • This is achieved via test-suite reduction -- we run a set covering algorithm to preserve the same coverage (coverage analysis), mutant-killings (mutation analysis) and sample-killings (pass-fail status of each sample-test pair).

image

PyPI: https://pypi.org/project/evalplus/0.1.5/
Docker Hub: https://hub.docker.com/layers/ganler/evalplus/v0.1.5/images/sha256-01ef3275ab02776e94edd4a436a3cd33babfaaf7a81e7ae44f895c2794f4c104

EvalPlus v0.1.4

12 May 05:02

Choose a tag to compare

  • Performance:
    • Lazy loading of the cache of evaluation results
    • Use ProcessPoolExecutor over ThreadPoolExecutor
    • Caching groudtruth outputs
  • Observability:
    • Concurrent logger when a task gets stuck for 10s
  • New models:
    • CodeGen2 (infill)
    • StarCoder (infill)
  • Fixes:
    • Deterministic hashing of input problem

PyPI: https://pypi.org/project/evalplus/0.1.4/
Docker Hub: https://hub.docker.com/layers/ganler/evalplus/v0.1.4/images/sha256-a0ea8279c71afa9418808326412b1e5cd11f44b3b59470477ecf4ba999d4b73a

EvalPlus v0.1.3

09 May 18:33

Choose a tag to compare

EvalPlus v0.1.2

07 May 05:24

Choose a tag to compare

  • Fix the bug induced by using --base-only
  • Build docker image locally instead of simply doing a pip install

PyPI: https://pypi.org/project/evalplus/0.1.2/
Docker Hub: https://hub.docker.com/layers/ganler/evalplus/v0.1.2/images/sha256-747ae02f0bfbd300c0205298113006203d984373e6ab6b8fb3048626f41dbe08

EvalPlus v0.1.1

07 May 05:21

Choose a tag to compare

In this version, efforts are mainly made to sanitize and standardize code in evalplus. Most importantly, evalplus strictly follows the dataset usage style of HumanEval. As a result, users can use evalplus in this way:

carbon (1)

For more details, the main changes are (tracked in #1):

  • Package build and pypi setup
  • (HumanEval Compatibility) Support sample files as .jsonl
  • (HumanEval Compatibility) get_human_eval_plus() returns a dict instead of list
  • (HumanEval Compatibility) Use HumanEval task ID splitter "/" over "_"
  • Optimize the evaluation parallelism scheme to the sample-level granularity (original: task level)
  • Optimize IPC via shared memory
  • Remove groundtruth solutions to avoid data leakage
  • Use docker the sandboxing mechanism
  • Support Codegen2 in generation
  • Split dependency into multiple categories

PyPI: https://pypi.org/project/evalplus/0.1.1/
Docker Hub: https://hub.docker.com/layers/ganler/evalplus/v0.1.1/images/sha256-4993a0dc0ec13d6fe88eb39f94dd0a927e1f26864543c8c13e2e8c5d5c347af0