A repository for community-driven evaluation results for open source models on HuggingFace Hub.
This project adds structured evaluation results to model repositories using the new .eval_results/ format, enabling:
- Benchmark Leaderboards: Results appear on both model pages and benchmark dataset leaderboards
- Community Contributions: Anyone can submit evaluation results via Pull Request
- Verified Results: Pier reviewed and verified evaluation results
| Benchmark | Hub Dataset ID |
|---|---|
| HLE | cais/hle |
| GPQA | Idavidrein/gpqa |
| MMLU-Pro | TIGER-Lab/MMLU-Pro |
| GSM8K | openai/gsm8k |
# Set environment variables (or add to .env file)
export HF_TOKEN="your-huggingface-token" # Required: write access for PRs# Preview (dry run)
uv run scripts/evaluation_manager.py add-eval \
--benchmark HLE \
--repo-id "moonshotai/Kimi-K2-Thinking"
# Create PR
uv run scripts/evaluation_manager.py add-eval \
--benchmark HLE \
--repo-id "moonshotai/Kimi-K2-Thinking" \
--create-pr# Preview top 10 trending LLMs
uv run scripts/batch_eval_prs.py --limit 10 --benchmark HLE --dry-run
# Create PRs for models with scores
uv run scripts/batch_eval_prs.py --limit 10 --benchmark HLE# From model card (default)
uv run scripts/evaluation_manager.py add-eval \
--benchmark HLE \
--repo-id "model/name"
# From Artificial Analysis
uv run scripts/evaluation_manager.py add-eval \
--benchmark GPQA \
--repo-id "model/name" \
--source aaSources:
model_card(default): Extract from README tablesaa: Query Artificial Analysis API (requiresAA_API_KEY)papers: HuggingFace Papers (not yet implemented)
Always check for existing PRs before creating new ones:
uv run scripts/evaluation_manager.py get-prs --repo-id "model/name"# Create PR (someone else's model)
uv run scripts/evaluation_manager.py add-eval \
--benchmark HLE \
--repo-id "other-user/their-model" \
--create-pr
# Push directly (model authors)
uv run scripts/evaluation_manager.py add-eval \
--benchmark HLE \
--repo-id "your-username/your-model" \
--applyFor models with evaluation tables in their README:
# 1. Inspect available tables
uv run scripts/evaluation_manager.py inspect-tables --repo-id "model/name"This will output the available tables and their numbers which you can then use to extract the evaluations.
This repository includes a Claude Code skill at .claude/skills/hugging-face-evaluation/ for automated evaluation management.
Add specific benchmark to a model:
Add the HLE evaluation results to moonshotai/Kimi-K2-Thinking from the model card
Batch process trending models:
Find the top 20 trending text-generation models and add HLE scores from Model Cards. Do a dry run first.
Find the top 20 trending text-generation models and add HLE scores from Artificial Analysis. Do a dry run first.
Check and add multiple benchmarks:
For meta-llama/Llama-3.1-8B-Instruct, check the model card for GPQA and MMLU-Pro scores and create PRs if found.
Extract from README tables:
Inspect the evaluation tables in deepseek-ai/DeepSeek-V3 and extract all benchmark scores to .eval_results/ format
- Check for existing PRs before creating new ones
- Preview first (dry run) to verify scores are found
- Use appropriate source (model_card for README tables, aa for Artificial Analysis)
- Create PRs for models you don't own, push directly for your own
User: Add HLE score to MiniMaxAI/MiniMax-M2.1 from Artificial Analysis
Agent:
1. Checks for existing open PRs on MiniMaxAI/MiniMax-M2.1
2. Looks up HLE score from Artificial Analysis API
3. Finds: HLE = 22.2%
4. Creates PR with .eval_results/hle.yaml containing the score
5. Returns PR URL to user
Results are stored as YAML files in .eval_results/:
# .eval_results/hle.yaml
- dataset:
id: cais/hle # Hub Benchmark dataset ID
task_id: default # Optional: specific task/leaderboard
value: 22.2 # Metric value
date: "2026-01-14" # ISO-8601 date
source: # Attribution
url: https://artificialanalysis.ai
name: Artificial AnalysisSee HuggingFace Eval Results Documentation for full format specification.
Results from batch runs are saved to runs/ directory:
runs/
├── hle_20260114_abc123.json
├── gpqa_20260114_def456.json
└── ...
Each file contains:
{
"benchmark": "HLE",
"source": "aa",
"source_url": "https://artificialanalysis.ai",
"created": "2026-01-14T08:00:00Z",
"results": [
{
"repo_id": "MiniMaxAI/MiniMax-M2.1",
"value": 22.2,
"status": "pr_created",
"source_url": "https://artificialanalysis.ai"
}
]
}Status values: pr_created, uploaded, not_found, dry_run, error
