Tables & Resources¶

This page contains statistical tables and resources from our comprehensive survey on Issue Resolution in Software Engineering.

Evaluation & Training Datasets¶

A comprehensive survey and statistical overview of issue resolution datasets. We categorize these datasets based on programming language, modality support, source repositories, data scale (Amount), and the availability of reproducible execution environments.

Dataset	Language	Multimodal	Repos	Amount	Environment	Link
Single-PL Datasets
SWE-Fixer	Python	No	856	115,406	No
SWE-smith	Python	No	128	50k	Yes
SWE-Lego	Python	No	3,251	32,119	Yes
SWE-rebench	Python	No	3,468	21,336	Yes
SWE-bench-train	Python	No	37	19k	No
SWE-Flow	Python	No	74	18,081	Yes
Skywork-SWE	Python	No	2,531	10,169	Yes	-
R2E-Gym	Python	No	10	8,135	Yes
RepoForge	Python	No	-	7.3k	Yes	-
SWE-bench-extra	Python	No	2k	6.38k	Yes
SWE-Gym	Python	No	11	2,438	Yes
SWE-bench	Python	No	12	2,294	Yes
SWE-bench-java	Java	No	19	1,797	Yes
FEA-bench	Python	No	83	1,401	Yes
SWE-bench-Live	Python	No	164	1,565	Yes
Loc-Bench	Python	No	-	560	No
SWE-bench Verified	Python	No	-	500	Yes
SWE-bench Lite	Python	No	12	300	Yes
SWE-MERA	Python	No	200	300	Yes
SWE-Bench-CL	Python	No	8	273	Yes
SWE-Sharp-Bench	C#	No	17	150	Yes
SWE-Perf	Python	No	12	140	Yes
Visual SWE-bench	Python	Yes	11	133	Yes
SWE-EVO	Python	No	7	48	Yes
Multi-PL Datasets
SWE-Mirror	Python, Rust, Go	No	40	60k	Yes	-
Multi-SWE-bench	Java, JS, TS, Go, Rust, C, C++	No	76	4,723	Yes
Swing-Bench	Python, Go, C++, Rust	No	400	2300	Yes	-
SWE-PolyBench	Python, Java, JS, TS	No	21	2,110	Yes
SWE-Compass	Python, JS, TS, Java, C, C++, Go, Rust, Kotlin, C#	No	-	2,000	Yes
SWE-Bench Pro	Python, Go, TS	No	41	1,865	Yes
SWE-bench++	Python, Go, TS, JS, Ruby, PHP, Java, Rust, C++, C#, C	No	3,971	1,782	Yes
SWE-Lancer	JS, TS	No	-	1,488	Yes
OmniGIRL	Python, TS, Java, JS	Yes	15	959	Yes
SWE-bench Multimodal	JS, TS, HTML, CSS	Yes	17	619	Yes
SWE-fficiency	Python, Cython	No	9	498	Yes
SWE-Factory	Python, Java, JS, TS	No	12	430	Yes
SWE-bench-Live-MultiLang \& Windows	Python, JS, TS, C, C++, C#, Java, Go, Rust	No	238	418	Yes
SWE-bench Multilingual	C, C++, Go, Java, JS, TS, Rust, Python, Ruby, PHP	No	42	300	Yes
SWE-InfraBench	Python, TS	No	-	100	Yes	-

Training Trajectory Datasets¶

A survey of trajectory datasets used for agent training or analysis. We list the programming language, number of source repositories, and total trajectories for each dataset.

Dataset	Language	Repos	Amount
SWE-Fixer	Python	856	69,752
SWE-rebench	Python	1,823	67,074
R2E-Gym	Python	10	3,321
SWE-Synth	Python	11	3,018
SWE-Factory	Python	10	2,809
SWE-Gym	Python	11	491
SWE-Lego	Python	3251	14.6k

SFT-based Methods¶

Overview of SFT-based methods for issue resolution. This table categorizes models by their base architecture and training scaffold (Sorted by Performance).

Model Name	Base Model	Size	Arch.	Training Scaffold	Res.(%)	Code	Data	Model
SWE-rebench-openhands-Qwen3-235B-A22B	Qwen3-235B-A22B	235B-A22B	MoE	OpenHands	59.9	-
SWE-Lego-Qwen3-32B	Qwen3-32B	32B	Dense	OpenHands	57.6
CGM-SWE-PY	Qwen2.5-Coder-72B	72B	Dense	Graph RAG	50.4		-
SWE-rebench-openhands-Qwen3-30B-A3B	Qwen3-30B-A3B	30B-A3B	MoE	OpenHands	49.7	-
Devstral	Mistral Small 3	22B	Dense	OpenHands	46.8	-	-
Co-PatcheR	Qwen2.5-Coder-14B	3$\times$14B	Dense	PatchPilot-mini	46.0		-
SWE-Swiss-32B	Qwen2.5-32B-Instruct	32B	Dense	Agentless	45.0
SWE-Lego-Qwen3-8B	Qwen3-8B	8B	Dense	OpenHands	44.4
Lingma SWE-GPT	Qwen2.5-72B-Instruct	72B	Dense	SWESynInfer	30.2		-	-
SWE-Gym-Qwen-32B	Qwen2.5-Coder-32B	32B	Dense	OpenHands, MoatlessTools	20.6		-
SWE-Gym-Qwen-14B	Qwen2.5-Coder-14B	14B	Dense	OpenHands, MoatlessTools	16.4		-
SWE-Gym-Qwen-7B	Qwen2.5-Coder-7B	7B	Dense	OpenHands, MoatlessTools	10.6		-

RL-based Methods¶

A comprehensive overview of specialized models for issue resolution, categorized by parameter size. The table details each model's base architecture, the training scaffold used for rollout, the type of reward signal employed (Outcome vs. Process), and their performance results (Res. %) on issue resolution benchmarks.

Model Name	Base Model	Size	Arch.	Train. Scaffold	Reward	Res.(%)	Code	Data	Model
560B Models (MoE)
LongCat-Flash-Think	LongCatFlash-Base	560B-A27B	MoE	R2E-Gym	Outcome	60.4		-
72B Models
Kimi-Dev	Qwen 2.5-72B-Base	72B	Dense	BugFixer + TestWriter	Outcome	60.4		-
SWE-RL	Llama-3.3-70B-Instruct	70B	Dense	Agentless-mini	Outcome	41.0		-	-
Multi-turn RL(Nebius)	Qwen2.5-72B-Instruct	72B	Dense	SWE-agent	Outcome	39.0	-	-	-
Agent-RLVR-RM-72B	Qwen2.5-Coder-72B	72B	Dense	Localization + Repair	Outcome	27.8	-	-	-
Agent-RLVR-72B	Qwen2.5-Coder-72B	72B	Dense	Localization + Repair	Outcome	22.4	-	-	-
32B Models
OpenHands Critic	Qwen2.5-Coder-32B	32B	Dense	SWE-Gym	-	66.4		-
KAT-Dev-32B	Qwen3-32B	32B	Dense	-	-	62.4	-	-
SWE-Swiss-32B	Qwen2.5-32B-Instruct	32B	Dense	-	Outcome	60.2
FoldAgent	Seed-OSS-36B-Instruct	36B	Dense	FoldAgent	Process	58.0		-	-
SeamlessFlow-32B	Qwen3-32B	32B	Dense	SWE-agent	Outcome	45.8		-	-
DeepSWE	Qwen3-32B	32B	Dense	R2E-Gym	Outcome	42.2
SA-SWE-32B	-	32B	Dense	SkyRL-Agent	-	39.4	-	-	-
OpenHands LM v0.1	Qwen2.5-Coder-32B	32B	Dense	SWE-Gym	-	37.2		-
SWE-Dev-32B	Qwen2.5-Coder-32B	32B	Dense	OpenHands	Outcome	36.6		-
Satori-SWE	Qwen2.5-Coder-32B	32B	Dense	Retriever + Code editor	Outcome	35.8
SoRFT-32B	Qwen2.5-Coder-32B	32B	Dense	Agentless	Outcome	30.8	-	-	-
Agent-RLVR-32B	Qwen2.5-Coder-32B	32B	Dense	Localization + Repair	Outcome	21.6	-	-	-
14B Models
Agent-RLVR-14B	Qwen2.5-Coder-14B	14B	Dense	Localization + Repair	Outcome	18.0	-	-	-
SEAlign-14B	Qwen2.5-Coder-14B	14B	Dense	OpenHands	Process	17.7	-	-	-
7-8B Models
SeamlessFlow-8B	Qwen3-8B	8B	Dense	SWE-agent	Outcome	27.4		-	-
SWE-Dev-7B	Qwen2.5-Coder-7B	7B	Dense	OpenHands	Outcome	23.4		-
SoRFT-7B	Qwen2.5-Coder-7B	7B	Dense	Agentless	Outcome	21.4	-	-	-
SWE-Dev-8B	Llama-3.1-8B	8B	Dense	OpenHands	Outcome	18.0		-
SEAlign-7B	Qwen2.5-Coder-7B	7B	Dense	OpenHands	Process	15.0	-	-	-
SWE-Dev-9B	GLM-4-9B	9B	Dense	OpenHands	Outcome	13.6		-

General Foundation Models¶

Overview of general foundation models evaluated on issue resolution. The table details the specific inference scaffolds (e.g., OpenHands, Agentless) employed during the evaluation process to achieve the reported results.

Model Name	Size	Arch.	Inf. Scaffold	Reward	Res.(%)	Code	Model
KAT-Coder	-	-	Claude Code	Outcome	73.4	-
MiMo-V2-Flash	309B-A15B	MoE	Agentless	Outcome	73.4
Deepseek V3.2	671B-A37B	MoE	Claude Code, RooCode	-	73.1
Kimi-K2-Instruct	1T	MoE	Agentless	Outcome	71.6	-
Qwen3-Coder	480B-A35B	MoE	OpenHands	Outcome	69.6
GLM-4.6	355B-A32B	MoE	OpenHands	Outcome	68.0	-
gpt-oss-120b	116.8B-A5.1B	MoE	Internal tool	Outcome	62.0
Minimax M2	230B-10B	MoE	R2E-Gym	Outcome	61.0
gpt-oss-20b	20.9B-A3.6B	MoE	Internal tool	Outcome	60.0
GLM-4.5-Air	106B-A12B	MoE	OpenHands	Outcome	57.6	-	-
Minimax M1-80k	456B-A45.9B	MoE	Agentless	Outcome	56.0
Minimax M1-40k	456B-A45.9B	MoE	Agentless	Outcome	55.6
Seed1.5-Thinking	200B-A20B	MoE	-	Outcome	47.0		-
Llama 4 Maverick	400B-A17B	MoE	mini-SWE-agent	Outcome	21.0
Llama 4 Scout	109B-17B	MoE	mini-SWE-agent	Outcome	9.1