Agent Benchmarks
Standardized benchmarks for measuring AI agent capabilities. From software engineering to web navigation, these benchmarks define the state of the art.
┌─────────────────────────────────────────────────────────────────────────────┐ │ Agent Benchmark Landscape │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ CODE & SOFTWARE WEB & NAVIGATION │ │ ┌────────────────────┐ ┌────────────────────┐ │ │ │ SWE-bench │ │ WebArena │ │ │ │ HumanEval │ │ MiniWoB++ │ │ │ │ MBPP │ │ Mind2Web │ │ │ └────────────────────┘ └────────────────────┘ │ │ │ │ GENERAL ASSISTANT TOOL & FUNCTION │ │ ┌────────────────────┐ ┌────────────────────┐ │ │ │ GAIA │ │ τ-bench │ │ │ │ AgentBench │ │ API-Bank │ │ │ │ AssistantBench │ │ ToolBench │ │ │ └────────────────────┘ └────────────────────┘ │ │ │ │ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ │ │ Human Baseline ████████████████████████████████████████████ 78-95% │ │ Current SOTA ██████████████████████ 40-72% │ │ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ │ │ │ └─────────────────────────────────────────────────────────────────────────────┘
Current State of the Art (2025)
Human vs AI Gap
| Benchmark | Top Score | Top Agent | Human Baseline | Gap |
|---|---|---|---|---|
| SWE-bench Verified | 72% | Claude 3.5 + Devin | ~95% | 23% |
| WebArena | 42% | GPT-4V + SoM | 78% | 36% |
| GAIA Level 1 | 75% | Various | 92% | 17% |
| GAIA Level 2 | 60% | Various | 92% | 32% |
| GAIA Level 3 | 40% | Various | 92% | 52% |
| t-bench Retail | 45% | GPT-4 | ~94% | 49% |
| HumanEval | 95%+ | Various | ~95% | ~0% |
Current state of the art vs human baseline
Benchmark Details
SWE-bench Verified
Gold standard for code agents. 500 human-verified GitHub issues from 12 Python repositories. Agent must generate patches that pass test suites.
Metrics:
SWE-bench Full
Complete dataset of 2,294 real GitHub issues. More diverse but noisier than Verified subset.
Metrics:
WebArena
Web navigation benchmark with 812 tasks across 5 self-hosted websites (shopping, Reddit, GitLab, OpenStreetMap, Wikipedia).
Metrics:
GAIA
General AI Assistant benchmark. 466 questions requiring web search, file processing, and multi-step reasoning.
Metrics:
τ-bench (Tau-bench)
Multi-turn conversations requiring tool use across airline, retail, and banking domains with simulated APIs.
Metrics:
AgentBench
Multi-dimensional evaluation across 8 environments: OS, DB, KG, Digital Cards, Lateral Thinking, House-Holding, Web Shopping, Web Browsing.
Metrics:
HumanEval
Code generation benchmark. 164 Python programming problems testing functional correctness.
Metrics:
MBPP
Mostly Basic Python Problems. 974 entry-level programming tasks.
Metrics:
Evaluation Methodology
Proper benchmarking requires reproducible methodology. Key principles:
Environment Isolation
Run evaluations in containerized environments (Docker) with pinned dependencies. This ensures reproducibility across runs and machines.
Deterministic Settings
Use temperature=0 and fixed seeds where possible. Report settings that can't be made deterministic.
Confidence Intervals
Report 95% confidence intervals, not just point estimates. Small test sets can have wide intervals.
Category Breakdown
Report performance by task category/difficulty. Aggregate scores hide important patterns.
// Benchmark Evaluation Methodology
// Standard approach for running agent benchmarks
1. ENVIRONMENT SETUP
// Isolated, reproducible environment
create_docker_container(benchmark_image)
setup_dependencies(benchmark.requirements)
initialize_test_harness()
2. AGENT CONFIGURATION
configure_agent(
model = "agent_model_name",
max_tokens = 4096,
temperature = 0, // Deterministic for reproducibility
tools = benchmark.available_tools
)
3. EVALUATION LOOP
for each task in benchmark.tasks:
// Record start time
start_time = now()
// Agent attempts task
result = agent.execute(task.input)
// Capture metrics
metrics = {
"latency": now() - start_time,
"tokens_used": result.total_tokens,
"tool_calls": result.tool_call_count,
"trajectory": result.reasoning_trace
}
// Evaluate result
passed = benchmark.evaluate(result.output, task.expected)
record_result(task.id, passed, metrics)
4. AGGREGATE RESULTS
compute_accuracy()
compute_confidence_intervals()
generate_breakdown_by_category()
export_results(format="json") import json
from dataclasses import dataclass, asdict
from typing import List, Dict, Any
from datetime import datetime
import numpy as np
from scipy import stats
@dataclass
class BenchmarkConfig:
name: str
version: str
agent_model: str
temperature: float = 0.0
max_tokens: int = 4096
timeout_seconds: int = 1800
@dataclass
class TaskResult:
task_id: str
passed: bool
latency_ms: float
tokens_used: int
tool_calls: int
trajectory: List[str]
output: str
error: str = None
@dataclass
class BenchmarkResults:
config: BenchmarkConfig
results: List[TaskResult]
timestamp: str
total_time_seconds: float
@property
def accuracy(self) -> float:
return sum(r.passed for r in self.results) / len(self.results)
@property
def confidence_interval(self) -> tuple:
"""95% confidence interval using Wilson score."""
n = len(self.results)
p = self.accuracy
z = 1.96 # 95% CI
denominator = 1 + z**2/n
centre = (p + z**2/(2*n)) / denominator
spread = z * np.sqrt((p*(1-p) + z**2/(4*n))/n) / denominator
return (centre - spread, centre + spread)
def to_json(self) -> str:
return json.dumps({
"config": asdict(self.config),
"summary": {
"accuracy": self.accuracy,
"confidence_interval": self.confidence_interval,
"total_tasks": len(self.results),
"passed": sum(r.passed for r in self.results),
"avg_latency_ms": np.mean([r.latency_ms for r in self.results]),
"avg_tokens": np.mean([r.tokens_used for r in self.results]),
},
"results": [asdict(r) for r in self.results],
"timestamp": self.timestamp,
"total_time_seconds": self.total_time_seconds
}, indent=2)
class BenchmarkRunner:
def __init__(self, config: BenchmarkConfig, agent):
self.config = config
self.agent = agent
self.results: List[TaskResult] = []
async def run(self, tasks: List[Dict]) -> BenchmarkResults:
start = datetime.now()
for task in tasks:
result = await self._run_task(task)
self.results.append(result)
return BenchmarkResults(
config=self.config,
results=self.results,
timestamp=start.isoformat(),
total_time_seconds=(datetime.now() - start).total_seconds()
)
async def _run_task(self, task: Dict) -> TaskResult:
task_start = datetime.now()
try:
response = await self.agent.execute(
task["input"],
timeout=self.config.timeout_seconds
)
passed = self._evaluate(response.output, task["expected"])
return TaskResult(
task_id=task["id"],
passed=passed,
latency_ms=(datetime.now() - task_start).total_seconds() * 1000,
tokens_used=response.tokens_used,
tool_calls=len(response.tool_calls),
trajectory=response.reasoning_trace,
output=response.output
)
except Exception as e:
return TaskResult(
task_id=task["id"],
passed=False,
latency_ms=(datetime.now() - task_start).total_seconds() * 1000,
tokens_used=0,
tool_calls=0,
trajectory=[],
output="",
error=str(e)
) public record BenchmarkConfig(
string Name,
string Version,
string AgentModel,
double Temperature = 0.0,
int MaxTokens = 4096,
int TimeoutSeconds = 1800
);
public record TaskResult(
string TaskId,
bool Passed,
double LatencyMs,
int TokensUsed,
int ToolCalls,
List<string> Trajectory,
string Output,
string? Error = null
);
public class BenchmarkResults
{
public BenchmarkConfig Config { get; init; }
public List<TaskResult> Results { get; init; } = new();
public DateTime Timestamp { get; init; }
public double TotalTimeSeconds { get; init; }
public double Accuracy =>
(double)Results.Count(r => r.Passed) / Results.Count;
public (double Lower, double Upper) ConfidenceInterval
{
get
{
// Wilson score interval
int n = Results.Count;
double p = Accuracy;
double z = 1.96; // 95% CI
double denom = 1 + z * z / n;
double centre = (p + z * z / (2 * n)) / denom;
double spread = z * Math.Sqrt((p * (1 - p) + z * z / (4 * n)) / n) / denom;
return (centre - spread, centre + spread);
}
}
public string ToJson() => JsonSerializer.Serialize(new
{
Config,
Summary = new
{
Accuracy,
ConfidenceInterval,
TotalTasks = Results.Count,
Passed = Results.Count(r => r.Passed),
AvgLatencyMs = Results.Average(r => r.LatencyMs),
AvgTokens = Results.Average(r => r.TokensUsed)
},
Results,
Timestamp,
TotalTimeSeconds
}, new JsonSerializerOptions { WriteIndented = true });
}
public class BenchmarkRunner
{
private readonly BenchmarkConfig _config;
private readonly IAgent _agent;
private readonly List<TaskResult> _results = new();
public BenchmarkRunner(BenchmarkConfig config, IAgent agent)
{
_config = config;
_agent = agent;
}
public async Task<BenchmarkResults> RunAsync(List<BenchmarkTask> tasks)
{
var start = DateTime.UtcNow;
foreach (var task in tasks)
{
var result = await RunTaskAsync(task);
_results.Add(result);
}
return new BenchmarkResults
{
Config = _config,
Results = _results,
Timestamp = start,
TotalTimeSeconds = (DateTime.UtcNow - start).TotalSeconds
};
}
private async Task<TaskResult> RunTaskAsync(BenchmarkTask task)
{
var sw = Stopwatch.StartNew();
try
{
var response = await _agent.ExecuteAsync(
task.Input,
TimeSpan.FromSeconds(_config.TimeoutSeconds)
);
bool passed = Evaluate(response.Output, task.Expected);
return new TaskResult(
task.Id, passed, sw.ElapsedMilliseconds,
response.TokensUsed, response.ToolCalls.Count,
response.ReasoningTrace, response.Output
);
}
catch (Exception ex)
{
return new TaskResult(
task.Id, false, sw.ElapsedMilliseconds,
0, 0, new List<string>(), "", ex.Message
);
}
}
} Results Schema
Standardized JSON schema for benchmark results enables comparison across agents and runs.
{
"config": {
"name": "SWE-bench Verified",
"version": "1.0",
"agent_model": "claude-3-sonnet-20240229",
"temperature": 0.0,
"max_tokens": 4096,
"tools_enabled": ["file_read", "file_write", "bash", "search"]
},
"summary": {
"accuracy": 0.52,
"confidence_interval": [0.48, 0.56],
"total_tasks": 500,
"passed": 260,
"failed": 240,
"avg_latency_ms": 45000,
"avg_tokens_per_task": 12500,
"total_cost_usd": 125.50
},
"category_breakdown": {
"django": { "passed": 45, "total": 85, "accuracy": 0.53 },
"flask": { "passed": 32, "total": 60, "accuracy": 0.53 },
"scikit-learn": { "passed": 28, "total": 55, "accuracy": 0.51 }
},
"results": [
{
"task_id": "django__django-11039",
"passed": true,
"latency_ms": 38500,
"tokens_used": 11200,
"tool_calls": 15,
"trajectory": ["search", "read", "read", "think", "write", "bash"],
"output": "patch content..."
}
],
"timestamp": "2025-01-26T10:30:00Z",
"total_time_seconds": 22500
} Which Benchmark Should You Use?
┌─────────────────────────────────────────────────────────────────────────────┐ │ Benchmark Selection Guide │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ What does your agent do? │ │ │ │ ├── Software engineering (code, bugs, PRs)? │ │ │ └──▶ SWE-bench (Verified for quality, Full for breadth) │ │ │ │ │ ├── Code generation only? │ │ │ └──▶ HumanEval (quick), MBPP (broader), SWE-bench Lite (realistic) │ │ │ │ │ ├── Web browsing and automation? │ │ │ └──▶ WebArena (comprehensive), MiniWoB++ (simpler) │ │ │ │ │ ├── Tool/function calling? │ │ │ └──▶ τ-bench (multi-turn), ToolBench (single-turn) │ │ │ │ │ ├── General assistant tasks? │ │ │ └──▶ GAIA (reasoning), AgentBench (multi-environment) │ │ │ │ │ └── Multiple capabilities? │ │ └──▶ AgentBench (8 environments), custom composite benchmark │ │ │ └─────────────────────────────────────────────────────────────────────────────┘
Common Pitfalls
Benchmark Overfitting
Optimizing specifically for benchmark patterns rather than general capability. Agents may game metrics without real improvement.
Data Contamination
LLMs trained on benchmark data have artificially inflated scores. Use held-out test sets and check for contamination.
Ignoring Cost
High scores achieved with 100x more tokens may not be practical. Always report cost alongside accuracy.
Cherry-Picking
Running many times and reporting best result, or selecting favorable subsets. Report all runs with confidence intervals.