Agent Benchmarks

Standardized benchmarks for measuring AI agent capabilities. From software engineering to web navigation, these benchmarks define the state of the art.

┌─────────────────────────────────────────────────────────────────────────────┐
│                        Agent Benchmark Landscape                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  CODE & SOFTWARE                    WEB & NAVIGATION                        │
│  ┌────────────────────┐             ┌────────────────────┐                  │
│  │ SWE-bench          │             │ WebArena           │                  │
│  │ HumanEval          │             │ MiniWoB++          │                  │
│  │ MBPP               │             │ Mind2Web           │                  │
│  └────────────────────┘             └────────────────────┘                  │
│                                                                             │
│  GENERAL ASSISTANT                  TOOL & FUNCTION                         │
│  ┌────────────────────┐             ┌────────────────────┐                  │
│  │ GAIA               │             │ τ-bench            │                  │
│  │ AgentBench         │             │ API-Bank           │                  │
│  │ AssistantBench     │             │ ToolBench          │                  │
│  └────────────────────┘             └────────────────────┘                  │
│                                                                             │
│  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━    │
│  Human Baseline ████████████████████████████████████████████ 78-95%         │
│  Current SOTA   ██████████████████████                       40-72%         │
│  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━    │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Current State of the Art (2025)

Human vs AI Gap

The gap between human performance and AI agent performance remains significant on most benchmarks, especially those requiring long-horizon planning, error recovery, and real-world interaction.
Benchmark Top Score Top Agent Human Baseline Gap
SWE-bench Verified 72% Claude 3.5 + Devin ~95% 23%
WebArena 42% GPT-4V + SoM 78% 36%
GAIA Level 1 75% Various 92% 17%
GAIA Level 2 60% Various 92% 32%
GAIA Level 3 40% Various 92% 52%
t-bench Retail 45% GPT-4 ~94% 49%
HumanEval 95%+ Various ~95% ~0%

Current state of the art vs human baseline

Benchmark Details

SWE-bench Verified

Software Engineering Hard
72%
Top Score

Gold standard for code agents. 500 human-verified GitHub issues from 12 Python repositories. Agent must generate patches that pass test suites.

Metrics:

Resolved Rate Patch Validity Test Pass Rate
500 tasks
Learn more

SWE-bench Full

Software Engineering Hard
51%
Top Score

Complete dataset of 2,294 real GitHub issues. More diverse but noisier than Verified subset.

Metrics:

Resolved Rate Category Breakdown
2294 tasks
Learn more

WebArena

Web Navigation Hard
42%
Top Score

Web navigation benchmark with 812 tasks across 5 self-hosted websites (shopping, Reddit, GitLab, OpenStreetMap, Wikipedia).

Metrics:

Task Success Rate Action Efficiency
812 tasks
Learn more

GAIA

General Assistant Medium-Hard
75% (L1)
Top Score

General AI Assistant benchmark. 466 questions requiring web search, file processing, and multi-step reasoning.

Metrics:

Level 1/2/3 Accuracy Exact Match
466 tasks
Learn more

τ-bench (Tau-bench)

Tool + Conversation Medium
45% (Retail)
Top Score

Multi-turn conversations requiring tool use across airline, retail, and banking domains with simulated APIs.

Metrics:

Pass Rate Tool Accuracy Response Quality
680 tasks
Learn more

AgentBench

Multi-Environment Varied
Model-dependent
Top Score

Multi-dimensional evaluation across 8 environments: OS, DB, KG, Digital Cards, Lateral Thinking, House-Holding, Web Shopping, Web Browsing.

Metrics:

Per-Environment Score Overall Capability
1632 tasks
Learn more

HumanEval

Code Generation Medium
95%+
Top Score

Code generation benchmark. 164 Python programming problems testing functional correctness.

Metrics:

Pass@1 Pass@10 Pass@100
164 tasks
Learn more

MBPP

Code Generation Easy-Medium
85%+
Top Score

Mostly Basic Python Problems. 974 entry-level programming tasks.

Metrics:

Pass@1 Pass@80
974 tasks
Learn more

Evaluation Methodology

Proper benchmarking requires reproducible methodology. Key principles:

Environment Isolation

Run evaluations in containerized environments (Docker) with pinned dependencies. This ensures reproducibility across runs and machines.

Deterministic Settings

Use temperature=0 and fixed seeds where possible. Report settings that can't be made deterministic.

Confidence Intervals

Report 95% confidence intervals, not just point estimates. Small test sets can have wide intervals.

Category Breakdown

Report performance by task category/difficulty. Aggregate scores hide important patterns.

Benchmark Evaluation Framework
// Benchmark Evaluation Methodology
// Standard approach for running agent benchmarks

1. ENVIRONMENT SETUP
   // Isolated, reproducible environment
   create_docker_container(benchmark_image)
   setup_dependencies(benchmark.requirements)
   initialize_test_harness()

2. AGENT CONFIGURATION
   configure_agent(
       model = "agent_model_name",
       max_tokens = 4096,
       temperature = 0,  // Deterministic for reproducibility
       tools = benchmark.available_tools
   )

3. EVALUATION LOOP
   for each task in benchmark.tasks:
       // Record start time
       start_time = now()

       // Agent attempts task
       result = agent.execute(task.input)

       // Capture metrics
       metrics = {
           "latency": now() - start_time,
           "tokens_used": result.total_tokens,
           "tool_calls": result.tool_call_count,
           "trajectory": result.reasoning_trace
       }

       // Evaluate result
       passed = benchmark.evaluate(result.output, task.expected)
       record_result(task.id, passed, metrics)

4. AGGREGATE RESULTS
   compute_accuracy()
   compute_confidence_intervals()
   generate_breakdown_by_category()
   export_results(format="json")
import json
from dataclasses import dataclass, asdict
from typing import List, Dict, Any
from datetime import datetime
import numpy as np
from scipy import stats

@dataclass
class BenchmarkConfig:
    name: str
    version: str
    agent_model: str
    temperature: float = 0.0
    max_tokens: int = 4096
    timeout_seconds: int = 1800

@dataclass
class TaskResult:
    task_id: str
    passed: bool
    latency_ms: float
    tokens_used: int
    tool_calls: int
    trajectory: List[str]
    output: str
    error: str = None

@dataclass
class BenchmarkResults:
    config: BenchmarkConfig
    results: List[TaskResult]
    timestamp: str
    total_time_seconds: float

    @property
    def accuracy(self) -> float:
        return sum(r.passed for r in self.results) / len(self.results)

    @property
    def confidence_interval(self) -> tuple:
        """95% confidence interval using Wilson score."""
        n = len(self.results)
        p = self.accuracy
        z = 1.96  # 95% CI
        denominator = 1 + z**2/n
        centre = (p + z**2/(2*n)) / denominator
        spread = z * np.sqrt((p*(1-p) + z**2/(4*n))/n) / denominator
        return (centre - spread, centre + spread)

    def to_json(self) -> str:
        return json.dumps({
            "config": asdict(self.config),
            "summary": {
                "accuracy": self.accuracy,
                "confidence_interval": self.confidence_interval,
                "total_tasks": len(self.results),
                "passed": sum(r.passed for r in self.results),
                "avg_latency_ms": np.mean([r.latency_ms for r in self.results]),
                "avg_tokens": np.mean([r.tokens_used for r in self.results]),
            },
            "results": [asdict(r) for r in self.results],
            "timestamp": self.timestamp,
            "total_time_seconds": self.total_time_seconds
        }, indent=2)


class BenchmarkRunner:
    def __init__(self, config: BenchmarkConfig, agent):
        self.config = config
        self.agent = agent
        self.results: List[TaskResult] = []

    async def run(self, tasks: List[Dict]) -> BenchmarkResults:
        start = datetime.now()

        for task in tasks:
            result = await self._run_task(task)
            self.results.append(result)

        return BenchmarkResults(
            config=self.config,
            results=self.results,
            timestamp=start.isoformat(),
            total_time_seconds=(datetime.now() - start).total_seconds()
        )

    async def _run_task(self, task: Dict) -> TaskResult:
        task_start = datetime.now()
        try:
            response = await self.agent.execute(
                task["input"],
                timeout=self.config.timeout_seconds
            )
            passed = self._evaluate(response.output, task["expected"])

            return TaskResult(
                task_id=task["id"],
                passed=passed,
                latency_ms=(datetime.now() - task_start).total_seconds() * 1000,
                tokens_used=response.tokens_used,
                tool_calls=len(response.tool_calls),
                trajectory=response.reasoning_trace,
                output=response.output
            )
        except Exception as e:
            return TaskResult(
                task_id=task["id"],
                passed=False,
                latency_ms=(datetime.now() - task_start).total_seconds() * 1000,
                tokens_used=0,
                tool_calls=0,
                trajectory=[],
                output="",
                error=str(e)
            )
public record BenchmarkConfig(
    string Name,
    string Version,
    string AgentModel,
    double Temperature = 0.0,
    int MaxTokens = 4096,
    int TimeoutSeconds = 1800
);

public record TaskResult(
    string TaskId,
    bool Passed,
    double LatencyMs,
    int TokensUsed,
    int ToolCalls,
    List<string> Trajectory,
    string Output,
    string? Error = null
);

public class BenchmarkResults
{
    public BenchmarkConfig Config { get; init; }
    public List<TaskResult> Results { get; init; } = new();
    public DateTime Timestamp { get; init; }
    public double TotalTimeSeconds { get; init; }

    public double Accuracy =>
        (double)Results.Count(r => r.Passed) / Results.Count;

    public (double Lower, double Upper) ConfidenceInterval
    {
        get
        {
            // Wilson score interval
            int n = Results.Count;
            double p = Accuracy;
            double z = 1.96; // 95% CI
            double denom = 1 + z * z / n;
            double centre = (p + z * z / (2 * n)) / denom;
            double spread = z * Math.Sqrt((p * (1 - p) + z * z / (4 * n)) / n) / denom;
            return (centre - spread, centre + spread);
        }
    }

    public string ToJson() => JsonSerializer.Serialize(new
    {
        Config,
        Summary = new
        {
            Accuracy,
            ConfidenceInterval,
            TotalTasks = Results.Count,
            Passed = Results.Count(r => r.Passed),
            AvgLatencyMs = Results.Average(r => r.LatencyMs),
            AvgTokens = Results.Average(r => r.TokensUsed)
        },
        Results,
        Timestamp,
        TotalTimeSeconds
    }, new JsonSerializerOptions { WriteIndented = true });
}

public class BenchmarkRunner
{
    private readonly BenchmarkConfig _config;
    private readonly IAgent _agent;
    private readonly List<TaskResult> _results = new();

    public BenchmarkRunner(BenchmarkConfig config, IAgent agent)
    {
        _config = config;
        _agent = agent;
    }

    public async Task<BenchmarkResults> RunAsync(List<BenchmarkTask> tasks)
    {
        var start = DateTime.UtcNow;

        foreach (var task in tasks)
        {
            var result = await RunTaskAsync(task);
            _results.Add(result);
        }

        return new BenchmarkResults
        {
            Config = _config,
            Results = _results,
            Timestamp = start,
            TotalTimeSeconds = (DateTime.UtcNow - start).TotalSeconds
        };
    }

    private async Task<TaskResult> RunTaskAsync(BenchmarkTask task)
    {
        var sw = Stopwatch.StartNew();
        try
        {
            var response = await _agent.ExecuteAsync(
                task.Input,
                TimeSpan.FromSeconds(_config.TimeoutSeconds)
            );

            bool passed = Evaluate(response.Output, task.Expected);

            return new TaskResult(
                task.Id, passed, sw.ElapsedMilliseconds,
                response.TokensUsed, response.ToolCalls.Count,
                response.ReasoningTrace, response.Output
            );
        }
        catch (Exception ex)
        {
            return new TaskResult(
                task.Id, false, sw.ElapsedMilliseconds,
                0, 0, new List<string>(), "", ex.Message
            );
        }
    }
}

Results Schema

Standardized JSON schema for benchmark results enables comparison across agents and runs.

benchmark_results.schema.json
{
  "config": {
    "name": "SWE-bench Verified",
    "version": "1.0",
    "agent_model": "claude-3-sonnet-20240229",
    "temperature": 0.0,
    "max_tokens": 4096,
    "tools_enabled": ["file_read", "file_write", "bash", "search"]
  },
  "summary": {
    "accuracy": 0.52,
    "confidence_interval": [0.48, 0.56],
    "total_tasks": 500,
    "passed": 260,
    "failed": 240,
    "avg_latency_ms": 45000,
    "avg_tokens_per_task": 12500,
    "total_cost_usd": 125.50
  },
  "category_breakdown": {
    "django": { "passed": 45, "total": 85, "accuracy": 0.53 },
    "flask": { "passed": 32, "total": 60, "accuracy": 0.53 },
    "scikit-learn": { "passed": 28, "total": 55, "accuracy": 0.51 }
  },
  "results": [
    {
      "task_id": "django__django-11039",
      "passed": true,
      "latency_ms": 38500,
      "tokens_used": 11200,
      "tool_calls": 15,
      "trajectory": ["search", "read", "read", "think", "write", "bash"],
      "output": "patch content..."
    }
  ],
  "timestamp": "2025-01-26T10:30:00Z",
  "total_time_seconds": 22500
}

Which Benchmark Should You Use?

┌─────────────────────────────────────────────────────────────────────────────┐
│                        Benchmark Selection Guide                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   What does your agent do?                                                  │
│                                                                             │
│   ├── Software engineering (code, bugs, PRs)?                               │
│   │   └──▶ SWE-bench (Verified for quality, Full for breadth)               │
│   │                                                                         │
│   ├── Code generation only?                                                 │
│   │   └──▶ HumanEval (quick), MBPP (broader), SWE-bench Lite (realistic)    │
│   │                                                                         │
│   ├── Web browsing and automation?                                          │
│   │   └──▶ WebArena (comprehensive), MiniWoB++ (simpler)                    │
│   │                                                                         │
│   ├── Tool/function calling?                                                │
│   │   └──▶ τ-bench (multi-turn), ToolBench (single-turn)                    │
│   │                                                                         │
│   ├── General assistant tasks?                                              │
│   │   └──▶ GAIA (reasoning), AgentBench (multi-environment)                 │
│   │                                                                         │
│   └── Multiple capabilities?                                                │
│       └──▶ AgentBench (8 environments), custom composite benchmark          │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Common Pitfalls

Benchmark Overfitting

Optimizing specifically for benchmark patterns rather than general capability. Agents may game metrics without real improvement.

Data Contamination

LLMs trained on benchmark data have artificially inflated scores. Use held-out test sets and check for contamination.

Ignoring Cost

High scores achieved with 100x more tokens may not be practical. Always report cost alongside accuracy.

Cherry-Picking

Running many times and reporting best result, or selecting favorable subsets. Report all runs with confidence intervals.

Related Topics