Evaluation & Metrics
How do you know if your agent is actually working? Evaluation is the discipline of measuring agent performance across multiple dimensions: component accuracy, task completion, and system-level metrics. This guide covers evaluation taxonomy, agent-specific metrics, and major benchmarks.
Evaluation Taxonomy
┌─────────────────────────────────────────────────────────────────────────────┐ │ Agent Evaluation Layers │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ Layer 3: SYSTEM │ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ │ │ E2E Latency │ │ Cost/Task │ │ Safety │ │ User Pref │ │ │ │ │ │ P95 < 30s │ │ $/query │ │ Compliance │ │ Win Rate │ │ │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────┘ │ │ ↑ │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ Layer 2: TASK │ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ │ │ Completion │ │ Step │ │ Error │ │ Output │ │ │ │ │ │ Rate │ │ Efficiency │ │ Recovery │ │ Quality │ │ │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────┘ │ │ ↑ │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ Layer 1: COMPONENT │ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ │ │ Tool Call │ │ Argument │ │ Response │ │ Context │ │ │ │ │ │ Accuracy │ │ Extraction │ │ Format │ │ Utilization │ │ │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────────────┘ Evaluation flows bottom-up: Strong component metrics are necessary but not sufficient for good task metrics, which are necessary but not sufficient for good system metrics.
Agent evaluation operates at three layers. Component-level metrics measure individual capabilities (tool calling, parsing). Task-level metrics measure goal achievement. System-level metrics measure real-world deployment concerns (latency, cost, safety).
Evaluation Metrics Reference
| Layer | Metric | Description | Evaluation Method |
|---|---|---|---|
| Component | Tool Calling Accuracy | Correct tool selection rate | Compare to ground truth tool sequence |
| Component | Argument Extraction F1 | Parameter parsing accuracy | Compare extracted args to expected |
| Component | Response Format Validity | Structured output correctness | JSON schema validation |
| Task | Task Completion Rate | Goal achievement percentage | Binary pass/fail per task |
| Task | Step Efficiency | Steps taken vs optimal path | Ratio of actual to optimal steps |
| Task | Error Recovery Rate | Recovery from failures | Track retry success rate |
| System | End-to-End Latency | Total response time | Measure P50, P95, P99 |
| System | Cost per Task | Resource consumption | Tokens/API calls/dollars |
| System | Safety Compliance | Guardrail adherence | Red team testing |
Evaluation metrics organized by layer
// Agent Evaluation Taxonomy
// Three-layer evaluation approach
Layer 1: Component Evaluation
├── Tool Calling Accuracy // Does the agent select correct tools?
├── Argument Extraction // Are parameters correctly parsed?
├── Response Formatting // Is output properly structured?
└── Context Utilization // Does it use available context well?
Layer 2: Task Evaluation
├── Task Completion Rate // Did the agent achieve the goal?
├── Step Efficiency // How many steps vs optimal?
├── Error Recovery // Did it handle failures gracefully?
└── Output Quality // Is the result correct and complete?
Layer 3: System Evaluation
├── End-to-End Latency // Total time from request to result
├── Cost Efficiency // Tokens/API calls per task
├── Safety Compliance // Did it stay within guardrails?
└── User Satisfaction // Human preference ratings from dataclasses import dataclass
from enum import Enum
from typing import List, Optional
class EvaluationLayer(Enum):
COMPONENT = "component"
TASK = "task"
SYSTEM = "system"
@dataclass
class Metric:
name: str
layer: EvaluationLayer
description: str
higher_is_better: bool = True
# Define evaluation taxonomy
EVALUATION_TAXONOMY = [
# Layer 1: Component Evaluation
Metric("tool_calling_accuracy", EvaluationLayer.COMPONENT,
"Percentage of correct tool selections", True),
Metric("argument_extraction_f1", EvaluationLayer.COMPONENT,
"F1 score for parameter extraction", True),
Metric("response_format_validity", EvaluationLayer.COMPONENT,
"Percentage of valid JSON/structured outputs", True),
Metric("context_utilization", EvaluationLayer.COMPONENT,
"Relevant context retrieval rate", True),
# Layer 2: Task Evaluation
Metric("task_completion_rate", EvaluationLayer.TASK,
"Percentage of successfully completed tasks", True),
Metric("step_efficiency", EvaluationLayer.TASK,
"Actual steps / optimal steps ratio", False),
Metric("error_recovery_rate", EvaluationLayer.TASK,
"Successful recoveries from errors", True),
Metric("output_correctness", EvaluationLayer.TASK,
"Factual accuracy of generated output", True),
# Layer 3: System Evaluation
Metric("e2e_latency_p95", EvaluationLayer.SYSTEM,
"95th percentile end-to-end latency", False),
Metric("cost_per_task", EvaluationLayer.SYSTEM,
"Average cost in tokens/dollars per task", False),
Metric("safety_compliance", EvaluationLayer.SYSTEM,
"Percentage of responses within guardrails", True),
Metric("user_preference", EvaluationLayer.SYSTEM,
"Human preference win rate vs baseline", True),
]
def get_metrics_by_layer(layer: EvaluationLayer) -> List[Metric]:
"""Get all metrics for a specific evaluation layer."""
return [m for m in EVALUATION_TAXONOMY if m.layer == layer] public enum EvaluationLayer
{
Component,
Task,
System
}
public record Metric(
string Name,
EvaluationLayer Layer,
string Description,
bool HigherIsBetter = true
);
public static class EvaluationTaxonomy
{
public static readonly List<Metric> Metrics = new()
{
// Layer 1: Component Evaluation
new("ToolCallingAccuracy", EvaluationLayer.Component,
"Percentage of correct tool selections"),
new("ArgumentExtractionF1", EvaluationLayer.Component,
"F1 score for parameter extraction"),
new("ResponseFormatValidity", EvaluationLayer.Component,
"Percentage of valid JSON/structured outputs"),
new("ContextUtilization", EvaluationLayer.Component,
"Relevant context retrieval rate"),
// Layer 2: Task Evaluation
new("TaskCompletionRate", EvaluationLayer.Task,
"Percentage of successfully completed tasks"),
new("StepEfficiency", EvaluationLayer.Task,
"Actual steps / optimal steps ratio", false),
new("ErrorRecoveryRate", EvaluationLayer.Task,
"Successful recoveries from errors"),
new("OutputCorrectness", EvaluationLayer.Task,
"Factual accuracy of generated output"),
// Layer 3: System Evaluation
new("E2ELatencyP95", EvaluationLayer.System,
"95th percentile end-to-end latency", false),
new("CostPerTask", EvaluationLayer.System,
"Average cost in tokens/dollars per task", false),
new("SafetyCompliance", EvaluationLayer.System,
"Percentage of responses within guardrails"),
new("UserPreference", EvaluationLayer.System,
"Human preference win rate vs baseline")
};
public static IEnumerable<Metric> GetByLayer(EvaluationLayer layer) =>
Metrics.Where(m => m.Layer == layer);
} Agent-Specific Metrics (DeepEval)
DeepEval is an open-source evaluation framework with metrics specifically designed for agentic systems. Unlike traditional NLP metrics, these capture tool use, grounding, and multi-step reasoning.
Tool Correctness
Measures whether the agent called the correct tools with correct arguments. Compares actual tool calls to expected calls.
Faithfulness
Measures whether the agent's response is grounded in retrieved context. Detects hallucination and fabrication.
Answer Relevancy
Measures whether the response actually addresses the user's question. Detects tangential or off-topic responses.
Task Completion
Uses an LLM judge to determine if the agent achieved the stated goal. More nuanced than binary pass/fail.
// DeepEval: Agent-Specific Metrics
// Framework for evaluating agentic behaviors
function evaluate_agent_response(response, context):
results = {}
// 1. Tool Call Correctness
// Did the agent call the right tools with right arguments?
results["tool_correctness"] = evaluate_tool_calls(
actual_calls = response.tool_calls,
expected_calls = context.ground_truth_tools
)
// 2. Faithfulness (Grounding)
// Is the response grounded in retrieved context?
results["faithfulness"] = evaluate_faithfulness(
response = response.text,
retrieved_context = context.retrieved_docs
)
// 3. Answer Relevancy
// Does the answer address the original question?
results["relevancy"] = evaluate_relevancy(
question = context.original_query,
answer = response.text
)
// 4. Task Completion
// Did the agent complete the assigned task?
results["task_completion"] = evaluate_task(
task = context.task_definition,
result = response.final_output,
expected = context.expected_output
)
// 5. Trajectory Quality
// Was the agent's reasoning path efficient?
results["trajectory_quality"] = evaluate_trajectory(
steps = response.reasoning_trace,
optimal_path = context.optimal_steps
)
return results from deepeval import evaluate
from deepeval.metrics import (
ToolCorrectnessMetric,
FaithfulnessMetric,
AnswerRelevancyMetric,
TaskCompletionMetric,
)
from deepeval.test_case import LLMTestCase, ToolCall
# Define test case for agent evaluation
test_case = LLMTestCase(
input="Find the weather in London and book a restaurant nearby",
actual_output="The weather in London is 15°C with light rain. I found 'The Ivy' restaurant nearby and booked a table for 7pm.",
expected_output="Weather retrieved and restaurant booked successfully",
retrieval_context=[
"London weather: 15°C, light rain, humidity 80%",
"Nearby restaurants: The Ivy (0.3mi), Sketch (0.5mi)"
],
tools_called=[
ToolCall(name="get_weather", args={"city": "London"}),
ToolCall(name="search_restaurants", args={"location": "London", "cuisine": "any"}),
ToolCall(name="book_restaurant", args={"restaurant": "The Ivy", "time": "19:00"})
],
expected_tools=[
ToolCall(name="get_weather", args={"city": "London"}),
ToolCall(name="search_restaurants", args={"location": "London"}),
ToolCall(name="book_restaurant", args={}) # Args can vary
]
)
# Define metrics
metrics = [
ToolCorrectnessMetric(
threshold=0.8,
include_args=True # Also check arguments
),
FaithfulnessMetric(
threshold=0.7,
model="gpt-4" # Judge model
),
AnswerRelevancyMetric(
threshold=0.7,
model="gpt-4"
),
TaskCompletionMetric(
threshold=0.8,
model="gpt-4"
)
]
# Run evaluation
results = evaluate([test_case], metrics)
# Access results
for metric_result in results:
print(f"{metric_result.name}: {metric_result.score:.2f}")
if metric_result.reason:
print(f" Reason: {metric_result.reason}") // Agent evaluation in C# using a custom framework
public interface IAgentMetric
{
string Name { get; }
double Threshold { get; }
Task<MetricResult> EvaluateAsync(AgentTestCase testCase);
}
public record MetricResult(
string MetricName,
double Score,
bool Passed,
string? Reason = null
);
public record AgentTestCase(
string Input,
string ActualOutput,
string ExpectedOutput,
List<string> RetrievalContext,
List<ToolCall> ToolsCalled,
List<ToolCall> ExpectedTools
);
public class ToolCorrectnessMetric : IAgentMetric
{
public string Name => "ToolCorrectness";
public double Threshold { get; init; } = 0.8;
public bool IncludeArgs { get; init; } = true;
public Task<MetricResult> EvaluateAsync(AgentTestCase testCase)
{
var expectedTools = testCase.ExpectedTools.Select(t => t.Name).ToHashSet();
var actualTools = testCase.ToolsCalled.Select(t => t.Name).ToHashSet();
// Calculate precision and recall
var truePositives = expectedTools.Intersect(actualTools).Count();
var precision = actualTools.Count > 0
? (double)truePositives / actualTools.Count : 0;
var recall = expectedTools.Count > 0
? (double)truePositives / expectedTools.Count : 0;
// F1 score
var score = precision + recall > 0
? 2 * (precision * recall) / (precision + recall) : 0;
return Task.FromResult(new MetricResult(
Name, score, score >= Threshold,
$"Precision: {precision:P0}, Recall: {recall:P0}"
));
}
}
// Evaluate agent
public class AgentEvaluator
{
private readonly List<IAgentMetric> _metrics;
public AgentEvaluator(List<IAgentMetric> metrics)
{
_metrics = metrics;
}
public async Task<List<MetricResult>> EvaluateAsync(AgentTestCase testCase)
{
var results = new List<MetricResult>();
foreach (var metric in _metrics)
{
results.Add(await metric.EvaluateAsync(testCase));
}
return results;
}
} Trajectory-Level Evaluation
Beyond final outputs, trajectory evaluation examines the agent's reasoning path. This is crucial for understanding why an agent succeeded or failed, and for detecting issues like reasoning loops or inefficient strategies.
Why Trajectory Evaluation?
// Trajectory-Level Evaluation
// Evaluate the full reasoning path, not just final output
function evaluate_trajectory(trajectory, task):
metrics = {}
// 1. Step Count Efficiency
// Compare actual steps to known optimal
optimal_steps = get_optimal_path(task)
metrics["step_ratio"] = len(trajectory.steps) / len(optimal_steps)
// 2. Action Diversity
// Penalize repetitive actions (agent stuck in loop)
unique_actions = set(s.action for s in trajectory.steps)
metrics["action_diversity"] = len(unique_actions) / len(trajectory.steps)
// 3. Progress Rate
// How much progress per step toward goal?
progress_scores = []
for i, step in enumerate(trajectory.steps):
progress = estimate_progress(step.state, task.goal)
progress_scores.append(progress)
metrics["progress_rate"] = average_improvement(progress_scores)
// 4. Error Recovery
// Did agent recover from mistakes?
errors = [s for s in trajectory.steps if s.is_error]
recoveries = count_successful_recoveries(errors, trajectory)
metrics["recovery_rate"] = recoveries / max(len(errors), 1)
// 5. Reasoning Quality (LLM-as-Judge)
// Have an LLM evaluate reasoning coherence
metrics["reasoning_quality"] = llm_judge_reasoning(
trajectory.reasoning_traces,
task.description
)
return metrics from dataclasses import dataclass
from typing import List, Dict, Any
import numpy as np
@dataclass
class TrajectoryStep:
state: Dict[str, Any]
action: str
observation: str
reasoning: str
is_error: bool = False
@dataclass
class AgentTrajectory:
steps: List[TrajectoryStep]
final_result: Any
task_completed: bool
class TrajectoryEvaluator:
def __init__(self, llm_judge=None):
self.llm_judge = llm_judge
def evaluate(self,
trajectory: AgentTrajectory,
optimal_steps: int = None) -> Dict[str, float]:
metrics = {}
# 1. Step efficiency
if optimal_steps:
metrics["step_efficiency"] = min(
1.0, optimal_steps / len(trajectory.steps)
)
# 2. Action diversity (detect loops)
actions = [s.action for s in trajectory.steps]
metrics["action_diversity"] = len(set(actions)) / len(actions)
# 3. Detect repetition (exact action sequences)
metrics["repetition_score"] = self._detect_repetition(actions)
# 4. Error recovery rate
errors = [s for s in trajectory.steps if s.is_error]
if errors:
recoveries = self._count_recoveries(trajectory, errors)
metrics["recovery_rate"] = recoveries / len(errors)
else:
metrics["recovery_rate"] = 1.0 # No errors = perfect
# 5. LLM-as-judge for reasoning quality
if self.llm_judge:
metrics["reasoning_quality"] = self._judge_reasoning(
trajectory
)
return metrics
def _detect_repetition(self, actions: List[str]) -> float:
"""Return 1.0 if no repetition, lower if patterns repeat."""
if len(actions) < 4:
return 1.0
# Check for repeating patterns of length 2-3
for pattern_len in [2, 3]:
for i in range(len(actions) - pattern_len * 2):
pattern = actions[i:i + pattern_len]
next_seq = actions[i + pattern_len:i + pattern_len * 2]
if pattern == next_seq:
return 0.5 # Repetition detected
return 1.0
def _count_recoveries(self,
trajectory: AgentTrajectory,
errors: List[TrajectoryStep]) -> int:
"""Count how many errors led to successful recovery."""
recoveries = 0
for error in errors:
error_idx = trajectory.steps.index(error)
# Check if subsequent steps show different approach
if error_idx + 1 < len(trajectory.steps):
next_step = trajectory.steps[error_idx + 1]
if next_step.action != error.action:
recoveries += 1
return recoveries
def _judge_reasoning(self, trajectory: AgentTrajectory) -> float:
"""Use LLM to judge reasoning quality."""
reasoning_trace = "\n".join(
f"Step {i}: {s.reasoning}"
for i, s in enumerate(trajectory.steps)
)
prompt = f"""Evaluate the quality of this agent's reasoning trace.
Reasoning trace:
{reasoning_trace}
Rate from 0-1 based on:
- Logical coherence
- Goal-directed behavior
- Appropriate use of observations
- Clear decision making
Return only a number between 0 and 1."""
response = self.llm_judge.generate(prompt)
return float(response.strip()) public record TrajectoryStep(
Dictionary<string, object> State,
string Action,
string Observation,
string Reasoning,
bool IsError = false
);
public record AgentTrajectory(
List<TrajectoryStep> Steps,
object FinalResult,
bool TaskCompleted
);
public class TrajectoryEvaluator
{
private readonly ILLMJudge? _llmJudge;
public TrajectoryEvaluator(ILLMJudge? llmJudge = null)
{
_llmJudge = llmJudge;
}
public async Task<Dictionary<string, double>> EvaluateAsync(
AgentTrajectory trajectory,
int? optimalSteps = null)
{
var metrics = new Dictionary<string, double>();
// 1. Step efficiency
if (optimalSteps.HasValue)
{
metrics["step_efficiency"] = Math.Min(
1.0, (double)optimalSteps.Value / trajectory.Steps.Count
);
}
// 2. Action diversity
var actions = trajectory.Steps.Select(s => s.Action).ToList();
metrics["action_diversity"] =
(double)actions.Distinct().Count() / actions.Count;
// 3. Repetition detection
metrics["repetition_score"] = DetectRepetition(actions);
// 4. Error recovery
var errors = trajectory.Steps.Where(s => s.IsError).ToList();
if (errors.Any())
{
var recoveries = CountRecoveries(trajectory, errors);
metrics["recovery_rate"] = (double)recoveries / errors.Count;
}
else
{
metrics["recovery_rate"] = 1.0;
}
// 5. LLM judge
if (_llmJudge != null)
{
metrics["reasoning_quality"] =
await JudgeReasoningAsync(trajectory);
}
return metrics;
}
private double DetectRepetition(List<string> actions)
{
if (actions.Count < 4) return 1.0;
for (int patternLen = 2; patternLen <= 3; patternLen++)
{
for (int i = 0; i < actions.Count - patternLen * 2; i++)
{
var pattern = actions.Skip(i).Take(patternLen).ToList();
var next = actions.Skip(i + patternLen).Take(patternLen).ToList();
if (pattern.SequenceEqual(next)) return 0.5;
}
}
return 1.0;
}
private int CountRecoveries(
AgentTrajectory trajectory,
List<TrajectoryStep> errors)
{
int recoveries = 0;
foreach (var error in errors)
{
int idx = trajectory.Steps.IndexOf(error);
if (idx + 1 < trajectory.Steps.Count)
{
var next = trajectory.Steps[idx + 1];
if (next.Action != error.Action) recoveries++;
}
}
return recoveries;
}
private async Task<double> JudgeReasoningAsync(AgentTrajectory trajectory)
{
var trace = string.Join("\n", trajectory.Steps.Select(
(s, i) => $"Step {i}: {s.Reasoning}"
));
var result = await _llmJudge!.JudgeAsync(trace);
return result;
}
} Major Agent Benchmarks
Benchmarks provide standardized tasks for comparing agent capabilities. Each focuses on different aspects of agentic behavior.
| Benchmark | Domain | Tasks | Primary Metric | Top Score (2025) |
|---|---|---|---|---|
| SWE-bench Verified | Software Engineering | 500 | Resolved Rate | ~72% (Claude 3.5 Sonnet) |
| SWE-bench Full | Software Engineering | 2,294 | Resolved Rate | ~51% |
| WebArena | Web Navigation | 812 | Task Success Rate | ~42% |
| GAIA Level 1 | General Assistant | ~165 | Exact Match | ~75% |
| GAIA Level 2 | General Assistant | ~186 | Exact Match | ~60% |
| GAIA Level 3 | General Assistant | ~115 | Exact Match | ~40% |
| t-bench | Tool + Conversation | 680 | Pass Rate | ~45% (Retail) |
| AgentBench | Multi-Environment | 1,632 | Overall Score | Model-dependent |
| HumanEval | Code Generation | 164 | Pass@1 | ~95% |
Major agent benchmarks comparison
Benchmark Details
SWE-bench: Software Engineering
SWE-bench evaluates agents on real GitHub issues from popular Python repositories (Django, Flask, scikit-learn, etc.). The agent must understand the issue, locate relevant code, and generate a patch that passes the repository's test suite.
Variants
WebArena: Web Navigation
WebArena tests agents on realistic web navigation tasks across 5 self-hosted websites (shopping, forums, code hosting, maps, wiki). Tasks include filling forms, navigating menus, and extracting information.
τ-bench (Tau-bench): Tool + Conversation
τ-bench evaluates agents on multi-turn conversations that require tool use. Unlike single-turn benchmarks, it tests the agent's ability to maintain context across turns while calling appropriate tools.
Domains
GAIA: General AI Assistant
GAIA tests general-purpose assistant capabilities with questions requiring web search, file processing, calculation, and multi-step reasoning. Questions are designed so humans can answer them but require agentic capabilities for AI.
// Major Agent Benchmarks Overview
// SWE-bench: Software Engineering
// - 2,294 real GitHub issues from popular Python repos
// - Agent must: understand issue, locate code, implement fix
// - Evaluated by running repository's test suite
SWEBench:
input: GitHub issue description + repository snapshot
output: Git patch that fixes the issue
evaluation: run_tests(patched_repo) → pass/fail
variants:
- SWE-bench Full: all 2,294 issues
- SWE-bench Verified: 500 human-verified subset
- SWE-bench Lite: 300 simpler issues
// WebArena: Web Navigation
// - 812 tasks across 5 real websites
// - Agent must navigate, fill forms, extract info
// - Evaluated by checking final page state
WebArena:
input: Natural language instruction
output: Sequence of browser actions
evaluation: check_page_state(final_page) → match/no_match
websites: shopping, reddit, gitlab, maps, wikipedia
// τ-bench (Tau-bench): Tool Use + Conversation
// - 680 multi-turn conversations requiring tool use
// - Tests realistic agentic assistance scenarios
// - Evaluates both tool use and conversational ability
TauBench:
input: Multi-turn user conversation
output: Agent responses + tool calls
evaluation:
- Tool call correctness
- Response helpfulness
- Goal achievement
domains: airline, retail, banking (simulated APIs)
// GAIA: General AI Assistant
// - 466 questions requiring multi-step reasoning
// - Three difficulty levels
// - Requires web search, calculation, reasoning
GAIA:
input: Question (may include files/images)
output: Final answer (short text)
evaluation: exact_match(answer, ground_truth)
levels:
- Level 1: 1-2 steps
- Level 2: 3-5 steps
- Level 3: 6+ steps # Running SWE-bench evaluation
from swebench.harness.run_evaluation import run_evaluation
from swebench.harness.constants import KEY_INSTANCE_ID, KEY_MODEL, KEY_PREDICTION
# Prepare predictions (your agent's patches)
predictions = [
{
KEY_INSTANCE_ID: "django__django-11039",
KEY_MODEL: "my_agent_v1",
KEY_PREDICTION: '''
--- a/django/db/models/sql/query.py
+++ b/django/db/models/sql/query.py
@@ -1234,6 +1234,8 @@ class Query:
def add_ordering(self, *ordering):
+ if not ordering:
+ return
errors = []
'''
}
]
# Run evaluation
results = run_evaluation(
predictions=predictions,
swe_bench_tasks="princeton-nlp/SWE-bench_Verified",
log_dir="./logs",
timeout=1800 # 30 min per instance
)
# Results structure
for result in results:
print(f"Instance: {result['instance_id']}")
print(f" Resolved: {result['resolved']}")
print(f" Tests passed: {result['tests_passed']}")
# Running GAIA evaluation
from datasets import load_dataset
# Load GAIA dataset
gaia = load_dataset("gaia-benchmark/GAIA", "2023_all")
def evaluate_gaia_response(prediction: str, ground_truth: str) -> bool:
"""GAIA uses relaxed exact match."""
# Normalize both strings
pred = prediction.lower().strip()
truth = ground_truth.lower().strip()
# Remove common suffixes/prefixes
pred = pred.rstrip('.')
truth = truth.rstrip('.')
return pred == truth
# Evaluate your agent
correct = 0
total = 0
for level in [1, 2, 3]:
level_data = gaia['test'].filter(lambda x: x['level'] == level)
for item in level_data:
agent_answer = my_agent.answer(item['question'], item.get('file'))
if evaluate_gaia_response(agent_answer, item['ground_truth']):
correct += 1
total += 1
print(f"Level {level}: {correct}/{total} = {correct/total:.1%}") // Benchmark evaluation framework in C#
public interface IBenchmark
{
string Name { get; }
Task<BenchmarkResult> EvaluateAsync(IAgent agent);
}
public record BenchmarkResult(
string BenchmarkName,
int TotalTasks,
int Passed,
double Accuracy,
Dictionary<string, double> CategoryScores,
TimeSpan TotalTime
);
public class SWEBenchEvaluator : IBenchmark
{
public string Name => "SWE-bench";
private readonly List<SWEBenchTask> _tasks;
private readonly IDockerRunner _docker;
public SWEBenchEvaluator(List<SWEBenchTask> tasks, IDockerRunner docker)
{
_tasks = tasks;
_docker = docker;
}
public async Task<BenchmarkResult> EvaluateAsync(IAgent agent)
{
var sw = Stopwatch.StartNew();
var results = new List<TaskResult>();
var categoryScores = new Dictionary<string, List<bool>>();
foreach (var task in _tasks)
{
// Agent generates patch
var patch = await agent.GeneratePatchAsync(
task.IssueDescription,
task.RepoSnapshot
);
// Apply patch and run tests in Docker
var testResult = await _docker.RunTestsAsync(
task.RepoSnapshot,
patch,
task.TestCommand,
timeoutSeconds: 1800
);
var passed = testResult.AllTestsPassed;
results.Add(new TaskResult(task.InstanceId, passed));
// Track by category
if (!categoryScores.ContainsKey(task.Category))
categoryScores[task.Category] = new List<bool>();
categoryScores[task.Category].Add(passed);
}
sw.Stop();
return new BenchmarkResult(
Name,
_tasks.Count,
results.Count(r => r.Passed),
(double)results.Count(r => r.Passed) / _tasks.Count,
categoryScores.ToDictionary(
kv => kv.Key,
kv => (double)kv.Value.Count(p => p) / kv.Value.Count
),
sw.Elapsed
);
}
}
public class GAIAEvaluator : IBenchmark
{
public string Name => "GAIA";
public async Task<BenchmarkResult> EvaluateAsync(IAgent agent)
{
var levelScores = new Dictionary<int, (int passed, int total)>
{
[1] = (0, 0), [2] = (0, 0), [3] = (0, 0)
};
foreach (var task in LoadGAIATasks())
{
var answer = await agent.AnswerAsync(task.Question, task.File);
var passed = EvaluateAnswer(answer, task.GroundTruth);
var (p, t) = levelScores[task.Level];
levelScores[task.Level] = (p + (passed ? 1 : 0), t + 1);
}
var totalPassed = levelScores.Values.Sum(x => x.passed);
var totalTasks = levelScores.Values.Sum(x => x.total);
return new BenchmarkResult(
Name, totalTasks, totalPassed,
(double)totalPassed / totalTasks,
levelScores.ToDictionary(
kv => $"Level {kv.Key}",
kv => (double)kv.Value.passed / kv.Value.total
),
TimeSpan.Zero
);
}
private bool EvaluateAnswer(string prediction, string groundTruth)
{
// GAIA relaxed exact match
return prediction.Trim().ToLower().TrimEnd('.')
== groundTruth.Trim().ToLower().TrimEnd('.');
}
} LLM-as-Judge
Many agent behaviors are hard to evaluate with deterministic metrics. LLM-as-Judge uses a capable model to evaluate agent outputs, providing more nuanced assessment.
┌──────────────────────────────────────────────────────────────────┐ │ LLM-as-Judge Pipeline │ ├──────────────────────────────────────────────────────────────────┤ │ │ │ ┌─────────────┐ ┌──────────────┐ ┌─────────────────┐ │ │ │ Agent │ │ Judge │ │ Structured │ │ │ │ Output │───▶│ Prompt │───▶│ Evaluation │ │ │ │ │ │ + Rubric │ │ │ │ │ └─────────────┘ └──────────────┘ │ • Score: 0-1 │ │ │ │ • Reasoning │ │ │ ┌─────────────┐ │ • Suggestions │ │ │ │ Ground │ └─────────────────┘ │ │ │ Truth │────────▶ Compare │ │ │ (optional) │ │ │ └─────────────┘ │ │ │ │ Judge Models: GPT-4, Claude 3 Opus, Gemini 1.5 Pro │ └──────────────────────────────────────────────────────────────────┘
Judge Bias
Evaluation Best Practices
Do
- Evaluate at multiple layers (component, task, system)
- Use held-out test sets separate from development
- Include adversarial/edge cases in test suite
- Track metrics over time to detect regression
- Combine automated metrics with human evaluation
- Report confidence intervals, not just point estimates
Don't
- Rely only on final answer accuracy
- Overfit to benchmark-specific patterns
- Ignore trajectory quality (how the answer was reached)
- Skip safety/guardrail evaluation
- Use only synthetic test cases
- Assume benchmark scores reflect production performance