Metrics¶

Metrics provide quantitative measurements about agent execution. Unlike graders (which return pass/fail), metrics return numeric values for analysis and tracking.

Overview¶

Metric	Description	Unit
`step_count`	Total steps taken	count
`token_usage`	Total tokens consumed	tokens
`tool_call_count`	Number of tool calls	count
`llm_call_count`	Number of LLM calls	count
`duration`	Execution time	milliseconds
`tool_diversity`	Unique tools / total calls	ratio
`step_efficiency`	Steps used / max allowed	ratio
`error_rate`	Failed steps / total steps	ratio

Built-in Metrics¶

Evaldeck automatically calculates metrics for every evaluation.

StepCountMetric¶

Total number of steps in the trace.

from evaldeck.metrics import StepCountMetric

metric = StepCountMetric()
result = metric.calculate(trace, test_case)
print(result.value)  # e.g., 5

TokenUsageMetric¶

Total tokens consumed across all LLM calls.

from evaldeck.metrics import TokenUsageMetric

metric = TokenUsageMetric()
result = metric.calculate(trace, test_case)
print(result.value)  # e.g., 1250
print(result.details)  # {"prompt": 800, "completion": 450}

ToolCallCountMetric¶

Number of tool calls made.

from evaldeck.metrics import ToolCallCountMetric

metric = ToolCallCountMetric()
result = metric.calculate(trace, test_case)
print(result.value)  # e.g., 3

LLMCallCountMetric¶

Number of LLM calls made.

from evaldeck.metrics import LLMCallCountMetric

metric = LLMCallCountMetric()
result = metric.calculate(trace, test_case)
print(result.value)  # e.g., 4

DurationMetric¶

Total execution time.

from evaldeck.metrics import DurationMetric

metric = DurationMetric()
result = metric.calculate(trace, test_case)
print(result.value)  # e.g., 2500
print(result.unit)   # "ms"

ToolDiversityMetric¶

Ratio of unique tools to total tool calls.

from evaldeck.metrics import ToolDiversityMetric

metric = ToolDiversityMetric()
result = metric.calculate(trace, test_case)
print(result.value)  # e.g., 0.75 (3 unique tools / 4 total calls)

Interpretation:

1.0 = Every tool call was a different tool
0.5 = Half as many unique tools as calls (some repetition)
0.1 = Same tool called repeatedly

StepEfficiencyMetric¶

Ratio of actual steps to maximum allowed.

from evaldeck.metrics import StepEfficiencyMetric

metric = StepEfficiencyMetric()
result = metric.calculate(trace, test_case)
print(result.value)  # e.g., 0.6 (6 steps / 10 max)

Interpretation:

< 1.0 = Under budget (good)
= 1.0 = At maximum
> 1.0 = Over budget (if no max_steps, uses default of 10)

ErrorRateMetric¶

Ratio of failed steps to total steps.

from evaldeck.metrics import ErrorRateMetric

metric = ErrorRateMetric()
result = metric.calculate(trace, test_case)
print(result.value)  # e.g., 0.1 (1 failed step / 10 total)

MetricResult Structure¶

Every metric returns a MetricResult:

@dataclass
class MetricResult:
    metric_name: str     # Identifier
    value: float         # The measurement
    unit: str | None     # Optional unit (e.g., "ms", "tokens")
    details: dict | None # Additional breakdown

Viewing Metrics¶

CLI Verbose Output¶

evaldeck run --verbose

  ✓ book_flight_basic (1.2s)

    Metrics:
    ├─ step_count: 4
    ├─ token_usage: 1250 tokens
    │   └─ prompt: 800, completion: 450
    ├─ tool_call_count: 2
    ├─ llm_call_count: 2
    ├─ duration: 1200 ms
    ├─ tool_diversity: 1.0
    ├─ step_efficiency: 0.4
    └─ error_rate: 0.0

JSON Output¶

evaldeck run --output json

{
  "results": [
    {
      "test_case": "book_flight_basic",
      "metrics": [
        {"name": "step_count", "value": 4},
        {"name": "token_usage", "value": 1250, "unit": "tokens"},
        {"name": "tool_call_count", "value": 2},
        {"name": "duration", "value": 1200, "unit": "ms"}
      ]
    }
  ]
}

Creating Custom Metrics¶

Basic Custom Metric¶

from evaldeck.metrics import BaseMetric
from evaldeck import Trace, EvalCase
from evaldeck.results import MetricResult

class AverageStepDuration(BaseMetric):
    """Calculate average duration per step."""

    def calculate(self, trace: Trace, test_case: EvalCase) -> MetricResult:
        if not trace.steps:
            return MetricResult(
                metric_name="avg_step_duration",
                value=0.0,
                unit="ms"
            )

        total_duration = sum(
            step.duration_ms or 0
            for step in trace.steps
        )
        avg = total_duration / len(trace.steps)

        return MetricResult(
            metric_name="avg_step_duration",
            value=round(avg, 2),
            unit="ms",
            details={
                "total_duration": total_duration,
                "step_count": len(trace.steps)
            }
        )

Using Custom Metrics¶

from evaldeck import Evaluator
from my_metrics import AverageStepDuration

evaluator = Evaluator()
evaluator.add_metric(AverageStepDuration())

result = evaluator.evaluate(trace, test_case)
for metric in result.metrics:
    print(f"{metric.metric_name}: {metric.value}")

Metrics vs Graders¶

Aspect	Metrics	Graders
Output	Numeric value	Pass/Fail
Purpose	Measurement	Evaluation
Example	"Used 1250 tokens"	"Token budget exceeded"
Use case	Tracking, analysis	CI/CD gates

Combining Both¶

Use metrics for tracking, graders for pass/fail:

# Metric: measure tokens
class TokenUsageMetric(BaseMetric):
    def calculate(self, trace, test_case):
        return MetricResult("token_usage", trace.total_tokens, "tokens")

# Grader: enforce limit
class TokenBudgetGrader(BaseGrader):
    def __init__(self, max_tokens):
        self.max_tokens = max_tokens

    def grade(self, trace, test_case):
        if trace.total_tokens <= self.max_tokens:
            return GradeResult.passed_result(...)
        return GradeResult.failed_result(...)

Analyzing Metrics¶

Aggregate Statistics¶

from evaldeck import EvaluationRunner

runner = EvaluationRunner(config)
run_result = runner.run(suites, agent_func)

# Collect metrics across all tests
all_tokens = []
all_durations = []

for suite_result in run_result.suite_results:
    for eval_result in suite_result.results:
        for metric in eval_result.metrics:
            if metric.metric_name == "token_usage":
                all_tokens.append(metric.value)
            elif metric.metric_name == "duration":
                all_durations.append(metric.value)

print(f"Avg tokens: {sum(all_tokens) / len(all_tokens):.0f}")
print(f"Avg duration: {sum(all_durations) / len(all_durations):.0f}ms")

Trend Analysis¶

Track metrics over time to detect regressions:

# Save results with timestamp
evaldeck run --output json --output-file results-$(date +%Y%m%d).json

Compare against baselines:

import json

with open("results-baseline.json") as f:
    baseline = json.load(f)

with open("results-current.json") as f:
    current = json.load(f)

# Compare average token usage
baseline_tokens = [r["metrics"]["token_usage"] for r in baseline["results"]]
current_tokens = [r["metrics"]["token_usage"] for r in current["results"]]

change = (sum(current_tokens) - sum(baseline_tokens)) / sum(baseline_tokens) * 100
print(f"Token usage change: {change:+.1f}%")

Best Practices¶

Track metrics over time - Detect performance regressions
Set baselines - Know what "normal" looks like
Alert on anomalies - Catch issues early
Use metrics for optimization - Find efficiency opportunities
Correlate with graders - Understand why tests fail