Metrics¶
Metrics provide quantitative measurements about agent execution. Unlike graders (which return pass/fail), metrics return numeric values for analysis and tracking.
Overview¶
| Metric | Description | Unit |
|---|---|---|
step_count |
Total steps taken | count |
token_usage |
Total tokens consumed | tokens |
tool_call_count |
Number of tool calls | count |
llm_call_count |
Number of LLM calls | count |
duration |
Execution time | milliseconds |
tool_diversity |
Unique tools / total calls | ratio |
step_efficiency |
Steps used / max allowed | ratio |
error_rate |
Failed steps / total steps | ratio |
Built-in Metrics¶
Evaldeck automatically calculates metrics for every evaluation.
StepCountMetric¶
Total number of steps in the trace.
from evaldeck.metrics import StepCountMetric
metric = StepCountMetric()
result = metric.calculate(trace, test_case)
print(result.value) # e.g., 5
TokenUsageMetric¶
Total tokens consumed across all LLM calls.
from evaldeck.metrics import TokenUsageMetric
metric = TokenUsageMetric()
result = metric.calculate(trace, test_case)
print(result.value) # e.g., 1250
print(result.details) # {"prompt": 800, "completion": 450}
ToolCallCountMetric¶
Number of tool calls made.
from evaldeck.metrics import ToolCallCountMetric
metric = ToolCallCountMetric()
result = metric.calculate(trace, test_case)
print(result.value) # e.g., 3
LLMCallCountMetric¶
Number of LLM calls made.
from evaldeck.metrics import LLMCallCountMetric
metric = LLMCallCountMetric()
result = metric.calculate(trace, test_case)
print(result.value) # e.g., 4
DurationMetric¶
Total execution time.
from evaldeck.metrics import DurationMetric
metric = DurationMetric()
result = metric.calculate(trace, test_case)
print(result.value) # e.g., 2500
print(result.unit) # "ms"
ToolDiversityMetric¶
Ratio of unique tools to total tool calls.
from evaldeck.metrics import ToolDiversityMetric
metric = ToolDiversityMetric()
result = metric.calculate(trace, test_case)
print(result.value) # e.g., 0.75 (3 unique tools / 4 total calls)
Interpretation:
1.0= Every tool call was a different tool0.5= Half as many unique tools as calls (some repetition)0.1= Same tool called repeatedly
StepEfficiencyMetric¶
Ratio of actual steps to maximum allowed.
from evaldeck.metrics import StepEfficiencyMetric
metric = StepEfficiencyMetric()
result = metric.calculate(trace, test_case)
print(result.value) # e.g., 0.6 (6 steps / 10 max)
Interpretation:
< 1.0= Under budget (good)= 1.0= At maximum> 1.0= Over budget (if no max_steps, uses default of 10)
ErrorRateMetric¶
Ratio of failed steps to total steps.
from evaldeck.metrics import ErrorRateMetric
metric = ErrorRateMetric()
result = metric.calculate(trace, test_case)
print(result.value) # e.g., 0.1 (1 failed step / 10 total)
MetricResult Structure¶
Every metric returns a MetricResult:
@dataclass
class MetricResult:
metric_name: str # Identifier
value: float # The measurement
unit: str | None # Optional unit (e.g., "ms", "tokens")
details: dict | None # Additional breakdown
Viewing Metrics¶
CLI Verbose Output¶
✓ book_flight_basic (1.2s)
Metrics:
├─ step_count: 4
├─ token_usage: 1250 tokens
│ └─ prompt: 800, completion: 450
├─ tool_call_count: 2
├─ llm_call_count: 2
├─ duration: 1200 ms
├─ tool_diversity: 1.0
├─ step_efficiency: 0.4
└─ error_rate: 0.0
JSON Output¶
{
"results": [
{
"test_case": "book_flight_basic",
"metrics": [
{"name": "step_count", "value": 4},
{"name": "token_usage", "value": 1250, "unit": "tokens"},
{"name": "tool_call_count", "value": 2},
{"name": "duration", "value": 1200, "unit": "ms"}
]
}
]
}
Creating Custom Metrics¶
Basic Custom Metric¶
from evaldeck.metrics import BaseMetric
from evaldeck import Trace, EvalCase
from evaldeck.results import MetricResult
class AverageStepDuration(BaseMetric):
"""Calculate average duration per step."""
def calculate(self, trace: Trace, test_case: EvalCase) -> MetricResult:
if not trace.steps:
return MetricResult(
metric_name="avg_step_duration",
value=0.0,
unit="ms"
)
total_duration = sum(
step.duration_ms or 0
for step in trace.steps
)
avg = total_duration / len(trace.steps)
return MetricResult(
metric_name="avg_step_duration",
value=round(avg, 2),
unit="ms",
details={
"total_duration": total_duration,
"step_count": len(trace.steps)
}
)
Using Custom Metrics¶
from evaldeck import Evaluator
from my_metrics import AverageStepDuration
evaluator = Evaluator()
evaluator.add_metric(AverageStepDuration())
result = evaluator.evaluate(trace, test_case)
for metric in result.metrics:
print(f"{metric.metric_name}: {metric.value}")
Metrics vs Graders¶
| Aspect | Metrics | Graders |
|---|---|---|
| Output | Numeric value | Pass/Fail |
| Purpose | Measurement | Evaluation |
| Example | "Used 1250 tokens" | "Token budget exceeded" |
| Use case | Tracking, analysis | CI/CD gates |
Combining Both¶
Use metrics for tracking, graders for pass/fail:
# Metric: measure tokens
class TokenUsageMetric(BaseMetric):
def calculate(self, trace, test_case):
return MetricResult("token_usage", trace.total_tokens, "tokens")
# Grader: enforce limit
class TokenBudgetGrader(BaseGrader):
def __init__(self, max_tokens):
self.max_tokens = max_tokens
def grade(self, trace, test_case):
if trace.total_tokens <= self.max_tokens:
return GradeResult.passed_result(...)
return GradeResult.failed_result(...)
Analyzing Metrics¶
Aggregate Statistics¶
from evaldeck import EvaluationRunner
runner = EvaluationRunner(config)
run_result = runner.run(suites, agent_func)
# Collect metrics across all tests
all_tokens = []
all_durations = []
for suite_result in run_result.suite_results:
for eval_result in suite_result.results:
for metric in eval_result.metrics:
if metric.metric_name == "token_usage":
all_tokens.append(metric.value)
elif metric.metric_name == "duration":
all_durations.append(metric.value)
print(f"Avg tokens: {sum(all_tokens) / len(all_tokens):.0f}")
print(f"Avg duration: {sum(all_durations) / len(all_durations):.0f}ms")
Trend Analysis¶
Track metrics over time to detect regressions:
Compare against baselines:
import json
with open("results-baseline.json") as f:
baseline = json.load(f)
with open("results-current.json") as f:
current = json.load(f)
# Compare average token usage
baseline_tokens = [r["metrics"]["token_usage"] for r in baseline["results"]]
current_tokens = [r["metrics"]["token_usage"] for r in current["results"]]
change = (sum(current_tokens) - sum(baseline_tokens)) / sum(baseline_tokens) * 100
print(f"Token usage change: {change:+.1f}%")
Best Practices¶
- Track metrics over time - Detect performance regressions
- Set baselines - Know what "normal" looks like
- Alert on anomalies - Catch issues early
- Use metrics for optimization - Find efficiency opportunities
- Correlate with graders - Understand why tests fail