Metrics¶
evaldeck.metrics.BaseMetric
¶
Bases: ABC
Base class for all metrics.
Metrics calculate quantitative measurements from traces. Unlike graders, metrics don't pass/fail - they just measure.
Supports both sync and async calculation. Override calculate_async() for metrics that need to make async I/O calls (e.g., fetching external benchmark data).
calculate
abstractmethod
¶
Calculate the metric value (sync).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
trace
|
Trace
|
The execution trace to measure. |
required |
test_case
|
EvalCase | None
|
Optional test case for context. |
None
|
Returns:
| Type | Description |
|---|---|
MetricResult
|
MetricResult with the calculated value. |
Source code in src/evaldeck/metrics/base.py
calculate_async
async
¶
Calculate the metric value (async).
Default implementation runs sync calculate() in a thread pool. Override this method for true async behavior (e.g., async API calls for external benchmarking services).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
trace
|
Trace
|
The execution trace to measure. |
required |
test_case
|
EvalCase | None
|
Optional test case for context. |
None
|
Returns:
| Type | Description |
|---|---|
MetricResult
|
MetricResult with the calculated value. |
Source code in src/evaldeck/metrics/base.py
evaldeck.metrics.StepCountMetric
¶
evaldeck.metrics.TokenUsageMetric
¶
evaldeck.metrics.ToolCallCountMetric
¶
evaldeck.metrics.DurationMetric
¶
evaldeck.metrics.ToolDiversityMetric
¶
evaldeck.metrics.StepEfficiencyMetric
¶
Bases: BaseMetric
Measure step efficiency compared to expected max steps.
Returns 1.0 if within expected steps, <1.0 if exceeded.