Skip to content

Metrics

evaldeck.metrics.BaseMetric

Bases: ABC

Base class for all metrics.

Metrics calculate quantitative measurements from traces. Unlike graders, metrics don't pass/fail - they just measure.

Supports both sync and async calculation. Override calculate_async() for metrics that need to make async I/O calls (e.g., fetching external benchmark data).

calculate abstractmethod

calculate(trace, test_case=None)

Calculate the metric value (sync).

Parameters:

Name Type Description Default
trace Trace

The execution trace to measure.

required
test_case EvalCase | None

Optional test case for context.

None

Returns:

Type Description
MetricResult

MetricResult with the calculated value.

Source code in src/evaldeck/metrics/base.py
@abstractmethod
def calculate(self, trace: Trace, test_case: EvalCase | None = None) -> MetricResult:
    """Calculate the metric value (sync).

    Args:
        trace: The execution trace to measure.
        test_case: Optional test case for context.

    Returns:
        MetricResult with the calculated value.
    """
    pass

calculate_async async

calculate_async(trace, test_case=None)

Calculate the metric value (async).

Default implementation runs sync calculate() in a thread pool. Override this method for true async behavior (e.g., async API calls for external benchmarking services).

Parameters:

Name Type Description Default
trace Trace

The execution trace to measure.

required
test_case EvalCase | None

Optional test case for context.

None

Returns:

Type Description
MetricResult

MetricResult with the calculated value.

Source code in src/evaldeck/metrics/base.py
async def calculate_async(
    self, trace: Trace, test_case: EvalCase | None = None
) -> MetricResult:
    """Calculate the metric value (async).

    Default implementation runs sync calculate() in a thread pool.
    Override this method for true async behavior (e.g., async API calls
    for external benchmarking services).

    Args:
        trace: The execution trace to measure.
        test_case: Optional test case for context.

    Returns:
        MetricResult with the calculated value.
    """
    return await asyncio.to_thread(self.calculate, trace, test_case)

evaldeck.metrics.StepCountMetric

Bases: BaseMetric

Count total number of steps in the trace.


evaldeck.metrics.TokenUsageMetric

Bases: BaseMetric

Total token usage across all LLM calls.


evaldeck.metrics.ToolCallCountMetric

Bases: BaseMetric

Count number of tool calls.


evaldeck.metrics.DurationMetric

Bases: BaseMetric

Total execution duration.


evaldeck.metrics.ToolDiversityMetric

Bases: BaseMetric

Measure diversity of tools used (unique tools / total calls).


evaldeck.metrics.StepEfficiencyMetric

Bases: BaseMetric

Measure step efficiency compared to expected max steps.

Returns 1.0 if within expected steps, <1.0 if exceeded.