Evaluator¶
evaldeck.evaluator.Evaluator
¶
Main evaluation engine.
Evaluates agent traces against test cases using graders and metrics.
Choosing sync vs async methods:
Use evaluate() (sync) when: - Running a single quick evaluation with code-based graders - Your graders are all CPU-bound (ContainsGrader, RegexGrader, etc.) - You're in a sync context without an event loop
Use evaluate_async() when: - Using LLMGrader or other I/O-bound graders - Running multiple graders that make API calls - You want concurrent grader execution for better throughput - Your custom graders/metrics make async API calls
Use evaluate_suite_async() when: - Running multiple test cases (concurrent execution) - Your agent function is async - You want to control concurrency with max_concurrent
Performance comparison::
# Sync: graders run sequentially
# 3 LLMGraders × 2 seconds each = ~6 seconds total
result = evaluator.evaluate(trace, test_case)
# Async: graders run concurrently
# 3 LLMGraders × 2 seconds each = ~2 seconds total
result = await evaluator.evaluate_async(trace, test_case)
Initialize the evaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
graders
|
list[BaseGrader] | None
|
List of graders to use. If None, uses defaults based on test case. |
None
|
metrics
|
list[BaseMetric] | None
|
List of metrics to calculate. If None, uses defaults. |
None
|
config
|
EvaldeckConfig | None
|
Evaldeck configuration. |
None
|
Source code in src/evaldeck/evaluator.py
evaluate
¶
Evaluate a single trace against a test case (sync).
Runs graders and metrics sequentially. Best for: - Code-based graders (ContainsGrader, RegexGrader, etc.) - Quick evaluations without I/O-bound operations - Contexts without an async event loop
For I/O-bound graders (LLMGrader) or concurrent execution, use evaluate_async() instead.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
trace
|
Trace
|
The execution trace to evaluate. |
required |
test_case
|
EvalCase
|
The test case defining expected behavior. |
required |
Returns:
| Type | Description |
|---|---|
EvaluationResult
|
EvaluationResult with grades and metrics. |
Source code in src/evaldeck/evaluator.py
evaluate_async
async
¶
Evaluate a single trace against a test case (async).
Runs graders and metrics concurrently using asyncio.gather(). Recommended for: - LLMGrader (makes async API calls to OpenAI/Anthropic) - Custom async graders that call external services - Custom async metrics that fetch benchmark data - Any scenario with multiple I/O-bound operations
Performance benefit: With 3 LLMGraders each taking 2 seconds, sync evaluate() takes ~6 seconds while evaluate_async() takes ~2 seconds.
Code-based graders (ContainsGrader, etc.) automatically run in a thread pool via asyncio.to_thread() to avoid blocking the event loop.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
trace
|
Trace
|
The execution trace to evaluate. |
required |
test_case
|
EvalCase
|
The test case defining expected behavior. |
required |
Returns:
| Type | Description |
|---|---|
EvaluationResult
|
EvaluationResult with grades and metrics. |
Source code in src/evaldeck/evaluator.py
evaluate_suite
¶
Evaluate all test cases in a suite (sync wrapper).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
suite
|
EvalSuite
|
The test suite to evaluate. |
required |
agent_func
|
Callable[[str], Trace] | Callable[[str], Awaitable[Trace]]
|
Function that takes input string and returns a Trace. Can be sync or async. |
required |
on_result
|
Callable[[EvaluationResult], None] | None
|
Optional callback called after each test case. |
None
|
max_concurrent
|
int
|
Maximum concurrent tests. 0 = unlimited. |
0
|
Returns:
| Type | Description |
|---|---|
SuiteResult
|
SuiteResult with all evaluation results. |
Source code in src/evaldeck/evaluator.py
evaluate_suite_async
async
¶
Evaluate all test cases in a suite concurrently.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
suite
|
EvalSuite
|
The test suite to evaluate. |
required |
agent_func
|
Callable[[str], Trace] | Callable[[str], Awaitable[Trace]]
|
Function that takes input string and returns a Trace. Can be sync or async. |
required |
on_result
|
Callable[[EvaluationResult], None] | None
|
Optional callback called after each test case. |
None
|
max_concurrent
|
int
|
Maximum concurrent tests. 0 = unlimited. |
0
|
Returns:
| Type | Description |
|---|---|
SuiteResult
|
SuiteResult with all evaluation results. |
Source code in src/evaldeck/evaluator.py
evaldeck.evaluator.EvaluationRunner
¶
High-level runner for executing evaluations.
Initialize the runner.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
EvaldeckConfig | None
|
Evaldeck configuration. If None, loads from file. |
None
|
Source code in src/evaldeck/evaluator.py
run
¶
Run evaluation on multiple suites (sync wrapper).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
suites
|
list[EvalSuite] | None
|
Test suites to run. If None, discovers from config. |
None
|
agent_func
|
Callable[[str], Trace] | Callable[[str], Awaitable[Trace]] | None
|
Function to run agent. If None, loads from config. Can be sync or async. |
None
|
tags
|
list[str] | None
|
Filter test cases by tags. |
None
|
on_result
|
Callable[[EvaluationResult], None] | None
|
Callback for each result. |
None
|
max_concurrent
|
int | None
|
Max concurrent tests per suite. None = use config. |
None
|
Returns:
| Type | Description |
|---|---|
RunResult
|
RunResult with all suite results. |
Source code in src/evaldeck/evaluator.py
run_async
async
¶
Run evaluation on multiple suites asynchronously.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
suites
|
list[EvalSuite] | None
|
Test suites to run. If None, discovers from config. |
None
|
agent_func
|
Callable[[str], Trace] | Callable[[str], Awaitable[Trace]] | None
|
Function to run agent. If None, loads from config. Can be sync or async. |
None
|
tags
|
list[str] | None
|
Filter test cases by tags. |
None
|
on_result
|
Callable[[EvaluationResult], None] | None
|
Callback for each result. |
None
|
max_concurrent
|
int | None
|
Max concurrent tests per suite. None = use config. |
None
|
Returns:
| Type | Description |
|---|---|
RunResult
|
RunResult with all suite results. |