Code-Based Graders¶
Deterministic graders for rule-based evaluation.
evaldeck.graders.ContainsGrader
¶
Bases: BaseGrader
Check if output contains expected values.
Initialize contains grader.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
values
|
list[str] | None
|
Strings that must be present. If None, uses test_case.expected. |
None
|
field
|
str
|
Field to check ("output" or "reasoning"). |
'output'
|
case_sensitive
|
bool
|
Whether to do case-sensitive matching. |
False
|
Source code in src/evaldeck/graders/code.py
grade
¶
Check if all values are present in the output.
Source code in src/evaldeck/graders/code.py
evaldeck.graders.NotContainsGrader
¶
Bases: BaseGrader
Check that output does NOT contain certain values.
Source code in src/evaldeck/graders/code.py
grade
¶
Check that no forbidden values are present.
Source code in src/evaldeck/graders/code.py
evaldeck.graders.EqualsGrader
¶
Bases: BaseGrader
Check if output exactly equals expected value.
Source code in src/evaldeck/graders/code.py
grade
¶
Check exact equality.
Source code in src/evaldeck/graders/code.py
evaldeck.graders.RegexGrader
¶
Bases: BaseGrader
Check if output matches a regex pattern.
Source code in src/evaldeck/graders/code.py
grade
¶
Check regex match.
Source code in src/evaldeck/graders/code.py
evaldeck.graders.ToolCalledGrader
¶
Bases: BaseGrader
Check that required tools were called.
Initialize tool called grader.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
required
|
list[str] | None
|
List of tool names that must be called. If None, uses test_case.expected.tools_called. |
None
|
Source code in src/evaldeck/graders/code.py
grade
¶
Check that all required tools were called.
Source code in src/evaldeck/graders/code.py
evaldeck.graders.ToolNotCalledGrader
¶
Bases: BaseGrader
Check that certain tools were NOT called.
Source code in src/evaldeck/graders/code.py
grade
¶
Check that forbidden tools were not called.
Source code in src/evaldeck/graders/code.py
evaldeck.graders.ToolOrderGrader
¶
Bases: BaseGrader
Check that tools were called in the correct order.
Source code in src/evaldeck/graders/code.py
grade
¶
Check tool call ordering.
Source code in src/evaldeck/graders/code.py
evaldeck.graders.MaxStepsGrader
¶
Bases: BaseGrader
Check that agent completed within maximum steps.
Source code in src/evaldeck/graders/code.py
grade
¶
Check step count.
Source code in src/evaldeck/graders/code.py
evaldeck.graders.MaxToolCallsGrader
¶
Bases: BaseGrader
Check that agent completed within maximum tool calls.
Unlike max_steps which counts all trace steps (including internal framework steps captured by OTel), this only counts actual tool calls.
Source code in src/evaldeck/graders/code.py
grade
¶
Check tool call count.
Source code in src/evaldeck/graders/code.py
evaldeck.graders.MaxLLMCallsGrader
¶
Bases: BaseGrader
Check that agent completed within maximum LLM calls.
Counts only LLM call steps, not internal framework steps.
Source code in src/evaldeck/graders/code.py
grade
¶
Check LLM call count.
Source code in src/evaldeck/graders/code.py
evaldeck.graders.TaskCompletedGrader
¶
Bases: BaseGrader
Check if the agent completed the task (based on trace status).
Source code in src/evaldeck/graders/code.py
grade
¶
Check task completion status.
Source code in src/evaldeck/graders/code.py
evaldeck.graders.CustomGrader
¶
Bases: BaseGrader
Run a custom grading function.
Supports both synchronous and asynchronous custom functions. When using evaluate_async(), async functions are awaited directly while sync functions run in a thread pool to avoid blocking the event loop.
Example with sync function::
def my_grader(trace, test_case):
if "error" in trace.output:
return GradeResult.failed_result("custom", "Found error")
return GradeResult.passed_result("custom", "No errors")
grader = CustomGrader(func=my_grader)
Example with async function::
async def my_async_grader(trace, test_case):
# Can make async API calls here
result = await external_validation_api(trace.output)
if result.valid:
return GradeResult.passed_result("custom", "Valid")
return GradeResult.failed_result("custom", "Invalid")
grader = CustomGrader(func=my_async_grader)
Initialize custom grader.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
func
|
Callable[[Trace, EvalCase], GradeResult] | None
|
Custom grading function. Can be sync or async. Signature: (trace, test_case) -> GradeResult |
None
|
module
|
str | None
|
Module path to import function from (alternative to func). |
None
|
function
|
str | None
|
Function name to import from module. |
None
|
Provide either func directly, or module and function to import.
Source code in src/evaldeck/graders/code.py
grade
¶
Run the custom grading function (sync).
Note: If your custom function is async, use grade_async() instead, which will properly await the function.
Source code in src/evaldeck/graders/code.py
grade_async
async
¶
Run the custom grading function (async).
Automatically detects if the custom function is async or sync: - Async functions are awaited directly - Sync functions run in a thread pool to avoid blocking the event loop
This allows custom graders to make async API calls (e.g., external validation services) without blocking other concurrent evaluations.