Code-Based Graders¶

Code-based graders perform deterministic checks on agent execution. They're fast, free, and predictable.

When to Use¶

Use code-based graders for:

Verifying specific tools were called
Checking output contains/doesn't contain strings
Validating output format with regex
Enforcing step limits
Any objective, rule-based criteria

Available Graders¶

ContainsGrader¶

Checks that the output contains expected strings.

YAML:

expected:
  output_contains:
    - "confirmation"
    - "booking reference"

Python:

from evaldeck.graders import ContainsGrader

grader = ContainsGrader(
    values=["confirmation", "booking reference"],
    case_sensitive=False  # Default: False
)

Behavior:

All values must be present
Case-insensitive by default
Partial matches count (e.g., "confirm" matches "confirmation")

NotContainsGrader¶

Checks that the output does NOT contain forbidden strings.

YAML:

expected:
  output_not_contains:
    - "error"
    - "failed"
    - "exception"

Python:

from evaldeck.graders import NotContainsGrader

grader = NotContainsGrader(
    values=["error", "failed", "exception"],
    case_sensitive=False
)

EqualsGrader¶

Checks for exact output match.

YAML:

expected:
  output_equals: "Operation completed successfully."

Python:

from evaldeck.graders import EqualsGrader

grader = EqualsGrader(
    expected="Operation completed successfully.",
    strip_whitespace=True  # Default: True
)

Note: Rarely used since most agent outputs have dynamic content.

RegexGrader¶

Checks output against a regex pattern.

YAML:

expected:
  output_matches: "Confirmation: [A-Z]{3}\\d{6}"

Python:

from evaldeck.graders import RegexGrader

grader = RegexGrader(
    pattern=r"Confirmation: [A-Z]{3}\d{6}",
    flags=0  # re module flags
)

Examples:

# Email format
output_matches: "[\\w.-]+@[\\w.-]+\\.\\w+"

# JSON object
output_matches: "\\{.*\\}"

# Date format
output_matches: "\\d{4}-\\d{2}-\\d{2}"

ToolCalledGrader¶

Verifies that required tools were called.

YAML:

expected:
  tools_called:
    - search_flights
    - book_flight

Python:

from evaldeck.graders import ToolCalledGrader

grader = ToolCalledGrader(
    required=["search_flights", "book_flight"]
)

Behavior:

All listed tools must be called at least once
Order doesn't matter
Extra tool calls are allowed

Failure output:

ToolCalledGrader: FAIL
  Expected: ['search_flights', 'book_flight']
  Actual: ['search_flights']
  Missing: ['book_flight']

ToolNotCalledGrader¶

Verifies that forbidden tools were NOT called.

YAML:

expected:
  tools_not_called:
    - delete_account
    - admin_override
    - drop_database

Python:

from evaldeck.graders import ToolNotCalledGrader

grader = ToolNotCalledGrader(
    forbidden=["delete_account", "admin_override"]
)

Use cases:

Ensuring dangerous tools aren't called
Verifying agent stays within scope
Security constraints

ToolOrderGrader¶

Verifies tools were called in a specific order.

YAML:

expected:
  tool_call_order:
    - authenticate
    - fetch_data
    - process_data
    - save_result

Python:

from evaldeck.graders import ToolOrderGrader

grader = ToolOrderGrader(
    expected_order=["authenticate", "fetch_data", "process_data", "save_result"]
)

Behavior:

Tools must appear in the specified sequence
Other tools may be called in between
Each tool in the sequence must appear after the previous one

Example:

Expected order: [A, B, C]
Actual calls:   [A, X, B, Y, C]  → PASS (A→B→C preserved)
Actual calls:   [A, C, B]        → FAIL (C before B)

MaxStepsGrader¶

Enforces a maximum step count (counts all trace steps including internal framework steps).

YAML:

expected:
  max_steps: 10

Python:

from evaldeck.graders import MaxStepsGrader

grader = MaxStepsGrader(max_steps=10)

Note: When using OpenTelemetry instrumentation, step counts include all captured spans (LLM calls, parsing, internal framework steps). For more intuitive limits based on actual tool calls, use MaxToolCallsGrader instead.

MaxToolCallsGrader¶

Enforces a maximum number of tool calls.

YAML:

expected:
  max_tool_calls: 5

Python:

from evaldeck.graders import MaxToolCallsGrader

grader = MaxToolCallsGrader(max_tool_calls=5)

Use case: Ensure agent efficiency by limiting actual tool invocations. Unlike max_steps, this only counts tool calls, not internal framework steps captured by OpenTelemetry.

MaxLLMCallsGrader¶

Enforces a maximum number of LLM calls.

YAML:

expected:
  max_llm_calls: 3

Python:

from evaldeck.graders import MaxLLMCallsGrader

grader = MaxLLMCallsGrader(max_llm_calls=3)

Use case: Control costs and latency by limiting how many times the agent calls the LLM. Useful for ensuring the agent doesn't get stuck in reasoning loops.

TaskCompletedGrader¶

Checks that the agent completed successfully.

YAML:

expected:
  task_completed: true

Python:

from evaldeck.graders import TaskCompletedGrader

grader = TaskCompletedGrader()

Behavior: Checks trace.status == TraceStatus.SUCCESS

Combining Graders¶

All expected conditions are combined with AND logic:

expected:
  tools_called: [search, book]     # AND
  output_contains: [confirmed]     # AND
  max_steps: 5                     # All must pass

Programmatic Usage¶

Use graders directly in Python:

from evaldeck import Trace, EvalCase
from evaldeck.graders import ToolCalledGrader, ContainsGrader

# Create graders
tool_grader = ToolCalledGrader(required=["search", "book"])
output_grader = ContainsGrader(values=["confirmed"])

# Grade a trace
trace = Trace(...)
test_case = EvalCase(...)

result1 = tool_grader.grade(trace, test_case)
result2 = output_grader.grade(trace, test_case)

print(f"Tools: {result1.status}")  # PASS or FAIL
print(f"Output: {result2.status}")

Best Practices¶

1. Start with Tool Checks¶

Tool selection is often the first point of failure:

expected:
  tools_called: [required_tool]

2. Add Negative Checks¶

Prevent dangerous or irrelevant actions:

expected:
  tools_not_called: [dangerous_tool]
  output_not_contains: [error, failed]

3. Set Reasonable Limits¶

Prevent runaway executions:

expected:
  max_steps: 10

4. Use Regex for Structured Output¶

Validate format without exact matching:

expected:
  output_matches: "Reference: [A-Z0-9]{8}"