Skip to content

Graders

Graders evaluate whether an agent's execution meets expectations. Evaldeck provides two types of graders: code-based (deterministic) and LLM-based (model-as-judge).

Overview

Type Best For Pros Cons
Code-based Objective checks Fast, deterministic, free Limited to rule-based logic
LLM-based Subjective evaluation Flexible, nuanced Slower, costs API calls, non-deterministic

Most evaluations combine both types for comprehensive coverage.

How Grading Works

flowchart LR
    A[Trace] --> B[Graders]
    C[Test Case] --> B
    B --> D[GradeResult]
    D --> E{PASS/FAIL}

Each grader receives:

  • Trace: The complete execution record
  • Test Case: The test definition with expectations

And returns a GradeResult with:

  • Status: PASS, FAIL, ERROR, or SKIP
  • Score: Optional numeric score (0.0-1.0)
  • Message: Explanation of the result

Built-in Graders

Evaldeck automatically creates graders from your expected block:

expected:
  tools_called: [search, book]      # → ToolCalledGrader
  tools_not_called: [delete]        # → ToolNotCalledGrader
  output_contains: ["confirmed"]    # → ContainsGrader
  max_steps: 5                      # → MaxStepsGrader

You can also add explicit graders:

graders:
  - type: llm
    prompt: "Is this response helpful?"
    model: gpt-4o-mini

Code-Based Graders

Fast, deterministic checks for objective criteria:

  • ContainsGrader - Output contains expected strings
  • NotContainsGrader - Output doesn't contain forbidden strings
  • EqualsGrader - Exact output match
  • RegexGrader - Regex pattern match
  • ToolCalledGrader - Required tools were called
  • ToolNotCalledGrader - Forbidden tools weren't called
  • ToolOrderGrader - Tools called in correct sequence
  • MaxStepsGrader - Within step limit
  • TaskCompletedGrader - Agent completed successfully

Learn more about code-based graders →

LLM-Based Graders

Use an LLM to evaluate subjective criteria:

  • LLMGrader - Pass/fail based on prompt
  • LLMRubricGrader - Multi-criteria scoring
graders:
  - type: llm
    prompt: |
      Is this response helpful and accurate?
      Response: {{ output }}
      Answer PASS or FAIL.
    model: gpt-4o-mini

Learn more about LLM graders →

Custom Graders

Create your own grading logic:

graders:
  - type: code
    module: my_graders
    function: check_format

Learn more about custom graders →

Combining Graders

All Must Pass (Default)

By default, all graders must pass:

expected:
  tools_called: [search]      # Must pass
  output_contains: [result]   # AND must pass

Composite Graders

For complex logic, use composite graders programmatically:

from evaldeck.graders import CompositeGrader, ContainsGrader, ToolCalledGrader

# All must pass
grader = CompositeGrader(
    graders=[
        ContainsGrader(values=["confirmed"]),
        ToolCalledGrader(required=["book"]),
    ],
    mode="all"  # all must pass
)

# Any can pass
grader = CompositeGrader(
    graders=[
        ContainsGrader(values=["success"]),
        ContainsGrader(values=["completed"]),
    ],
    mode="any"  # at least one must pass
)

Grading Strategy

Layer Your Evaluation

Start with fast, deterministic checks, then add LLM evaluation:

# Layer 1: Quick checks (free, deterministic)
expected:
  tools_called: [required_tool]
  output_not_contains: [error]

# Layer 2: Nuanced evaluation (costs API calls)
graders:
  - type: llm
    prompt: "Is this response professional and helpful?"

Match Grader to Criteria

Criteria Grader Type
Tool was called Code (ToolCalledGrader)
Output format correct Code (RegexGrader)
Response is helpful LLM
Tone is professional LLM
No errors occurred Code (NotContainsGrader)
Information is accurate LLM

GradeResult

Every grader returns a GradeResult:

@dataclass
class GradeResult:
    grader_name: str           # Name of the grader
    status: GradeStatus        # PASS, FAIL, ERROR, SKIP
    score: float | None        # Optional 0.0-1.0 score
    message: str               # Explanation
    details: dict | None       # Additional data
    expected: Any | None       # What was expected
    actual: Any | None         # What was observed

Example output:

ToolCalledGrader: FAIL
  Expected: ['search', 'book']
  Actual: ['search']
  Missing: ['book']