Test Cases¶
Test cases define what your agent should do and how to evaluate it. This guide covers all test case options.
Test Case Structure¶
A test case is a YAML file with a turns array defining the conversation:
# Required fields
name: unique_test_name
turns:
- user: "The input to send to your agent"
expected:
tools_called: [...]
output_contains: [...]
# Optional: custom graders (applied to all turns)
graders:
- type: llm
prompt: "..."
# Optional: metadata
description: "What this test verifies"
timeout: 30
retries: 0
tags: [category, priority]
metadata:
custom_key: value
Single-Turn vs Multi-Turn¶
Single-Turn (Simple)¶
For simple single-turn tests:
name: book_flight_basic
turns:
- user: "Book me a flight from NYC to LA"
expected:
tools_called: [search_flights, book_flight]
output_contains: ["confirmation"]
Multi-Turn Conversations¶
For multi-turn conversations, add multiple entries to turns:
name: booking_conversation
turns:
- user: "I want to book a flight"
expected:
output_contains: ["help", "where"]
- user: "NYC to LA on March 15"
expected:
tools_called: [search_flights]
output_contains: ["found", "flights"]
- user: "Book the cheapest one"
expected:
tools_called: [book_flight]
output_contains: ["confirmation"]
Fail-fast behavior: If any turn fails, subsequent turns are skipped. This is intentional—if turn 1 fails, the conversation context is broken.
Required Fields¶
name¶
Unique identifier for the test case:
Best practices:
- Use snake_case
- Be descriptive:
book_flight_roundtripnottest1 - Include the feature being tested
turns¶
Array of conversation turns. Each turn has:
user(required): The user's messageexpected(optional): Expected behavior for this turngraders(optional): Turn-specific graders
turns:
- user: "Book me a flight from NYC to LA on March 15th"
expected:
tools_called: [search_flights, book_flight]
For multi-line user input:
Expected Behavior¶
The expected block defines what your agent should do.
Tool Expectations¶
tools_called¶
Tools that must be called:
All listed tools must be called at least once. Order doesn't matter.
tools_not_called¶
Tools that must NOT be called:
Useful for ensuring the agent doesn't take dangerous or irrelevant actions.
tool_call_order¶
Require tools to be called in a specific sequence:
The agent may call other tools, but these must appear in this order.
Output Expectations¶
output_contains¶
Strings that must appear in the output:
All strings must be present (case-insensitive by default).
output_not_contains¶
Strings that must NOT appear:
output_equals¶
Exact output match:
Rarely used—most outputs have dynamic content.
output_matches¶
Regex pattern match:
Useful for validating structured output formats.
Execution Expectations¶
max_steps¶
Maximum allowed steps (all trace steps including internal framework steps):
Note: When using OpenTelemetry instrumentation, step counts include all captured spans. For more intuitive limits, use max_tool_calls or max_llm_calls instead.
max_tool_calls¶
Maximum allowed tool calls:
Only counts actual tool invocations, not internal framework steps. Recommended over max_steps for most use cases.
max_llm_calls¶
Maximum allowed LLM calls:
Limits how many times the agent calls the LLM. Useful for controlling costs and preventing reasoning loops.
min_steps¶
Minimum required steps:
Ensures the agent doesn't skip necessary steps.
task_completed¶
Whether the agent must complete successfully:
Checks that trace.status == "success".
Custom Graders¶
Add custom evaluation logic with graders:
LLM Grader¶
Use an LLM to evaluate the output:
graders:
- type: llm
prompt: |
Evaluate if this response is helpful and accurate.
User asked: {{ input }}
Agent responded: {{ output }}
Consider:
1. Is the information accurate?
2. Is the response complete?
3. Is the tone appropriate?
Answer: PASS or FAIL
Reason: <your explanation>
model: gpt-4o-mini
Available template variables:
| Variable | Description |
|---|---|
{{ input }} |
The test case input |
{{ output }} |
The agent's output |
{{ trace }} |
Full trace as JSON |
{{ task }} |
Test case description |
With Threshold¶
For scored evaluation:
graders:
- type: llm
prompt: |
Score this response from 1-5 for helpfulness.
Response: {{ output }}
SCORE: <number>
model: gpt-4o-mini
threshold: 4 # Must score 4 or higher
Code Grader¶
Use a custom Python function:
# my_graders.py
from evaldeck import Trace, EvalCase, GradeResult
def custom_check(trace: Trace, test_case: EvalCase) -> GradeResult:
# Custom logic
if "important" in trace.output.lower():
return GradeResult.passed_result("custom_check", "Found important content")
return GradeResult.failed_result("custom_check", "Missing important content")
Test Metadata¶
description¶
Human-readable description:
description: |
Tests that the booking agent can handle a basic one-way flight
booking with specified departure date.
timeout¶
Override default timeout:
retries¶
Number of retries on failure:
tags¶
Categorize tests for filtering:
Run by tag:
metadata¶
Custom key-value pairs:
Multiple Test Cases Per File¶
Use YAML document separators for multiple tests:
# tests/evals/booking.yaml
name: book_flight_basic
turns:
- user: "Book a flight to LA"
expected:
tools_called: [book_flight]
tags: [booking, simple]
---
name: book_flight_roundtrip
turns:
- user: "Book a roundtrip flight to LA"
expected:
tools_called: [book_flight]
output_contains: ["roundtrip", "return"]
tags: [booking, complex]
---
name: book_flight_with_preferences
turns:
- user: "Book a flight to LA, window seat, vegetarian meal"
expected:
tools_called: [book_flight, set_preferences]
output_contains: ["window", "vegetarian"]
tags: [booking, complex]
Reference Data¶
reference_output¶
Expected output for comparison:
Useful for LLM graders that compare against expected output.
reference_tools¶
Expected tool call sequence with arguments:
reference_tools:
- name: search_flights
args:
from: NYC
to: LA
date: "2024-03-15"
- name: book_flight
args:
flight_id: AA123
Best Practices¶
1. One Behavior Per Test¶
# Good: focused test
name: book_flight_validates_date
turns:
- user: "Book a flight for yesterday"
expected:
output_not_contains: ["invalid date"]
# Avoid: testing too many things
name: book_flight_everything
turns:
- user: "Book a flight"
expected:
tools_called: [search, filter, sort, book, confirm, notify]
2. Use Descriptive Names¶
# Good
name: booking_rejects_past_dates
name: search_handles_empty_results
name: auth_requires_valid_token
# Avoid
name: test1
name: booking_test
name: should_work
3. Tag Strategically¶
# Recommended tag categories
tags:
- critical # Must pass for deploy
- smoke # Quick sanity checks
- regression # Full regression suite
- booking # Feature area
- slow # Long-running tests
4. Document Edge Cases¶
name: booking_handles_sold_out_flight
description: |
Verifies the agent gracefully handles the case where
the selected flight becomes unavailable during booking.
turns:
- user: "Book flight AA123" # This flight is configured to be sold out
expected:
output_contains: ["unavailable", "alternative"]
tools_not_called: [charge_payment] # Should not charge if sold out