User Guide¶
This guide covers everything you need to effectively use Evaldeck for agent evaluation.
Overview¶
The User Guide is organized into the following sections:
| Section | Description |
|---|---|
| Configuration | Project setup and evaldeck.yaml options |
| Test Cases | Writing and organizing test cases |
| CLI Reference | Command-line interface options |
| Graders | Evaluation strategies |
| Metrics | Quantitative measurements |
| Integrations | Framework adapters |
| CI/CD | Continuous integration setup |
Quick Reference¶
Project Structure¶
A typical Evaldeck project:
my-project/
├── evaldeck.yaml # Configuration
├── my_agent.py # Your agent code
├── tests/
│ └── evals/
│ ├── booking/
│ │ ├── basic.yaml
│ │ └── complex.yaml
│ └── search/
│ └── web_search.yaml
└── .evaldeck/ # Output (gitignore)
└── results/
Essential Commands¶
# Initialize new project
evaldeck init
# Run all tests
evaldeck run
# Run with specific config
evaldeck run --config custom.yaml
# Run tests by tag
evaldeck run --tag critical
# Verbose output
evaldeck run --verbose
# Generate reports
evaldeck run --output junit --output-file results.xml
Test Case Quick Reference¶
name: test_name # Required: unique identifier
description: What this tests # Optional: documentation
input: "User message" # Required: agent input
expected: # Expected behavior
tools_called: [tool1, tool2] # Required tools
tools_not_called: [bad_tool] # Forbidden tools
tool_call_order: [a, b, c] # Sequence requirement
output_contains: ["phrase"] # Output must contain
output_not_contains: ["error"] # Output must not contain
output_equals: "exact match" # Exact output match
output_matches: "regex.*" # Regex match
max_steps: 10 # Step limit
min_steps: 2 # Minimum steps
task_completed: true # Must succeed
graders: # Custom graders
- type: llm
prompt: "Grade this..."
model: gpt-4o-mini
threshold: 0.8
timeout: 30 # Timeout in seconds
retries: 2 # Retry on failure
tags: [critical, booking] # Categorization
metadata: # Custom metadata
author: team-a
priority: high
Configuration Quick Reference¶
# evaldeck.yaml
version: 1
agent:
module: my_agent # Python module
function: run # Function name
test_dir: tests/evals # Test case directory
defaults:
timeout: 30 # Default timeout
retries: 0 # Default retries
graders:
llm:
model: gpt-4o-mini # Default LLM model
provider: openai # openai or anthropic
thresholds:
min_pass_rate: 0.9 # Minimum pass rate
output_dir: .evaldeck # Results directory
Workflow Tips¶
1. Organize Tests by Feature¶
tests/evals/
├── booking/
│ ├── flights.yaml
│ └── hotels.yaml
├── search/
│ └── web.yaml
└── auth/
└── login.yaml
2. Use Tags Strategically¶
3. Layer Your Grading¶
Start with deterministic checks, add LLM grading for nuance:
expected:
tools_called: [search] # Deterministic
output_contains: ["result"] # Deterministic
graders:
- type: llm # Nuanced evaluation
prompt: "Is this helpful?"
4. Set Appropriate Thresholds¶
For CI/CD:
For development: