Evaldeck¶
The evaluation framework for AI agents. Pytest for agents.
Evaldeck helps you answer one question: "Is my agent actually working?"
Unlike LLM evaluation tools that focus on single input→output scoring, Evaldeck evaluates the entire agent execution—how it reasons, which tools it selects, and whether it achieves the goal.
-
5-Minute Setup
Get started with a single command. No complex configuration needed.
-
Framework Agnostic
Works with LangChain, CrewAI, AutoGen, or your custom agent framework.
-
Comprehensive Evaluation
Evaluate tool selection, execution traces, step efficiency, and more.
-
Flexible Grading
Combine deterministic code-based checks with LLM-as-judge evaluation.
Why Evaldeck?¶
Traditional LLM evaluation tools treat models as black boxes—they measure whether the final output is "good" but ignore how the agent got there. This approach fails for agents because:
- Agents are multi-step: A booking agent might search, filter, compare, and book. Each step matters.
- Tool selection is critical: Calling the wrong tool or passing bad arguments causes cascading failures.
- Efficiency matters: An agent that takes 20 steps to do a 3-step task is wasting time and tokens.
Evaldeck captures the complete execution trace and provides granular feedback on exactly where things went wrong.
Quick Example¶
Define what your agent should do in YAML:
name: book_flight_basic
turns:
- user: "Book me a flight from NYC to LA on March 15th"
expected:
tools_called:
- search_flights
- book_flight
output_contains:
- "confirmation"
- "March 15"
max_steps: 5
Run the evaluation:
Get actionable feedback:
Running 3 tests...
✓ book_flight_basic (1.2s)
✓ book_flight_roundtrip (2.1s)
✗ book_flight_with_preferences (1.8s)
└─ FAIL at step 3: Wrong tool called
Expected: search_flights_with_filters
Got: search_flights
Results: 2/3 passed (66.7%)
Installation¶
With LLM graders:
pip install evaldeck[openai] # OpenAI model graders
pip install evaldeck[anthropic] # Anthropic model graders
pip install evaldeck[all] # Everything
Next Steps¶
-
Getting Started
Install Evaldeck and run your first evaluation.
-
User Guide
Learn how to configure test cases, graders, and CI/CD.
-
Concepts
Understand traces, evaluation workflows, and grading strategies.
-
API Reference
Detailed documentation for all classes and functions.