Concepts¶
This section explains the core concepts behind Evaldeck and agent evaluation.
Overview¶
Understanding these concepts will help you write better tests and debug evaluation failures:
| Concept | Description |
|---|---|
| Architecture | How Evaldeck's components work together |
| Traces & Steps | The execution record of an agent |
| Evaluation Workflow | How evaluation proceeds |
| Grading Strategies | Approaches to agent evaluation |
The Problem Evaldeck Solves¶
Traditional LLM evaluation treats models as black boxes:
This works for simple Q&A but fails for agents because:
- Agents are multi-step: A booking agent might search, filter, compare, and book
- Tool selection matters: Calling the wrong tool causes cascading failures
- Efficiency matters: 20 steps for a 3-step task wastes time and money
- Intermediate states matter: Even with correct output, the path matters
Evaldeck's Approach¶
Evaldeck captures the complete execution trace:
Then evaluates the entire journey:
Key Concepts at a Glance¶
Trace¶
A complete record of agent execution:
Trace(
input="Book a flight to NYC",
steps=[
Step(type=TOOL_CALL, tool_name="search_flights", ...),
Step(type=LLM_CALL, model="gpt-4o-mini", ...),
Step(type=TOOL_CALL, tool_name="book_flight", ...),
],
output="Your flight is booked. Confirmation: ABC123",
status=SUCCESS
)
Step¶
A single action in the trace:
- TOOL_CALL: Agent called a tool
- LLM_CALL: Agent called an LLM
- REASONING: Agent's internal reasoning
- HUMAN_INPUT: Human-in-the-loop input
Test Case¶
What the agent should do:
name: book_flight
input: "Book a flight to NYC"
expected:
tools_called: [search_flights, book_flight]
output_contains: ["confirmation"]
max_steps: 5
Grader¶
Evaluates trace against expectations:
- Code-based: Deterministic checks (tool called, output contains)
- LLM-based: Model-as-judge for subjective criteria
Metric¶
Quantitative measurements:
- Token usage
- Step count
- Duration
- Error rate
The Evaluation Formula¶
An evaluation passes when all graders pass.
Why This Matters¶
Without Evaldeck¶
Problems:
- How do we know it actually booked?
- Did it call the right APIs?
- Was it efficient?
- Will it work next time?
With Evaldeck¶
Trace shows:
1. ✓ Called search_flights
2. ✓ Called book_flight
3. ✓ Confirmation in output
4. ✓ Completed in 3 steps (under limit of 5)
Result: PASS
Benefits:
- Verifiable execution path
- Reproducible tests
- Automated CI/CD
- Regression detection