Skip to content

Evaldeck

The evaluation framework for AI agents. Pytest for agents.


Evaldeck helps you answer one question: "Is my agent actually working?"

Unlike LLM evaluation tools that focus on single input→output scoring, Evaldeck evaluates the entire agent execution—how it reasons, which tools it selects, and whether it achieves the goal.

  • 5-Minute Setup


    Get started with a single command. No complex configuration needed.

    Quick Start

  • Framework Agnostic


    Works with LangChain, CrewAI, AutoGen, or your custom agent framework.

    Integrations

  • Comprehensive Evaluation


    Evaluate tool selection, execution traces, step efficiency, and more.

    Metrics

  • Flexible Grading


    Combine deterministic code-based checks with LLM-as-judge evaluation.

    Graders

Why Evaldeck?

Traditional LLM evaluation tools treat models as black boxes—they measure whether the final output is "good" but ignore how the agent got there. This approach fails for agents because:

  • Agents are multi-step: A booking agent might search, filter, compare, and book. Each step matters.
  • Tool selection is critical: Calling the wrong tool or passing bad arguments causes cascading failures.
  • Efficiency matters: An agent that takes 20 steps to do a 3-step task is wasting time and tokens.

Evaldeck captures the complete execution trace and provides granular feedback on exactly where things went wrong.

Quick Example

Define what your agent should do in YAML:

tests/evals/booking.yaml
name: book_flight_basic
turns:
  - user: "Book me a flight from NYC to LA on March 15th"
    expected:
      tools_called:
        - search_flights
        - book_flight
      output_contains:
        - "confirmation"
        - "March 15"
      max_steps: 5

Run the evaluation:

evaldeck run

Get actionable feedback:

Running 3 tests...

  ✓ book_flight_basic (1.2s)
  ✓ book_flight_roundtrip (2.1s)
  ✗ book_flight_with_preferences (1.8s)
    └─ FAIL at step 3: Wrong tool called
       Expected: search_flights_with_filters
       Got: search_flights

Results: 2/3 passed (66.7%)

Installation

pip install evaldeck

With LLM graders:

pip install evaldeck[openai]      # OpenAI model graders
pip install evaldeck[anthropic]   # Anthropic model graders
pip install evaldeck[all]         # Everything

Next Steps

  • Getting Started


    Install Evaldeck and run your first evaluation.

    Get Started

  • User Guide


    Learn how to configure test cases, graders, and CI/CD.

    User Guide

  • Concepts


    Understand traces, evaluation workflows, and grading strategies.

    Concepts

  • API Reference


    Detailed documentation for all classes and functions.

    API Reference