Examples¶
Practical examples showing how to use Evaldeck in common scenarios.
Quick Examples¶
| Example | Description |
|---|---|
| Basic Usage | Core workflow: create trace, define test, evaluate |
| Testing Tool Calls | Verify correct tool selection and arguments |
| LLM-as-Judge | Use LLMs for subjective evaluation |
| LangChain Agent | Evaluate LangChain/LangGraph agents |
Complete Example Project¶
For a full working example, see the evaldeck-langchain-example repository.
Code Snippets¶
Minimal Example¶
from evaldeck import Trace, Step, Evaluator, EvalCase, ExpectedBehavior, Turn
# Create a trace (simulating agent execution)
trace = Trace(input="Search for flights to NYC")
trace.add_step(Step.tool_call("search_flights", {"destination": "NYC"}, {"flights": [...]}))
trace.complete(output="Found 3 flights to NYC")
# Define expectations
test_case = EvalCase(
name="search_test",
turns=[
Turn(
user="Search for flights to NYC",
expected=ExpectedBehavior(
tools_called=["search_flights"],
output_contains=["flights", "NYC"]
)
)
]
)
# Evaluate
result = Evaluator().evaluate(trace, test_case)
print(f"Passed: {result.passed}")
With YAML Test Cases¶
from evaldeck import EvalSuite, Evaluator, Trace
# Load tests from YAML files
suite = EvalSuite.from_directory("tests/evals")
# Your agent function (must accept input and history)
def my_agent(input: str, history=None) -> Trace:
# ... your agent logic ...
return trace
# Run all tests
evaluator = Evaluator()
result = evaluator.evaluate_suite(suite, my_agent)
print(f"Results: {result.passed}/{result.total} passed")
With LLM Grading¶
from evaldeck import Trace, EvalCase, Evaluator
from evaldeck.graders import LLMGrader
# Add LLM grader for subjective evaluation
llm_grader = LLMGrader(
prompt="Is this response helpful and professional? Answer PASS or FAIL.",
model="gpt-4o-mini"
)
evaluator = Evaluator(graders=[llm_grader])
result = evaluator.evaluate(trace, test_case)
File Structure¶
Example project structure: