Architecture Overview¶
This document explains how Evaldeck's components work together.
High-Level Architecture¶
┌─────────────────────────────────────────────────────────────┐
│ Evaldeck │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ CLI │ │ Python API │ │ Config │ │
│ │ evaldeck run │ │ Evaluator │ │ YAML files │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │ │
│ └──────────────────┼──────────────────┘ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Evaluation Engine │ │
│ │ ┌───────────┐ ┌───────────┐ ┌───────────┐ │ │
│ │ │ Graders │ │ Metrics │ │ Results │ │ │
│ │ └───────────┘ └───────────┘ └───────────┘ │ │
│ └─────────────────────────────────────────────────────┘ │
│ ▲ │
│ ┌──────────────────┼──────────────────┐ │
│ │ │ │ │
│ ┌──────┴──────┐ ┌──────┴──────┐ ┌──────┴──────┐ │
│ │ Trace │ │ Test Case │ │ Integrations│ │
│ │ Models │ │ Models │ │ LangChain │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Component Overview¶
Data Models¶
Trace Models (trace.py)
Trace- Complete execution recordStep- Single action (tool call, LLM call, etc.)TokenUsage- Token consumption tracking- Enums:
StepType,StepStatus,TraceStatus
Test Case Models (test_case.py)
EvalCase- Single test definitionEvalSuite- Collection of test casesExpectedBehavior- What agent should doGraderConfig- Custom grader configuration
Result Models (results.py)
GradeResult- Single grader outputMetricResult- Single metric outputEvaluationResult- Complete evaluation of one testSuiteResult- Results for a test suiteRunResult- Results for entire run
Evaluation Engine¶
Evaluator (evaluator.py)
- Core evaluation logic
- Builds graders from expectations
- Runs graders and collects results
- Calculates metrics
EvaluationRunner (evaluator.py)
- High-level orchestration
- Suite discovery
- Agent loading
- Result aggregation
Graders¶
Base (graders/base.py)
BaseGrader- Abstract base class
Code-Based (graders/code.py)
ContainsGrader,NotContainsGraderEqualsGrader,RegexGraderToolCalledGrader,ToolNotCalledGrader,ToolOrderGraderMaxStepsGrader,TaskCompletedGraderCustomGrader,CompositeGrader
LLM-Based (graders/llm.py)
LLMGrader- Pass/fail with promptLLMRubricGrader- Multi-criteria scoring
Metrics¶
Base (metrics/base.py)
BaseMetric- Abstract base class
Built-in (metrics/builtin.py)
StepCountMetric,TokenUsageMetricToolCallCountMetric,LLMCallCountMetricDurationMetric,ToolDiversityMetricStepEfficiencyMetric,ErrorRateMetric
Interface Layer¶
CLI (cli.py)
- Click-based command interface
init,runcommands- Output formatting (text, JSON, JUnit)
Configuration (config.py)
- YAML configuration loading
- Defaults and validation
Data Flow¶
Evaluation Flow¶
1. Load Configuration
evaldeck.yaml → EvaldeckConfig
2. Discover Tests
tests/evals/*.yaml → EvalSuite[]
3. Load Agent
config.agent → agent_function
4. For each test case:
a. Run agent
test_case.input → agent_function → Trace
b. Build graders
test_case.expected → Grader[]
c. Run graders
Trace + TestCase → Grader[] → GradeResult[]
d. Calculate metrics
Trace → Metric[] → MetricResult[]
e. Aggregate results
GradeResult[] + MetricResult[] → EvaluationResult
5. Aggregate suite results
EvaluationResult[] → SuiteResult → RunResult
6. Output results
RunResult → text/JSON/JUnit
Grading Flow¶
┌─────────────────────────────────────────────────────┐
│ Grading │
├─────────────────────────────────────────────────────┤
│ │
│ ExpectedBehavior Custom Graders │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ tools_called │──┐ │ LLMGrader │ │
│ │ output_contains │ │ │ CustomGrader │ │
│ │ max_steps │ │ └────────┬────────┘ │
│ └─────────────────┘ │ │ │
│ ▼ │ │
│ ┌────────────────┐ │ │
│ │ Auto-build │ │ │
│ │ Graders │ │ │
│ └───────┬────────┘ │ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────────────────────┐ │
│ │ Combined Graders │ │
│ └──────────────┬──────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────┐ │
│ │ grade(trace, case) │ │
│ └──────────────┬──────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────┐ │
│ │ GradeResult[] │ │
│ └─────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────┘
Extension Points¶
Adding a New Grader¶
- Inherit from
BaseGrader - Implement
grade(trace, test_case) -> GradeResult - Export from
evaldeck.graders
class MyGrader(BaseGrader):
def grade(self, trace: Trace, test_case: EvalCase) -> GradeResult:
# Custom logic
return GradeResult(...)
Adding a New Metric¶
- Inherit from
BaseMetric - Implement
calculate(trace, test_case) -> MetricResult - Export from
evaldeck.metrics
class MyMetric(BaseMetric):
def calculate(self, trace: Trace, test_case: EvalCase) -> MetricResult:
# Custom calculation
return MetricResult(...)
Adding a New Integration¶
The recommended approach is to use OpenTelemetry with OpenInference instrumentors. See evaldeck.integrations.opentelemetry for the implementation.
For frameworks with OpenInference support, no additional code is needed:
from evaldeck.integrations import setup_otel_tracing
from openinference.instrumentation.langchain import LangChainInstrumentor
processor = setup_otel_tracing()
LangChainInstrumentor().instrument()
# Traces captured automatically
For custom frameworks without OpenTelemetry support, create an adapter that builds Trace objects:
class MyFrameworkTracer:
def __init__(self):
self.trace = Trace(...)
def on_event(self, event):
self.trace.add_step(...)
def get_trace(self) -> Trace:
return self.trace
Design Principles¶
1. Framework Agnostic¶
The Trace model is independent of any agent framework. Integrations convert framework-specific events to this common format.
2. Composable Graders¶
Graders are independent units that can be combined. Each grader checks one thing.
3. Separation of Concerns¶
- Models: Data structures
- Graders: Pass/fail logic
- Metrics: Measurements
- Evaluator: Orchestration
- CLI: User interface
4. YAML-First Configuration¶
Test cases and config use YAML for readability and version control friendliness.
5. Python API Parity¶
Everything available in YAML is also available programmatically.