Traces & Steps¶

A Trace is the complete record of an agent's execution. Understanding traces is essential for effective agent evaluation.

What is a Trace?¶

A trace captures everything that happened during agent execution:

Input: "Book a flight to NYC"
                 │
                 ▼
┌─────────────────────────────────────┐
│              Trace                  │
├─────────────────────────────────────┤
│ Step 1: LLM_CALL                    │
│   "I'll search for flights..."      │
├─────────────────────────────────────┤
│ Step 2: TOOL_CALL (search_flights)  │
│   → Found 3 flights                 │
├─────────────────────────────────────┤
│ Step 3: REASONING                   │
│   "AA123 is cheapest, selecting..." │
├─────────────────────────────────────┤
│ Step 4: TOOL_CALL (book_flight)     │
│   → Confirmation: ABC123            │
├─────────────────────────────────────┤
│ Step 5: LLM_CALL                    │
│   "Your flight is booked..."        │
└─────────────────────────────────────┘
                 │
                 ▼
Output: "Your flight to NYC is booked. Confirmation: ABC123"

Trace Structure¶

from evaldeck import Trace, TraceStatus

trace = Trace(
    # Required
    input="User's request",

    # Set after execution
    output="Agent's response",
    status=TraceStatus.SUCCESS,

    # Execution details
    steps=[...],
    started_at=datetime(...),
    completed_at=datetime(...),
    duration_ms=1500,

    # Metadata
    framework="langchain",
    agent_name="BookingAgent",
    metadata={"user_id": "123"}
)

Trace Fields¶

Field	Type	Description
`id`	str	Unique identifier (auto-generated)
`input`	str	User input to the agent
`output`	str	Agent's final output
`status`	TraceStatus	SUCCESS, FAILURE, TIMEOUT, ERROR
`steps`	list[Step]	Ordered list of execution steps
`started_at`	datetime	When execution started
`completed_at`	datetime	When execution completed
`duration_ms`	int	Total execution time
`framework`	str	Framework name (e.g., "langchain")
`agent_name`	str	Agent identifier
`metadata`	dict	Custom key-value data

Trace Status¶

Status	When to Use
`SUCCESS`	Agent completed task successfully
`FAILURE`	Agent failed to complete task
`TIMEOUT`	Execution exceeded time limit
`ERROR`	Unexpected error occurred

Step Types¶

Each step in a trace has a type:

TOOL_CALL¶

Agent invoked a tool/function:

from evaldeck import Step

step = Step.tool_call(
    tool_name="search_flights",
    tool_args={"from": "NYC", "to": "LA", "date": "2024-03-15"},
    tool_result={"flights": [{"id": "AA123", "price": 299}]},
    duration_ms=150,
    status="success"
)

Fields:

tool_name: Name of the tool called
tool_args: Arguments passed to the tool
tool_result: What the tool returned
status: "success" or "error"
error: Error message if failed

LLM_CALL¶

Agent called a language model:

from evaldeck import Step, TokenUsage

step = Step.llm_call(
    model="gpt-4o-mini",
    input="Parse this request: Book a flight to NYC",
    output='{"destination": "NYC", "type": "flight"}',
    token_usage=TokenUsage(prompt_tokens=50, completion_tokens=20, total_tokens=70),
    duration_ms=500
)

Fields:

model: Model identifier
input: Prompt sent to model
output: Model's response
token_usage: Token consumption

REASONING¶

Agent's internal reasoning:

step = Step.reasoning(
    text="User wants to book a flight. I should search for available flights first.",
    duration_ms=10
)

Fields:

reasoning_text: The reasoning content

HUMAN_INPUT¶

Human-in-the-loop input:

step = Step(
    type=StepType.HUMAN_INPUT,
    input="Is this the correct flight?",
    output="Yes, proceed with booking"
)

Step Structure¶

All steps share common fields:

Step(
    # Identity
    id="step_001",
    type=StepType.TOOL_CALL,

    # Timing
    timestamp=datetime.now(),
    duration_ms=150,

    # Status
    status=StepStatus.SUCCESS,  # or FAILURE, ERROR
    error=None,

    # Type-specific fields
    tool_name="search",
    tool_args={...},
    tool_result={...},

    # Or for LLM calls
    model="gpt-4o-mini",
    input="...",
    output="...",
    token_usage=TokenUsage(...),

    # Metadata
    metadata={"custom": "data"}
)

Building Traces¶

Manual Construction¶

from evaldeck import Trace, Step, TokenUsage

trace = Trace(input="Book a flight to NYC")

# Add steps as execution proceeds
trace.add_step(Step.llm_call(
    model="gpt-4o-mini",
    input="Parse: Book a flight to NYC",
    output="destination=NYC, type=flight",
    token_usage=TokenUsage(prompt_tokens=30, completion_tokens=10, total_tokens=40)
))

trace.add_step(Step.tool_call(
    tool_name="search_flights",
    tool_args={"destination": "NYC"},
    tool_result={"flights": [...]}
))

# Complete the trace
trace.complete(
    output="Your flight is booked!",
    status="success"
)

With OpenTelemetry Integration¶

Use the built-in OpenTelemetry adapter to capture traces automatically:

from evaldeck.integrations import setup_otel_tracing
from openinference.instrumentation.langchain import LangChainInstrumentor

# Setup once at module level
processor = setup_otel_tracing()
LangChainInstrumentor().instrument()

# Run agent (traces captured automatically)
processor.reset()
result = agent.invoke({"input": "..."})
trace = processor.get_latest_trace()

Accessing Trace Data¶

Step Iteration¶

for step in trace.steps:
    print(f"{step.type}: {step.status}")

Filtered Access¶

# Get all tool calls
tool_calls = trace.tool_calls
for tc in tool_calls:
    print(f"{tc.tool_name}({tc.tool_args}) -> {tc.tool_result}")

# Get all LLM calls
llm_calls = trace.llm_calls
for lc in llm_calls:
    print(f"{lc.model}: {lc.token_usage.total} tokens")

Computed Properties¶

# Set of tool names called
tools = trace.tools_called  # {"search", "book"}

# Total token usage
tokens = trace.total_tokens  # 250

# Step count
count = trace.step_count  # 5

# Duration
duration = trace.duration_ms  # 1500

Serialization¶

To Dictionary¶

data = trace.to_dict()
# {'id': '...', 'input': '...', 'steps': [...], ...}

From Dictionary¶

trace = Trace.from_dict(data)

To/From JSON¶

import json

# Serialize
json_str = json.dumps(trace.to_dict())

# Deserialize
trace = Trace.from_dict(json.loads(json_str))

Why Traces Matter¶

1. Granular Debugging¶

When a test fails, the trace shows exactly where:

Step 1: LLM_CALL ✓
Step 2: TOOL_CALL (search) ✓
Step 3: TOOL_CALL (book) ✗  ← Failed here
        Error: "Invalid flight ID"

2. Behavioral Verification¶

Check not just output, but how the agent got there:

expected:
  tools_called: [search, book]  # Verify the path
  tool_call_order: [search, book]  # Verify the sequence

3. Efficiency Analysis¶

Measure and optimize agent behavior:

print(f"Steps: {trace.step_count}")
print(f"Tokens: {trace.total_tokens}")
print(f"Duration: {trace.duration_ms}ms")

4. Regression Detection¶

Compare traces across versions to catch regressions:

# Before: 3 steps, 150 tokens
# After: 7 steps, 450 tokens  ← Regression!

Best Practices¶

Capture everything - Include reasoning steps, not just tool calls
Track timing - duration_ms helps identify bottlenecks
Include metadata - Add context that helps debugging
Use status correctly - Distinguish between failure modes
Complete traces properly - Always call complete() with final status