LLM-Based Graders¶
LLM-based graders use a language model to evaluate agent outputs. They're ideal for subjective criteria that can't be captured by simple rules.
When to Use¶
Use LLM graders for:
- Evaluating helpfulness or quality
- Checking accuracy of information
- Assessing tone and professionalism
- Comparing against reference outputs
- Any subjective evaluation
Setup¶
API Keys¶
LLM graders require API keys:
Installation¶
pip install evaldeck[openai] # For OpenAI models
pip install evaldeck[anthropic] # For Anthropic models
pip install evaldeck[all] # Both
LLMGrader¶
The basic LLM grader evaluates pass/fail based on a prompt.
YAML Configuration¶
graders:
- type: llm
prompt: |
Evaluate if this response is helpful and accurate.
User asked: {{ input }}
Agent responded: {{ output }}
Is the response helpful? Answer PASS or FAIL.
model: gpt-4o-mini
Template Variables¶
| Variable | Description |
|---|---|
{{ input }} |
The test case input |
{{ output }} |
The agent's final output |
{{ trace }} |
Full trace as JSON |
{{ task }} |
Test case description |
{{ reference }} |
Reference output (if provided) |
Full Example¶
name: customer_support_quality
input: "I can't log into my account"
graders:
- type: llm
prompt: |
You are evaluating a customer support AI agent.
Customer message: {{ input }}
Agent response: {{ output }}
Evaluate the response on these criteria:
1. Does it acknowledge the customer's issue?
2. Does it provide actionable next steps?
3. Is the tone empathetic and professional?
If all criteria are met, respond with: VERDICT: PASS
If any criteria fail, respond with: VERDICT: FAIL
Explain your reasoning briefly.
model: gpt-4o-mini
Response Parsing¶
The grader looks for these patterns in the LLM response:
VERDICT: PASSorVERDICT: FAILPASSorFAIL(at start of line)Result: PASSorResult: FAIL
Threshold-Based Scoring¶
For numeric scoring instead of pass/fail:
graders:
- type: llm
prompt: |
Score this response from 1-5 for helpfulness.
Response: {{ output }}
Provide your score as: SCORE: <number>
model: gpt-4o-mini
threshold: 4 # Must score 4 or higher to pass
The grader extracts the score and compares against the threshold.
LLMRubricGrader¶
For detailed multi-criteria evaluation:
graders:
- type: llm_rubric
prompt: |
Evaluate this customer service response.
Response: {{ output }}
rubric:
accuracy:
description: "Information provided is correct"
weight: 0.4
helpfulness:
description: "Response helps solve the problem"
weight: 0.3
tone:
description: "Professional and empathetic tone"
weight: 0.3
model: gpt-4o-mini
threshold: 3.5 # Weighted average must be ≥ 3.5
Model Selection¶
Available Models¶
OpenAI:
gpt-4o- Most capable, higher costgpt-4o-mini- Good balance (recommended)gpt-4-turbo- Previous generation
Anthropic:
claude-3-opus- Most capableclaude-3-sonnet- Good balanceclaude-3-haiku- Fast and cheap
Configuration¶
Set default model in evaldeck.yaml:
Override per grader:
Python Usage¶
from evaldeck.graders import LLMGrader
grader = LLMGrader(
prompt="""
Is this response helpful?
Response: {{ output }}
Answer PASS or FAIL.
""",
model="gpt-4o-mini",
provider="openai", # or "anthropic"
threshold=None, # None for pass/fail, float for scoring
timeout=60, # Timeout in seconds
)
result = grader.grade(trace, test_case)
print(result.status) # PASS or FAIL
print(result.message) # LLM's explanation
Prompt Engineering Tips¶
1. Be Specific About Criteria¶
# Vague (bad)
prompt: "Is this good?"
# Specific (good)
prompt: |
Evaluate the response on:
1. Accuracy of information
2. Completeness of answer
3. Clarity of explanation
2. Provide Context¶
prompt: |
You are evaluating an AI travel agent.
The agent helps users book flights and hotels.
User request: {{ input }}
Agent response: {{ output }}
Did the agent appropriately handle this travel request?
3. Specify Output Format¶
prompt: |
Evaluate and respond in this exact format:
VERDICT: PASS or FAIL
REASON: <brief explanation>
4. Include Reference When Available¶
name: accuracy_check
input: "What is the capital of France?"
reference_output: "The capital of France is Paris."
graders:
- type: llm
prompt: |
Compare the agent's response to the reference.
Reference: {{ reference }}
Agent response: {{ output }}
Is the agent's response accurate? PASS or FAIL.
Cost Management¶
LLM graders cost money. Optimize by:
1. Use Cheaper Models for Simple Checks¶
2. Layer with Code-Based Graders¶
Run fast checks first, LLM only if needed:
expected:
# Free, fast checks
tools_called: [required_tool]
output_not_contains: [error]
graders:
# Only runs if above pass
- type: llm
prompt: "..."
3. Batch Similar Tests¶
Group tests that need the same type of evaluation.
Determinism¶
LLM graders are non-deterministic. The same input may produce different results.
Strategies for Consistency¶
- Use low temperature (when available)
- Clear, specific prompts reduce ambiguity
- Multiple runs for critical tests
- Thresholds instead of binary pass/fail
Error Handling¶
When LLM calls fail:
Failed calls return GradeStatus.ERROR with details in the message.
Best Practices¶
- Combine with code-based graders - Use LLM for what rules can't check
- Start with gpt-4o-mini - Upgrade only if needed
- Test your prompts - Run manually before automating
- Monitor costs - Track API usage
- Cache when possible - Same input = same grade (coming soon)