LLM Graders¶
Model-as-judge graders using LLM APIs.
evaldeck.graders.LLMGrader
¶
LLMGrader(prompt=None, model='gpt-4o-mini', provider=None, api_key=None, threshold=None, temperature=0.0, task=None)
Bases: BaseGrader
Use an LLM to grade agent output.
This grader sends the trace/output to an LLM with a grading prompt and parses the response to determine pass/fail.
Supports OpenAI and Anthropic APIs (user provides their own API key).
Initialize LLM grader.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prompt
|
str | None
|
Custom grading prompt. Use {input}, {output}, {trace} placeholders. |
None
|
model
|
str
|
Model to use (e.g., "gpt-4o-mini", "claude-3-haiku-20240307"). |
'gpt-4o-mini'
|
provider
|
str | None
|
API provider ("openai" or "anthropic"). Auto-detected from model. |
None
|
api_key
|
str | None
|
API key. If None, uses environment variable. |
None
|
threshold
|
float | None
|
Score threshold for pass (if using scored evaluation). |
None
|
temperature
|
float
|
Model temperature. |
0.0
|
task
|
str | None
|
Task description for the default prompt. |
None
|
Source code in src/evaldeck/graders/llm.py
grade
¶
Grade the trace using an LLM (sync).
Source code in src/evaldeck/graders/llm.py
grade_async
async
¶
Grade the trace using an LLM (async).
Uses async API clients for better performance in concurrent evaluation.
Source code in src/evaldeck/graders/llm.py
evaldeck.graders.LLMRubricGrader
¶
Bases: LLMGrader
LLM grader with a detailed scoring rubric.
Initialize rubric grader.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rubric
|
dict[str, str]
|
Dict mapping criterion names to descriptions. |
required |
pass_threshold
|
float
|
Minimum score ratio to pass (0-1). |
0.7
|
**kwargs
|
Any
|
Passed to LLMGrader. |
{}
|