CLI Reference¶

Evaldeck provides a command-line interface for initializing projects and running evaluations.

Commands Overview¶

evaldeck --help

Usage: evaldeck [OPTIONS] COMMAND [ARGS]...

  Evaldeck - The evaluation framework for AI agents.

Options:
  --version  Show the version and exit.
  --help     Show this message and exit.

Commands:
  init  Initialize a new Evaldeck project.
  run   Run evaluations.

evaldeck init¶

Initialize a new Evaldeck project in the current directory.

evaldeck init [OPTIONS]

Options¶

Option	Description
`--force`	Overwrite existing files
`--help`	Show help message

What It Creates¶

your-project/
├── evaldeck.yaml           # Configuration file
├── tests/
│   └── evals/
│       └── example.yaml  # Example test case
└── .evaldeck/              # Output directory

Example¶

# Initialize in current directory
evaldeck init

# Force overwrite existing files
evaldeck init --force

evaldeck run¶

Run evaluations against your agent.

evaldeck run [OPTIONS]

Options¶

Option	Short	Description
`--config PATH`	`-c`	Path to configuration file
`--suite NAME`	`-s`	Run specific test suite
`--tag TAG`	`-t`	Filter tests by tag (comma-separated for multiple)
`--output FORMAT`	`-o`	Output format: `text`, `json`, `junit`
`--output-file PATH`	`-f`	Write output to file
`--verbose`	`-v`	Show detailed output including traces
`--dry-run`		Validate config without running
`--help`		Show help message

Examples¶

Basic Run¶

Run all tests with default configuration:

evaldeck run

Custom Configuration¶

Use a specific config file:

evaldeck run --config config/evaldeck.ci.yaml
evaldeck run -c evaldeck.production.yaml

Filter by Suite¶

Run a specific test suite:

evaldeck run --suite critical
evaldeck run -s smoke

Filter by Tag¶

Run tests with specific tags:

# Single tag
evaldeck run --tag critical

# Multiple tags (OR logic)
evaldeck run --tag "critical,smoke"
evaldeck run -t booking -t search

Output Formats¶

Text (default) - Human-readable terminal output:

evaldeck run

Running 5 tests...

  ✓ book_flight_basic (1.2s)
  ✓ book_flight_roundtrip (2.1s)
  ✗ book_flight_preferences (1.8s)
    └─ FAIL: ToolCalledGrader - Expected tool 'set_preferences' not called

Results: 2/3 passed (66.7%)

JSON - Machine-readable format:

evaldeck run --output json
evaldeck run -o json --output-file results.json

{
  "total": 3,
  "passed": 2,
  "failed": 1,
  "pass_rate": 0.667,
  "duration_ms": 5100,
  "results": [...]
}

JUnit XML - For CI/CD integration:

evaldeck run --output junit --output-file results.xml

<?xml version="1.0" encoding="UTF-8"?>
<testsuites>
  <testsuite name="evaldeck" tests="3" failures="1" time="5.1">
    <testcase name="book_flight_basic" time="1.2"/>
    <testcase name="book_flight_roundtrip" time="2.1"/>
    <testcase name="book_flight_preferences" time="1.8">
      <failure message="ToolCalledGrader failed">
        Expected tool 'set_preferences' not called
      </failure>
    </testcase>
  </testsuite>
</testsuites>

Verbose Mode¶

Show detailed execution traces:

evaldeck run --verbose
evaldeck run -v

  ✗ book_flight_preferences (1.8s)

    Trace:
    ├─ [1] LLM_CALL (gpt-4o-mini) - 0.3s
    │   └─ Parsed user request
    ├─ [2] TOOL_CALL (search_flights) - 0.5s
    │   └─ Found 5 flights
    ├─ [3] TOOL_CALL (book_flight) - 0.4s
    │   └─ Booked flight AA123
    └─ [4] LLM_CALL (gpt-4o-mini) - 0.2s
        └─ Generated response

    Grades:
    ├─ ToolCalledGrader: FAIL
    │   Expected: ['search_flights', 'set_preferences', 'book_flight']
    │   Actual: ['search_flights', 'book_flight']
    │   Missing: ['set_preferences']
    └─ ContainsGrader: PASS

Dry Run¶

Validate configuration without executing:

evaldeck run --dry-run

Configuration valid.
Found 15 test cases across 3 suites.
Agent: my_agent.run_agent

Would run:
  - critical: 5 tests
  - regression: 8 tests
  - smoke: 2 tests

Exit Codes¶

Code	Meaning
0	All tests passed
1	One or more tests failed
2	Configuration or runtime error

Use exit codes in scripts:

evaldeck run --tag critical
if [ $? -eq 0 ]; then
  echo "All critical tests passed!"
else
  echo "Some tests failed"
  exit 1
fi

Environment Variables¶

Variable	Description
`EVALDECK_CONFIG`	Default config file path
`EVALDECK_VERBOSE`	Set to `1` for verbose output
`OPENAI_API_KEY`	API key for OpenAI LLM graders
`ANTHROPIC_API_KEY`	API key for Anthropic LLM graders

# Use environment variables
export EVALDECK_CONFIG=configs/evaldeck.ci.yaml
export EVALDECK_VERBOSE=1
evaldeck run

Combining Options¶

Options can be combined:

# Run critical tests with JSON output to file
evaldeck run \
  --config evaldeck.ci.yaml \
  --tag critical \
  --output json \
  --output-file results.json \
  --verbose

Shell Completion¶

Enable shell completion for bash/zsh:

# Bash
eval "$(_EVALDECK_COMPLETE=bash_source evaldeck)"

# Zsh
eval "$(_EVALDECK_COMPLETE=zsh_source evaldeck)"

# Add to your shell profile for persistence
echo 'eval "$(_EVALDECK_COMPLETE=bash_source evaldeck)"' >> ~/.bashrc