Eval Panel (`/i/dev/eval`)

The Eval Panel is a built-in devtools page that lets you run live evaluation cases — small fixed prompts with deterministic expectations — directly against your locally running agent. It exercises the real LLM, real tools, and real prompt assembly without polluting your conversation history.

It is available when the devtools UI is enabled with mushroom-agent start or mushroom-agent serve --ui. It is not mounted in plain production serve mode.

When to use

Scenario	Use this	Use `mushroom-agent eval-live`
改了 system prompt / 工具，本地交互式验证	✅ Eval Panel	⚠️（也行，但无 UI）
CI / 脚本里跑 live regression	❌	✅ `eval-live` + `eval-live.yml`
想验证某条 instruction 当前模型是否还会遵守	✅	✅

Live runs share your agent's ctx.think and ctx.tools — they will hit your configured LLM provider and consume real tokens. Plan accordingly.

Open the page

mushroom-agent start
Open http://127.0.0.1:7860/i/dev/eval (or click "Eval →" from the chat top-bar)

Run a suite

Pick one or more cases from the left list (use All / None for quick toggling). All bundled cases run live.
Set Samples (1–5). Each sample is an independent LLM call.
(Optional) Tick Enable LLM-as-judge to score outputs with judge.instruction_following (and optionally judge.task_completion from the Advanced section). The checkbox is disabled if EVAL_JUDGE_MODEL is not set.
Click Run — a confirmation modal shows the estimated number of LLM calls (cases × samples × (1 + judge_metrics)).
Watch the live progress bar / per-sample table. Use Cancel to abort an in-flight run.
The summary line at the bottom shows pass/fail counts and any host-context drift warnings.

Each run is persisted to <workspace>/eval_runs/run_<hex>.json (or ~/.mushroom_agent/eval_runs/ if no workspace is configured). The newest dev.eval.keep_runs files are kept; older runs are pruned.

Token budget

A typical run spends:

calls = N_cases × N_samples × (1 + N_judge_metrics)

With the bundled 10 live cases × 1 sample × 0 judges = 10 LLM calls. With both judges enabled and 3 samples = 90 calls. Confirm the estimate in the modal before clicking through.

LLM-as-judge configuration

Judging requires its own LLM credentials, not the production OPENAI_API_KEY (this is enforced — see mushroom-evals/mushroom_evals/judge.py for the rationale):

export EVAL_JUDGE_MODEL=gpt-4o-mini
export EVAL_JUDGE_API_KEY=sk-eval-...
# optional
export EVAL_JUDGE_BASE_URL=https://your-proxy/v1

Without these, the judge metric rows show status="skipped" and do not count as failures. With them set incorrectly (e.g. wrong key), the row shows status="error" and the run continues.

Configuration (`config.yaml`)

dev:
  eval:
    enabled: true            # set false to hide /i/dev/eval entirely
    max_concurrency: 1       # how many runs may execute at the same time (hard cap 4)
    keep_runs: 20            # disk retention
    default_enable_judge: false  # whether the UI judge checkbox starts checked

If dev.eval.enabled = false the route register_eval_routes returns None and the panel becomes a 404.

Difference from `mushroom-agent eval-live`

Both entry points share the exact same execution path (mushroom_evals.runners.agent_runner.run_case_live). The differences are operational:

	`mushroom-agent eval-live` (CLI)	Eval Panel (`/i/dev/eval`)
LLM cost	real	real
Where it executes	CI / scripts (no UI)	inside the dev process
Trigger	shell command	manual click
Memory	real `ctx.memory`; runner sets `turn_ctx.extras['is_eval']=True` and the memory implementation decides whether to short-circuit	same
Trace IDs	`eval-{run_id}-{case.id}-s{sample}`	same
Use case	scheduled / on-demand live regression	exploratory verification

For the CI/scripts entry point, see mushroom-agent eval-live below and the .github/workflows/eval-live.yml workflow.

`mushroom-agent eval-live`

Same live execution path as the dev panel, but as a non-interactive CLI for CI / scripts.

# actually runs live LLM (requires a usable ~/.mushroom_agent/config.yaml + llm.api_key)
mushroom-agent eval-live --suite all --samples 2

# dry-run: discover cases, build agent, but do NOT call LLM
mushroom-agent eval-live --dry-run

Key flags:

Flag	Default	Notes
`--suite`	`all`	`smoke` or `all`
`--samples`	`1`	independent runs per case
`--capability` / `--case-id`	unset	filters
`--enable-judge` / `--no-enable-judge`	auto	needs `EVAL_JUDGE_MODEL` + `EVAL_JUDGE_API_KEY`
`--judge-metrics`	`judge.instruction_following`	comma-separated
`--case-timeout`	`120`	seconds; per-case asyncio.wait_for
`--write-baseline NAME`	unset	writes candidate baseline under `mushroom-evals/mushroom_evals/baselines/`
`--dry-run`	off	exits 0 after schema/agent validation, no LLM
`--silence`	off	suppress per-case progress lines

Reports:

mushroom-evals/reports/live-{suite}-{ts}-detail.jsonl — one row per case×sample with all metrics
mushroom-evals/reports/live-{suite}-{ts}-summary.jsonl — aggregated per case (pass@1, avg_score)
exit code: 0 all green / 1 regression / 2 setup error (e.g. missing llm.api_key)

The GitHub Action .github/workflows/eval-live.yml is the canonical scheduled / on-demand runner; it only triggers via workflow_dispatch.

When to use​

Open the page​

Run a suite​

Token budget​

LLM-as-judge configuration​

Configuration (config.yaml)​

Difference from mushroom-agent eval-live​

mushroom-agent eval-live​

See also​