Eval Panel (/i/dev/eval)
The Eval Panel is a built-in devtools page that lets you run live evaluation cases — small fixed prompts with deterministic expectations — directly against your locally running agent. It exercises the real LLM, real tools, and real prompt assembly without polluting your conversation history.
It is available when the devtools UI is enabled with mushroom-agent start or mushroom-agent serve --ui. It is not mounted in plain production serve mode.
When to use
| Scenario | Use this | Use mushroom-agent eval-live |
|---|---|---|
| 改了 system prompt / 工具,本地交互式验证 | ✅ Eval Panel | ⚠️(也行,但无 UI) |
| CI / 脚本里跑 live regression | ❌ | ✅ eval-live + eval-live.yml |
| 想验证某条 instruction 当前模型是否还会遵守 | ✅ | ✅ |
Live runs share your agent's
ctx.thinkandctx.tools— they will hit your configured LLM provider and consume real tokens. Plan accordingly.
Open the page
mushroom-agent start- Open http://127.0.0.1:7860/i/dev/eval (or click "Eval →" from the chat top-bar)
Run a suite
- Pick one or more cases from the left list (use All / None for quick toggling). All bundled cases run live.
- Set Samples (1–5). Each sample is an independent LLM call.
- (Optional) Tick Enable LLM-as-judge to score outputs with
judge.instruction_following(and optionallyjudge.task_completionfrom the Advanced section). The checkbox is disabled ifEVAL_JUDGE_MODELis not set. - Click Run — a confirmation modal shows the estimated number of LLM calls (
cases × samples × (1 + judge_metrics)). - Watch the live progress bar / per-sample table. Use Cancel to abort an in-flight run.
- The summary line at the bottom shows pass/fail counts and any host-context drift warnings.
Each run is persisted to <workspace>/eval_runs/run_<hex>.json (or ~/.mushroom_agent/eval_runs/ if no workspace is configured). The newest dev.eval.keep_runs files are kept; older runs are pruned.
Token budget
A typical run spends:
calls = N_cases × N_samples × (1 + N_judge_metrics)
With the bundled 10 live cases × 1 sample × 0 judges = 10 LLM calls. With both judges enabled and 3 samples = 90 calls. Confirm the estimate in the modal before clicking through.
LLM-as-judge configuration
Judging requires its own LLM credentials, not the production OPENAI_API_KEY (this is enforced — see mushroom-evals/mushroom_evals/judge.py for the rationale):
export EVAL_JUDGE_MODEL=gpt-4o-mini
export EVAL_JUDGE_API_KEY=sk-eval-...
# optional
export EVAL_JUDGE_BASE_URL=https://your-proxy/v1
Without these, the judge metric rows show status="skipped" and do not count as failures. With them set incorrectly (e.g. wrong key), the row shows status="error" and the run continues.
Configuration (config.yaml)
dev:
eval:
enabled: true # set false to hide /i/dev/eval entirely
max_concurrency: 1 # how many runs may execute at the same time (hard cap 4)
keep_runs: 20 # disk retention
default_enable_judge: false # whether the UI judge checkbox starts checked
If dev.eval.enabled = false the route register_eval_routes returns None and the panel becomes a 404.
Difference from mushroom-agent eval-live
Both entry points share the exact same execution path (mushroom_evals.runners.agent_runner.run_case_live). The differences are operational:
mushroom-agent eval-live (CLI) | Eval Panel (/i/dev/eval) | |
|---|---|---|
| LLM cost | real | real |
| Where it executes | CI / scripts (no UI) | inside the dev process |
| Trigger | shell command | manual click |
| Memory | real ctx.memory; runner sets turn_ctx.extras['is_eval']=True and the memory implementation decides whether to short-circuit | same |
| Trace IDs | eval-{run_id}-{case.id}-s{sample} | same |
| Use case | scheduled / on-demand live regression | exploratory verification |
For the CI/scripts entry point, see mushroom-agent eval-live below and the
.github/workflows/eval-live.yml workflow.
mushroom-agent eval-live
Same live execution path as the dev panel, but as a non-interactive CLI for CI / scripts.
# actually runs live LLM (requires a usable ~/.mushroom_agent/config.yaml + llm.api_key)
mushroom-agent eval-live --suite all --samples 2
# dry-run: discover cases, build agent, but do NOT call LLM
mushroom-agent eval-live --dry-run
Key flags:
| Flag | Default | Notes |
|---|---|---|
--suite | all | smoke or all |
--samples | 1 | independent runs per case |
--capability / --case-id | unset | filters |
--enable-judge / --no-enable-judge | auto | needs EVAL_JUDGE_MODEL + EVAL_JUDGE_API_KEY |
--judge-metrics | judge.instruction_following | comma-separated |
--case-timeout | 120 | seconds; per-case asyncio.wait_for |
--write-baseline NAME | unset | writes candidate baseline under mushroom-evals/mushroom_evals/baselines/ |
--dry-run | off | exits 0 after schema/agent validation, no LLM |
--silence | off | suppress per-case progress lines |
Reports:
mushroom-evals/reports/live-{suite}-{ts}-detail.jsonl— one row per case×sample with all metricsmushroom-evals/reports/live-{suite}-{ts}-summary.jsonl— aggregated per case (pass@1, avg_score)- exit code:
0all green /1regression /2setup error (e.g. missingllm.api_key)
The GitHub Action .github/workflows/eval-live.yml is the canonical scheduled / on-demand
runner; it only triggers via workflow_dispatch.
See also
- Implementation walk-through: docs/design/agent-eval-framework.md §9.3
- Schema: mushroom-evals/mushroom_evals/schema.py
- Dev console (chat):
/i/chat