跳到主要内容

Eval Panel (/i/dev/eval)

The Eval Panel is a built-in devtools page that lets you run live evaluation cases — small fixed prompts with deterministic expectations — directly against your locally running agent. It exercises the real LLM, real tools, and real prompt assembly without polluting your conversation history.

It is available when the devtools UI is enabled with mushroom-agent start or mushroom-agent serve --ui. It is not mounted in plain production serve mode.

When to use

ScenarioUse thisUse mushroom-agent eval-live
改了 system prompt / 工具,本地交互式验证✅ Eval Panel⚠️(也行,但无 UI)
CI / 脚本里跑 live regressioneval-live + eval-live.yml
想验证某条 instruction 当前模型是否还会遵守

Live runs share your agent's ctx.think and ctx.tools — they will hit your configured LLM provider and consume real tokens. Plan accordingly.

Open the page

  1. mushroom-agent start
  2. Open http://127.0.0.1:7860/i/dev/eval (or click "Eval →" from the chat top-bar)

Run a suite

  1. Pick one or more cases from the left list (use All / None for quick toggling). All bundled cases run live.
  2. Set Samples (1–5). Each sample is an independent LLM call.
  3. (Optional) Tick Enable LLM-as-judge to score outputs with judge.instruction_following (and optionally judge.task_completion from the Advanced section). The checkbox is disabled if EVAL_JUDGE_MODEL is not set.
  4. Click Run — a confirmation modal shows the estimated number of LLM calls (cases × samples × (1 + judge_metrics)).
  5. Watch the live progress bar / per-sample table. Use Cancel to abort an in-flight run.
  6. The summary line at the bottom shows pass/fail counts and any host-context drift warnings.

Each run is persisted to <workspace>/eval_runs/run_<hex>.json (or ~/.mushroom_agent/eval_runs/ if no workspace is configured). The newest dev.eval.keep_runs files are kept; older runs are pruned.

Token budget

A typical run spends:

calls = N_cases × N_samples × (1 + N_judge_metrics)

With the bundled 10 live cases × 1 sample × 0 judges = 10 LLM calls. With both judges enabled and 3 samples = 90 calls. Confirm the estimate in the modal before clicking through.

LLM-as-judge configuration

Judging requires its own LLM credentials, not the production OPENAI_API_KEY (this is enforced — see mushroom-evals/mushroom_evals/judge.py for the rationale):

export EVAL_JUDGE_MODEL=gpt-4o-mini
export EVAL_JUDGE_API_KEY=sk-eval-...
# optional
export EVAL_JUDGE_BASE_URL=https://your-proxy/v1

Without these, the judge metric rows show status="skipped" and do not count as failures. With them set incorrectly (e.g. wrong key), the row shows status="error" and the run continues.

Configuration (config.yaml)

dev:
eval:
enabled: true # set false to hide /i/dev/eval entirely
max_concurrency: 1 # how many runs may execute at the same time (hard cap 4)
keep_runs: 20 # disk retention
default_enable_judge: false # whether the UI judge checkbox starts checked

If dev.eval.enabled = false the route register_eval_routes returns None and the panel becomes a 404.

Difference from mushroom-agent eval-live

Both entry points share the exact same execution path (mushroom_evals.runners.agent_runner.run_case_live). The differences are operational:

mushroom-agent eval-live (CLI)Eval Panel (/i/dev/eval)
LLM costrealreal
Where it executesCI / scripts (no UI)inside the dev process
Triggershell commandmanual click
Memoryreal ctx.memory; runner sets turn_ctx.extras['is_eval']=True and the memory implementation decides whether to short-circuitsame
Trace IDseval-{run_id}-{case.id}-s{sample}same
Use casescheduled / on-demand live regressionexploratory verification

For the CI/scripts entry point, see mushroom-agent eval-live below and the .github/workflows/eval-live.yml workflow.

mushroom-agent eval-live

Same live execution path as the dev panel, but as a non-interactive CLI for CI / scripts.

# actually runs live LLM (requires a usable ~/.mushroom_agent/config.yaml + llm.api_key)
mushroom-agent eval-live --suite all --samples 2

# dry-run: discover cases, build agent, but do NOT call LLM
mushroom-agent eval-live --dry-run

Key flags:

FlagDefaultNotes
--suiteallsmoke or all
--samples1independent runs per case
--capability / --case-idunsetfilters
--enable-judge / --no-enable-judgeautoneeds EVAL_JUDGE_MODEL + EVAL_JUDGE_API_KEY
--judge-metricsjudge.instruction_followingcomma-separated
--case-timeout120seconds; per-case asyncio.wait_for
--write-baseline NAMEunsetwrites candidate baseline under mushroom-evals/mushroom_evals/baselines/
--dry-runoffexits 0 after schema/agent validation, no LLM
--silenceoffsuppress per-case progress lines

Reports:

  • mushroom-evals/reports/live-{suite}-{ts}-detail.jsonl — one row per case×sample with all metrics
  • mushroom-evals/reports/live-{suite}-{ts}-summary.jsonl — aggregated per case (pass@1, avg_score)
  • exit code: 0 all green / 1 regression / 2 setup error (e.g. missing llm.api_key)

The GitHub Action .github/workflows/eval-live.yml is the canonical scheduled / on-demand runner; it only triggers via workflow_dispatch.

See also