Skip to main content

Agent

This section focuses on how the agent works internally — the runtime, the cognitive pipeline, and how it extends into multi-node systems.

Core pipeline

channel -> communication -> sensor -> agent -> think -> skill

In this pipeline:

  • channel handles platform protocols and ingress (Feishu, Discord, WebSocket, HTTP, etc.)
  • communication handles transport and unified send/receive behavior
  • sensor turns external input into agent-ready perception — text, voice transcription, file content, events
  • agent owns the runtime orchestration: context assembly, conversation lifecycle, turn management
  • think makes decisions, reasons, and produces reply plans using the LLM
  • skill provides reusable procedures and operating guidance loaded on demand by context

Three pillars

MushroomAgent's architecture rests on three design principles:

  • Unified subject: no matter how many nodes sit underneath, users interact with one agent identity. All sessions, memories, and tool results converge to a single cognitive core.
  • Unified world model: inputs from different modalities (text, voice, files, sensor data) and different locations (server, edge device, phone) merge into one shared context. The agent sees the full picture.
  • Unified action orchestration: capabilities can be distributed across devices, while planning and decision-making stay together in the think engine.

Two modes: Local and Remote

MushroomAgent supports two operating modes:

Local node (start)

Thinking and execution run on the same device. mushroom-agent start starts the agent server and attaches the configured local device runtime in the current process. All inputs, reasoning, and actions happen in one place.

Remote mode

A remote agent service handles the thinking, while one or more devices connect to it and execute actions locally.

Workflow:

  1. Start mushroom-agent serve on the remote host — this starts the think engine
  2. Configure each device's remote.yaml, then run mushroom-agent node attach — the device connects to the remote service

The remote serve instance supports multiple devices. Each device collects its own inputs (audio, video, text) and sends them to the remote think engine. The think engine processes the input, makes decisions, and dispatches actions back to the appropriate devices for execution.

Agent lifecycle

A typical turn goes through these stages:

  1. Receive — input arrives via a channel (chat message, voice, HTTP request)
  2. Perceive — the sensor layer processes the input (transcribe voice, parse files, extract intent)
  3. Contextualize — the agent assembles conversation history, loaded Skills, tool results, and world model into a prompt
  4. Reason — the think layer calls the LLM to decide what to do
  5. Act — execute tools, invoke Skills, dispatch node actions, or produce a reply
  6. Respond — send the output back through the channel

Start here