Agent

This section focuses on how the agent works internally — the runtime, the cognitive pipeline, and how it extends into multi-node systems.

Core pipeline

channel -> communication -> sensor -> agent -> think -> skill

In this pipeline:

channel handles platform protocols and ingress (Feishu, Discord, WebSocket, HTTP, etc.)
communication handles transport and unified send/receive behavior
sensor turns external input into agent-ready perception — text, voice transcription, file content, events
agent owns the runtime orchestration: context assembly, conversation lifecycle, turn management
think makes decisions, reasons, and produces reply plans using the LLM
skill provides reusable procedures and operating guidance loaded on demand by context

Three pillars

MushroomAgent's architecture rests on three design principles:

Unified subject: no matter how many nodes sit underneath, users interact with one agent identity. All sessions, memories, and tool results converge to a single cognitive core.
Unified world model: inputs from different modalities (text, voice, files, sensor data) and different locations (server, edge device, phone) merge into one shared context. The agent sees the full picture.
Unified action orchestration: capabilities can be distributed across devices, while planning and decision-making stay together in the think engine.

Two modes: Local and Remote

MushroomAgent supports two operating modes:

Local node (`start`)

Thinking and execution run on the same device. mushroom-agent start starts the agent server and attaches the configured local device runtime in the current process. All inputs, reasoning, and actions happen in one place.

Remote mode

A remote agent service handles the thinking, while one or more devices connect to it and execute actions locally.

Workflow:

Start mushroom-agent serve on the remote host — this starts the think engine
Configure each device's remote.yaml, then run mushroom-agent node attach — the device connects to the remote service

The remote serve instance supports multiple devices. Each device collects its own inputs (audio, video, text) and sends them to the remote think engine. The think engine processes the input, makes decisions, and dispatches actions back to the appropriate devices for execution.

Agent lifecycle

A typical turn goes through these stages:

Receive — input arrives via a channel (chat message, voice, HTTP request)
Perceive — the sensor layer processes the input (transcribe voice, parse files, extract intent)
Contextualize — the agent assembles conversation history, loaded Skills, tool results, and world model into a prompt
Reason — the think layer calls the LLM to decide what to do
Act — execute tools, invoke Skills, dispatch node actions, or produce a reply
Respond — send the output back through the channel

Start here

Agent Loop

How the think loop works — input, context assembly, LLM call, tool execution.

System Prompt

The 5-layer prompt the agent sees before each LLM call.

Agent Workspace

Directory structure, context files, and how they're loaded.

Mushroom Architecture

How the pipeline extends into embodied multi-node systems.

Core pipeline​

Three pillars​

Two modes: Local and Remote​

Local node (start)​

Remote mode​

Agent lifecycle​

Start here​