Agent
This section focuses on how the agent works internally — the runtime, the cognitive pipeline, and how it extends into multi-node systems.
Core pipeline
channel -> communication -> sensor -> agent -> think -> skill
In this pipeline:
channelhandles platform protocols and ingress (Feishu, Discord, WebSocket, HTTP, etc.)communicationhandles transport and unified send/receive behaviorsensorturns external input into agent-ready perception — text, voice transcription, file content, eventsagentowns the runtime orchestration: context assembly, conversation lifecycle, turn managementthinkmakes decisions, reasons, and produces reply plans using the LLMskillprovides reusable procedures and operating guidance loaded on demand by context
Three pillars
MushroomAgent's architecture rests on three design principles:
- Unified subject: no matter how many nodes sit underneath, users interact with one agent identity. All sessions, memories, and tool results converge to a single cognitive core.
- Unified world model: inputs from different modalities (text, voice, files, sensor data) and different locations (server, edge device, phone) merge into one shared context. The agent sees the full picture.
- Unified action orchestration: capabilities can be distributed across devices, while planning and decision-making stay together in the think engine.
Two modes: Local and Remote
MushroomAgent supports two operating modes:
Local node (start)
Thinking and execution run on the same device. mushroom-agent start starts the agent server and attaches the configured local device runtime in the current process. All inputs, reasoning, and actions happen in one place.
Remote mode
A remote agent service handles the thinking, while one or more devices connect to it and execute actions locally.
Workflow:
- Start
mushroom-agent serveon the remote host — this starts the think engine - Configure each device's
remote.yaml, then runmushroom-agent node attach— the device connects to the remote service
The remote serve instance supports multiple devices. Each device collects its own inputs (audio, video, text) and sends them to the remote think engine. The think engine processes the input, makes decisions, and dispatches actions back to the appropriate devices for execution.
Agent lifecycle
A typical turn goes through these stages:
- Receive — input arrives via a channel (chat message, voice, HTTP request)
- Perceive — the sensor layer processes the input (transcribe voice, parse files, extract intent)
- Contextualize — the agent assembles conversation history, loaded Skills, tool results, and world model into a prompt
- Reason — the think layer calls the LLM to decide what to do
- Act — execute tools, invoke Skills, dispatch node actions, or produce a reply
- Respond — send the output back through the channel
Start here
How the think loop works — input, context assembly, LLM call, tool execution.
The 5-layer prompt the agent sees before each LLM call.
Directory structure, context files, and how they're loaded.
How the pipeline extends into embodied multi-node systems.