Technology · Architecture Reference
The Mind Layer for consumer AI agents — how it actually works.
Sonzai is the persistent cognitive substrate that sits between any LLM and any application. Eight composable modules — memory, personality, mood, relationships, knowledge, learning, media, and an agent runtime — on managed infrastructure most teams would otherwise spend a year building.
A composable substrate between any LLM and any application.
Sonzai is a multi-tenant AI Mind Layer — a managed substrate between any LLM and any application. Stateless at the edge, stateful at the core, provider-agnostic. Drop it in front of Gemini, GPT, Grok, Claude, or any combination, and the agent retains memory, personality, mood, relationships, and learning across every session.
Core principle
Memory is the floor of an agent's mind, not the whole of it. Sonzai treats memory, personality, mood, relationships, knowledge, learning, media, and orchestration as one composable substrate — because in practice they're entangled, and pretending otherwise leaks complexity into every application that tries to do it itself.
Eight modules. Pick one, pick all.
Each layer is independently consumable through the same SDK. The Memory Layer can stand alone. Personality, Mood, and Relationships compose on top. Learning Systems rewrite every layer below them. Agent Runtime caps the stack with provider-agnostic orchestration.
Module reference
| Module | What it does | What it enables |
|---|---|---|
| Memory Layer | Cite-and-verify extraction of atomic facts, ranked by confidence, decayed on the Ebbinghaus curve, deduped by embedding similarity, consolidated nightly. | Recall that ages gracefully. |
| Personality | Big-5 trait tracking with evolution deltas, full history, and per-scenario overlays. | A self that drifts with experience. |
| Mood & Emotion | Live 4-D affective vector with theme detection per turn and contagion across multi-agent scenes. | Affective state, not robotic transcript. |
| Relationship | Directional love/trust scores per pair · shared-memory channels · 3 privacy tiers. | Multi-agent dynamics, not just calls. |
| Knowledge Base | Tenant- and project-scoped knowledge store with retrieval + write tools, gated by access control. | Grounded, ACL-aware retrieval. |
| Learning Systems | RL with shadow-model promotion · nightly self-learning · federated cross-agent concept catalog. | Agents that improve in production. |
| Agent Runtime | Provider-agnostic LLM orchestration with tool calling, SSE streaming, and priority failover. | One integration, zero SPOF. |
| Media Generation | Image, video, TTS, music, and SFX through best-of-breed providers. | Multi-modal output, one client. |
A production agent stack is a dozen distributed systems wired together. Sonzai operates all of them.
The Mind Layer is the part visible through the API. Underneath sits the infrastructure that makes a multi-tenant, evolving, learning agent platform actually work in production — transactional and columnar stores, a hybrid index, a hot cache, a queue with DLQ and retries, a distributed worker pool, a versioned per-user weight store, a scheduler for cadence-driven jobs, a key vault, a cost ledger, observability, and an eval gate.
If you DIY vs. with Sonzai
| If you DIY | With Sonzai |
|---|---|
| Pick a queue, tune retries, run dead-letter handling. | Already wired and idempotent. |
| Shard a database per tenant, replicate cross-region. | Multi-tenant isolation out of the box. |
| Stand up a vector index and keep embeddings fresh. | Vector + entity + temporal indexes managed. |
| Cluster GPU/CPU compute for extraction and embedding. | Distributed compute pool, autoscaled. |
| Store and version per-user RL weights and overlays. | Per-user weight store, hot-swappable. |
| Schedule nightly consolidation, hourly decay, sweep jobs. | Background scheduler runs the full cadence. |
| Build a cost ledger before usage explodes. | Per-user / per-day / per-month caps included. |
| Wire eval gates so quality doesn't regress on model swaps. | SOTOPIA 6-dim gate runs on every release. |
Order of magnitude
The left column is the work a platform team typically takes 12+ months to stand up and another 12 to harden. Every row of that work is already running under the API.
Plain RAG embeds, top-k's, and hopes. Sonzai's retrieval is agentic.
The model reasons about what it needs to know, chooses which memory tools to call, inspects the results, and iterates until it has enough context. The ReAct loop applied to memory — not just to web search.
Vector RAG vs. agentic retrieval
| Plain Vector RAG | Sonzai agentic retrieval |
|---|---|
| Single embedding query, top-k dump. | ReAct loop: reason → choose tool → observe → refine. |
| Whole-document chunks, semantic-only. | Atomic facts with entity, temporal, and confidence dimensions. |
| One index, one signal. | Hybrid: vector + BM25 + entity graph + temporal range. |
| Static — every query treated the same. | Tool-calling agent picks recall / recall_shared / check_emotional_alignment per turn. |
| Hallucinations leak through. | Cite-and-verify — every fact traceable, filtered before storage. |
| Stale or contradictory facts coexist silently. | Polarity groups form on contradiction; confidence decays; consolidation resolves. |
| Same answer regardless of relationship or mood. | Retrieval is context-conditioned on relationship, mood, personality, goals. |
| No learning. | Retrieval reinforces — hits boost confidence, misses decay it. |
The mental model
“Sonzai treats memory the way a reasoning agent treats the world — as something to interrogate, not something to flush into the prompt.”
Five patterns. Same Mind Layer. Pick the shape that fits your stack.
Each pattern is independently usable. Adopt one and you can graduate to another without re-platforming — the surface area changes, the substrate doesn't.
Process Endpoint
Memory layered onto an existing chat stack — one POST per turn
You already run your own LLM and chat — you want memory, personality, and learning layered on top without replacing what you have.
The LLM call · the response stream · the UI.
Fact extraction · memory persistence · mood / personality / relationship deltas.
from sonzai import Client
sz = Client(api_key=os.environ["SONZAI_API_KEY"])
async def handle_turn(user_id, agent_id, messages):
# Sonzai extracts facts, persists memory, applies deltas — returns audit info.
deltas = await sz.agents.process(
agent_id=agent_id, user_id=user_id, messages=messages,
)
return deltas # { memories_created, facts_extracted, mood, personality, relationship }Real-Time Sessions
Your chat UI, our memory lifecycle — explicit start / per-turn / end
You want explicit per-conversation lifecycle — a clean start, per-turn enrichment and extraction, end-of-session consolidation.
The LLM call · the message stream.
Context retrieval · per-turn extraction · async consolidation on close.
const s = await client.agents.sessions.start({ agent, user, sessionId });
for (const message of stream) {
const ctx = await s.context({ query: message }); // 7-layer enriched system block
const reply = await yourLLM([ctx.systemBlock, message]);
await s.turn({ messages: [message, reply] }); // async extract + learn
}
await s.end({ messages }); // triggers consolidationAgent Chat Endpoint
Full hosted runtime — SSE deltas, tools, multi-provider failover
Greenfield apps that want a complete agent in one call — streaming, tool calling, side-effect events for memory mutations.
UI only.
Context assembly · LLM orchestration · tool dispatch · memory persistence · provider fallback.
async for evt in client.agents.chat(
agent=agent, messages=[...], stream=True, tools=[...]
):
if evt.type == "delta": render(evt.text)
elif evt.type == "tool_call": handle_tool(evt)
elif evt.type == "complete": show_usage(evt.usage)Hermes Plugin
Drop-in for Nous Research's Hermes Agent — two lines of YAML
You already run Hermes Agent and want the Mind Layer added with two lines of YAML and zero handler changes.
Hermes config.
Memory recall on prefetch · fact extraction after each turn · intelligent context compression on overflow.
# Two plugins, cooperating:
# Memory Provider runs every turn; Context Engine fires only on token-budget hit.
plugins:
memory: sonzai
context: sonzai
sonzai:
api_key: ${SONZAI_API_KEY}OpenClaw Plugin
Drop-in for OpenClaw agents — config-flip, zero code
You run OpenClaw and want server-backed enrichment instead of the default local Markdown memory.
OpenClaw config.
The full Context Engine lifecycle — bootstrap, assemble, afterTurn, compact, dispose.
{
"contextEngine": "sonzai",
"sonzai": {
"apiKey": "<your-key>",
"audit": true // composio_app + request_id captured
}
}Design choice
All five flows share the same Mind Layer underneath. Moving between them is a code-level change, not a re-platforming — per-user state, learned weights, and accumulated memory all carry across.
The agent on day 90 is not the agent on day 1. It has learned this user specifically.
Most platforms ship a single model that serves every user the same. Sonzai stores per-user reinforcement-learning policy weights and personality overlays, hot-loaded into the inference path. The substrate to do this safely — shadow rollouts, promotion gates, versioning, rollback — is the kind of thing teams spend a year building.
What this changes
With per-user policies, the effective model becomes a different one per user over time — safely, with shadow rollout, eval-gated promotion, and sub-second rollback. Personalisation at the weight level, not just the prompt.
One request in. One response out. Eleven things in between.
The full lifecycle of a single user turn in Managed Runtime mode. Steps 1–6 are synchronous (in the request path). Steps 7–11 are asynchronous (queued, eventually consistent).
| Step | Sync | What happens |
|---|---|---|
| 1 · Auth & route | ✓ | Tenant + user resolved. Rate limiter checked. Provider keys vault hit. |
| 2 · Load per-user weights | ✓ | RL policy + personality overlay hot-loaded from weight store (§06). |
| 3 · Agentic retrieval | ✓ | ReAct loop — LLM picks memory tools, queries hybrid index, refines (§04). |
| 4 · Context assembly | ✓ | Memory + mood + relationship + personality + knowledge composed into prompt. |
| 5 · LLM call with failover | ✓ | Multi-provider router; priority list; cascade on quota exhaustion. |
| 6 · Stream response + tool calls | ✓ | SSE to your app. Tool calls intercepted, audited, returned. |
| 7 · Cite-and-verify extract | — | New atomic facts extracted, verified against turn source, scored, stored. |
| 8 · Mood + personality drift | — | Affective vector updated. Big-5 deltas applied. |
| 9 · Relationship update | — | Bond scores adjusted. Shared-memory channels checked. |
| 10 · Reinforcement learning | — | RL signal recorded. Shadow model scored. Promotion considered. |
| 11 · Consolidation queue | — | Turn queued for nightly consolidation, decay sweeps, polarity-group formation. |
Same primitives. Six surfaces. Pick what fits your stack.
| Surface | For | Shape |
|---|---|---|
| Python SDK | Backend services · batch jobs · eval pipelines | client.agents.chat(...) — sync & async |
| TypeScript SDK | Node · Bun · Deno · edge | Zero-dependency, isomorphic. Same surface area. |
| Go SDK | High-throughput infrastructure | Native client for Go runtimes. |
| MCP Server | Any MCP-compatible host | Memory, knowledge, and tool primitives as MCP servers. |
| Framework Plugins | Hermes · OpenClaw · similar | Drop-in plugin auto-injects <sonzai-context>. No code change. |
| REST API | Anything else | OpenAPI-spec'd, language-agnostic. |
import { Sonzai } from "@sonzai-labs/agents";
const sz = new Sonzai({ apiKey: process.env.SONZAI_API_KEY });
const stream = await sz.agents.stream({
userId,
message,
scene: "front_of_house",
providers: ["claude-3.5", "gpt-4o"],
tools: ["composio.gmail", "kb.search"],
});
for await (const chunk of stream) yield chunk.text;import sonzai "github.com/sonz-ai/sonzai-go"
sz, _ := sonzai.New(sonzai.WithAPIKey(os.Getenv("SONZAI_API_KEY")))
facts, _ := sz.Memory.Recall(ctx, &sonzai.RecallReq{ UserID: uid, Query: msg })
// ... your LLM call, with facts injected ...
sz.Memory.ExtractAsync(ctx, uid, transcript)Deployment modes — adopt what you need
| Mode | Sonzai owns | You own |
|---|---|---|
| Standalone Memory | Memory · Personality · Mood (via 2 calls / turn) | LLM call · orchestration · UX |
| Drop-In Runtime | The full request loop · all 8 modules · failover · tools | UX · auth · business logic |
| Edge / Local | On-device semantic memory · privacy-sensitive flows | Everything else |
| Research / Benchmark | Eval harness · SOTOPIA scoring | Your candidate memory backend |
| Bring-Your-Own-Key | Routing · failover · all behavioral systems | Provider keys · provider billing |
None of this is one feature. It's nine choices that compound.
Each item below is a deliberate design choice in the substrate. None of them is novel in isolation — retrieval, evals, RL, fallback, all exist elsewhere. The substrate is what's hard: making them work together, per-tenant, under production load, with rollback.
Agentic, multi-signal retrieval
ReAct loop over hybrid vector + BM25 + entity + temporal indexes. The LLM picks tools per turn. Not RAG-on-vector-soup.
Confidence-aware memory ranking
Facts carry decay curves. Retrieval reinforces them. Contradictions form polarity groups instead of silently overwriting.
Adaptive consolidation cadence
Dormant users pay near-zero. Heavy users get more passes. Cost scales with engagement, not headcount.
Cross-tenant concept catalog
Cheap models inherit frontier-model quality via grounded retrieval. The largest economic lever in the stack.
Cite-and-verify pipeline
Every extracted fact is traceable to its source turn. Hallucinated facts are filtered before storage.
Multi-provider failover by priority
Automatic cascade on quota exhaustion. Single point of integration, zero single point of failure.
Per-user model weights, hot-loaded
Each user's agent becomes a different model over time. Shadow rollout, promotion gates, rollback all managed.
SOTOPIA-gated releases
6-dim behavioral scoring — Believability, Relationships, Knowledge, Social Rules, EQ, Goal Completion — on every release.
Workbench = production, accelerated
What you evaluate in minutes of simulated time is exactly what runs in production. Same code path.
The Mind Layer
Give any LLM a mind.
One SDK. Five integration patterns. The same Mind Layer underneath whether you adopt it as a memory sidecar, a session runtime, a hosted agent, or a plugin in Hermes or OpenClaw.