Skip to main content

Technology · Architecture Reference

The Mind Layer for consumer AI agents — how it actually works.

Sonzai is the persistent cognitive substrate that sits between any LLM and any application. Eight composable modules — memory, personality, mood, relationships, knowledge, learning, media, and an agent runtime — on managed infrastructure most teams would otherwise spend a year building.

Modules
8 composable
Surfaces
6 SDKs / protocols
Adoption
5 deploy modes
Providers
Gemini · GPT · Claude · Grok
01What Sonzai isin one diagram

A composable substrate between any LLM and any application.

Sonzai is a multi-tenant AI Mind Layer — a managed substrate between any LLM and any application. Stateless at the edge, stateful at the core, provider-agnostic. Drop it in front of Gemini, GPT, Grok, Claude, or any combination, and the agent retains memory, personality, mood, relationships, and learning across every session.

OverviewFig. 1 — Three-tier view
YOUR APPLICATIONWeb · mobile · API · your productHolds the user, the UX, the business logic.SONZAI · MANAGED MIND LAYERMIND LAYER · 8 MODULESMemory LayerKnowledge BaseRelationshipMood & EmotionPersonalityLearning SystemsMedia GenerationAgent RuntimeEach module is independently consumable.Standalone-usable: Memory Layer.MANAGED INFRASTRUCTURE· Transactional + columnar stores· Vector / entity / temporal index· Message queue · DLQ · retries· Distributed compute (extract/embed)· Per-user model-weight store· Background scheduler · key vault· Cost ledger · observability · evalsPROVIDERGeminiPROVIDEROpenAI · GPTPROVIDERAnthropic · ClaudePROVIDERxAI · GrokSDK · MCP · REST · pluginsrouted · failover · BYOK or hosted keys
You build / operate
Sonzai operates
External provider

Core principle

Memory is the floor of an agent's mind, not the whole of it. Sonzai treats memory, personality, mood, relationships, knowledge, learning, media, and orchestration as one composable substrate — because in practice they're entangled, and pretending otherwise leaks complexity into every application that tries to do it itself.

02The Stack8 modules, independently consumable

Eight modules. Pick one, pick all.

Each layer is independently consumable through the same SDK. The Memory Layer can stand alone. Personality, Mood, and Relationships compose on top. Learning Systems rewrite every layer below them. Agent Runtime caps the stack with provider-agnostic orchestration.

The 8-module stackFig. 2 — Modular by design
PROVIDERLLM · Gemini · GPT · Claude · Grok08Agent RuntimeProvider-agnostic orchestration · tool calling · SSE streaming · multi-agent scenes · auto-failover07PersonalityBig-5 (OCEAN) traits · evolution over time · per-scenario overlays · cross-agent composition06Mood & Emotion4-D affective vector (happiness · energy · calmness · affection) · theme detection · contagion05RelationshipDirectional bond scores · shared-memory channels · privacy tiers (private / shared / public)04Knowledge BaseTenant- + project-scoped knowledge store · ACL-gated retrieval & write tools03Learning SystemsReinforcement (shadow → live) · self-learning · federated / cross-agent concept catalog02Memory Layer · standalone-usableAtomic facts · hierarchical tree · confidence decay · embedding-dedup · nightly consolidation01Media GenerationImage · video · TTS · music · SFX through best-of-breed providers, one orchestration layerYOUR APPLICATIONVia SDK (Python · TS · Go) · MCP · REST · framework pluginsfeedback · learning rewrites all layers
Sonzai module
Your application
External LLM provider

Module reference

ModuleWhat it doesWhat it enables
Memory LayerCite-and-verify extraction of atomic facts, ranked by confidence, decayed on the Ebbinghaus curve, deduped by embedding similarity, consolidated nightly.Recall that ages gracefully.
PersonalityBig-5 trait tracking with evolution deltas, full history, and per-scenario overlays.A self that drifts with experience.
Mood & EmotionLive 4-D affective vector with theme detection per turn and contagion across multi-agent scenes.Affective state, not robotic transcript.
RelationshipDirectional love/trust scores per pair · shared-memory channels · 3 privacy tiers.Multi-agent dynamics, not just calls.
Knowledge BaseTenant- and project-scoped knowledge store with retrieval + write tools, gated by access control.Grounded, ACL-aware retrieval.
Learning SystemsRL with shadow-model promotion · nightly self-learning · federated cross-agent concept catalog.Agents that improve in production.
Agent RuntimeProvider-agnostic LLM orchestration with tool calling, SSE streaming, and priority failover.One integration, zero SPOF.
Media GenerationImage, video, TTS, music, and SFX through best-of-breed providers.Multi-modal output, one client.
03The Managed Platformthe part beneath the API

A production agent stack is a dozen distributed systems wired together. Sonzai operates all of them.

The Mind Layer is the part visible through the API. Underneath sits the infrastructure that makes a multi-tenant, evolving, learning agent platform actually work in production — transactional and columnar stores, a hybrid index, a hot cache, a queue with DLQ and retries, a distributed worker pool, a versioned per-user weight store, a scheduler for cadence-driven jobs, a key vault, a cost ledger, observability, and an eval gate.

Managed infrastructure · component mapFig. 3 — Sonzai operates the whole bottom tier
APPLICATIONYour app · your UX · your authSONZAI MIND LAYER · 8 MODULESMemory · Personality · Mood · Relationships · Knowledge · Learning · Media · RuntimeSONZAI MANAGED INFRASTRUCTURE · WHAT YOU DON'T BUILDDATA · TRANSACTIONALPRITransactional storeMulti-region · ACID · async replication3 replicas · RPO ≪ 1sDATA · COLUMNARColumnar memoryPer-tenant · time-series · append-onlyShard by tenant · auto-rebalanceDATA · HYBRID INDEXVector + entity + timeCosine · BM25 · entity graph · rangeHybrid retrieval · reranked per-tenantCACHE · IN-MEMORY< 5msIn-memory cacheHot context · read-through · TTLp99 < 5ms · > 98% hit ratioQUEUE · MESSAGES + DLQwDLQMessage queue + DLQIdempotent · ordered · exp. backoffDLQ inspection · manual replayCOMPUTE · DISTRIBUTEDDistributed computeExtract · embed · infer · CPU + GPUSpot-capable · autoscaledSTATE · PER-USER WEIGHTSv15v16v17 · LIVEv18 · SHADOWPer-user weight storeRL policy · overlays · LoRA deltasVersioned · hot-load · < 1s rollbackJOBS · SCHEDULERBackground schedulerNightly · hourly · decay sweepsAdaptive cadence · dormant ≈ 0 costSECURITY · KEY VAULTBYOKMulti-tenant key vaultBYOK + hosted · per-project isolationEnvelope-encrypted at restFINOPS · COST LEDGER$0capCost ledger + limiterPer-user · per-day · per-month capsHard caps · soft alerts before billsQUALITY · OBSERVABILITYp99rpserrObservabilityTracing · metrics · logs · per-tenantOTel-native · OpenSearch sinkQUALITY · EVAL GATEPASSRelease eval gateBelievability · KB · EQ · social · goalSOTOPIA 6-dim · blocks regressions

If you DIY vs. with Sonzai

If you DIYWith Sonzai
Pick a queue, tune retries, run dead-letter handling.Already wired and idempotent.
Shard a database per tenant, replicate cross-region.Multi-tenant isolation out of the box.
Stand up a vector index and keep embeddings fresh.Vector + entity + temporal indexes managed.
Cluster GPU/CPU compute for extraction and embedding.Distributed compute pool, autoscaled.
Store and version per-user RL weights and overlays.Per-user weight store, hot-swappable.
Schedule nightly consolidation, hourly decay, sweep jobs.Background scheduler runs the full cadence.
Build a cost ledger before usage explodes.Per-user / per-day / per-month caps included.
Wire eval gates so quality doesn't regress on model swaps.SOTOPIA 6-dim gate runs on every release.

Order of magnitude

The left column is the work a platform team typically takes 12+ months to stand up and another 12 to harden. Every row of that work is already running under the API.

04Beyond Vector RAGhow Sonzai actually retrieves

Plain RAG embeds, top-k's, and hopes. Sonzai's retrieval is agentic.

The model reasons about what it needs to know, chooses which memory tools to call, inspects the results, and iterates until it has enough context. The ReAct loop applied to memory — not just to web search.

Agentic retrieval — ReAct loop over memory toolsFig. 4 — One turn of retrieval
INPUTUser turn"What did we decide?"REASONLLM picks tools"I need shared memories with X"MEMORY TOOLS · CHOSEN PER TURNrecall(query, top_k, filters)recall_shared_memories(with_id)recall_by_entity(entity_id)recall_by_time(start, end)check_emotional_alignment(topic)check_relationship_state(user_id)recall_personality_drift(window)search_knowledge(query, project)remember_fact(text, refs, confidence)// + project-defined custom toolsOBSERVEHybrid index lookupvector · BM25 · entity · temporalREFINELLM judgesenough? · try another tool?RESPONDGrounded answer · cite-and-verifyEvery claim traceable to a source memory or knowledge entry.POST-TURN · ASYNCExtract atomic facts · update mood · drift personality · reinforce retrieval · queue consolidationContradictions form polarity groups; confidence decays without reinforcement; hallucinations filtered before storage.↺ iterate

Vector RAG vs. agentic retrieval

Plain Vector RAGSonzai agentic retrieval
Single embedding query, top-k dump.ReAct loop: reason → choose tool → observe → refine.
Whole-document chunks, semantic-only.Atomic facts with entity, temporal, and confidence dimensions.
One index, one signal.Hybrid: vector + BM25 + entity graph + temporal range.
Static — every query treated the same.Tool-calling agent picks recall / recall_shared / check_emotional_alignment per turn.
Hallucinations leak through.Cite-and-verify — every fact traceable, filtered before storage.
Stale or contradictory facts coexist silently.Polarity groups form on contradiction; confidence decays; consolidation resolves.
Same answer regardless of relationship or mood.Retrieval is context-conditioned on relationship, mood, personality, goals.
No learning.Retrieval reinforces — hits boost confidence, misses decay it.

The mental model

“Sonzai treats memory the way a reasoning agent treats the world — as something to interrogate, not something to flush into the prompt.”

05Integration Patternsfive flows, one platform

Five patterns. Same Mind Layer. Pick the shape that fits your stack.

Each pattern is independently usable. Adopt one and you can graduate to another without re-platforming — the surface area changes, the substrate doesn't.

Flow 01

Process Endpoint

Memory layered onto an existing chat stack — one POST per turn

Use when

You already run your own LLM and chat — you want memory, personality, and learning layered on top without replacing what you have.

You own

The LLM call · the response stream · the UI.

Sonzai owns

Fact extraction · memory persistence · mood / personality / relationship deltas.

Process Endpoint · sequenceFig. 5.1 — one round-trip per turn
YOUR APPYour existing chat stackSONZAIProcess API · /processSONZAIMind Layer1POST /process{ messages, userId, agentId }2extract atomic facts · persist memorycite-and-verify · embedding dedup · confidence rank3apply mood · personality · relationship deltas4{ memories_created, facts_extracted, deltas }
You operate
Sonzai operates
Return / async
Python · /process
from sonzai import Client
sz = Client(api_key=os.environ["SONZAI_API_KEY"])

async def handle_turn(user_id, agent_id, messages):
    # Sonzai extracts facts, persists memory, applies deltas — returns audit info.
    deltas = await sz.agents.process(
        agent_id=agent_id, user_id=user_id, messages=messages,
    )
    return deltas  # { memories_created, facts_extracted, mood, personality, relationship }
Flow 02

Real-Time Sessions

Your chat UI, our memory lifecycle — explicit start / per-turn / end

Use when

You want explicit per-conversation lifecycle — a clean start, per-turn enrichment and extraction, end-of-session consolidation.

You own

The LLM call · the message stream.

Sonzai owns

Context retrieval · per-turn extraction · async consolidation on close.

Real-Time Sessions · sequenceFig. 5.2 — session-scoped lifecycle
YOUR APPYour existing chat UISONZAISessions APISONZAIMind Layer1sessions.start{ agent, user, sessionId }2session handlePER TURN ↻3session.context(query)4fetch enriched · 7-layer contextmemory · mood · personality · bonds · KB5systemBlock (inject before your LLM call)6session.turn(messages)(your LLM call happens between 5 and 6)7session.end(messages)triggers async consolidation across the stack
You operate
Sonzai operates
Return / async
TypeScript · sessions
const s = await client.agents.sessions.start({ agent, user, sessionId });

for (const message of stream) {
  const ctx = await s.context({ query: message });   // 7-layer enriched system block
  const reply = await yourLLM([ctx.systemBlock, message]);
  await s.turn({ messages: [message, reply] });      // async extract + learn
}
await s.end({ messages });                            // triggers consolidation
Flow 03

Agent Chat Endpoint

Full hosted runtime — SSE deltas, tools, multi-provider failover

Use when

Greenfield apps that want a complete agent in one call — streaming, tool calling, side-effect events for memory mutations.

You own

UI only.

Sonzai owns

Context assembly · LLM orchestration · tool dispatch · memory persistence · provider fallback.

Agent Chat · SSE sequenceFig. 5.3 — one call, many events
YOUR APPUI onlySONZAIAgent Chat APISONZAIMind + Runtime + LLM1agents.chat{ agent, messages, stream: true, tools: […] }2assemble enriched ctx · call LLM with failoverGemini → GPT → Claude → Grok by prioritySSE STREAM ◆◇◆3phase: "thinking"4text delta · token stream5tool_call · your app handles, returns result6side-effects · memory mutations queued7complete · usage · stop_reason
You operate
Sonzai operates
Return / async
Python · /chat · stream
async for evt in client.agents.chat(
    agent=agent, messages=[...], stream=True, tools=[...]
):
    if   evt.type == "delta":     render(evt.text)
    elif evt.type == "tool_call": handle_tool(evt)
    elif evt.type == "complete":  show_usage(evt.usage)
Flow 04

Hermes Plugin

Drop-in for Nous Research's Hermes Agent — two lines of YAML

Use when

You already run Hermes Agent and want the Mind Layer added with two lines of YAML and zero handler changes.

You own

Hermes config.

Sonzai owns

Memory recall on prefetch · fact extraction after each turn · intelligent context compression on overflow.

Hermes Plugin · sequenceFig. 5.4 — zero handler code change
HOST · EXTERNALHermes AgentSONZAI · PLUGINsonzai-hermesSONZAIMind Layerconfig.yaml — memory: sonzai · context: sonzai (zero code change in your handlers)PER TURN ↻1prefetch(query)Memory Provider hook fires2fetch 7-layer enriched context3<sonzai-context> block — injected to system prompt4sync_turn(messages)after Hermes runs LLM5on context overflow — compress() · flush · consolidate
External host
Sonzai operates
Return / async
config.yaml · Hermes
# Two plugins, cooperating:
# Memory Provider runs every turn; Context Engine fires only on token-budget hit.
plugins:
  memory: sonzai
  context: sonzai
sonzai:
  api_key: ${SONZAI_API_KEY}
Flow 05

OpenClaw Plugin

Drop-in for OpenClaw agents — config-flip, zero code

Use when

You run OpenClaw and want server-backed enrichment instead of the default local Markdown memory.

You own

OpenClaw config.

Sonzai owns

The full Context Engine lifecycle — bootstrap, assemble, afterTurn, compact, dispose.

OpenClaw Plugin · sequenceFig. 5.5 — full Context Engine lifecycle
HOST · EXTERNALOpenClaw AgentSONZAI · PLUGINContext EngineSONZAIMind Layeropenclaw.json — contextEngine: sonzai (config-flip, zero code)1bootstrap(sessionId) · resolve agent · start sessionPER TURN ↻2assemble(messages)3fetch enriched · build systemAddition4afterTurn() — extract delta (async via Mind)5compact() on budget — consolidate6dispose() — end session
External host
Sonzai operates
Return / async
openclaw.json
{
  "contextEngine": "sonzai",
  "sonzai": {
    "apiKey": "<your-key>",
    "audit": true           // composio_app + request_id captured
  }
}

Design choice

All five flows share the same Mind Layer underneath. Moving between them is a code-level change, not a re-platforming — per-user state, learned weights, and accumulated memory all carry across.

06Per-User Model Weightsinference personalised at the weight level

The agent on day 90 is not the agent on day 1. It has learned this user specifically.

Most platforms ship a single model that serves every user the same. Sonzai stores per-user reinforcement-learning policy weights and personality overlays, hot-loaded into the inference path. The substrate to do this safely — shadow rollouts, promotion gates, versioning, rollback — is the kind of thing teams spend a year building.

Per-user weights · inference pathFig. 6 — Hot-loaded, gated, rollback-safe
INFERENCE PATH · request timeUser turn → load user_id's policy + overlay → assemble prompt → LLM callHot-loaded from weight store · <5ms overhead · cache-awarePER-USER WEIGHT STOREuser_id → versionsv17 ▓▓▓▓▓▓░ livev18 ▓▓▓▓▓░░ shadowv16 ▓▓▓▓▓▓▓ prevv15 ▓▓▓▓▓░░ prevRL policy heads · personalityoverlays · LoRA deltasHot-swappable. Rollback in <1s.SHADOW · LIVEShadow model scoredvs current live on real trafficConfidence rankingturn-by-turn deltas trackedGraduated promotion1% · 10% · 50% · 100%Auto-rollback on regression.PROMOTION GATESOTOPIA 6-dim eval· Believability· Relationships· Knowledge· Social Rules· EQ· Goal CompletionEvery release. Every user.

What this changes

With per-user policies, the effective model becomes a different one per user over time — safely, with shadow rollout, eval-gated promotion, and sub-second rollback. Personalisation at the weight level, not just the prompt.

07A Single Turn, End-to-Endwhat actually happens

One request in. One response out. Eleven things in between.

The full lifecycle of a single user turn in Managed Runtime mode. Steps 1–6 are synchronous (in the request path). Steps 7–11 are asynchronous (queued, eventually consistent).

StepSyncWhat happens
1 · Auth & routeTenant + user resolved. Rate limiter checked. Provider keys vault hit.
2 · Load per-user weightsRL policy + personality overlay hot-loaded from weight store (§06).
3 · Agentic retrievalReAct loop — LLM picks memory tools, queries hybrid index, refines (§04).
4 · Context assemblyMemory + mood + relationship + personality + knowledge composed into prompt.
5 · LLM call with failoverMulti-provider router; priority list; cascade on quota exhaustion.
6 · Stream response + tool callsSSE to your app. Tool calls intercepted, audited, returned.
7 · Cite-and-verify extractNew atomic facts extracted, verified against turn source, scored, stored.
8 · Mood + personality driftAffective vector updated. Big-5 deltas applied.
9 · Relationship updateBond scores adjusted. Shared-memory channels checked.
10 · Reinforcement learningRL signal recorded. Shadow model scored. Promotion considered.
11 · Consolidation queueTurn queued for nightly consolidation, decay sweeps, polarity-group formation.
08SDKs & Integration Surfacessix ways to plug in

Same primitives. Six surfaces. Pick what fits your stack.

SurfaceForShape
Python SDKBackend services · batch jobs · eval pipelinesclient.agents.chat(...) — sync & async
TypeScript SDKNode · Bun · Deno · edgeZero-dependency, isomorphic. Same surface area.
Go SDKHigh-throughput infrastructureNative client for Go runtimes.
MCP ServerAny MCP-compatible hostMemory, knowledge, and tool primitives as MCP servers.
Framework PluginsHermes · OpenClaw · similarDrop-in plugin auto-injects <sonzai-context>. No code change.
REST APIAnything elseOpenAPI-spec'd, language-agnostic.
TypeScript · managed runtime
import { Sonzai } from "@sonzai-labs/agents";
const sz = new Sonzai({ apiKey: process.env.SONZAI_API_KEY });

const stream = await sz.agents.stream({
  userId,
  message,
  scene: "front_of_house",
  providers: ["claude-3.5", "gpt-4o"],
  tools: ["composio.gmail", "kb.search"],
});
for await (const chunk of stream) yield chunk.text;
Go · standalone memory
import sonzai "github.com/sonz-ai/sonzai-go"

sz, _ := sonzai.New(sonzai.WithAPIKey(os.Getenv("SONZAI_API_KEY")))

facts, _ := sz.Memory.Recall(ctx, &sonzai.RecallReq{ UserID: uid, Query: msg })
// ... your LLM call, with facts injected ...
sz.Memory.ExtractAsync(ctx, uid, transcript)

Deployment modes — adopt what you need

ModeSonzai ownsYou own
Standalone MemoryMemory · Personality · Mood (via 2 calls / turn)LLM call · orchestration · UX
Drop-In RuntimeThe full request loop · all 8 modules · failover · toolsUX · auth · business logic
Edge / LocalOn-device semantic memory · privacy-sensitive flowsEverything else
Research / BenchmarkEval harness · SOTOPIA scoringYour candidate memory backend
Bring-Your-Own-KeyRouting · failover · all behavioral systemsProvider keys · provider billing
09Architectural Choicesthe nine that compound

None of this is one feature. It's nine choices that compound.

Each item below is a deliberate design choice in the substrate. None of them is novel in isolation — retrieval, evals, RL, fallback, all exist elsewhere. The substrate is what's hard: making them work together, per-tenant, under production load, with rollback.

Agentic, multi-signal retrieval

ReAct loop over hybrid vector + BM25 + entity + temporal indexes. The LLM picks tools per turn. Not RAG-on-vector-soup.

Confidence-aware memory ranking

Facts carry decay curves. Retrieval reinforces them. Contradictions form polarity groups instead of silently overwriting.

Adaptive consolidation cadence

Dormant users pay near-zero. Heavy users get more passes. Cost scales with engagement, not headcount.

Cross-tenant concept catalog

Cheap models inherit frontier-model quality via grounded retrieval. The largest economic lever in the stack.

Cite-and-verify pipeline

Every extracted fact is traceable to its source turn. Hallucinated facts are filtered before storage.

Multi-provider failover by priority

Automatic cascade on quota exhaustion. Single point of integration, zero single point of failure.

Per-user model weights, hot-loaded

Each user's agent becomes a different model over time. Shadow rollout, promotion gates, rollback all managed.

SOTOPIA-gated releases

6-dim behavioral scoring — Believability, Relationships, Knowledge, Social Rules, EQ, Goal Completion — on every release.

Workbench = production, accelerated

What you evaluate in minutes of simulated time is exactly what runs in production. Same code path.

The Mind Layer

Give any LLM a mind.

One SDK. Five integration patterns. The same Mind Layer underneath whether you adopt it as a memory sidecar, a session runtime, a hosted agent, or a plugin in Hermes or OpenClaw.