Technology · Architecture Reference
The Relationship Layer for consumer AI agents — how it actually works.
Sonzai is the persistent cognitive substrate that sits between any LLM and any application. Eight composable modules — memory, personality, mood, relationships, knowledge, learning, media, and an agent runtime — on managed infrastructure most teams would otherwise spend a year building. This page is the system, end to end.
A composable substrate between any LLM and any application.
Sonzai is a multi-tenant AI Relationship Layer: a managed substrate that sits between any LLM and any application. Stateless at the edge, stateful at the core, and provider-agnostic — drop it in front of Gemini, GPT, Grok, Claude, or any combination, and the agent retains memory, personality, mood, relationships, and learning across every session.
- · Transactional + columnar stores
- · Vector / entity / temporal index
- · Message queue · DLQ · retries
- · Distributed compute (extract/embed)
- · Per-user model-weight store
- · Background scheduler · key vault
- · Cost ledger · observability · evals
Core principle
Memory is the floor of an agent's mind, not the whole of it. Sonzai treats memory, personality, mood, relationships, knowledge, learning, media, and orchestration as one composable substrate — because in practice they're entangled, and pretending otherwise leaks complexity into every application that tries to do it itself.
Eight modules. Pick one, pick all.
Each layer is independently consumable through the same SDK. The Memory Layer can stand alone. Personality, Mood, and Relationships compose on top. Learning Systems rewrite every layer below them. Agent Runtime caps the stack with provider-agnostic orchestration.
| Module | What it does | What it enables |
|---|---|---|
| Memory Layer | Cite-and-verify extraction of atomic facts, ranked by confidence, decayed on the Ebbinghaus curve, deduped by embedding similarity, consolidated nightly. | Recall that ages gracefully. |
| Personality | Big-5 trait tracking with evolution deltas, full history, and per-scenario overlays. | A self that drifts with experience. |
| Mood & Emotion | Live 4-D affective vector with theme detection per turn and contagion across multi-agent scenes. | Affective state, not robotic transcript. |
| Relationship | Directional love/trust scores per pair · shared-memory channels · 3 privacy tiers. | Multi-agent dynamics, not just calls. |
| Knowledge Base | Tenant- and project-scoped knowledge store with retrieval + write tools, gated by access control. | Grounded, ACL-aware retrieval. |
| Learning Systems | RL with shadow-model promotion · nightly self-learning · federated cross-agent concept catalog. | Agents that improve in production. |
| Agent Runtime | Provider-agnostic LLM orchestration with tool calling, SSE streaming, and priority failover. | One integration, zero SPOF. |
| Media Generation | Image, video, TTS, music, and SFX through best-of-breed providers. | Multi-modal output, one client. |
A production agent stack is a dozen distributed systems wired together. Sonzai operates all of them.
The Relationship Layer is the part visible through the API. Underneath sits the infrastructure that makes a multi-tenant, evolving, learning agent platform actually work in production. Transactional and columnar stores for state, a hybrid index for retrieval, a cache for hot context, a queue with DLQ and retries for async work, a worker pool for extraction and embedding, a versioned per-user weight store for RL policies, a scheduler for cadence-driven jobs, a key vault for credentials, a cost ledger for guardrails, observability for traces and metrics, and an eval gate for releases.
If you DIY vs. with Sonzai
Order of magnitude
The left column is the work a platform team typically takes 12+ months to stand up and another 12 to harden. Every row of that work is already running under the API.
Plain RAG embeds, top-k's, and hopes. Sonzai's retrieval is agentic.
The model reasons about what it needs to know, chooses which memory tools to call, inspects the results, and iterates until it has enough context. This is the ReAct loop applied to memory — not just to web search.
- REASONLLM judges the user turn and decides which memory tools it needs.
- ACTIt calls one or more tools — by entity, by time, by relationship, by mood.
- OBSERVEHybrid index returns ranked, deduped, confidence-scored hits.
- REFINEIf under-confident, the LLM picks another tool and iterates.
- RESPONDGrounded reply. Every claim traceable to a source memory.
- recall(query, top_k, filters)
- recall_shared_memories(with_id)
- recall_by_entity(entity_id)
- recall_by_time(start, end)
- check_emotional_alignment(topic)
- check_relationship_state(user_id)
- recall_personality_drift(window)
- search_knowledge(query, project)
- remember_fact(text, refs, confidence)
- // + project-defined custom tools
Vector RAG vs. agentic retrieval
The mental model
“Sonzai treats memory the way a reasoning agent treats the world — as something to interrogate, not something to flush into the prompt.”
Five patterns. Same Relationship Layer. Pick the shape that fits your stack.
Each pattern is independently usable. Adopt one and you can graduate to another without re-platforming — the surface area changes, the substrate doesn't. Start with Process for the lightest touch, move to Sessions when you want lifecycle, or go straight to Agent Chat for the full hosted runtime. Hermes and OpenClaw are config-flips if you're already on those agent frameworks.
Process Endpoint
Memory layered onto an existing chat stack — one POST per turn
You already run your own LLM and chat — you want memory, personality, and learning layered on top without replacing what you have.
The LLM call · the response stream · the UI.
Fact extraction · memory persistence · mood / personality / relationship deltas.
from sonzai import Client
sz = Client(api_key=os.environ["SONZAI_API_KEY"])
async def handle_turn(user_id, agent_id, messages):
# Sonzai extracts facts, persists memory, applies deltas — returns audit info.
deltas = await sz.agents.process(
agent_id=agent_id, user_id=user_id, messages=messages,
)
return deltas # { memories_created, facts_extracted, mood, personality, relationship }Real-Time Sessions
Your chat UI, our memory lifecycle — explicit start / per-turn / end
You want explicit per-conversation lifecycle — a clean start, per-turn enrichment and extraction, end-of-session consolidation.
The LLM call · the message stream.
Context retrieval · per-turn extraction · async consolidation on close.
const s = await client.agents.sessions.start({ agent, user, sessionId });
for (const message of stream) {
const ctx = await s.context({ query: message }); // 7-layer enriched system block
const reply = await yourLLM([ctx.systemBlock, message]);
await s.turn({ messages: [message, reply] }); // async extract + learn
}
await s.end({ messages }); // triggers consolidationAgent Chat Endpoint
Full hosted runtime — SSE deltas, tools, multi-provider failover
Greenfield apps that want a complete agent in one call — streaming, tool calling, side-effect events for memory mutations.
UI only.
Context assembly · LLM orchestration · tool dispatch · memory persistence · provider fallback.
async for evt in client.agents.chat(
agent=agent, messages=[...], stream=True, tools=[...]
):
if evt.type == "delta": render(evt.text)
elif evt.type == "tool_call": handle_tool(evt)
elif evt.type == "complete": show_usage(evt.usage)Hermes Plugin
Drop-in for Nous Research's Hermes Agent — two lines of YAML
You already run Hermes Agent and want the Relationship Layer added with two lines of YAML and zero handler changes.
Hermes config.
Memory recall on prefetch · fact extraction after each turn · intelligent context compression on overflow.
# Two plugins, cooperating:
# Memory Provider runs every turn; Context Engine fires only on token-budget hit.
plugins:
memory: sonzai
context: sonzai
sonzai:
api_key: ${SONZAI_API_KEY}OpenClaw Plugin
Drop-in for OpenClaw agents — config-flip, zero code
You run OpenClaw and want server-backed enrichment instead of the default local Markdown memory.
OpenClaw config.
The full Context Engine lifecycle — bootstrap, assemble, afterTurn, compact, dispose.
{
"contextEngine": "sonzai",
"sonzai": {
"apiKey": "<your-key>",
"audit": true // composio_app + request_id captured
}
}Choosing a flow
Match your existing stack to the flow that touches it least. All five share the same backend, the same primitives, and the same per-user state — moving between them is a code change, not a re-platforming.
| If your stack looks like… | Use | Why |
|---|---|---|
| Your own LLM and chat — just need memory. | 01 · Process | Lowest surface area. One call per turn. |
| Your own chat, but you want session lifecycle. | 02 · Sessions | Explicit start / turn / end for clean conversation boundaries. |
| Hosted agent — streaming, tools, the whole runtime. | 03 · Agent Chat | Full runtime in one SSE call. |
| Hermes Agent. | 04 · Hermes | Two YAML lines, zero code. |
| OpenClaw. | 05 · OpenClaw | Config-flip, zero code. |
Design choice
All five flows share the same Relationship Layer underneath. Moving between them is a code-level change, not a re-platforming — per-user state, learned weights, and accumulated memory all carry across.
The agent on day 90 is not the agent on day 1. It has learned this user specifically.
Most platforms ship a single model that serves every user the same. Sonzai stores per-user reinforcement-learning policy weights and personality overlays, hot-loaded into the inference path. The substrate to do this safely — shadow rollouts, promotion gates, versioning, rollback — is the kind of thing teams spend a year building. Sonzai ships it.
- v18 · shadow ░░░░░
- v17 · live ▓▓▓▓▓
- v16 · prev ▓▓▓▓▓
- v15 · prev ▓▓▓▓▓
- Shadow model scored vs current live on real traffic.
- Confidence ranking with turn-by-turn deltas tracked.
- Promotion at 1% → 10% → 50% → 100%.
- · Believability
- · Relationships
- · Knowledge
- · Social Rules
- · EQ
- · Goal Completion
What this changes
Most platforms ship one model that serves every user identically. With per-user policies, the effective model becomes a different one per user over time — safely, with shadow rollout, eval-gated promotion, and sub-second rollback. Personalisation at the weight level, not just the prompt.
One request in. One response out. Eleven things in between.
The full lifecycle of a single user turn in Managed Runtime mode. Steps 1–6 are synchronous (in the request path). Steps 7–11 are asynchronous (queued, eventually consistent).
| Step | Sync | What happens |
|---|---|---|
| 1 · Auth & route | ✓ | Tenant + user resolved. Rate limiter checked. Provider keys vault hit. |
| 2 · Load per-user weights | ✓ | RL policy + personality overlay hot-loaded from weight store (§06). |
| 3 · Agentic retrieval | ✓ | ReAct loop — LLM picks memory tools, queries hybrid index, refines (§04). |
| 4 · Context assembly | ✓ | Memory + mood + relationship + personality + knowledge composed into prompt. |
| 5 · LLM call with failover | ✓ | Multi-provider router; priority list; cascade on quota exhaustion. |
| 6 · Stream response + tool calls | ✓ | SSE to your app. Tool calls intercepted, audited, returned. |
| 7 · Cite-and-verify extract | — | New atomic facts extracted, verified against turn source, scored, stored. |
| 8 · Mood + personality drift | — | Affective vector updated. Big-5 deltas applied. |
| 9 · Relationship update | — | Bond scores adjusted. Shared-memory channels checked. |
| 10 · Reinforcement learning | — | RL signal recorded. Shadow model scored. Promotion considered. |
| 11 · Consolidation queue | — | Turn queued for nightly consolidation, decay sweeps, polarity-group formation. |
Same primitives. Six surfaces. Pick what fits your stack.
| Surface | For | Shape |
|---|---|---|
| Python SDK | Backend services · batch jobs · eval pipelines | client.agents.chat(...) — sync & async |
| TypeScript SDK | Node · Bun · Deno · edge | Zero-dependency, isomorphic. Same surface area. |
| Go SDK | High-throughput infrastructure | Native client for Go runtimes. |
| MCP Server | Any MCP-compatible host | Memory, knowledge, and tool primitives as MCP servers. |
| Framework Plugins | Hermes · OpenClaw · similar | Drop-in plugin auto-injects <sonzai-context>. No code change. |
| REST API | Anything else | OpenAPI-spec'd, language-agnostic. |
import { Sonzai } from "@sonzai-labs/agents";
const sz = new Sonzai({ apiKey: process.env.SONZAI_API_KEY });
const stream = await sz.agents.stream({
userId,
message,
scene: "front_of_house",
providers: ["claude-3.5", "gpt-4o"],
tools: ["composio.gmail", "kb.search"],
});
for await (const chunk of stream) yield chunk.text;import sonzai "github.com/sonz-ai/sonzai-go"
sz, _ := sonzai.New(sonzai.WithAPIKey(os.Getenv("SONZAI_API_KEY")))
facts, _ := sz.Memory.Recall(ctx, &sonzai.RecallReq{ UserID: uid, Query: msg })
// ... your LLM call, with facts injected ...
sz.Memory.ExtractAsync(ctx, uid, transcript)Deployment modes — adopt what you need
| Mode | Sonzai owns | You own |
|---|---|---|
| Standalone Memory | Memory · Personality · Mood (via 2 calls / turn) | LLM call · orchestration · UX |
| Drop-In Runtime | The full request loop · all 8 modules · failover · tools | UX · auth · business logic |
| Edge / Local | On-device semantic memory · privacy-sensitive flows | Everything else |
| Research / Benchmark | Eval harness · SOTOPIA scoring | Your candidate memory backend |
| Bring-Your-Own-Key | Routing · failover · all behavioral systems | Provider keys · provider billing |
None of this is one feature. It's nine choices that compound.
Each item below is a deliberate design choice in the substrate. None of them is novel in isolation — retrieval, evals, RL, fallback all exist elsewhere. The substrate is what's hard: making them work together, per-tenant, under production load, with rollback. Read these as the nine commitments the platform has already made, so applications built on top don't have to.
Agentic, multi-signal retrieval
ReAct loop over hybrid vector + BM25 + entity + temporal indexes. The LLM picks tools per turn. Not RAG-on-vector-soup.
Confidence-aware memory ranking
Facts carry decay curves. Retrieval reinforces them. Contradictions form polarity groups instead of silently overwriting.
Adaptive consolidation cadence
Dormant users pay near-zero. Heavy users get more passes. Cost scales with engagement, not headcount.
Cross-tenant concept catalog
Cheap models inherit frontier-model quality via grounded retrieval. The largest economic lever in the stack.
Cite-and-verify pipeline
Every extracted fact is traceable to its source turn. Hallucinated facts are filtered before storage.
Multi-provider failover by priority
Automatic cascade on quota exhaustion. Single point of integration, zero single point of failure.
Per-user model weights, hot-loaded
Each user's agent becomes a different model over time. Shadow rollout, promotion gates, rollback all managed.
SOTOPIA-gated releases
6-dim behavioral scoring — Believability, Relationships, Knowledge, Social Rules, EQ, Goal Completion — on every release.
Workbench = production, accelerated
What you evaluate in minutes of simulated time is exactly what runs in production. Same code path.
Built so the on-call rotation doesn't become your problem.
The questions that actually decide whether a stack ships to production aren't about features. They're about what happens at p99, what happens when a provider 429s, what happens when finance asks where the bill came from, and what happens when legal asks where the data sits. Sonzai answers each one before you have to.
Security & isolation by construction
Isolation lives in the storage layer, not the API. BYOK belongs in the platform, not the wrapper. Audit lives in the substrate, not the integration. The boundaries an architect cares about are drawn before the request reaches the model.
Per-tenant isolation
Every tenant gets its own data namespace across the transactional store, columnar memory, vector index, and key vault. Queries can never cross the tenant boundary — enforced at the storage layer, not just at the API.
BYOK & envelope encryption
Bring your own provider keys. They land in a multi-tenant key vault, envelope-encrypted at rest, with per-project isolation. We never persist plaintext provider credentials. Sonzai's 25% service fee still applies on routed work units; you keep the provider relationship.
Per-user memory isolation
User A's conversations, mood history, and relationship state are completely walled off from User B's — even when both talk to the same shared persona. Memory namespaces are per-user under the persona; the persona is shared, the state is not.
Cite-and-verify audit trail
Every extracted fact is traceable to the source turn that produced it. Every tool call has a request_id. Every promotion goes through an evaluated gate. Audit trail is built into the substrate — you don't bolt one on.
Cost guardrails
Per-user, per-day, per-month caps stop runaway spend before it happens. Soft alerts fire before hard caps trigger. Cost ledger surfaces every line of consumption — by tenant, by user, by agent, by provider.
Data residency & exit
Your data is yours. Single-region pinning available; full export via API at any time; deletion enforced down to the columnar store and weight overlays. No proprietary file format — exports are documented JSON.
The architect's promise
We've already drawn the lines the security review is going to ask about. The boundary between tenants is in the storage, not the API. The boundary between you and the model provider is in BYOK, not in our trust assertions. The audit trail is the substrate, not a feature we'll add later.
The fastest way to die in this space is to lock customers in. So we didn't.
Every place a vendor could insert a wedge — the model, the provider keys, the SDK, the data format, the protocol — we left open. Sonzai is worth using because it's better, not because you can't leave.
Bring your own LLM (BYOLLM)
Route across Gemini, GPT, Claude, Grok — or any OpenAI-compatible endpoint. Failover is configurable; we don't pick winners.
Bring your own key (BYOK)
Pay the provider directly at your negotiated rates. Use your volume discounts and rate limits. Audit our routing line-by-line against the provider invoice.
Bring your own model (BYOM)
Run self-hosted Llama, Qwen, DeepSeek, or your own fine-tunes via vLLM, Ollama, or TGI. Sonzai handles orchestration; you keep the weights inside your perimeter.
Open protocol surfaces
REST + OpenAPI. MCP servers for memory, knowledge, and tool primitives. Framework plugins for Hermes and OpenClaw. No SDK lock-in — every primitive is reachable over the wire.
Exit on your terms
Full data export at any time: memories, knowledge, user weight overlays, evaluation history. Documented JSON, not a black box. Migrate in, migrate out.
Composable adoption
Start with Standalone Memory, graduate to the full Runtime, drop in plugins where they fit. Adoption is not all-or-nothing — and the per-user state carries across modes.
Why this matters to your CIO
The procurement question isn't “will it work?”, it's “what happens in year three when we want to switch?” Sonzai is designed so the answer is short: you export your data, point your traffic somewhere else, and the switching cost is the integration time — not a hostage negotiation over your own state.
The nine questions your architect is going to ask.
We've been through enough security reviews and vendor assessments to know the questions before they're asked. Here's the short answer to each — read this as the section you forward to your platform-ops counterpart.
Build vs. buy, plainly
Standing this up internally is twelve to twenty-four months of platform work — multi-tenant isolation, per-user weight stores, consolidation cadence, eval gates, BYOK plumbing, cost ledgers, DLQ wiring — before you've shipped a single agent feature. The product team is going to ask you for a date. Sonzai turns that date into next quarter.
The Relationship Layer
Give any LLM a mind.
One SDK. Five integration patterns. The same Relationship Layer underneath whether you adopt it as a memory sidecar, a session runtime, a hosted agent, or a plugin in Hermes or OpenClaw.