Pattern 4: Standalone Memory (Real-Time)
You own the LLM and the chat loop. Sonzai owns memory, mood, personality, and relationships. Per-turn — sessions.start → loop of session.context() + your LLM + session.turn() → sessions.end.
You keep your existing chat loop. Before each LLM call, you ask Sonzai
for the enriched context for the user's message; after the LLM replies,
you submit just that exchange via session.turn(). Mood lands inline
(~300–500 ms). Deeper extraction — facts, personality drift, habit
detection, goal updates — runs asynchronously 5–15 seconds later in the
background. Sonzai never sees your tool execution and never picks
your model.
This is the right shape for chat companions, voice agents, agent frameworks (OpenAI Agents SDK, LangChain, LiveKit), and anywhere you already had a working LLM loop in production before adopting Sonzai.
When to use this
- You already have a production LLM loop with custom tools, evals, prompt templates, or a specific provider.
- You need fresh per-turn context, not a once-a-conversation pull.
- You want mood, facts, personality, habits, goals, and relationship signal — without ceding LLM choice or tool execution.
When to switch
- One conversation can't end fast enough to wait for
.turn()per exchange — switch to Pattern 5: Standalone Batch. - Sonzai owning the LLM call is fine — switch to Pattern 1: Managed Runtime and delete most of this code.
Architecture
┌─────────────┐ ┌──────────────────┐ ┌──────────────┐
│ Your App │ │ Sonzai API │ │ Your LLM │
└──────┬──────┘ └────────┬─────────┘ └──────┬───────┘
│ │ │
│ sessions.start │ │
│────────────────────>│ (prewarms memory) │
│ <── Session ───────│ │
│ │ │
│ ─── Per turn ──────────────────────────── │
│ │ │
│ session.context() │ │
│────────────────────>│ │
│ <── enriched ctx ──│ │
│ personality, mood│ │
│ memories, goals │ │
│ │ │
│ Your LLM loop ─────┼──────────────────────>│
│ + your tools │ │
│ + your multimodal │ │
│ <── reply ─────────┼───────────────────────│
│ │ │
│ sendToUser(reply) (no waiting on Sonzai) │
│ │ │
│ session.turn() │ │
│────────────────────>│ ⇒ sync mood ~300ms │
│ <── mood, status ──│ ⇒ background extract │
│ │ (5–15s) │
│ │ │
│ ─── Repeat ────────────────────────────── │
│ │ │
│ session.end() │ │
│────────────────────>│── consolidate │
│ │ long-term memory │
└─────────────────────┴───────────────────────┘
End-to-end snippet
The minimum viable loop with a real harness. The OpenAI Agents SDK owns
conversation state, model selection, and tool dispatch. Sonzai sits
outside that loop: it supplies the system prompt via
session.context() before the run, and ingests the finished exchange
via session.turn() after. No OPENAI_API_KEY needed — the Agents SDK
is pointed at Gemini's OpenAI-compat endpoint.
import os, uuid
from openai import AsyncOpenAI
from agents import Agent, Runner, OpenAIChatCompletionsModel, function_tool, set_tracing_disabled
from sonzai import Sonzai
set_tracing_disabled(True) # Agents SDK tries to ship traces to OpenAI; we don't have a key.
# Your LLM harness — owns history, tool dispatch, multi-step reasoning.
gemini = AsyncOpenAI(
base_url="https://generativelanguage.googleapis.com/v1beta/openai/",
api_key=os.environ["GEMINI_API_KEY"],
)
model = OpenAIChatCompletionsModel(model="gemini-3.1-flash-lite-preview", openai_client=gemini)
@function_tool
def get_current_time() -> str:
from datetime import datetime, timezone
return datetime.now(timezone.utc).isoformat(timespec="seconds")
# Sonzai = memory layer. Never sees the LLM client.
sonzai = Sonzai(api_key=os.environ["SONZAI_API_KEY"])
def run_conversation(agent_id: str, user_id: str):
session = sonzai.agents.sessions.start(
agent_id,
user_id=user_id,
session_id=f"session-{uuid.uuid4().hex[:8]}",
provider="gemini", # default for the deferred-extraction LLM
model="gemini-3.1-flash-lite-preview",
)
def turn(user_message: str) -> str:
# 1. Fresh, query-relevant context BEFORE the LLM call.
ctx = session.context(query=user_message)
# 2. Your harness runs the LLM + your tools. Sonzai is OUT of the loop.
agent = Agent(
name="Companion",
instructions=build_system_prompt(ctx),
model=model,
tools=[get_current_time],
)
result = Runner.run_sync(agent, user_message)
send_to_user(result.final_output)
# 3. Convert the run's items (assistant text + ToolCallItem +
# ToolCallOutputItem) into Sonzai's tool-aware shape so
# extraction can pick up facts from tool outputs too.
sonzai_messages = run_result_to_sonzai_messages(user_message, result)
# 4. Submit. Sync mood ~300ms; deferred extraction 5–15s later.
session.turn(messages=sonzai_messages)
return result.final_output
return turn, session.end
# /context returns a flat dict — read what you need, drop the rest.
def build_system_prompt(ctx: dict) -> str:
facts = "\n".join(f"- {f.get('atomic_text', '')}" for f in (ctx.get("loaded_facts") or []))
parts = [
ctx.get("personality_prompt", "You are a helpful AI companion."),
f"Personality (Big5): {ctx.get('big5', {})}",
f"Current mood: {ctx.get('current_mood', {})}",
]
if facts:
parts.append(f"Relevant memories:\n{facts}")
return "\n\n".join(parts)The load-bearing habit
Always call session.context(query=user_msg) before the LLM call —
every turn. That's the closing-the-loop step. Skipping it means the
LLM works from stale state and the value of a memory layer collapses.
Save a roundtrip with fetchNextContext
session.turn() accepts fetchNextContext: { query: nextMessage }
(Python: fetch_next_context={"query": ...}). When set, the response
carries the next /context payload under next_context, so the
client already has turn N+1's context by the time turn N finishes.
Tool calls flow through to extraction
Sonzai's /turn accepts OpenAI/Anthropic-style tool messages: tool_calls
on assistant messages and role: "tool" results. Forward the full
exchange and the extractor can capture facts that only surfaced inside a
tool output (e.g. "user's last order shipped from Tokyo" from an
order-lookup tool).
session.turn(messages=[
{"role": "user", "content": "Where did my last order ship from?"},
{
"role": "assistant",
"content": None,
"tool_calls": [{
"id": "call_1", "type": "function",
"function": {"name": "order-lookup", "arguments": "{}"},
}],
},
{"role": "tool", "tool_call_id": "call_1",
"content": '{"order_id":"42","origin":"Tokyo","carrier":"DHL"}'},
{"role": "assistant", "content": "Your last order shipped from Tokyo via DHL."},
])Sonzai never executes a tool — that's your harness's job. It just reads
the messages you submit. If you're on the OpenAI Agents SDK, see the
demo's run_result_to_sonzai_messages
helper — it converts a Runner result's MessageOutputItem /
ToolCallItem / ToolCallOutputItem items into this shape.
Multimodal: your harness sees pixels, Sonzai sees text
/turn accepts text content only. This is intentional, not a
limitation. Memory is a layer of semantic understanding — the
question Sonzai needs to answer later is "what does this agent know
about this user?", not "what bytes did the LLM see?". Your vision-capable
LLM has already understood the image; pass that understanding to Sonzai
as text, and the memory pipeline can extract facts, habits, and
inventory items from it like any other turn.
The recommended pattern: have your same multimodal LLM produce a short
factual description alongside its warm reply, and embed that
description in the user message you submit to session.turn().
# Your harness: Gemini sees the actual image bytes via input_image.
result = await gemini.chat.completions.create(
model="gemini-3.1-flash-lite-preview",
messages=[
{"role": "system", "content": SYSTEM_PROMPT_IMAGE_AWARE}, # see below
{"role": "user", "content": [
{"type": "text", "text": user_msg},
{"type": "image_url", "image_url": {"url": image_url}},
]},
],
)
raw = result.choices[0].message.content
# Dual-output: split the reply (shown to user) from the [MEMORY: ...] note.
memory_note, reply = split_memory_note(raw) # your tiny parser
send_to_user(reply)
# Sonzai sees: the original user text + a description of the image.
# It will extract facts like "user goes to the gym", "wore a black tank top".
session.turn(messages=[
{"role": "user",
"content": f"{user_msg}\n\n[Image attached: {memory_note}, URL: {image_url}]"},
{"role": "assistant", "content": reply},
])The SYSTEM_PROMPT_IMAGE_AWARE instruction is what makes this work —
it asks the LLM to emit a hidden line like [MEMORY: <factual description>] after its warm reply. Same LLM call, no extra cost or
latency, no second roundtrip. The same pattern works for audio
(send the transcript) and assistant-generated images (describe what
you generated). For the full pattern with all three SDKs, see the
deep guide's multimodal section.
Tool outputs are multimodal too
If a tool returns a screenshot, file blob, or any non-text payload,
apply the same rule: have your harness summarize what the tool
returned in a one-line text result before forwarding the
role: "tool" message to session.turn().
Skipping local history with recent_turns
If your harness already keeps a message log (most do — Agents SDK,
LangChain, etc.), use that. If you'd rather not maintain one, every
/context response carries recent_turns — the raw messages buffered
by /turn for the current session, in chronological order. Read them
straight off ctx.recent_turns and feed them to your LLM:
ctx = session.context(query=user_message)
history = [{"role": t["role"], "content": t["content"]} for t in (ctx.get("recent_turns") or [])]
reply = your_llm.chat(
system=build_system_prompt(ctx),
messages=[*history, {"role": "user", "content": user_message}],
)The buffer is per-session and text-only — no tool calls, no images, no system prompts. It's the right shape for a simple chat loop where Sonzai is the source of truth; if you need richer message structure, keep your own.
Where to next
Pattern 3: OpenClaw
Drop the @sonzai-labs/openclaw-context plugin into an OpenClaw project and Sonzai becomes the agent's contextEngine — persistent memory, mood, personality, relationships, all under OpenClaw's existing chat loop.
Pattern 5: Standalone Memory (Batch)
One call after the conversation is done. Ship the full transcript to /process (or sessions.end with messages) and let Sonzai extract facts, mood, personality, and habits in the background.