Skip to main content

Pattern 4: Standalone Memory (Real-Time)

You own the LLM and the chat loop. Sonzai owns memory, mood, personality, and relationships. Per-turn — sessions.start → loop of session.context() + your LLM + session.turn() → sessions.end.

You keep your existing chat loop. Before each LLM call, you ask Sonzai for the enriched context for the user's message; after the LLM replies, you submit just that exchange via session.turn(). Mood lands inline (~300–500 ms). Deeper extraction — facts, personality drift, habit detection, goal updates — runs asynchronously 5–15 seconds later in the background. Sonzai never sees your tool execution and never picks your model.

This is the right shape for chat companions, voice agents, agent frameworks (OpenAI Agents SDK, LangChain, LiveKit), and anywhere you already had a working LLM loop in production before adopting Sonzai.

When to use this

  • You already have a production LLM loop with custom tools, evals, prompt templates, or a specific provider.
  • You need fresh per-turn context, not a once-a-conversation pull.
  • You want mood, facts, personality, habits, goals, and relationship signal — without ceding LLM choice or tool execution.

When to switch

Architecture

┌─────────────┐     ┌──────────────────┐     ┌──────────────┐
│  Your App   │     │   Sonzai API     │     │   Your LLM   │
└──────┬──────┘     └────────┬─────────┘     └──────┬───────┘
     │                     │                       │
     │  sessions.start     │                       │
     │────────────────────>│ (prewarms memory)     │
     │  <── Session ───────│                       │
     │                     │                       │
     │  ─── Per turn ──────────────────────────── │
     │                     │                       │
     │  session.context()  │                       │
     │────────────────────>│                       │
     │  <── enriched ctx ──│                       │
     │    personality, mood│                       │
     │    memories, goals  │                       │
     │                     │                       │
     │  Your LLM loop ─────┼──────────────────────>│
     │  + your tools       │                       │
     │  + your multimodal  │                       │
     │  <── reply ─────────┼───────────────────────│
     │                     │                       │
     │  sendToUser(reply)  (no waiting on Sonzai)  │
     │                     │                       │
     │  session.turn()     │                       │
     │────────────────────>│ ⇒ sync mood ~300ms    │
     │  <── mood, status ──│ ⇒ background extract  │
     │                     │   (5–15s)             │
     │                     │                       │
     │  ─── Repeat ────────────────────────────── │
     │                     │                       │
     │  session.end()      │                       │
     │────────────────────>│── consolidate         │
     │                     │   long-term memory    │
     └─────────────────────┴───────────────────────┘

End-to-end snippet

The minimum viable loop with a real harness. The OpenAI Agents SDK owns conversation state, model selection, and tool dispatch. Sonzai sits outside that loop: it supplies the system prompt via session.context() before the run, and ingests the finished exchange via session.turn() after. No OPENAI_API_KEY needed — the Agents SDK is pointed at Gemini's OpenAI-compat endpoint.

import os, uuid
from openai import AsyncOpenAI
from agents import Agent, Runner, OpenAIChatCompletionsModel, function_tool, set_tracing_disabled
from sonzai import Sonzai

set_tracing_disabled(True)  # Agents SDK tries to ship traces to OpenAI; we don't have a key.

# Your LLM harness — owns history, tool dispatch, multi-step reasoning.
gemini = AsyncOpenAI(
  base_url="https://generativelanguage.googleapis.com/v1beta/openai/",
  api_key=os.environ["GEMINI_API_KEY"],
)
model = OpenAIChatCompletionsModel(model="gemini-3.1-flash-lite-preview", openai_client=gemini)

@function_tool
def get_current_time() -> str:
  from datetime import datetime, timezone
  return datetime.now(timezone.utc).isoformat(timespec="seconds")

# Sonzai = memory layer. Never sees the LLM client.
sonzai = Sonzai(api_key=os.environ["SONZAI_API_KEY"])

def run_conversation(agent_id: str, user_id: str):
  session = sonzai.agents.sessions.start(
      agent_id,
      user_id=user_id,
      session_id=f"session-{uuid.uuid4().hex[:8]}",
      provider="gemini",                          # default for the deferred-extraction LLM
      model="gemini-3.1-flash-lite-preview",
  )

  def turn(user_message: str) -> str:
      # 1. Fresh, query-relevant context BEFORE the LLM call.
      ctx = session.context(query=user_message)

      # 2. Your harness runs the LLM + your tools. Sonzai is OUT of the loop.
      agent = Agent(
          name="Companion",
          instructions=build_system_prompt(ctx),
          model=model,
          tools=[get_current_time],
      )
      result = Runner.run_sync(agent, user_message)
      send_to_user(result.final_output)

      # 3. Convert the run's items (assistant text + ToolCallItem +
      #    ToolCallOutputItem) into Sonzai's tool-aware shape so
      #    extraction can pick up facts from tool outputs too.
      sonzai_messages = run_result_to_sonzai_messages(user_message, result)

      # 4. Submit. Sync mood ~300ms; deferred extraction 5–15s later.
      session.turn(messages=sonzai_messages)

      return result.final_output

  return turn, session.end


# /context returns a flat dict — read what you need, drop the rest.
def build_system_prompt(ctx: dict) -> str:
  facts = "\n".join(f"- {f.get('atomic_text', '')}" for f in (ctx.get("loaded_facts") or []))
  parts = [
      ctx.get("personality_prompt", "You are a helpful AI companion."),
      f"Personality (Big5): {ctx.get('big5', {})}",
      f"Current mood: {ctx.get('current_mood', {})}",
  ]
  if facts:
      parts.append(f"Relevant memories:\n{facts}")
  return "\n\n".join(parts)

The load-bearing habit

Always call session.context(query=user_msg) before the LLM call — every turn. That's the closing-the-loop step. Skipping it means the LLM works from stale state and the value of a memory layer collapses.

Save a roundtrip with fetchNextContext

session.turn() accepts fetchNextContext: { query: nextMessage } (Python: fetch_next_context={"query": ...}). When set, the response carries the next /context payload under next_context, so the client already has turn N+1's context by the time turn N finishes.

Tool calls flow through to extraction

Sonzai's /turn accepts OpenAI/Anthropic-style tool messages: tool_calls on assistant messages and role: "tool" results. Forward the full exchange and the extractor can capture facts that only surfaced inside a tool output (e.g. "user's last order shipped from Tokyo" from an order-lookup tool).

session.turn(messages=[
    {"role": "user", "content": "Where did my last order ship from?"},
    {
        "role": "assistant",
        "content": None,
        "tool_calls": [{
            "id": "call_1", "type": "function",
            "function": {"name": "order-lookup", "arguments": "{}"},
        }],
    },
    {"role": "tool", "tool_call_id": "call_1",
     "content": '{"order_id":"42","origin":"Tokyo","carrier":"DHL"}'},
    {"role": "assistant", "content": "Your last order shipped from Tokyo via DHL."},
])

Sonzai never executes a tool — that's your harness's job. It just reads the messages you submit. If you're on the OpenAI Agents SDK, see the demo's run_result_to_sonzai_messages helper — it converts a Runner result's MessageOutputItem / ToolCallItem / ToolCallOutputItem items into this shape.

Multimodal: your harness sees pixels, Sonzai sees text

/turn accepts text content only. This is intentional, not a limitation. Memory is a layer of semantic understanding — the question Sonzai needs to answer later is "what does this agent know about this user?", not "what bytes did the LLM see?". Your vision-capable LLM has already understood the image; pass that understanding to Sonzai as text, and the memory pipeline can extract facts, habits, and inventory items from it like any other turn.

The recommended pattern: have your same multimodal LLM produce a short factual description alongside its warm reply, and embed that description in the user message you submit to session.turn().

# Your harness: Gemini sees the actual image bytes via input_image.
result = await gemini.chat.completions.create(
    model="gemini-3.1-flash-lite-preview",
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT_IMAGE_AWARE},  # see below
        {"role": "user", "content": [
            {"type": "text", "text": user_msg},
            {"type": "image_url", "image_url": {"url": image_url}},
        ]},
    ],
)
raw = result.choices[0].message.content

# Dual-output: split the reply (shown to user) from the [MEMORY: ...] note.
memory_note, reply = split_memory_note(raw)   # your tiny parser
send_to_user(reply)

# Sonzai sees: the original user text + a description of the image.
# It will extract facts like "user goes to the gym", "wore a black tank top".
session.turn(messages=[
    {"role": "user",
     "content": f"{user_msg}\n\n[Image attached: {memory_note}, URL: {image_url}]"},
    {"role": "assistant", "content": reply},
])

The SYSTEM_PROMPT_IMAGE_AWARE instruction is what makes this work — it asks the LLM to emit a hidden line like [MEMORY: <factual description>] after its warm reply. Same LLM call, no extra cost or latency, no second roundtrip. The same pattern works for audio (send the transcript) and assistant-generated images (describe what you generated). For the full pattern with all three SDKs, see the deep guide's multimodal section.

Tool outputs are multimodal too

If a tool returns a screenshot, file blob, or any non-text payload, apply the same rule: have your harness summarize what the tool returned in a one-line text result before forwarding the role: "tool" message to session.turn().

Skipping local history with recent_turns

If your harness already keeps a message log (most do — Agents SDK, LangChain, etc.), use that. If you'd rather not maintain one, every /context response carries recent_turns — the raw messages buffered by /turn for the current session, in chronological order. Read them straight off ctx.recent_turns and feed them to your LLM:

ctx = session.context(query=user_message)
history = [{"role": t["role"], "content": t["content"]} for t in (ctx.get("recent_turns") or [])]
reply = your_llm.chat(
    system=build_system_prompt(ctx),
    messages=[*history, {"role": "user", "content": user_message}],
)

The buffer is per-session and text-only — no tool calls, no images, no system prompts. It's the right shape for a simple chat loop where Sonzai is the source of truth; if you need richer message structure, keep your own.

Where to next

On this page