Per-turn integration. You control the LLM and tool-calling loop; Sonzai handles what that LLM knows about the user. Includes tool calling and multimodal/image handling.

You control the LLM. Sonzai handles what that LLM knows about the user.

Open a Session once. For every turn: call session.context({ query }) to pull the enriched user profile, build your system prompt, call your own LLM (with your own tools), then call session.turn({ messages }) to submit just the new exchange. Sync mood updates inline (~300–500ms); deeper extraction (facts, personality, habits) lands asynchronously 5–15 seconds later in the background.

This is the same data model mem0 provides (relevant memories injected before generation), extended with personality evolution, mood tracking, habit detection, goal tracking, proactive outreach scheduling, and relationship dynamics.

┌─────────────┐     ┌──────────────────┐     ┌──────────────┐
│  Your App   │     │   Sonzai API     │     │   Your LLM   │
└──────┬──────┘     └────────┬─────────┘     └──────┬───────┘
     │                     │                       │
     │  sessions.start     │                       │
     │────────────────────>│ (prewarms memory)     │
     │  <── Session ───────│                       │
     │                     │                       │
     │  ─── Per turn ──────────────────────────── │
     │                     │                       │
     │  session.context()  │                       │
     │────────────────────>│                       │
     │  <── enriched ctx ──│                       │
     │    personality, mood│                       │
     │    memories, goals  │                       │
     │                     │                       │
     │  Your LLM loop ─────┼──────────────────────>│
     │  + your tools       │                       │
     │  <── reply ─────────┼───────────────────────│
     │                     │                       │
     │  sendToUser(reply) (no waiting on Sonzai)   │
     │                     │                       │
     │  session.turn()     │                       │
     │────────────────────>│ ⇒ sync mood ~300ms    │
     │  <── mood, status ──│ ⇒ background extraction│
     │                     │   (5–15s)             │
     │                     │                       │
     │  ─── Repeat ────────────────────────────── │
     │                     │                       │
     │  session.end()      │                       │
     │────────────────────>│── consolidate         │
     │                     │   long-term memory    │
     └─────────────────────┴───────────────────────┘

What Sonzai's LLM is used for

session.context() and sessions.start use no Sonzai LLM credits — they are pure reads. session.turn(), /process, and sessions.end({ messages }) use Sonzai's LLM for fact extraction + session summary (light, per-call, billed). Heavy background work — cross-session dedup, clustering, diary, decay — runs on auto-scheduled jobs (8h post-session, daily, weekly) and is billed against the same tenant but not per-call. Your chat LLM is entirely your cost.

Core loop

Open the session once with your provider/model defaults. Then for every turn: get context → call your LLM (running tool calls in your own loop) → submit the turn. End the session when done.

import { Sonzai } from "@sonzai-labs/agents";

const sonzai = new Sonzai({ apiKey: process.env.SONZAI_API_KEY! });

async function runConversation(agentId: string, userId: string) {
const sessionId = `session-${Date.now()}`;
const history: { role: string; content: string }[] = [];

// Open a Session handle. agentId/userId/sessionId and provider/model
// defaults live on the handle so you don't repeat them on every call.
const session = await sonzai.agents.sessions.start(agentId, {
  userId,
  sessionId,
  toolDefinitions: yourTools,                   // optional — register session-scoped tool schemas
  provider: "gemini",                           // optional — default for .turn()
  model: "gemini-3.1-flash-lite-preview",       // optional — default for .turn()
});

async function turn(userMessage: string): Promise<string> {
  // Fresh enriched context for this specific message
  const ctx = await session.context({ query: userMessage });

  // Your LLM — swap in any provider you like
  let reply = await yourLLM.chat({
    system: buildSystemPrompt(ctx),
    messages: [...history, { role: "user", content: userMessage }],
    tools: yourTools,
  });

  // Tool-calling loop is entirely yours — Sonzai is OUT of the loop here.
  const toolMessages: any[] = [];
  while (reply.tool_calls?.length) {
    for (const call of reply.tool_calls) {
      const result = await runYourTool(call);
      toolMessages.push(
        { role: "assistant", tool_calls: [call] },
        { role: "tool", tool_call_id: call.id, content: result },
      );
    }
    reply = await yourLLM.chat({
      system: buildSystemPrompt(ctx),
      messages: [...history, { role: "user", content: userMessage }, ...toolMessages],
      tools: yourTools,
    });
  }

  sendToUser(reply.content); // send first; don't block on Sonzai

  // Submit just the new turn. Sync mood ~300ms, deferred extraction
  // (facts, personality, habits) runs asynchronously 5–15s later.
  // Pass the FULL exchange — including tool calls and tool results —
  // so Sonzai can extract facts from tool outputs too.
  const { mood, extraction_id } = await session.turn({
    messages: [
      { role: "user", content: userMessage },
      ...toolMessages,                          // assistant tool_calls + tool results
      { role: "assistant", content: reply.content },
    ],
  });

  history.push({ role: "user", content: userMessage });
  history.push({ role: "assistant", content: reply.content });

  return reply.content;
}

return { turn, end: () => session.end() };
}

// The /context response is a flat object — there is no nested
// `profile` / `behavioral` / `memory` envelope.
function buildSystemPrompt(ctx: any): string {
const facts = (ctx.loaded_facts ?? []).map((f: any) => `- ${f.atomic_text}`).join("\n");
const goals = (ctx.active_goals ?? []).map((g: any) => g.description).join(", ");
return `${ctx.personality_prompt ?? "You are a helpful AI companion."}
Personality (Big5): ${JSON.stringify(ctx.big5 ?? {})}
Current mood: ${JSON.stringify(ctx.current_mood ?? {})}
Active goals: ${goals || "none"}
Relevant memories:
${facts || "none yet"}`;
}

Pull fresh context every turn

The single most important habit in Pattern 1 is calling session.context(query=user_msg) before every LLM call. This is the load-bearing piece that closes the loop — without it, the LLM doesn't get the fresh mood (which lands inline on .turn()) or the freshly-extracted facts (which land 5–15 seconds after .turn()).

while (conversationActive) {
const userMsg = await getUserInput();

// 1. PULL FRESH CONTEXT — happens every turn, before the LLM call.
//    ctx is a flat object — no `profile` / `behavioral` / `memory` envelope.
//    Fields you'll usually read:
//      ctx.personality_prompt          — agent identity / instructions
//      ctx.bio, ctx.speech_patterns    — agent identity bits
//      ctx.big5                        — Big5 trait object
//      ctx.current_mood                — fresh inline (~300ms after .turn())
//      ctx.habits, ctx.active_goals    — behavioral state
//      ctx.loaded_facts                — recalled facts (5-15s lag from extraction)
//      ctx.proactive_memories          — pending proactive signals
//      ctx.knowledge.results           — KB hits (only nested key)
//      ctx.recent_turns                — buffered messages from this session
const ctx = await session.context({ query: userMsg });

// 2. Build system prompt from the context layers
const systemPrompt = renderPromptFromContext(ctx);

// 3. Run YOUR LLM — Sonzai is OUT of the loop here
const reply = await yourLLM.chat({
  system: systemPrompt,
  messages: [...history, { role: "user", content: userMsg }],
});

// 4. Submit the just-completed turn — sync mood + async deferred extraction
await session.turn({
  messages: [
    { role: "user", content: userMsg },
    { role: "assistant", content: reply.content },
  ],
});
}

function renderPromptFromContext(ctx: any): string {
const parts: string[] = [];
if (ctx.personality_prompt) parts.push(ctx.personality_prompt);
if (ctx.big5) parts.push(`Personality (Big5): ${JSON.stringify(ctx.big5)}`);
if (ctx.speech_patterns?.length) parts.push(`Speech patterns: ${ctx.speech_patterns.join(", ")}`);
if (ctx.current_mood) parts.push(`Current mood: ${JSON.stringify(ctx.current_mood)}`);
const facts = (ctx.loaded_facts ?? []).slice(0, 5).map((f: any) => `- ${f.atomic_text ?? ""}`).join("\n");
if (facts) parts.push(`Relevant memories:\n${facts}`);
const kb = (ctx.knowledge?.results ?? []).slice(0, 3).map((r: any) => `- ${r.label}: ${(r.content ?? "").slice(0, 120)}`).join("\n");
if (kb) parts.push(`Knowledge base:\n${kb}`);
return parts.join("\n\n");
}

Save a roundtrip with fetchNextContext

session.turn() accepts a fetch_next_context={"query": next_user_message} argument (TS: fetchNextContext). When set, the server runs the deferred extraction trigger AND fetches the next /context payload in the same response, returning it under next_context. This eliminates the second roundtrip on the next turn — your client already has the context for turn N+1 by the time turn N has finished. Use this when you can predict the next user query (e.g., for the very next render of context).

Context freshness. Mood updates inline on each .turn() call (~300ms), so the very next .context() reflects the new mood. Personality / facts / inventory land 5–15 seconds after .turn() in the background, so they appear within a turn or two of being mentioned.

Why per-turn. State changes between turns. A user mentioning a new pet on turn 3 means turn 4's context should carry that fact. Skipping .context() between turns means the LLM works from stale state — and the value of a memory layer collapses.

Pass the actual user message as query. session.context() uses the query for memory recall, KB search, and proactive signal selection. Passing the raw user message gives the most relevant pull; passing a static placeholder gives generic context regardless of what the user asked.

Skipping local history with `recent_turns`

Most agent harnesses (OpenAI Agents SDK, LangChain, LiveKit) own the message log themselves — let them. But if you're rolling a thin LLM loop and would rather not maintain a parallel history array on your side, every /context response carries recent_turns: the raw messages buffered by /turn for the current session, in chronological order. Read them straight off the context payload.

const ctx = await session.context({ query: userMessage });

// Sonzai is the source of truth — no local history list needed.
const history = (ctx.recent_turns ?? []).map((t) => ({
role: t.role,
content: t.content,
}));

const reply = await yourLLM.chat({
system:   buildSystemPrompt(ctx),
messages: [...history, { role: "user", content: userMessage }],
});

What's in the buffer. Last ~20 messages from the current session only — text content, role, and a server-side timestamp. Capped at 20 turns and scoped to (agent_id, user_id, session_id); cross-session history isn't there (use agents.memory.list_facts for that — facts are the durable form).

What's not in the buffer. No system prompts, no tool_calls arrays, no role: "tool" payloads, no image attachments. The buffer mirrors the narrative you submitted to /turn, not the rich message structure your LLM saw. If your conversation has tool calls or multimodal content the LLM needs to re-read on the next turn, keep your own history.

When the buffer is empty. Right after sessions.start (no turns yet), or in degraded mode if Redis is down — the field is omitted, not zero-length-with-error. Treat ctx.recent_turns ?? [] as a no-op.

Tool messages flow through to extraction

The /turn schema accepts OpenAI/Anthropic-style tool messages: role: "tool" for tool results and tool_calls arrays on assistant messages. Pass the entire intermediate exchange — Sonzai's extractor reads tool results and can capture facts that only appeared in tool output (e.g. "user's last order shipped from Tokyo" from an order-lookup tool).

await session.turn({
messages: [
  { role: "user", content: "Where did my last order ship from?" },
  {
    role: "assistant",
    tool_calls: [{ id: "call_1", type: "function", function: { name: "order-lookup", arguments: "{}" } }],
  },
  {
    role: "tool",
    tool_call_id: "call_1",
    content: '{"order_id":"42","origin":"Tokyo","carrier":"DHL"}',
  },
  { role: "assistant", content: "Your last order shipped from Tokyo via DHL." },
],
});

Polling deferred extraction

/turn returns immediately after the sync mood pass. The deeper extraction runs asynchronously and reaches done in 5–15s. You can poll the status if you need to gate something on it:

const { extraction_id } = await session.turn({ messages });

// Optional — only poll if you need to wait for facts/personality before doing something
let status = await session.status(extraction_id);
while (status.state !== "done" && status.state !== "failed") {
await new Promise((r) => setTimeout(r, 1000));
status = await session.status(extraction_id);
}

Tool calling

Pattern 1 hands the tool-calling loop entirely to you. Sonzai never executes a tool — but it does read tool calls and tool results out of the messages you submit on /turn, so the extractor can capture facts that surfaced inside a tool output. There are two flavors of tools you'll typically wire up.

A. Your own tools

Use whatever your agent framework provides — @function_tool in the OpenAI Agents SDK, tools= on Anthropic, function declarations on Gemini, @tool in LangChain. The pattern is the same: register the tool with your LLM, run the tool-calling loop on your side, and forward the full exchange (including the assistant's tool_calls message and the role: "tool" result message) to session.turn().

from agents import Agent, Runner, function_tool

@function_tool
def get_current_time() -> str:
    """Return the current time."""
    from datetime import datetime, timezone
    return datetime.now(timezone.utc).isoformat(timespec="seconds")

agent = Agent(name="Companion", tools=[get_current_time], model=gemini_model)
result = Runner.run_sync(agent, user_msg)

# Build the tool-aware messages array Sonzai expects.
sonzai_messages = [
    {"role": "user", "content": user_msg},
    {
        "role": "assistant",
        "content": None,
        "tool_calls": [{
            "id": "call_1",
            "type": "function",
            "function": {"name": "get_current_time", "arguments": "{}"},
        }],
    },
    {"role": "tool", "tool_call_id": "call_1", "content": "2026-05-07T07:30:00Z"},
    {"role": "assistant", "content": result.final_output},
]
session.turn(messages=sonzai_messages)

When the assistant says "It's 7:30 AM" and the user replies "Set my morning standup for 8", Sonzai's extractor sees the tool's actual output, not just the assistant's paraphrase — and can capture "user prefers 8 AM standups" with the right grounding.

B. Sonzai's capabilities as tools

You can also wrap Sonzai's own REST endpoints as tools your LLM can call mid-turn. The two most useful are knowledge base search and memory search — both let the LLM pull additional context on demand without you having to inject everything up-front through session.context().

// TypeScript — agents.memory.search is available directly
import { Sonzai } from "@sonzai-labs/agents";
import { tool } from "ai";
import { z } from "zod";

const sonzai = new Sonzai({ apiKey: process.env.SONZAI_API_KEY! });

const kbSearch = tool({
description: "Search the agent's knowledge base.",
parameters: z.object({ query: z.string() }),
execute: async ({ query }) => {
  const res = await sonzai.agents.knowledgeSearch("agent-id", { query, limit: 5 });
  return res.results.map((r) => `- ${r.label}: ${r.content}`).join("\n") || "No matching knowledge.";
},
});

const memorySearch = tool({
description: "Search the user's long-term memory.",
parameters: z.object({ query: z.string() }),
execute: async ({ query }) => {
  const res = await sonzai.agents.memory.search("agent-id", {
    query,
    user_id: "user-123",
    limit: 5,
  });
  return res.results.map((r) => `- ${r.text}`).join("\n") || "No matching memories.";
},
});

Why expose Sonzai endpoints as tools?

session.context() returns the most relevant facts for the current query — a strong default. Exposing kb_search and memory_search as tools lets the LLM decide for itself when to dig deeper (e.g., when the user asks "what did I tell you last week about X?"). It's especially useful for agent frameworks that already think in terms of tools.

When the LLM calls these tools, the result lands in your tool-calling loop just like any other tool. Forward the full exchange to session.turn() and Sonzai's extractor will see the search results too — but be aware that re-extracting facts from a memory_search tool result can create echoes (the user's own past fact resurfaces as if it were new). Either skip extraction for those tool messages on your side, or trust the dedup pass.

For deeper coverage of Sonzai's tool endpoints, see the Tool Integration guide.

What's available as a tool

Sonzai endpoint	SDK method	Useful as an LLM tool?
Knowledge base search	`agents.knowledge_search(agent_id, query, limit)`	Yes — LLM looks up policies, products, docs
Memory search	`agents.memory.search(agentId, { query, userId })` (TS/Go); `agents.memory.list_facts(agent_id, user_id)` (Python)	Yes — LLM looks up past user statements
Mood / personality / habits / goals reads	`agents.get_mood`, `agents.personality.get`, `agents.list_habits`, `agents.list_goals`	Mostly inject via `session.context()` instead — read-only state changes rarely with the user query
Image generation	`generation.generate_image`	Possible, but typically your app exposes this as its own UI action, not as an LLM tool

Working with images & multimodal input

Sonzai's memory pipeline is text-based today. The /turn and /process endpoints accept string content only — DialogueMessage.content is string. Your LLM can be fully multimodal (Gemini, Claude, GPT-4o all accept image URLs and audio natively) but to get image-related facts into Sonzai you need to bridge the multimodal content into text in the messages you send to /turn.

The recommended pattern is dual-output: have your vision-capable LLM produce both (a) the warm reply you show the user and (b) a hidden [MEMORY: ...] line with a detailed factual description. Strip the [MEMORY: ...] line out before showing the user, and embed it in the bridged text you submit to Sonzai.

import OpenAI from "openai";
import { Sonzai } from "@sonzai-labs/agents";

const gemini = new OpenAI({
baseURL: "https://generativelanguage.googleapis.com/v1beta/openai/",
apiKey: process.env.GEMINI_API_KEY!,
});
const sonzai = new Sonzai({ apiKey: process.env.SONZAI_API_KEY! });

const SYSTEM_PROMPT_IMAGE_AWARE = `You are a friendly companion. When the user shares an image, respond warmly
to what's emotionally important to THEM.

After your reply, ALWAYS include a single line:
[MEMORY: <detailed factual description of the image — setting, objects,
people, mood, time of day, what the user appears to be doing>]

The user does NOT see the [MEMORY: ...] line.`;

async function processImageTurn(session: any, userMsg: string, imageUrl: string): Promise<string> {
const result = await gemini.chat.completions.create({
  model: "gemini-3.1-flash-lite-preview",
  messages: [
    { role: "system", content: SYSTEM_PROMPT_IMAGE_AWARE },
    {
      role: "user",
      content: [
        { type: "text", text: userMsg },
        { type: "image_url", image_url: { url: imageUrl } },
      ],
    },
  ],
});
const raw = result.choices[0].message.content ?? "";

// Split the dual output
const m = raw.match(/\[MEMORY:\s*([\s\S]+?)\]/);
const memoryNote = m ? m[1].trim() : "";
const reply = raw.replace(/\[MEMORY:[\s\S]+?\]/, "").trim();

sendToUser(reply);

await session.turn({
  messages: [
    { role: "user", content: `${userMsg}\n\n[Image attached: ${memoryNote}, URL: ${imageUrl}]` },
    { role: "assistant", content: reply },
  ],
});
return reply;
}

Why this pattern:

No backend multimodal yet. /turn accepts string content. Text-bridging through your same vision-capable LLM is the cleanest workaround.
Why dual-output (vs. a separate vision call). The same LLM call serves both purposes — no extra cost, no extra latency, no second roundtrip. You're already paying for vision on the assistant turn; let it produce the description too.
Why a hidden line. Keeps user-facing replies emotionally warm — "Oh you have such nice shoulders!" — while still capturing the factual detail (gym, tank top, mirror, time of day) that memory extraction needs.
It's a developer pattern, not a Sonzai field. The [MEMORY: ...] convention is yours to define. Sonzai just sees text. You can use any sentinel — <<MEM>>...<</MEM>>, JSON, whatever your prompt and parser agree on.

Including the URL. Embedding the URL in the bridged text isn't required, but it lets Sonzai later surface the image as a memory artifact ("the photo you shared last week") without re-running vision on the image. Your app keeps using its own image storage; Sonzai just remembers the link as text.

Audio & voice follow the same pattern

Speech-to-text (STT) on your side, send the transcript in messages. Text-to-speech (TTS) is rendered after the assistant text exists, so you forward the assistant text to session.turn() exactly as you would for a text-only chat. See the Voice AI use case below.

Why text-only /turn is the design, not a placeholder

Memory is a layer of semantic understanding. The question Sonzai needs to answer next week is "what does this agent know about this user?" — not "what bytes did the LLM see?". Your vision-capable LLM has already understood the image; text-bridging passes that understanding through to extraction in the form the memory pipeline actually consumes (atomic facts, habits, inventory). Storing raw image bytes server-side would inflate cost without improving recall, and would re-couple your LLM choice to ours. The dual-output pattern keeps your harness fully in charge of perception.

Use case: chat companion (OpenAI Agents SDK + Gemini)

The canonical Pattern 1 example. You bring your own agent harness — here the OpenAI Agents SDK — and route it at Gemini via the OpenAI-compat endpoint, so no OPENAI_API_KEY is ever used. Sonzai sits outside the LLM/tool-calling loop entirely: it supplies the system prompt via session.context() and ingests the finished transcript via session.turn(). The Agents SDK does all multi-step reasoning and tool dispatch on your side; Sonzai does memory.

import os
from openai import AsyncOpenAI
from agents import (
    Agent,
    Runner,
    OpenAIChatCompletionsModel,
    function_tool,
    set_tracing_disabled,
)
from sonzai import Sonzai

# The Agents SDK ships traces to OpenAI by default — disable, since we
# have no OpenAI key and aren't talking to OpenAI's servers at all.
set_tracing_disabled(True)

# Point the Agents SDK's AsyncOpenAI client at Gemini's OpenAI-compat URL.
gemini = AsyncOpenAI(
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/",
    api_key=os.environ["GEMINI_API_KEY"],
)
model = OpenAIChatCompletionsModel(
    model="gemini-3.1-flash-lite-preview",
    openai_client=gemini,
)

# Sonzai = memory layer only. It never sees the LLM client.
sonzai = Sonzai(api_key=os.environ["SONZAI_API_KEY"])
session = sonzai.agents.sessions.start(
    "agent-id",
    user_id="user-123",
    session_id="session-abc",
)

@function_tool
def get_current_time() -> str:
    """Return the current time."""
    from datetime import datetime, timezone
    return datetime.now(timezone.utc).isoformat(timespec="seconds")

while True:
    user_msg = input("You: ")
    if not user_msg:
        break

    # 1) Pull enriched context (mood, personality, relevant facts, …) from Sonzai.
    ctx = session.context(query=user_msg)

    mood = ctx.get("current_mood") or "neutral"
    instructions = f"You are a friendly companion. Current mood: {mood}."

    # 2) Run the Agents SDK loop — it handles tool-calling and multi-step reasoning.
    agent = Agent(
        name="Companion",
        instructions=instructions,
        model=model,
        tools=[get_current_time],
    )
    result = Runner.run_sync(agent, user_msg)
    print(f"Assistant: {result.final_output}")

    # 3) Convert the run's items (assistant text + ToolCallItem + ToolCallOutputItem)
    # into Sonzai's tool-aware messages format. See the demo for the implementation.
    sonzai_messages = run_result_to_sonzai_messages(user_msg, result)

    # 4) Submit the turn. `mood` comes back inline (~300ms); facts / personality /
    # inventory are extracted asynchronously and land 5-15s later.
    turn_result = session.turn(messages=sonzai_messages)
    print(f"  -> mood updated: {turn_result.mood}")

session.end()

What's happening on each turn:

Sonzai is out of the LLM loop. The OpenAI Agents SDK runs the model, dispatches tools, and produces result.final_output. Sonzai never sees the LLM client and has no opinion on which model answered.
Mood is real-time. session.turn() returns fresh mood inline in ~300ms — you can render it the moment the response arrives.
Facts, personality drift, and inventory are deferred (5-15s). They run async under the returned extraction_id. Re-poll agents.memory.list_facts, agents.personality.get, etc. on the next turn; whatever didn't land yet will be there shortly.
Tool calls flow through to extraction. Sonzai's tool-aware message format accepts assistant messages with tool_calls plus a tool message carrying the result. The conversion helper packages the Agents SDK's ToolCallItem + ToolCallOutputItem into that shape so extraction can pick up facts from tool outputs too.

Want a working version? See the OpenAI Agents companion demo — a two-pane Streamlit app showing live mood, Big5, recent facts, inventory, and the constellation graph as you chat.

Use case: voice AI assistant

STT → enrich → LLM → TTS. Sonzai holds the memory; you own the audio pipeline. Submit the turn while TTS is synthesizing — sync mood is fast enough not to block, and deferred extraction never blocks.

import { Sonzai } from "@sonzai-labs/agents";

const sonzai = new Sonzai({ apiKey: process.env.SONZAI_API_KEY! });

async function processVoiceTurn(
session: any, // Session handle from sonzai.agents.sessions.start
audioBuffer: Buffer
): Promise<Buffer> {
// Your STT
const transcript = await yourSTT.transcribe(audioBuffer);

// Inject memory into a concise voice-friendly system prompt
const ctx = await session.context({ query: transcript });

const systemPrompt = `${ctx.personality_prompt ?? "You are a voice companion."} Keep replies under 2 sentences for voice.
Mood: ${JSON.stringify(ctx.current_mood)}.
Key memory: ${ctx.loaded_facts?.[0]?.atomic_text ?? "none"}.`;

const reply = await yourLLM.chat({ system: systemPrompt, message: transcript });

// Submit the turn while TTS synthesizes (run in parallel)
const [audioResponse] = await Promise.all([
  yourTTS.synthesize(reply),
  session.turn({
    messages: [
      { role: "user", content: transcript },
      { role: "assistant", content: reply },
    ],
  }),
]);

return audioResponse;
}

Use case: agent framework (LangChain / LlamaIndex)

Sonzai injects user context into the agent's system prompt. The framework handles tool calling, multi-step reasoning, and memory of the current conversation; Sonzai handles what the agent knows about the user across sessions. Send the full transcript including any tool messages to session.turn() so extraction can pick up facts from tool results.

import { ChatOpenAI } from "@langchain/openai";
import { SystemMessage, HumanMessage, AIMessage } from "@langchain/core/messages";
import { Sonzai } from "@sonzai-labs/agents";

const sonzai = new Sonzai({ apiKey: process.env.SONZAI_API_KEY! });
const llm = new ChatOpenAI({ model: "gpt-4o", tools: yourToolSchemas });

async function agentTurn(
session: any,
userInput: string,
messageHistory: (HumanMessage | AIMessage)[]
): Promise<string> {
const ctx = await session.context({ query: userInput });

const messages = [
  new SystemMessage(buildSystemPrompt(ctx)),
  ...messageHistory,
  new HumanMessage(userInput),
];

// Run the agent's full tool-calling loop on your side, then surface
// every intermediate message (assistant tool_calls + tool results)
// to Sonzai so it can extract from them.
const { reply, intermediate } = await runLangchainAgent(llm, messages);

await session.turn({
  messages: [
    { role: "user", content: userInput },
    ...intermediate,
    { role: "assistant", content: reply },
  ],
});

return reply;
}

Use case: multi-LLM router

Route to different models based on task type while Sonzai stitches user memory across all of them. The Session-level provider/model default is just a default — every .turn() can override.

import Anthropic from "@anthropic-ai/sdk";
import { GoogleGenAI } from "@google/genai";
import { Sonzai } from "@sonzai-labs/agents";

const sonzai = new Sonzai({ apiKey: process.env.SONZAI_API_KEY! });
const claude = new Anthropic();
const gemini = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY! });

type TaskType = "creative" | "analytical" | "casual";

function classifyTask(message: string): TaskType {
if (/write|story|poem|imagine/i.test(message)) return "creative";
if (/analyze|compare|explain|why/i.test(message)) return "analytical";
return "casual";
}

async function routedTurn(session: any, userMessage: string): Promise<string> {
const ctx = await session.context({ query: userMessage });
const systemPrompt = buildSystemPrompt(ctx);
const task = classifyTask(userMessage);

let reply: string;

if (task === "creative") {
  const response = await claude.messages.create({
    model: "claude-sonnet-4-6",
    max_tokens: 1024,
    system: systemPrompt,
    messages: [{ role: "user", content: userMessage }],
  });
  reply = response.content[0].type === "text" ? response.content[0].text : "";
} else {
  const response = await gemini.models.generateContent({
    model: "gemini-2.5-flash",
    contents: [{ role: "user", parts: [{ text: systemPrompt + "\n\n" + userMessage }] }],
  });
  reply = response.text ?? "";
}

// Same .turn() call regardless of which chat model answered.
await session.turn({
  messages: [
    { role: "user", content: userMessage },
    { role: "assistant", content: reply },
  ],
});

return reply;
}

Use case: privacy-first (anonymize before LLM)

Redact PII from the enriched context before it reaches your LLM. Only structured extracted facts are stored by Sonzai — never raw text.

async function privacyTurn(session: any, userMessage: string): Promise<string> {
const ctx = await session.context({ query: userMessage });

// Scrub PII from facts before they reach your LLM
const sanitizedFacts = (ctx.loaded_facts ?? []).map((f: any) => ({
  ...f,
  atomic_text: redactPII(f.atomic_text), // your PII redaction logic
}));

const sanitizedCtx = { ...ctx, loaded_facts: sanitizedFacts };
const systemPrompt = buildSystemPrompt(sanitizedCtx);

const reply = await yourLLM.chat({ system: systemPrompt, message: userMessage });

// Send unredacted transcript to Sonzai for extraction
// (Sonzai stores structured facts, not raw text)
await session.turn({
  messages: [
    { role: "user", content: userMessage },
    { role: "assistant", content: reply },
  ],
});

return reply;
}

Pattern 2: Post-Session Batch Processing — when Sonzai shouldn't be in the hot path
Endpoint walkthrough — full reference for sessions.start, context, turn, process, end, and read endpoints
KB & limitations — knowledge base behavior in standalone mode and what's not supported

Pattern 1: Memory Middleware (Real-Time)