Custom LLM

Bring your own model while keeping the full managed experience — built-in tools, streaming, per-message extraction, personality evolution, and all behavioral systems. Sonzai calls your OpenAI-compatible endpoint instead of the default provider.

How It Works

Configure an OpenAI-compatible API endpoint for your project. Sonzai routes all chat generation through your endpoint while handling everything else: context assembly, tool execution, side-effect extraction, memory storage, personality tracking, and consolidation.

Full Managed Experience

Built-in tools (web search, memory recall, image generation, inventory), streaming SSE, per-message side effects — everything works exactly as with our default providers.

Your Model, Your Control

Use fine-tuned models, self-hosted endpoints, or any OpenAI-compatible provider (vLLM, Ollama, Together, Groq, Azure OpenAI, etc.).

Encrypted at Rest

Your API key is encrypted with AES-256 before storage. Only the first 8 characters are visible in the dashboard for identification.

Per-Project Configuration

Each project can have its own custom LLM endpoint. Toggle it on/off without deleting the config.

Custom LLM vs. Standalone Memory

Which one should I use?

Custom LLM is the right choice when you want to use your own model but still want the full Sonzai experience (tools, streaming, per-message extraction). Standalone Memory is for when you need to control the entire chat loop yourself — e.g., for privacy preprocessing, data anonymization, or deep integration with an agent framework. See the Standalone Memory docs for the tradeoffs.

Feature	Custom LLM	Standalone Memory
Built-in tools	Full support	Manual only
Streaming SSE	Yes	No
Per-message extraction	Automatic	Manual /process call
Memory prewarming	Yes	No
Data preprocessing	No	Full control
Agent framework integration	N/A	Full control

Requirements

Your endpoint must be OpenAI-compatible:

Accept POST /chat/completions (or the equivalent path your base URL resolves to)
Accept OpenAI chat message format (messages, model, temperature, etc.)
Return SSE stream in OpenAI chunk format (data: {"choices": [...]})
Support tools / tool_choice parameters if you want built-in tools to work

Compatible providers include: vLLM, Ollama, Together AI, Groq, Azure OpenAI, Fireworks AI, Anyscale, and any server implementing the OpenAI API spec.

Configuration via Dashboard

In the Sonzai dashboard, go to your project settings and configure the Custom LLM under the Custom LLM section:

Enter your OpenAI-compatible endpoint URL (e.g., https://api.together.xyz/v1)
Paste your API key (encrypted at rest with AES-256)
Specify the model name (e.g., meta-llama/Llama-3.1-70B-Instruct)
Optionally set a display name for easy identification
Toggle active/inactive without deleting the config

Configuration via API

Set Configuration

// Configure custom LLM for a project
await client.projects.customLlm.set("project-id", {
  endpoint: "https://api.together.xyz/v1",
  apiKey: "your-api-key",
  model: "meta-llama/Llama-3.1-70B-Instruct",
  displayName: "Together Llama 3.1 70B",
  isActive: true,
});

Get Configuration

const config = await client.projects.customLlm.get("project-id");

if (config.configured) {
  console.log(config.endpoint);      // "https://api.together.xyz/v1"
  console.log(config.apiKeyPrefix);   // "your-api" (first 8 chars)
  console.log(config.model);          // "meta-llama/Llama-3.1-70B-Instruct"
  console.log(config.isActive);       // true
}

Remove Configuration

await client.projects.customLlm.delete("project-id");

How Chat Routes Through Your Model

Once configured, here is what happens when a chat request is made:

Context assembly — Sonzai builds the 7-layer enriched context (personality, memory, mood, habits, goals, relationships, game state) exactly as with default providers.
Tool injection — Built-in tools (sonzai_memory_recall, sonzai_web_search, etc.) and any custom tools are added to the request.
Your endpoint called — The request is sent to your configured endpoint with your model name, API key, and the full message history including system prompt.
Streaming proxy — SSE chunks from your endpoint are streamed back to the client in real time.
Post-stream processing — After the stream completes, Sonzai extracts side effects (memory facts, mood changes, personality shifts, habits, tool calls) and stores them — same as with default providers.

Background Job Consistency

Background tasks like fact extraction, memory consolidation, diary generation, and summarization automatically use the same model family you configured. Sonzai tracks the last-used provider/model for each agent and routes background LLM calls accordingly.

Security

API key encryption — Keys are encrypted with AES-256 before storage. Only the first 8 characters are visible.
SSRF protection — Endpoint URLs are validated to block localhost, private IPs (10.x, 172.16-31.x, 192.168.x), link-local, and cloud metadata addresses.
Project-scoped — Each config is scoped to a project. Different projects can use different endpoints.

Billing

Custom LLM usage is billed at a flat per-token rate under the custom_llmbilling model, regardless of which actual model your endpoint serves. Sonzai tracks input/output tokens from your endpoint's usage response. Your own endpoint costs (API fees, compute) are entirely yours.

← 独立记忆层工具集成 (BYO-LLM)→