Custom LLM
Bring your own model while keeping the full managed experience — built-in tools, streaming, per-message extraction, personality evolution, and all behavioral systems. Sonzai calls your OpenAI-compatible endpoint instead of the default provider.
How It Works
Configure an OpenAI-compatible API endpoint for your project. Sonzai routes all chat generation through your endpoint while handling everything else: context assembly, tool execution, side-effect extraction, memory storage, personality tracking, and consolidation.
Full Managed Experience
Built-in tools (web search, memory recall, image generation, inventory), streaming SSE, per-message side effects — everything works exactly as with our default providers.
Your Model, Your Control
Use fine-tuned models, self-hosted endpoints, or any OpenAI-compatible provider (vLLM, Ollama, Together, Groq, Azure OpenAI, etc.).
Encrypted at Rest
Your API key is encrypted with AES-256 before storage. Only the first 8 characters are visible in the dashboard for identification.
Per-Project Configuration
Each project can have its own custom LLM endpoint. Toggle it on/off without deleting the config.
Custom LLM vs. Standalone Memory
Which one should I use?
Custom LLM is the right choice when you want to use your own model but still want the full Sonzai experience (tools, streaming, per-message extraction). Standalone Memory is for when you need to control the entire chat loop yourself — e.g., for privacy preprocessing, data anonymization, or deep integration with an agent framework. See the Standalone Memory docs for the tradeoffs.
| Feature | Custom LLM | Standalone Memory |
|---|---|---|
| Built-in tools | Full support | Manual only |
| Streaming SSE | Yes | No |
| Per-message extraction | Automatic | Manual /process call |
| Memory prewarming | Yes | No |
| Data preprocessing | No | Full control |
| Agent framework integration | N/A | Full control |
Requirements
Your endpoint must be OpenAI-compatible:
- Accept
POST /chat/completions(or the equivalent path your base URL resolves to) - Accept OpenAI chat message format (
messages,model,temperature, etc.) - Return SSE stream in OpenAI chunk format (
data: {"choices": [...]}) - Support
tools/tool_choiceparameters if you want built-in tools to work
Compatible providers include: vLLM, Ollama, Together AI, Groq, Azure OpenAI, Fireworks AI, Anyscale, and any server implementing the OpenAI API spec.
Configuration via Dashboard
In the Sonzai dashboard, go to your project settings and configure the Custom LLM under the Custom LLM section:
- Enter your OpenAI-compatible endpoint URL (e.g.,
https://api.together.xyz/v1) - Paste your API key (encrypted at rest with AES-256)
- Specify the model name (e.g.,
meta-llama/Llama-3.1-70B-Instruct) - Optionally set a display name for easy identification
- Toggle active/inactive without deleting the config
Configuration via API
Set Configuration
// Configure custom LLM for a project
await client.projects.customLlm.set("project-id", {
endpoint: "https://api.together.xyz/v1",
apiKey: "your-api-key",
model: "meta-llama/Llama-3.1-70B-Instruct",
displayName: "Together Llama 3.1 70B",
isActive: true,
});Get Configuration
const config = await client.projects.customLlm.get("project-id");
if (config.configured) {
console.log(config.endpoint); // "https://api.together.xyz/v1"
console.log(config.apiKeyPrefix); // "your-api" (first 8 chars)
console.log(config.model); // "meta-llama/Llama-3.1-70B-Instruct"
console.log(config.isActive); // true
}Remove Configuration
await client.projects.customLlm.delete("project-id");How Chat Routes Through Your Model
Once configured, here is what happens when a chat request is made:
- Context assembly — Sonzai builds the 7-layer enriched context (personality, memory, mood, habits, goals, relationships, game state) exactly as with default providers.
- Tool injection — Built-in tools (
sonzai_memory_recall,sonzai_web_search, etc.) and any custom tools are added to the request. - Your endpoint called — The request is sent to your configured endpoint with your model name, API key, and the full message history including system prompt.
- Streaming proxy — SSE chunks from your endpoint are streamed back to the client in real time.
- Post-stream processing — After the stream completes, Sonzai extracts side effects (memory facts, mood changes, personality shifts, habits, tool calls) and stores them — same as with default providers.
Background Job Consistency
Background tasks like fact extraction, memory consolidation, diary generation, and summarization automatically use the same model family you configured. Sonzai tracks the last-used provider/model for each agent and routes background LLM calls accordingly.
Security
- API key encryption — Keys are encrypted with AES-256 before storage. Only the first 8 characters are visible.
- SSRF protection — Endpoint URLs are validated to block localhost, private IPs (10.x, 172.16-31.x, 192.168.x), link-local, and cloud metadata addresses.
- Project-scoped — Each config is scoped to a project. Different projects can use different endpoints.
Billing
Custom LLM usage is billed at a flat per-token rate under the custom_llmbilling model, regardless of which actual model your endpoint serves. Sonzai tracks input/output tokens from your endpoint's usage response. Your own endpoint costs (API fees, compute) are entirely yours.