Voice
Real-time voice interaction with text-to-speech, speech-to-text, and live duplex streaming.
Voice gives every agent three modes of audio interaction: one-shot text-to-speech for spoken replies, speech-to-text for transcribing user audio, and a live duplex stream for full real-time conversations over a token-authenticated WebSocket. The same agent identity drives all three — same personality, same memory, same tools — so spoken turns are consolidated into the same session as text turns. Pick a voice name, choose an output format, and the Mind Layer handles synthesis, transcription, and turn-taking server-side.
Text-to-Speech (TTS)
Convert text to spoken audio.
const audio = await client.agents.voice.tts("agent-id", {
text: "Hello! How can I help you today?",
voiceName: "aria",
language: "en",
outputFormat: "mp3",
});
// audio.data contains the audio bytesSpeech-to-Text (STT)
Transcribe audio to text.
const result = await client.agents.voice.stt("agent-id", {
audio: base64AudioData,
audioFormat: "wav",
language: "en",
});
console.log(result.text);Live Voice Streaming
Real-time duplex voice conversation. Get a token, then open a bidirectional stream.
// 1. Get a streaming token
const token = await client.agents.voice.getToken("agent-id", {
voiceName: "aria",
userId: "user-123",
});
// 2. Connect to live stream
const stream = await client.agents.voice.stream(token);
// Send audio chunks
stream.sendAudio(audioChunk);
// Or send text for the agent to speak
stream.sendText("Tell me about your day");
// Receive events
for await (const event of stream) {
if (event.type === "audio") {
playAudio(event.data);
} else if (event.type === "transcript") {
console.log(event.text);
}
}
// End session
stream.endSession();WebSocket Transport
Live streaming is powered by WebSocket and supports real-time duplex audio. The client sends microphone audio chunks upstream while simultaneously receiving synthesized speech and transcripts downstream, enabling natural conversational flow.
Browse Voice Catalog
List available voices.
const voices = await client.voices.list({
language: "en",
gender: "female",
});
for (const voice of voices.voices) {
console.log(voice.name, voice.language, voice.gender);
}Voice capabilities
Four AgentCapabilities fields describe an agent's voice configuration:
| Field | Type | Description |
|---|---|---|
voiceGeneration | boolean | Whether voice (TTS) generation is enabled for this agent |
voiceUnlockedAt | string (ISO 8601) | When voice generation was granted |
voiceId | string | The voice identifier used by default for this agent's TTS calls |
voiceTier | number | Numeric tier level for voice quality (higher = higher quality/cost) |
voiceId and voiceTier are read from get_capabilities(). To persist a preferred voice for an agent, store the voiceId from voices.list() and pass it to TTS calls. voiceGeneration is platform-managed and flips when your plan includes voice capabilities.
// Read voice capability fields
const caps = await client.agents.getCapabilities("agent-id");
console.log(caps.voiceGeneration); // true | false
console.log(caps.voiceId); // "aria" or null
console.log(caps.voiceTier); // 1, 2, etc. or null
console.log(caps.voiceUnlockedAt); // "2024-11-01T00:00:00Z" or null
// Pick a voice and use it for TTS
const voices = await client.voices.list({ language: "en" });
const chosen = voices.voices[0];
const audio = await client.agents.voice.tts("agent-id", {
text: "Hello!",
voiceName: chosen.name,
language: "en",
outputFormat: "mp3",
});In Practice
Voice is primarily relevant to companions and enterprise. For task agents, it's usually not needed — but if you're building a phone/IVR flow, the enterprise patterns apply.
Pick a voice that matches the character. Browse voices.list(),
shortlist 3-5, and A/B test with real users before committing. The
wrong voice kills immersion faster than any other mistake.
Use duplex for live conversations. WebSocket duplex streams both STT (user input) and TTS (agent reply) in parallel — the natural shape for a live phone-call-style experience. Don't use polling TTS for companions; the latency kills presence.
Tune prosody. Set stability: 0.4-0.6 and clarity: 0.7-0.9 for a
warm, expressive read. Pure stability sounds robotic.
Conversations
Send messages to an agent and stream back responses over SSE — the primary loop that drives memory, mood, and personality evolution with every turn.
Events & Multi-Agent Dialogue
Trigger agent reactions to backend events and run conversations between two agents — for achievements, milestones, NPC interactions, or automated simulations.