Real-time voice interaction with text-to-speech, speech-to-text, and live duplex streaming.

Voice gives every agent three modes of audio interaction: one-shot text-to-speech for spoken replies, speech-to-text for transcribing user audio, and a live duplex stream for full real-time conversations over a token-authenticated WebSocket. The same agent identity drives all three — same personality, same memory, same tools — so spoken turns are consolidated into the same session as text turns. Pick a voice name, choose an output format, and the Mind Layer handles synthesis, transcription, and turn-taking server-side.

Text-to-Speech (TTS)

Convert text to spoken audio.

const audio = await client.agents.voice.tts("agent-id", {
text: "Hello! How can I help you today?",
voiceName: "aria",
language: "en",
outputFormat: "mp3",
});
// audio.data contains the audio bytes

Speech-to-Text (STT)

Transcribe audio to text.

const result = await client.agents.voice.stt("agent-id", {
audio: base64AudioData,
audioFormat: "wav",
language: "en",
});
console.log(result.text);

Live Voice Streaming

Real-time duplex voice conversation. Get a token, then open a bidirectional stream.

// 1. Get a streaming token
const token = await client.agents.voice.getToken("agent-id", {
voiceName: "aria",
userId: "user-123",
});

// 2. Connect to live stream
const stream = await client.agents.voice.stream(token);

// Send audio chunks
stream.sendAudio(audioChunk);

// Or send text for the agent to speak
stream.sendText("Tell me about your day");

// Receive events
for await (const event of stream) {
if (event.type === "audio") {
  playAudio(event.data);
} else if (event.type === "transcript") {
  console.log(event.text);
}
}

// End session
stream.endSession();

WebSocket Transport

Live streaming is powered by WebSocket and supports real-time duplex audio. The client sends microphone audio chunks upstream while simultaneously receiving synthesized speech and transcripts downstream, enabling natural conversational flow.

Browse Voice Catalog

List available voices.

const voices = await client.voices.list({
language: "en",
gender: "female",
});

for (const voice of voices.voices) {
console.log(voice.name, voice.language, voice.gender);
}

Voice capabilities

Four AgentCapabilities fields describe an agent's voice configuration:

Field	Type	Description
`voiceGeneration`	`boolean`	Whether voice (TTS) generation is enabled for this agent
`voiceUnlockedAt`	`string (ISO 8601)`	When voice generation was granted
`voiceId`	`string`	The voice identifier used by default for this agent's TTS calls
`voiceTier`	`number`	Numeric tier level for voice quality (higher = higher quality/cost)

voiceId and voiceTier are read from get_capabilities(). To persist a preferred voice for an agent, store the voiceId from voices.list() and pass it to TTS calls. voiceGeneration is platform-managed and flips when your plan includes voice capabilities.

// Read voice capability fields
const caps = await client.agents.getCapabilities("agent-id");
console.log(caps.voiceGeneration);  // true | false
console.log(caps.voiceId);          // "aria" or null
console.log(caps.voiceTier);        // 1, 2, etc. or null
console.log(caps.voiceUnlockedAt);  // "2024-11-01T00:00:00Z" or null

// Pick a voice and use it for TTS
const voices = await client.voices.list({ language: "en" });
const chosen = voices.voices[0];

const audio = await client.agents.voice.tts("agent-id", {
text: "Hello!",
voiceName: chosen.name,
language: "en",
outputFormat: "mp3",
});

In Practice

Voice is primarily relevant to companions and enterprise. For task agents, it's usually not needed — but if you're building a phone/IVR flow, the enterprise patterns apply.

Pick a voice that matches the character. Browse voices.list(), shortlist 3-5, and A/B test with real users before committing. The wrong voice kills immersion faster than any other mistake.

Use duplex for live conversations. WebSocket duplex streams both STT (user input) and TTS (agent reply) in parallel — the natural shape for a live phone-call-style experience. Don't use polling TTS for companions; the latency kills presence.

Tune prosody. Set stability: 0.4-0.6 and clarity: 0.7-0.9 for a warm, expressive read. Pure stability sounds robotic.

Voice