Generate speech audio files from text using HeyGen's in-house Starfish TTS model via the v3 API. This skill is for standalone audio generation — separate from video creation.
Authentication
All requests require the X-Api-Key header. Set the HEYGEN_API_KEY environment variable.
curl -X GET "https://api.heygen.com/v3/voices?engine=starfish" \
-H "X-Api-Key: $HEYGEN_API_KEY"
Tool Selection
If HeyGen MCP tools are available (mcp__heygen__*), prefer them over direct HTTP API calls.
| Task | MCP Tool | Fallback (Direct API) |
|---|
| List TTS voices | mcp__heygen__list_audio_voices | GET /v3/voices?engine=starfish |
| Generate speech audio | mcp__heygen__text_to_speech | POST /v3/voices/speech |
Default Workflow
- List voices with
mcp__heygen__list_audio_voices (or GET /v3/voices?engine=starfish)
- Pick a voice matching desired language, gender, and features
- Call
mcp__heygen__text_to_speech (or POST /v3/voices/speech) with text and voice_id
- Use the returned
audio_url to download or play the audio
List TTS Voices
Retrieve voices compatible with the Starfish TTS model.
Note: This uses the unified GET /v3/voices endpoint with the engine=starfish filter to return only TTS-compatible voices. Not all video voices support Starfish TTS. The response is paginated — use next_token to fetch additional pages.
Query Parameters
| Param | Type | Description |
|---|
engine | string | Filter by engine (use starfish for TTS voices) |
type | string | public or private |
language | string | Filter by language |
gender | string | Filter by gender |
limit | integer | Results per page, 1-100 |
token | string | Pagination cursor from next_token |
curl
curl -X GET "https://api.heygen.com/v3/voices?engine=starfish" \
-H "X-Api-Key: $HEYGEN_API_KEY"
TypeScript
interface AudioVoiceItem {
voice_id: string;
name: string;
language: string;
gender: "female" | "male" | "unknown";
preview_audio_url: string | null;
support_pause: boolean;
support_locale: boolean;
type: string;
}interface TTSVoicesResponse {
error: null | string;
data: AudioVoiceItem[];
has_more: boolean;
next_token: string | null;
}
async function listTTSVoices(): Promise {
const allVoices: AudioVoiceItem[] = [];
let token: string | null = null;
do {
const url = new URL("https://api.heygen.com/v3/voices");
url.searchParams.set("engine", "starfish");
if (token) url.searchParams.set("token", token);
const response = await fetch(url.toString(), {
headers: { "X-Api-Key": process.env.HEYGEN_API_KEY! },
});
const json: TTSVoicesResponse = await response.json();
if (json.error) {
throw new Error(json.error);
}
allVoices.push(...json.data);
token = json.next_token;
} while (token);
return allVoices;
}
Python
import requests
import osdef list_tts_voices() -> list:
all_voices = []
token = None
while True:
params = {"engine": "starfish"}
if token:
params["token"] = token
response = requests.get(
"https://api.heygen.com/v3/voices",
headers={"X-Api-Key": os.environ["HEYGEN_API_KEY"]},
params=params,
)
data = response.json()
if data.get("error"):
raise Exception(data["error"])
all_voices.extend(data["data"])
if not data.get("has_more"):
break
token = data.get("next_token")
return all_voices
Response Format
{
"error": null,
"data": [
{
"voice_id": "f38a635bee7a4d1f9b0a654a31d050d2",
"name": "Chill Brian",
"language": "English",
"gender": "male",
"preview_audio_url": "https://resource.heygen.ai/text_to_speech/WpSDQvmLGXEqXZVZQiVeg6.mp3",
"support_pause": true,
"support_locale": false,
"type": "public"
}
],
"has_more": false,
"next_token": null
}
Generate Speech Audio
Convert text to speech audio using a specified voice.
Endpoint
POST https://api.heygen.com/v3/voices/speech
Request Fields
| Field | Type | Req | Description |
|---|
text | string | Y | Text content to convert (1-5000 characters) |
voice_id | string | Y | Voice ID from GET /v3/voices?engine=starfish |
input_type | string | "text" (default) or "ssml" for full SSML markup |
speed | number | Speech speed, 0.5-2.0 (default: 1.0) |
language | string | Base language code (e.g., "en", "pt"). Auto-detected if omitted |
locale | string | BCP-47 locale for multilingual voices (e.g., "en-US", "pt-BR") |
curl
curl -X POST "https://api.heygen.com/v3/voices/speech" \
-H "X-Api-Key: $HEYGEN_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"text": "Hello! Welcome to our product demo.",
"voice_id": "YOUR_VOICE_ID",
"speed": 1.0
}'
TypeScript
interface TTSRequest {
text: string;
voice_id: string;
input_type?: "text" | "ssml";
speed?: number;
language?: string;
locale?: string;
}interface WordTimestamp {
word: string;
start: number;
end: number;
}
interface TTSResponse {
error: null | string;
data: {
audio_url: string;
duration: number;
request_id?: string;
word_timestamps?: WordTimestamp[];
};
}
async function textToSpeech(request: TTSRequest): Promise {
const response = await fetch(
"https://api.heygen.com/v3/voices/speech",
{
method: "POST",
headers: {
"X-Api-Key": process.env.HEYGEN_API_KEY!,
"Content-Type": "application/json",
},
body: JSON.stringify(request),
}
);
const json: TTSResponse = await response.json();
if (json.error) {
throw new Error(json.error);
}
return json.data;
}
Python
import requests
import osdef text_to_speech(
text: str,
voice_id: str,
input_type: str = "text",
speed: float = 1.0,
language: str | None = None,
locale: str | None = None,
) -> dict:
payload = {
"text": text,
"voice_id": voice_id,
"speed": speed,
}
if input_type != "text":
payload["input_type"] = input_type
if language:
payload["language"] = language
if locale:
payload["locale"] = locale
response = requests.post(
"https://api.heygen.com/v3/voices/speech",
headers={
"X-Api-Key": os.environ["HEYGEN_API_KEY"],
"Content-Type": "application/json",
},
json=payload,
)
data = response.json()
if data.get("error"):
raise Exception(data["error"])
return data["data"]
Response Format
{
"error": null,
"data": {
"audio_url": "https://resource2.heygen.ai/text_to_speech/.../id=365d46bb.wav",
"duration": 5.526,
"request_id": "p38QJ52hfgNlsYKZZmd9",
"word_timestamps": [
{ "word": "", "start": 0.0, "end": 0.0 },
{ "word": "Hey", "start": 0.079, "end": 0.219 },
{ "word": "there,", "start": 0.239, "end": 0.459 },
{ "word": "", "start": 5.526, "end": 5.526 }
]
}
}
Usage Examples
Basic TTS
const result = await textToSpeech({
text: "Welcome to our quarterly earnings call.",
voice_id: "YOUR_VOICE_ID",
});console.log(Audio URL: ${result.audio_url});
console.log(Duration: ${result.duration}s);
With Speed Adjustment
const result = await textToSpeech({
text: "We're thrilled to announce our newest feature!",
voice_id: "YOUR_VOICE_ID",
speed: 1.1,
});
With Language and Locale for Multilingual Voices
const result = await textToSpeech({
text: "Bem-vindo ao nosso produto.",
voice_id: "MULTILINGUAL_VOICE_ID",
language: "pt",
locale: "pt-BR",
});
With SSML Input
const result = await textToSpeech({
text: 'Hello and welcome!',
voice_id: "YOUR_VOICE_ID",
input_type: "ssml",
});
Find a Voice and Generate Audio
async function generateSpeech(text: string, language: string): Promise {
const voices = await listTTSVoices();
const voice = voices.find(
(v) => v.language.toLowerCase().includes(language.toLowerCase())
); if (!voice) {
throw new Error(No TTS voice found for language: ${language});
}
const result = await textToSpeech({
text,
voice_id: voice.voice_id,
});
return result.audio_url;
}
const audioUrl = await generateSpeech("Hello and welcome!", "english");
Pauses with Break Tags
Use SSML-style break tags in your text for pauses:
word word
Rules:
- Use seconds with
s suffix:
- Must have spaces before and after the tag
- Self-closing tag format
With v3, you can also use input_type: "ssml" for full SSML support, allowing richer markup beyond just break tags:
{
"text": "Welcome! Let's get started.",
"voice_id": "YOUR_VOICE_ID",
"input_type": "ssml"
}
Best Practices
- Use
GET /v3/voices?engine=starfish to find compatible voices — the unified /v3/voices endpoint serves all voice types, so the engine=starfish filter is essential for TTS
- Check
support_locale before setting a locale — only multilingual voices support locale selection
- Keep speed between 0.8-1.2 for natural-sounding output
- Preview voices using the
preview_audio_url before generating (may be null for some voices)
- Use
word_timestamps in the response for caption syncing or timed text overlays
- Use SSML break tags in your text for pauses:
word word
- Use
input_type: "ssml" when you need full SSML markup control beyond simple break tags
- Paginate voice listing — the v3 endpoint returns paginated results; use
has_more and next_token to fetch all voices