详细分析 ▾
运行时依赖
版本
修复:更新 TTS 文件路径至 /tmp/openclaw/tts-*/,添加最新文件安全检查,添加工作空间复制步骤以兼容消息工具。
安装命令 点击复制
技能文档
ElevenLabs TTS (文本转语音) 生成带情感标签的语音消息,使用 ElevenLabs v3 和音频标签。... (中间内容为代码块、命令行指令和 Markdown 格式,保持不变)...
Generate expressive voice messages using ElevenLabs v3 with audio tags.
Prerequisites
- ElevenLabs API Key (
ELEVENLABS_API_KEY): Required. Get one at elevenlabs.io → Profile → API Keys. Configure inopenclaw.jsonundermessages.tts.elevenlabs.apiKey. - ffmpeg: Required for audio format conversion (MP3 → Opus for WhatsApp compatibility). Must be installed and available on PATH.
Quick Start Examples
Storytelling (emotional journey):
[soft] It started like any other day... [pause] But something felt different. [nervous] My hands were shaking as I opened the envelope. [gasps] I got in! [excited] I actually got in! [laughs] [happy] This changes everything!
Horror/Suspense (building dread):
[whispers] The house has been empty for years... [pause] At least, that's what they told me. [nervous] But I keep hearing footsteps. [scared] They're getting closer. [gasps] [panicking] The door— it's opening by itself!
Conversation with reactions:
[curious] So what happened at the meeting? [pause] [surprised] Wait, they fired him?! [gasps] [sad] That's terrible... [sighs] He had a family. [thoughtful] I wonder what he'll do now.
Hebrew (romantic moment):
[soft] היא עמדה שם, מול השקיעה... [pause] הלב שלי פעם כל כך חזק. [nervous] לא ידעתי מה להגיד. [hesitates] אני... [breathes] [tender] את יודעת שאני אוהב אותך, נכון?
Spanish (celebration to reflection):
[excited] ¡Lo logramos! [laughs] [happy] No puedo creerlo... [pause] [thoughtful] Fueron tantos años de trabajo. [emotional] [soft] Gracias a todos los que creyeron en mí. [sighs] [content] Valió la pena cada momento.
Configuration (OpenClaw)
In openclaw.json, configure TTS under messages.tts:
{
"messages": {
"tts": {
"provider": "elevenlabs",
"elevenlabs": {
"apiKey": "sk_your_api_key_here",
"voiceId": "pNInz6obpgDQGcFmaJgB",
"modelId": "eleven_v3",
"languageCode": "en",
"voiceSettings": {
"stability": 0.5,
"similarityBoost": 0.75,
"style": 0,
"useSpeakerBoost": true,
"speed": 1
}
}
}
}
}
Getting your API Key:
- Go to https://elevenlabs.io
- Sign up/login
- Click profile → API Keys
- Copy your key
Recommended Voices for v3
These premade voices are optimized for v3 and work well with audio tags:
| Voice | ID | Gender | Accent | Best For |
|---|---|---|---|---|
| Adam | pNInz6obpgDQGcFmaJgB | Male | American | Deep narration, general use |
| Rachel | 21m00Tcm4TlvDq8ikWAM | Female | American | Calm narration, conversational |
| Brian | nPczCjzI2devNBz1zQrb | Male | American | Deep narration, podcasts |
| Charlotte | XB0fDUnXU5powFXDhCwa | Female | English-Swedish | Expressive, video games |
| George | JBFqnCBsd6RMkjVDRZzb | Male | British | Raspy narration, storytelling |
- Browse: https://elevenlabs.io/voice-library
- v3-optimized collection: https://elevenlabs.io/app/voice-library/collections/aF6JALq9R6tXwCczjhKH
- API:
GET https://api.elevenlabs.io/v1/voices
Voice selection tips:
- Use IVC (Instant Voice Clone) or premade voices - PVC not optimized for v3 yet
- Match voice character to your use case (whispering voice won't shout well)
- For expressive IVCs, include varied emotional tones in training samples
Model Settings
- Model:
eleven_v3(alpha) - ONLY model supporting audio tags - Languages: 70+ supported with full audio tag control
Stability Modes
| Mode | Stability | Description |
|---|---|---|
| Creative | 0.3-0.5 | More emotional/expressive, may hallucinate |
| Natural | 0.5-0.7 | Balanced, closest to original voice |
| Robust | 0.7-1.0 | Highly stable, less responsive to tags |
Speed Control
Range: 0.7 (slow) to 1.2 (fast), default 1.0
Extreme values affect quality. For pacing, prefer audio tags like [rushed] or [drawn out].
Critical Rules
Length Limits
- Optimal: <800 characters per segment (best quality)
- Maximum: 10,000 characters (API hard limit)
- Quality degrades with longer text - voice becomes inconsistent
Audio Tags - Best Practices for Natural Sound
How many tags to use:
- 1-2 tags per sentence or phrase (not more!)
- Tags persist until the next tag - no need to repeat
- Overusing tags sounds unnatural and robotic
Where to place tags:
- At emotional transition points
- Before key dramatic moments
- When energy/pace changes
Context matters:
- Write text that matches the tag emotion
- Longer text with context = better interpretation
- Example:
[nervous] I... I'm not sure about this. What if it doesn't work?works better than[nervous] Hello.
Combine tags for nuance:
[nervously][whispers]= nervous whispering[excited][laughs]= excited laughter- Keep combinations to 2 tags max
Regenerate for best results:
- v3 is non-deterministic - same text = different outputs
- Generate 3+ versions, pick the best
- Small text tweaks can improve results
Match tag to voice:
- Don't use
[shouts]on a whispering voice - Don't use
[whispers]on a loud/energetic voice - Test tags with your chosen voice
SSML Not Supported
v3 does NOT support SSML break tags. Use audio tags and punctuation instead.Punctuation Effects (use with tags!)
Punctuation enhances audio tags:
- Ellipses (...) → dramatic pauses:
[nervous] I... I don't know... - CAPS → emphasis:
[excited] That's AMAZING! - Dashes (—) → interruptions:
[explaining] So what you do is— [interrupting] Wait! - Question marks → uncertainty:
[nervous] Are you sure about this? - Exclamation! → energy boost:
[happy] We did it!
Combine tags + punctuation for maximum effect:
[tired] It was a long day... [sighs] Nobody listens anymore.
WhatsApp Voice Messages
Complete Workflow
- Generate with
ttstool (returns Opus in/tmp/openclaw/tts-/) - Copy to workspace (message tool only allows workspace paths)
- Send with
messagetool - Cleanup - delete the workspace copy
Step-by-Step
1. Generate TTS (add [pause] at end to prevent cutoff):
tts text="[excited] This is amazing! [pause]" channel=whatsapp
2. Find the LATEST file (⚠️ CRITICAL - always use the newest file!):
find /tmp/openclaw/tts- /tmp/tts- -name ".opus" -o -name ".mp3" -o -name ".ogg" 2>/dev/null | xargs ls -t | head -1
The tts tool now outputs to /tmp/openclaw/tts-/ (NOT /tmp/tts-/).
Old files may exist in /tmp/tts-*/ from previous sessions - never use those!3. If file is MP3, convert to Opus:
ffmpeg -i /path/to/voice.mp3 -c:a libopus -b:a 64k -vbr on -application voip /path/to/voice.ogg
If already .opus, skip this step.4. Copy to workspace and send:
cp /tmp/openclaw/tts-xxx/voice.opus ~/. openclaw/workspace/voice-temp.ogg
message action=send channel=whatsapp target="+972..." filePath="/root/.openclaw/workspace/voice-temp.ogg" asVoice=true message=" "
5. Cleanup:
rm /root/.openclaw/workspace/voice-temp.ogg
WhatsApp requires a non-empty message body to send voice notes. Use a single space as the message.
Why Opus?
| Format | iOS | Android | Transcribe |
|---|---|---|---|
| MP3 | ✅ Works | ❌ May fail | ❌ No |
| Opus (.ogg) | ✅ Works | ✅ Works | ✅ Yes |
- Works on all devices (iOS + Android)
- Supports WhatsApp's transcribe button
Audio Cutoff Fix
ElevenLabs sometimes cuts off the last word. Always add [pause] or ... at the end:
[excited] This is amazing! [pause]
Long-Form Audio (Podcasts)
For content >800 chars:
- Split into short segments (<800 chars each)
- Generate each with
ttstool - Concatenate with ffmpeg:
cat > list.txt << EOF
file '/path/file1.mp3'
file '/path/file2.mp3'
EOF
ffmpeg -f concat -safe 0 -i list.txt -c copy final.mp3
- Convert to Opus for WhatsApp
- Send as single voice message
Important: Don't mention "part 2" or "chapter" - keep it seamless.
Multi-Speaker Dialogue
v3 can handle multiple characters in one generation:
Jessica: [whispers] Did you hear that?
Chris: [interrupting] —I heard it too!
Jessica: [panicking] We need to hide!
Dialogue tags: [interrupting], [overlapping], [cuts in], [interjecting]
Audio Tags Quick Reference
| Category | Tags | When to Use |
|---|---|---|
| Emotions | [excited], [happy], [sad], [angry], [nervous], [curious] | Main emotional state - use 1 per section |
| Delivery | [whispers], [shouts], [soft], [rushed], [drawn out] | Volume/speed changes |
| Reactions | [laughs], [sighs], [gasps], [clears throat], [gulps] | Natural human moments - sprinkle sparingly |
| Pacing | [pause], [hesitates], [stammers], [breathes] | Dramatic timing |
| Character | [French accent], [British accent], [robotic tone] | Character voice shifts |
| Dialogue | [interrupting], [overlapping], [cuts in] | Multi-speaker conversations |
- Emotions:
[excited],[nervous],[sad],[happy] - Reactions:
[laughs],[sighs],[whispers] - Pacing:
[pause]
Less reliable (test and regenerate):
- Sound effects:
[explosion],[gunshot] - Accents: results vary by voice
Full tag list: See references/audio-tags.md
Troubleshooting
Tags read aloud?
- Verify using
eleven_v3model - Use IVC/premade voices, not PVC
- Simplify tags (no "tone" suffix)
- Increase text length (250+ chars)
Voice inconsistent?
- Segment is too long - split at <800 chars
- Regenerate (v3 is non-deterministic)
- Try lower stability setting
WhatsApp won't play?
- Convert to Opus format (see above)
No emotion despite tags?
- Voice may not match tag style
- Try Creative stability mode (0.5)
- Add more context around the tag
免费技能或插件可能存在安全风险,如需更匹配、更安全的方案,建议联系付费定制