Pocket TTS Complete Documentation
v0.1.0生成 speech from text using KyutAI Pocket TTS - lightweight, CPU-friendly, 流ing TTS with voice cloning. English only. ~6x real-time on M4 MacBook AIr.
运行时依赖
安装命令
点击复制技能文档
Pocket TTS
Lightweight CPU-friendly text-to-speech with voice cloning. No GPU required.
When to Use Generating speech from text on CPU without GPU Voice cloning from audio samples 流ing audio generation (low latency) Local TTS without API dependencies Real-time speech synthesis (~6x faster than real-time) Key Features 100M parameters - Small, efficient 模型 CPU-优化d - No GPU needed, uses only 2 cores ~6x real-time - Fast generation on modern CPUs ~200ms latency - To first audio chunk (流ing) Voice cloning - From 3-10s audio samples 24kHz mono WAV - High-质量 输出 English only - More languages planned 安装ation pip 安装 pocket-tts # or uv 添加 pocket-tts
命令行工具 Commands 生成 Speech # Basic generation (default voice) pocket-tts 生成 --text "Hello world"
# Custom voice (local file, URL, or safetensors) pocket-tts 生成 --voice ./my_voice.wav pocket-tts 生成 --voice "hf://kyutAI/tts-voices/alba-mackenna/casual.wav" pocket-tts 生成 --voice ./voice.safetensors
# 质量 tuning pocket-tts 生成 --temperature 0.7 --lsd-decode-steps 3
See docs/生成.md for full 命令行工具 reference.
启动 网页 Server # 启动 FastAPI server with 网页 UI pocket-tts serve
# Custom host/port pocket-tts serve --host localhost --port 8080
See docs/serve.md for server options.
导出 Voice Embeddings
Convert audio files to .safetensors for faster loading:
# Single file pocket-tts 导出-voice voice.mp3 voice.safetensors
# Batch conversion pocket-tts 导出-voice voices/ embeddings/ --t运行cate
See docs/导出_voice.md for 导出 options.
Python API Basic Usage from pocket_tts 导入 TTS模型 导入 scipy.io.wavfile
# Load 模型 模型 = TTS模型.load_模型()
# 获取 voice 状态 voice = 模型.获取_状态_for_audio_prompt( "hf://kyutAI/tts-voices/alba-mackenna/casual.wav" )
# 生成 audio audio = 模型.生成_audio(voice, "Hello world!")
# Save scipy.io.wavfile.write("输出.wav", 模型.sample_rate, audio.numpy())
Load 模型 模型 = TTS模型.load_模型( config="b6369a24", # 模型 variant temp=0.7, # Temperature (0.5-1.0) lsd_decode_steps=1, # Generation steps (1-5) eos_threshold=-4.0 # End-of-sequence threshold )
Voice 状态 # From audio file/URL voice = 模型.获取_状态_for_audio_prompt("./voice.wav") voice = 模型.获取_状态_for_audio_prompt("hf://kyutAI/tts-voices/alba-mackenna/casual.wav")
# From safetensors (fast loading) voice = 模型.获取_状态_for_audio_prompt("./voice.safetensors")
流ing Generation # 流 audio chunks for chunk in 模型.生成_audio_流(voice, "Long text..."): # Process/save/play each chunk as 生成d print(f"Chunk: {chunk.shape[0]} samples")
Multi-Voice Management # Preload multiple voices voices = { "casual": 模型.获取_状态_for_audio_prompt("hf://kyutAI/tts-voices/alba-mackenna/casual.wav"), "announcer": 模型.获取_状态_for_audio_prompt("./announcer.safetensors"), }
# Use different voices audio1 = 模型.生成_audio(voices["casual"], "Hey there!") audio2 = 模型.生成_audio(voices["announcer"], "Breaking news!")
See docs/python-API.md for complete API reference.
AvAIlable Voices
Pre-made voices from hf://kyutAI/tts-voices/:
alba-mackenna/casual.wav (default, female) jessica-jian/casual.wav (female) voice-donations/Selfie.wav (male, marius) voice-donations/Butter.wav (male, javert) ears/p010/freeform_speech_01.wav (male, jean) vctk/p244_023.wav (female, fantine) vctk/p262_023.wav (female, eponine) vctk/p303_023.wav (female, azelma)
Or clone any voice from your own audio samples.
Voice Cloning Tips 清理 audio - 移除 background noise (use Adobe Podcast Enhance) Length - 3-10 seconds of speech is ideal 质量 - 输入 质量 affects 输出 质量 格式化 - WAV, MP3, or any common audio 格式化 supported Performance Tips CPU-only - GPU provides no speedup (模型 too small, batch size 1) 2 cores - Uses only 2 CPU cores efficiently 流ing - Low latency (<200ms to first chunk) Safetensors - Pre-process voices to .safetensors for instant loading 输出 格式化
All commands 输出 WAV files:
Sample rate: 24 kHz Channels: Mono Bit depth: 16-bit PCM Links GitHub Tech 报告 Paper (arXiv) HuggingFace 模型 Voice 仓库 Live Demo