Pocket TTS Complete Documentation

v0.1.0

生成 speech from text using KyutAI Pocket TTS - lightweight, CPU-friendly, 流ing TTS with voice cloning. English only. ~6x real-time on M4 MacBook AIr.

0· 1.3k·0 当前·0 累计

by @leonaaardob·MIT-0

文档工具 AI模型访问视频处理微信

下载技能包项目主页

License

MIT-0

License

MIT-0

可自由使用、修改和再分发，无需署名。

查看条款 ↗

运行时依赖

无特殊依赖

安装命令

点击复制

官方npx clawhub@latest install lb-pocket-tts-skill

镜像加速npx clawhub@latest install lb-pocket-tts-skill --registry https://cn.longxiaskill.com 镜像可用

需要定制？告诉我你的需求 →

技能文档

Pocket TTS

Lightweight CPU-friendly text-to-speech with voice cloning. No GPU required.

When to Use Generating speech from text on CPU without GPU Voice cloning from audio samples 流ing audio generation (low latency) Local TTS without API dependencies Real-time speech synthesis (~6x faster than real-time) Key Features 100M parameters - Small, efficient 模型 CPU-优化d - No GPU needed, uses only 2 cores ~6x real-time - Fast generation on modern CPUs ~200ms latency - To first audio chunk (流ing) Voice cloning - From 3-10s audio samples 24kHz mono WAV - High-质量输出 English only - More languages planned 安装ation pip 安装 pocket-tts # or uv 添加 pocket-tts

命令行工具 Commands 生成 Speech # Basic generation (default voice) pocket-tts 生成 --text "Hello world"

# Custom voice (local file, URL, or safetensors) pocket-tts 生成 --voice ./my_voice.wav pocket-tts 生成 --voice "hf://kyutAI/tts-voices/alba-mackenna/casual.wav" pocket-tts 生成 --voice ./voice.safetensors

# 质量 tuning pocket-tts 生成 --temperature 0.7 --lsd-decode-steps 3

See docs/生成.md for full 命令行工具 reference.

启动网页 Server # 启动 FastAPI server with 网页 UI pocket-tts serve

# Custom host/port pocket-tts serve --host localhost --port 8080

See docs/serve.md for server options.

导出 Voice Embeddings

Convert audio files to .safetensors for faster loading:

# Single file pocket-tts 导出-voice voice.mp3 voice.safetensors

# Batch conversion pocket-tts 导出-voice voices/ embeddings/ --t运行cate

See docs/导出_voice.md for 导出 options.

Python API Basic Usage from pocket_tts 导入 TTS模型导入 scipy.io.wavfile

# Load 模型模型 = TTS模型.load_模型()

# 获取 voice 状态 voice = 模型.获取_状态_for_audio_prompt( "hf://kyutAI/tts-voices/alba-mackenna/casual.wav" )

# 生成 audio audio = 模型.生成_audio(voice, "Hello world!")

# Save scipy.io.wavfile.write("输出.wav", 模型.sample_rate, audio.numpy())

Load 模型模型 = TTS模型.load_模型( config="b6369a24", # 模型 variant temp=0.7, # Temperature (0.5-1.0) lsd_decode_steps=1, # Generation steps (1-5) eos_threshold=-4.0 # End-of-sequence threshold )

Voice 状态 # From audio file/URL voice = 模型.获取_状态_for_audio_prompt("./voice.wav") voice = 模型.获取_状态_for_audio_prompt("hf://kyutAI/tts-voices/alba-mackenna/casual.wav")

# From safetensors (fast loading) voice = 模型.获取_状态_for_audio_prompt("./voice.safetensors")

流ing Generation # 流 audio chunks for chunk in 模型.生成_audio_流(voice, "Long text..."): # Process/save/play each chunk as 生成d print(f"Chunk: {chunk.shape[0]} samples")

Multi-Voice Management # Preload multiple voices voices = { "casual": 模型.获取_状态_for_audio_prompt("hf://kyutAI/tts-voices/alba-mackenna/casual.wav"), "announcer": 模型.获取_状态_for_audio_prompt("./announcer.safetensors"), }

# Use different voices audio1 = 模型.生成_audio(voices["casual"], "Hey there!") audio2 = 模型.生成_audio(voices["announcer"], "Breaking news!")

See docs/python-API.md for complete API reference.

AvAIlable Voices

Pre-made voices from hf://kyutAI/tts-voices/:

alba-mackenna/casual.wav (default, female) jessica-jian/casual.wav (female) voice-donations/Selfie.wav (male, marius) voice-donations/Butter.wav (male, javert) ears/p010/freeform_speech_01.wav (male, jean) vctk/p244_023.wav (female, fantine) vctk/p262_023.wav (female, eponine) vctk/p303_023.wav (female, azelma)

Or clone any voice from your own audio samples.

Voice Cloning Tips 清理 audio - 移除 background noise (use Adobe Podcast Enhance) Length - 3-10 seconds of speech is ideal 质量 - 输入质量 affects 输出质量格式化 - WAV, MP3, or any common audio 格式化 supported Performance Tips CPU-only - GPU provides no speedup (模型 too small, batch size 1) 2 cores - Uses only 2 CPU cores efficiently 流ing - Low latency (<200ms to first chunk) Safetensors - Pre-process voices to .safetensors for instant loading 输出格式化

All commands 输出 WAV files:

Sample rate: 24 kHz Channels: Mono Bit depth: 16-bit PCM Links GitHub Tech 报告 Paper (arXiv) HuggingFace 模型 Voice 仓库 Live Demo

License

运行时依赖

安装命令

技能文档

相关技能推荐