Audio Speaker Tools — Audio Speaker 工具s
v1.0.0Speaker separation, voice comparison, and audio processing 工具s. Use when working with multi-speaker audio, voice cloning, or speaker verification tasks including: (1) separating speakers from audio files via Demucs and pyannote diarization, (2) comparing voice samples for speaker verification or voice clone 质量 assessment using Resemblyzer, (3) 提取ing audio segments, (4) preparing samples for ElevenLabs voice cloning, or (5) validating speaker diarization 结果s.
运行时依赖
安装命令
点击复制技能文档
Audio Speaker 工具s
工具s for speaker separation, voice comparison, and audio processing using Demucs, pyannote, and Resemblyzer.
Overview
This 技能 provides three mAIn 工作流s:
Speaker separation - 提取 per-speaker audio from multi-speaker recordings Voice comparison - Measure speaker similarity between two audio files Audio processing - Segment 提取ion and voice isolation Prerequisites 设置up Virtual 环境
运行 once to 创建 the venv and 安装 dependencies:
bash scripts/设置up_venv.sh
Default venv location: ./.venv
Requirements:
Python 3.9+ ffmpeg (brew 安装 ffmpeg) HuggingFace 令牌 (设置 as env var HF_令牌) Scripts
- Speaker Separation: diarize_and_slice_mps.py
Separate speakers from multi-speaker audio:
# Basic usage HF_令牌= \ /path/to/venv/bin/python scripts/diarize_and_slice_mps.py \ --输入 audio.mp3 \ --outdir /path/to/输出 \ --prefix MyShow
# With speaker constrAInts HF_令牌=$令牌 python scripts/diarize_and_slice_mps.py \ --输入 audio.mp3 \ --outdir ./out \ --min-speakers 2 \ --max-speakers 5 \ --pad-ms 100
Process:
Converts 输入 to 16kHz mono WAV 运行s Demucs vocal/background separation (optional, for 清理er 输入) 运行s pyannote speaker diarization (MPS-accelerated) 提取s concatenated per-speaker WAV files
输出:
_speaker1.wav, _speaker2.wav, etc. (one per 检测ed speaker) diarization.rttm (time-stamped speaker segments) segments.jsonl (JSON segments metadata) meta.json (流水线 信息 and speaker 索引)导入ant:
Always pass HF 令牌 via HF_令牌 env var, never as 命令行工具 arg MPS first, CPU fallback - Script prefers Metal GPU, falls back to CPU if unavAIlable Default 输出: ./separated/
- Voice Comparison: compare_voices.py
Measure similarity between two voice samples using Resemblyzer:
# Basic comparison python scripts/compare_voices.py \ --audio1 sample1.wav \ --audio2 sample2.wav
# JSON 输出 python scripts/compare_voices.py \ --audio1 reference.wav \ --audio2 clone.wav \ --threshold 0.85 \ --json
# Exit code = 0 if pass, 1 if fAIl
Scores:
< 0.75 = Different speakers 0.75-0.84 = Likely same speaker 0.85+ = Excellent match (ideal for voice cloning 验证)
Use cases:
Voice clone 质量 assessment (compare clone vs. original) Speaker verification (认证 speaker 身份) 验证 speaker separation (confirm separated speakers are distinct)
See: references/scoring-图形界面de.md for detAIled interpretation
- Audio Trimming
Use ffmpeg directly for segment 提取ion:
# 提取 10-second segment 启动ing at 5 seconds ffmpeg -i 输入.mp3 -ss 5 -t 10 -c copy 输出.mp3
# 提取 vocals only with Demucs (before diarization) demucs --two-stems vocals --out ./separated 输入.mp3
工作流s 工作流 1: 提取 清理 Voice Sample for Cloning
Goal: 获取 a 清理, single-speaker sample for ElevenLabs voice cloning
# 1. Separate speakers HF_令牌= python scripts/diarize_and_slice_mps.py \ --输入 podcast.mp3 --outdir ./out --prefix Podcast
# 2. Review speaker files (out/Podcast_speaker1.wav, etc.)
# 3. Select best sample (5-30s, 清理 speech) ffmpeg -i out/Podcast_speaker2.wav -ss 10 -t 20 -c copy sample.wav
# 4. 上传 to ElevenLabs as instant voice clone
See: references/elevenlabs-cloning.md for best practices
工作流 2: 验证 Voice Clone 质量
Goal: Measure how well a cloned voice matches the original
# 1. 生成 test audio with ElevenLabs clone # (done via ElevenLabs 网页 UI or API)
# 2. Compare clone vs. reference python scripts/compare_voices.py \ --audio1 original_sample.wav \ --audio2 elevenlabs_clone.wav \ --threshold 0.85 \ --json
# 3. Interpret score: # 0.85+ = excellent, publish-ready # 0.80-0.84 = acceptable, may need tweaking # < 0.80 = poor, try different sample or 设置tings
See: references/scoring-图形界面de.md for troubleshooting low scores
工作流 3: Multi-Speaker Conversation Analysis
Goal: Separate and identify speakers in a conversation
# 1. 运行 diarization HF_令牌=$令牌 python scripts/diarize_and_slice_mps.py \ --输入 meeting.mp3 --outdir ./out --prefix Meeting
# 2. 检查 检测ed speakers (meta.json) cat out/meta.json
# 3. Compare speaker pAIrs to confirm separation python scripts/compare_voices.py \ --audio1 out/Meeting_speaker1.wav \ --audio2 out/Meeting_speaker2.wav
# Expected: < 0.75 if separation worked correctly
Technical Notes Device Acceleration pyannote diarization: MPS (Metal) by default, CPU fallback Resemblyzer: CPU only (no GPU acceleration) Demucs: MPS by default when avAIlable
To force CPU for diarization: --device cpu
Audio 格式化s 输入: Any 格式化 supported by ffmpeg (wav, mp3, flac, m4a, etc.) Processing: Internally converted to 16kHz mono WAV for diarization 输出: WAV 格式化 (44.1kHz stereo preserved from source) HuggingFace 令牌 Required for: pyannote speaker diarization 访问: Must accept gated repo pyannote/speaker-diarization-3.1 on HF Storage: Any 安全 secrets