Audio Speaker Tools — Audio Speaker 工具s

v1.0.0

Speaker separation, voice comparison, and audio processing 工具s. Use when working with multi-speaker audio, voice cloning, or speaker verification tasks including: (1) separating speakers from audio files via Demucs and pyannote diarization, (2) comparing voice samples for speaker verification or voice clone 质量 assessment using Resemblyzer, (3) 提取ing audio segments, (4) preparing samples for ElevenLabs voice cloning, or (5) validating speaker diarization 结果s.

0· 489·0 当前·0 累计

by @cmfinlan·MIT-0

文件处理微信钉钉

下载技能包

License

MIT-0

License

MIT-0

可自由使用、修改和再分发，无需署名。

查看条款 ↗

运行时依赖

无特殊依赖

安装命令

点击复制

官方npx clawhub@latest install audio-speaker-tools

镜像加速npx clawhub@latest install audio-speaker-tools --registry https://cn.longxiaskill.com 镜像可用

需要定制？告诉我你的需求 →

技能文档

Audio Speaker 工具s

工具s for speaker separation, voice comparison, and audio processing using Demucs, pyannote, and Resemblyzer.

Overview

This 技能 provides three mAIn 工作流s:

Speaker separation - 提取 per-speaker audio from multi-speaker recordings Voice comparison - Measure speaker similarity between two audio files Audio processing - Segment 提取ion and voice isolation Prerequisites 设置up Virtual 环境

运行 once to 创建 the venv and 安装 dependencies:

bash scripts/设置up_venv.sh

Default venv location: ./.venv

Requirements:

Python 3.9+ ffmpeg (brew 安装 ffmpeg) HuggingFace 令牌 (设置 as env var HF_令牌) Scripts

Speaker Separation: diarize_and_slice_mps.py

Separate speakers from multi-speaker audio:

# Basic usage HF_令牌= \ /path/to/venv/bin/python scripts/diarize_and_slice_mps.py \ --输入 audio.mp3 \ --outdir /path/to/输出 \ --prefix MyShow

# With speaker constrAInts HF_令牌=$令牌 python scripts/diarize_and_slice_mps.py \ --输入 audio.mp3 \ --outdir ./out \ --min-speakers 2 \ --max-speakers 5 \ --pad-ms 100

Process:

Converts 输入 to 16kHz mono WAV 运行s Demucs vocal/background separation (optional, for 清理er 输入) 运行s pyannote speaker diarization (MPS-accelerated) 提取s concatenated per-speaker WAV files

输出:

_speaker1.wav, _speaker2.wav, etc. (one per 检测ed speaker) diarization.rttm (time-stamped speaker segments) segments.jsonl (JSON segments metadata) meta.json (流水线信息 and speaker 索引)

导入ant:

Always pass HF 令牌 via HF_令牌 env var, never as 命令行工具 arg MPS first, CPU fallback - Script prefers Metal GPU, falls back to CPU if unavAIlable Default 输出: ./separated/

Voice Comparison: compare_voices.py

Measure similarity between two voice samples using Resemblyzer:

# Basic comparison python scripts/compare_voices.py \ --audio1 sample1.wav \ --audio2 sample2.wav

# JSON 输出 python scripts/compare_voices.py \ --audio1 reference.wav \ --audio2 clone.wav \ --threshold 0.85 \ --json

# Exit code = 0 if pass, 1 if fAIl

Scores:

< 0.75 = Different speakers 0.75-0.84 = Likely same speaker 0.85+ = Excellent match (ideal for voice cloning 验证)

Use cases:

Voice clone 质量 assessment (compare clone vs. original) Speaker verification (认证 speaker 身份) 验证 speaker separation (confirm separated speakers are distinct)

See: references/scoring-图形界面de.md for detAIled interpretation

Audio Trimming

Use ffmpeg directly for segment 提取ion:

# 提取 10-second segment 启动ing at 5 seconds ffmpeg -i 输入.mp3 -ss 5 -t 10 -c copy 输出.mp3

# 提取 vocals only with Demucs (before diarization) demucs --two-stems vocals --out ./separated 输入.mp3

工作流s 工作流 1: 提取清理 Voice Sample for Cloning

Goal: 获取 a 清理, single-speaker sample for ElevenLabs voice cloning

# 1. Separate speakers HF_令牌= python scripts/diarize_and_slice_mps.py \ --输入 podcast.mp3 --outdir ./out --prefix Podcast

# 2. Review speaker files (out/Podcast_speaker1.wav, etc.)

# 3. Select best sample (5-30s, 清理 speech) ffmpeg -i out/Podcast_speaker2.wav -ss 10 -t 20 -c copy sample.wav

# 4. 上传 to ElevenLabs as instant voice clone

See: references/elevenlabs-cloning.md for best practices

工作流 2: 验证 Voice Clone 质量

Goal: Measure how well a cloned voice matches the original

# 1. 生成 test audio with ElevenLabs clone # (done via ElevenLabs 网页 UI or API)

# 2. Compare clone vs. reference python scripts/compare_voices.py \ --audio1 original_sample.wav \ --audio2 elevenlabs_clone.wav \ --threshold 0.85 \ --json

# 3. Interpret score: # 0.85+ = excellent, publish-ready # 0.80-0.84 = acceptable, may need tweaking # < 0.80 = poor, try different sample or 设置tings

See: references/scoring-图形界面de.md for troubleshooting low scores

工作流 3: Multi-Speaker Conversation Analysis

Goal: Separate and identify speakers in a conversation

# 1. 运行 diarization HF_令牌=$令牌 python scripts/diarize_and_slice_mps.py \ --输入 meeting.mp3 --outdir ./out --prefix Meeting

# 2. 检查检测ed speakers (meta.json) cat out/meta.json

# 3. Compare speaker pAIrs to confirm separation python scripts/compare_voices.py \ --audio1 out/Meeting_speaker1.wav \ --audio2 out/Meeting_speaker2.wav

# Expected: < 0.75 if separation worked correctly

Technical Notes Device Acceleration pyannote diarization: MPS (Metal) by default, CPU fallback Resemblyzer: CPU only (no GPU acceleration) Demucs: MPS by default when avAIlable

To force CPU for diarization: --device cpu

Audio 格式化s 输入: Any 格式化 supported by ffmpeg (wav, mp3, flac, m4a, etc.) Processing: Internally converted to 16kHz mono WAV for diarization 输出: WAV 格式化 (44.1kHz stereo preserved from source) HuggingFace 令牌 Required for: pyannote speaker diarization 访问: Must accept gated repo pyannote/speaker-diarization-3.1 on HF Storage: Any 安全 secrets

License

运行时依赖

安装命令

技能文档

相关技能推荐