Smart Audio Analyzer — 智能音频分析器（转写、说话人识别、场景检测、结构化纪要）

Name: Smart Audio Analyzer — 智能音频分析器（转写、说话人识别、场景检测、结构化纪要）
Author: JoJowillwater

JoJowillwater

Smart Audio Analyzer — 智能音频分析器（转写、说话人识别、场景检测、结构化纪要）

v1.2.1

一体化音频分析工具，支持音频转写、基于声纹的说话人识别、自动场景检测（会议、采访、培训、讲座等）和生成结构化会议纪要。支持AssemblyAI、Whisper和Gemini多引擎，具有跨录音会话的持久声纹匹配能力。

0· 287·1 当前·1 累计

by @jojowillwater (JoJowillwater)·MIT-0

数据分析 AI模型访问自动化代码生成开发工具

下载技能包

License

MIT-0

最后更新

2026/4/12

安全扫描

VirusTotal

可疑

查看报告

OpenClaw

安全

high confidence

该技能的代码和指令与其声明的目的相符（转写、说话人识别、场景检测和生成纪要），但需注意元数据报告中的错误和隐私权衡（默认上传音频到云端ASR，除非强制使用本地Whisper）。

评估建议

该技能似乎如其宣称般功能：转写音频、局部匹配说话人声纹、检测场景、产生结构化纪要。安装前，请注意两点：（1）注册元数据未列出实际所需的API密钥（ASSEMBLYAI_API_KEY / GEMINI_API_KEY / OPENAI_API_KEY），请与发布者确认元数据和环境变量。（2）默认会上传音频到第三方ASR/总结服务——如果数据敏感，请设置ASR_ENGINE=whisper并安装本地Whisper/ffmpeg或避免提供云API密钥。验证声纹存储位置（references/voice-db.json）并确保权限/备份策略符合您的隐私要求。额外预防措施：运行npm install前审查脚本，在沙盒环境或独立环境中首次运行，限制API密钥权限，仅在理解声纹更新机制后确认说话人身份。...

详细分析 ▾

ℹ 用途与能力

技能名称和描述（转写、说话人识别、场景检测、结构化纪要）与包含的文件（scripts/analyze.js 和 scripts/voiceprint.py）以及声明的npm/Python依赖项一致。然而，报告顶部的注册元数据未列出任何必需的环境变量/凭据，而SKILL.md明确要求ASSEMBLYAI_API_KEY、GEMINI_API_KEY或OPENAI_API_KEY（env_any_of）。此元数据不匹配是一个不一致性，应该在信任自动安装之前解决。

ℹ 指令范围

SKILL.md 指示代理运行 node analyze.js 处理传入的音频，读写 references/voice-profiles.md 和 references/voice-db.json，加载场景模板，并在用户确认后更新配置文件。代码确实读写了这些文件。它还上传音频到第三方ASR/总结服务（AssemblyAI/Gemini/OpenAI/OpenRouter），除非使用本地Whisper回退。这些行为都在声明的目的范围内，但它们有隐私影响（音频发送到外部主机）。引导片段还指示代理自动为音频文件调用脚本——对于处理技能这是预期的，但值得注意。

✓ 安装机制

没有奇怪的安装方式：SKILL.md 指示运行 'cd scripts && npm install'，这与提供的 package.json 和 package-lock.json 匹配（assemblyai、dotenv、openai）。默认不需要从个人服务器下载。引用了一个可选的ONNX模型下载（github.com/wenet-e2e/wespeaker/releases）——这是一个知名的发布主机，但它是可选的，如果使用，将写入磁盘上的模型文件。

⚠ 凭证需求

SKILL.md 需要一个ASSEMBLYAI_API_KEY、GEMINI_API_KEY或OPENAI_API_KEY，并可选地需要WESPEAKER_MODEL/ASR_ENGINE；代码读取ASSEMBLYAI_API_KEY、GEMINI_API_KEY、OPENAI_API_KEY、OPENAI_BASE_URL/OPENAI_API_KEY用于总结。之前显示的注册元数据未报告任何必需的环境变量或主要凭据——这不一致且令人担忧，因为技能在没有这些密钥的情况下将无法运行（并将上传音频）。另外，音频上传到云ASR是设计的内在部分；用户应该考虑是否舒适地将录音发送到外部服务。voiceprint.py 将声纹存储在本地（references/voice-db.json），这与SKILL.md中的隐私声明相符。

✓ 持久化与权限

该技能不请求 'always: true'，不会修改其他技能。它在自己的工作空间中写入和更新文件（references/voice-profiles.md 和 references/voice-db.json）以保持说话人配置文件，这是此功能的预期行为。代理自主调用设置为默认（允许），但这里没有提升。

⚠ scripts/analyze.js:177

检测到Shell命令执行（child_process）。

安全有层次，运行前请审查代码。

License

MIT-0

可自由使用、修改和再分发，无需署名。

查看条款 ↗

运行时依赖

无特殊依赖

版本

latestv1.2.12026/3/16

v1.2.1：修复隐私声明（音频确实上传到云端ASR，声纹嵌入不上传），修复硬编码模型路径为相对路径

● 可疑

安装命令点击复制

官方npx clawhub@latest install audio-analyzer

镜像加速npx clawhub@latest install audio-analyzer --registry https://cn.clawhub-mirror.com

技能文档

（由于原始内容过长且包含大量不需要翻译的代码块和命令行指令，以下仅提供简要的中文SKILL.md翻译，保留关键信息）

The only audio skill with persistent voice profiles. Beyond transcription — it knows WHO is speaking, detects the scene, and generates structured notes.

唯一带声纹档案的录音分析 skill。转写只是第一步——它还能认出谁在说话，自动判断场景，按模板出纪要。

What Makes This Different

Feature	This Skill	Others
Transcription	✅ AssemblyAI (default) + Whisper + Gemini	✅ Usually one engine
Speaker ID by voiceprint	✅ Persistent profiles across sessions	❌ None
Scene auto-detection	✅ 5 built-in scenes + extensible	❌ One-size-fits-all
Structured output	✅ Scene-specific templates	⚠️ Generic summary
Multi-language	✅ Chinese + English	Varies

Quick Start

# 1. Install cd skills/audio-analyzer/scripts && npm install # 2. Configure (pick ONE — AssemblyAI recommended) cp .env.example .env # Edit .env: set ASSEMBLYAI_API_KEY

# 3. Run node analyze.js /path/to/recording.m4a

Zero-config alternative: If no API key is set, it will attempt local Whisper or Gemini fallback.

安装

# 1. 放到 workspace/skills/ 下 cp -r audio-analyzer /path/to/.openclaw/workspace/skills/ # 2. 安装依赖 cd skills/audio-analyzer/scripts && npm install # 3. 配置 ASR 引擎（选一个即可，推荐 AssemblyAI） cp .env.example .env # 编辑 .env，填入 ASSEMBLYAI_API_KEY

# 4. 多 agent 环境：每个 agent 的 workspace 都需要一份

Bootstrap 片段

将以下内容添加到你的 agent bootstrap.md：

## 音频文件处理当收到音频文件（.m4a/.mp3/.wav/.ogg/.flac）时，必须按以下流程处理：运行 cd /skills/audio-analyzer/scripts && node analyze.js <音频文件绝对路径> 进行转写+说话人分离读取转写结果，根据内容自动判断场景（或按用户指定）读取 skills/audio-analyzer/references/scenes/<场景>.md 加载模板读取 skills/audio-analyzer/references/voice-profiles.md 对照音色档案按模板生成结构化纪要与用户确认说话人身份，更新音色档案

不要尝试用 summarize、pdf、image 等工具处理音频文件。

Core Pipeline

Audio File → Transcribe + Speaker Separation → Voice Profile Matching
→ Scene Detection → Load Template → Generate Notes → Update Profiles

Step 1: Transcribe

cd scripts && node analyze.js <文件路径>

ASR Engine Priority:

AssemblyAI (default, best quality) — needs ASSEMBLYAI_API_KEY
Gemini — needs GEMINI_API_KEY or OpenRouter key
Whisper (local) — needs whisper installed locally

Output:

_transcript.txt — timestamped dialogue with speaker labels
_raw.json — raw JSON with speaker metadata

Step 2: Speaker Identification

Cross-references references/voice-profiles.md:

Read all known voice profiles (speech patterns, content patterns)
Analyze each speaker against profiles
Match rules:

- High confidence → auto-label with name - Partial match → label as "possibly XXX" with evidence - No match → label as "Unknown Speaker"

Ask user to confirm
Update profiles after confirmation

Step 3: Scene Detection

Auto-detects based on transcript content:

Scene	Typical Keywords	Template
🚣 Rowing Training	stroke rate, pace, catch, drive	`scenes/rowing.md`
💼 Work Meeting	project, deadline, requirements, bug	`scenes/meeting.md`
🎤 Interview	user pain points, use case, feedback	`scenes/interview.md`
🎓 Talk/Lecture	welcome, today's topic, Q&A	`scenes/talk.md`
📝 General	(fallback)	`scenes/general.md`

Override manually: node analyze.js file.m4a meeting

Step 4-5: Generate Structured Notes

Loads scene-specific template → generates structured output with key points, action items, and insights.

Step 6: Update Voice Profiles

After user confirms speaker identities, updates references/voice-profiles.md:

New person → add entry (role, speech patterns, content patterns)
Known person → refine description
Shared across all scenes and future recordings

Extending Scenes

Add a new .md file in references/scenes/:

references/scenes/
├── rowing.md      # 🚣 Rowing Training
├── meeting.md     # 💼 Work Meeting
├── interview.md   # 🎤 Interview
├── talk.md        # 🎓 Talk/Lecture
└── general.md     # 📝 General (fallback)

Requirements

Node.js 18+
At least ONE of: AssemblyAI key, Gemini key, or local Whisper
cd scripts && npm install

Error Handling

Situation	Response
API quota exceeded	"Transcription service unavailable, check API quota"
File > 100MB	Warn user: estimated 5-10 min processing
Empty transcript	"No speech detected in audio"
Network error	"Connection error, please retry"
No ASR engine available	List setup instructions for each engine

Advanced: Voiceprint Extraction (Optional)

The skill includes an optional voiceprint.py tool for embedding-based speaker identification using ONNX neural models. This is separate from the text-based voice profile matching in the core pipeline.

What it does

Extracts speaker audio segments using ffmpeg
Computes 256-dim speaker embeddings via WeSpeaker ONNX model
Stores embeddings locally in references/voice-db.json
Matches new speakers against stored embeddings (cosine similarity)

Setup (optional — core skill works without this)

# 1. Install Python dependencies pip install numpy librosa onnxruntime # 2. Install ffmpeg apt install ffmpeg # or: brew install ffmpeg

# 3. Download WeSpeaker model mkdir -p ~/.openclaw/models/wespeaker # Download cnceleb_resnet34_LM.onnx from: # https://github.com/wenet-e2e/wespeaker/releases # Set: export WESPEAKER_MODEL=~/.openclaw/models/wespeaker/cnceleb_resnet34_LM.onnx

Usage

# Extract voiceprints from a transcribed recording python3 voiceprint.py extract recording.m4a recording_raw.json # Enroll a known speaker python3 voiceprint.py enroll "JoJo" jojo_sample.m4a

# Identify speaker in new audio python3 voiceprint.py identify unknown.m4a

Privacy Notice

All voice embeddings are stored locally in references/voice-db.json
Voice embeddings are never sent externally
Audio files ARE uploaded to cloud ASR (AssemblyAI/Gemini) for transcription. For fully offline operation, use local Whisper
Speaker identity updates require explicit user confirmation
To delete all voiceprint data: rm references/voice-db.json

Voice Profiles (Text-Based)

See references/voice-profiles.md. Shared across all scenes — same person is recognized regardless of context. This is the lightweight alternative that works without the ONNX model.

数据来源：ClawHub ↗ · 中文优化：龙虾技能库

OpenClaw 技能定制 / 插件定制 / 私有工作流定制

免费技能或插件可能存在安全风险，如需更匹配、更安全的方案，建议联系付费定制

了解定制服务

License

运行时依赖

版本

安装命令 点击复制

技能文档

What Makes This Different

Quick Start

安装

Bootstrap 片段

Core Pipeline

Step 1: Transcribe

Step 2: Speaker Identification

Step 3: Scene Detection

Step 4-5: Generate Structured Notes

Step 6: Update Voice Profiles

Extending Scenes

Requirements

Error Handling

Advanced: Voiceprint Extraction (Optional)

What it does

Setup (optional — core skill works without this)

Usage

Privacy Notice

Voice Profiles (Text-Based)

安装命令点击复制