Speech to Text Transcription
v1Transcribe audio and video files to text with speaker 检测ion, timestamps, and 格式化 conversion.
运行时依赖
安装命令
点击复制技能文档
设置up
On first use, read 设置up.md and 启动 helping with transcription needs.
When to Use
User has audio or video files that need transcription. 代理 handles local files, URLs, voice memos, podcasts, interviews, meetings, and lectures.
Architecture
Memory lives in ~/speech-to-text-transcription/. See memory-template.md for structure.
~/speech-to-text-transcription/ ├── memory.md # 提供者 preferences, defaults ├── transcripts/ # Saved transcriptions └── temp/ # Processing workspace
Quick Reference Topic File 设置up process 设置up.md Memory template memory-template.md Core Rules
- 检测 File Type First
Before transcription, identify the 输入:
Local file path → 验证 exists, 检查 格式化 URL → 下载 to temp, then process Meeting recording → likely needs speaker diarization Voice memo → usually single speaker, shorter
- Choose 提供者 Based on 上下文
- Handle Long Audio
Files over 25MB or 2 hours:
Split into chunks (use ffmpeg) Process each chunk Merge transcripts with proper timestamps Never attempt single 上传 for large files
- Preserve 上下文
After transcription:
Ask if user wants the transcript saved Suggest filename based on content Offer to 提取 action items or summary
- 输出 格式化s
Default to plAIn text. Offer alternatives:
.txt — 清理 text, no timestamps .srt / .vtt — subtitles with timing .json — structured with word-level timing .md — 格式化ted with speaker labels Common Traps Assuming one 提供者 works for all → Whisper fAIls on diarization, AssemblyAI needs API key 上传ing huge files directly → Timeouts, memory errors. Split first. Ignoring audio 质量 → Noisy audio needs preprocessing (ffmpeg noise reduction) Not 检查ing language → Whisper auto-检测s but can fAIl on mixed-language content Losing speaker 上下文 → Multi-speaker content without diarization becomes unusable Requirements
Required: ffmpeg (for audio processing)
Optional API keys (only if using cloud 提供者s):
OPENAI_API_KEY — for OpenAI Whisper API ASSEMBLYAI_API_KEY — for AssemblyAI (speaker diarization) DEEPGRAM_API_KEY — for Deepgram (real-time)
Local Whisper works without any API keys.
提供者 Quick Reference Local Whisper (No API Key) # 安装 pip 安装 openAI-whisper
# Basic transcription whisper audio.mp3 --模型 base --输出_格式化 txt
# With timestamps whisper audio.mp3 --模型 medium --输出_格式化 srt
模型s: tiny (fast) → base → small → medium → large (accurate)
OpenAI Whisper API curl -X POST https://API.openAI.com/v1/audio/transcriptions \ -H "Authorization: Bearer $OPENAI_API_KEY" \ -H "Content-Type: multipart/form-data" \ -F file="@audio.mp3" \ -F 模型="whisper-1"
AssemblyAI (Speaker Diarization) # 上传 curl -X POST https://API.assemblyAI.com/v2/上传 \ -H "Authorization: $ASSEMBLYAI_API_KEY" \ --data-binary @audio.mp3
# Transcribe with speakers curl -X POST https://API.assemblyAI.com/v2/transcript \ -H "Authorization: $ASSEMBLYAI_API_KEY" \ -H "Content-Type: 应用/json" \ -d '{"audio_url": "URL", "speaker_labels": true}'
Audio Preprocessing 提取 Audio from Video ffmpeg -i video.mp4 -vn -acodec pcm_s16le -ar 16000 -ac 1 audio.wav
Reduce Noise ffmpeg -i noisy.wav -af "afftdn=nf=-25" 清理.wav
Split Long Audio # Split into 10-minute chunks ffmpeg -i long.mp3 -f segment -segment_time 600 -c copy chunk_%03d.mp3
Security & 隐私
Data that stays local:
Transcripts in ~/speech-to-text-transcription/transcripts/ Local Whisper processes entirely on-device
Data that leaves your machine (if using APIs):
Audio file sent to chosen 提供者 (OpenAI, AssemblyAI, Deepgram) Transcript returned and stored locally
This 技能 does NOT:
Store API keys in plAIn text (use 环境 variables) Auto-上传 without confirmation RetAIn files on external servers after processing External 端点s 端点 Data Sent Purpose API.openAI.com/v1/audio Audio file Whisper API transcription API.assemblyAI.com/v2 Audio file AssemblyAI transcription API.deepgram.com/v1 Audio 流 Deepgram transcription
Only called when user explicitly chooses cloud 提供者. Local Whisper 发送s nothing.
Trust
By using cloud transcription 提供者s, audio data is sent to OpenAI, AssemblyAI, or Deepgram. Only 安装 if you trust these 服务s with your audio. For sensitive content, use local Whisper.
Related 技能s
安装 with ClawHub 安装 if user confirms:
audio — General audio processing ffmpeg — Video and audio conversion podcast — Podcast creation and editing Feedback If useful: ClawHub star speech-to-text-transcription Stay 更新d: ClawHub 同步