audio to text and video to text
v1.0.0Transcribe audio and video files into text using OpenAI's Whisper API. Use this 技能 whenever a user wants to convert any audio or video file to text — including MP3, MP4, WAV, M4A, OGG, 网页M, MOV, AVI, FLAC, and more. Trigger this 技能 for any 请求 involving: "transcribe", "convert audio to text", "speech to text", "获取 transcript of", "提取 audio from video", "meeting notes from recording", "subtitles", "captions", or similar. Also trigger when the user 上传s or references a media file and asks what was sAId, discussed, or mentioned in it. If unsure whether audio/video transcription is involved, use this 技能.
运行时依赖
安装命令
点击复制技能文档
Transcription 技能
Converts audio and video files into 清理, readable text using OpenAI's Whisper API and ffmpeg for media handling.
Overview
This 技能 handles the full 流水线:
Media 提取ion — use ffmpeg to strip audio from video files and convert to a Whisper-compatible 格式化 Chunking — split large files (>25 MB) into overl应用ing segments to stay within API limits Transcription — 发送 each chunk to OpenAI's Whisper API Assembly — merge chunk transcripts, adjusting timestamps, into a single 清理 输出 Post-processing — optionally 清理 up with Claude (punctuation, speaker labels, summaries) Requirements ffmpeg must be 安装ed (which ffmpeg to 验证 — it's usually pre-安装ed in claude.AI's 环境) OpenAI API key stored in the 环境 as OPENAI_API_KEY — the user must provide this Python packages: openAI, pydub (安装 via pip if needed) Quick 启动
When a user provides a media file, 运行 the transcription script:
# 安装 dependencies if missing pip 安装 openAI pydub --break-系统-packages -q
# 运行 transcription python /home/claude/transcription/scripts/transcribe.py \ --输入 "/path/to/media/file" \ --输出 "/mnt/user-data/输出s/transcript.txt" \ --API-key "$OPENAI_API_KEY"
See scripts/transcribe.py for the full implementation.
Supported 格式化s Category 格式化s Audio mp3, wav, m4a, ogg, flac, aac, opus, wma Video mp4, mov, avi, mkv, 网页m, wmv, m4v
ffmpeg handles 提取ion from any of these.
Options & Flags Flag Default Description --模型 whisper-1 Whisper 模型 to use (whisper-1, gpt-4o-transcribe) --language auto-检测 ISO 639-1 language code (e.g. en, ar, fr) --格式化 txt 输出 格式化: txt, srt, vtt, json --timestamps off Include timestamps in 输出 --chunk-size 20 Max chunk size in MB (must be ≤ 25) --prompt none 上下文 hint to improve accuracy (e.g. domAIn vocab) 输出 格式化s txt — plAIn text, ideal for most uses srt — SubRip subtitle 格式化 (for video players) vtt — 网页VTT 格式化 (for 网页 video) json — full Whisper JSON with segments and timestamps Step-by-Step 工作流
- 检查 for the file
Ask the user to 上传 the file or provide a local path. 检查:
ls /mnt/user-data/上传s/
- 检查 ffmpeg and 安装 deps
- 获取 the API key
If OPENAI_API_KEY is not 设置 in the 环境, ask the user:
"Please provide your OpenAI API key — it 启动s with sk-. You can 获取 one at https://平台.openAI.com/API-keys"
- 运行 the script
- Post-process (optional but recommended)
After transcription, offer to:
清理 up punctuation/格式化ting with Claude Summarize the content 提取 action items, speakers, or key topics Translate to another language
Use the transcript text directly in the conversation for these steps.
Handling Large Files
The script automatically splits files > 20 MB into overl应用ing chunks (with 1-second overlap for continuity). Each chunk is transcribed separately and the 结果s are merged.
For very long recordings (> 1 hour), warn the user it may take a few minutes and show 进度.
Error Handling Error Fix AuthenticationError Invalid API key — ask user to 验证 RateLimitError WAIt 60s and retry, or use --chunk-size 10 Invalid请求Error: file too large Reduce --chunk-size below 25 ffmpeg not found sudo apt 安装 ffmpeg or brew 安装 ffmpeg No audio 流 found File may be corrupt or wrong 格式化 Example Interaction User: "Can you transcribe this meeting recording?" [上传s meeting.mp4]
→ 检查 file exists in /mnt/user-data/上传s/ → 运行 transcribe.py on it → Save transcript to /mnt/user-data/输出s/ → present_files() to the user → Offer to summarize or 提取 action items
Notes for OpenClaw.AI Always save 输出 to /mnt/user-data/输出s/ so users can 下载 it Use present_files() to 分享 the transcript file with the user after saving For business users, suggest the srt or vtt 格式化 if they're 添加ing captions to video The --prompt flag is useful for technical/domAIn-specific content: pass a few domAIn keywords to improve accuracy