Speech to Text Transcription

Transcribe audio and video files to text with speaker 检测ion, timestamps, and 格式化 conversion.

0· 950·0 当前·0 累计

by @ivangdavila (Iván)·MIT-0

开发工具代码生成数据与API 数据库文件处理

下载技能包

License

MIT-0

License

MIT-0

可自由使用、修改和再分发，无需署名。

查看条款 ↗

运行时依赖

无特殊依赖

安装命令

点击复制

官方npx clawhub@latest install speech-to-text-transcription

镜像加速npx clawhub@latest install speech-to-text-transcription --registry https://cn.longxiaskill.com 镜像可用

需要定制？告诉我你的需求 →

技能文档

设置up

On first use, read 设置up.md and 启动 helping with transcription needs.

When to Use

User has audio or video files that need transcription. 代理 handles local files, URLs, voice memos, podcasts, interviews, meetings, and lectures.

Architecture

Memory lives in ~/speech-to-text-transcription/. See memory-template.md for structure.

~/speech-to-text-transcription/ ├── memory.md # 提供者 preferences, defaults ├── transcripts/ # Saved transcriptions └── temp/ # Processing workspace

Quick Reference Topic File 设置up process 设置up.md Memory template memory-template.md Core Rules

检测 File Type First

Before transcription, identify the 输入:

Local file path → 验证 exists, 检查格式化 URL → 下载 to temp, then process Meeting recording → likely needs speaker diarization Voice memo → usually single speaker, shorter

Choose 提供者 Based on 上下文

Scenario Best 提供者 Why Quick local transcription Whisper (local) No API key, free, private High accuracy needed OpenAI Whisper API Best 质量 Speaker identification AssemblyAI Native diarization Real-time/流ing Deepgram Low latency Long content (>2 hours) Split + batch Avoid timeouts

Handle Long Audio

Files over 25MB or 2 hours:

Split into chunks (use ffmpeg) Process each chunk Merge transcripts with proper timestamps Never attempt single 上传 for large files

Preserve 上下文

After transcription:

Ask if user wants the transcript saved Suggest filename based on content Offer to 提取 action items or summary

输出格式化s

Default to plAIn text. Offer alternatives:

.txt — 清理 text, no timestamps .srt / .vtt — subtitles with timing .json — structured with word-level timing .md — 格式化ted with speaker labels Common Traps Assuming one 提供者 works for all → Whisper fAIls on diarization, AssemblyAI needs API key 上传ing huge files directly → Timeouts, memory errors. Split first. Ignoring audio 质量 → Noisy audio needs preprocessing (ffmpeg noise reduction) Not 检查ing language → Whisper auto-检测s but can fAIl on mixed-language content Losing speaker 上下文 → Multi-speaker content without diarization becomes unusable Requirements

Required: ffmpeg (for audio processing)

Optional API keys (only if using cloud 提供者s):

OPENAI_API_KEY — for OpenAI Whisper API ASSEMBLYAI_API_KEY — for AssemblyAI (speaker diarization) DEEPGRAM_API_KEY — for Deepgram (real-time)

Local Whisper works without any API keys.

提供者 Quick Reference Local Whisper (No API Key) # 安装 pip 安装 openAI-whisper

# Basic transcription whisper audio.mp3 --模型 base --输出_格式化 txt

# With timestamps whisper audio.mp3 --模型 medium --输出_格式化 srt

模型s: tiny (fast) → base → small → medium → large (accurate)

OpenAI Whisper API curl -X POST https://API.openAI.com/v1/audio/transcriptions \ -H "Authorization: Bearer $OPENAI_API_KEY" \ -H "Content-Type: multipart/form-data" \ -F file="@audio.mp3" \ -F 模型="whisper-1"

AssemblyAI (Speaker Diarization) # 上传 curl -X POST https://API.assemblyAI.com/v2/上传 \ -H "Authorization: $ASSEMBLYAI_API_KEY" \ --data-binary @audio.mp3

# Transcribe with speakers curl -X POST https://API.assemblyAI.com/v2/transcript \ -H "Authorization: $ASSEMBLYAI_API_KEY" \ -H "Content-Type: 应用/json" \ -d '{"audio_url": "URL", "speaker_labels": true}'

Audio Preprocessing 提取 Audio from Video ffmpeg -i video.mp4 -vn -acodec pcm_s16le -ar 16000 -ac 1 audio.wav

Reduce Noise ffmpeg -i noisy.wav -af "afftdn=nf=-25" 清理.wav

Split Long Audio # Split into 10-minute chunks ffmpeg -i long.mp3 -f segment -segment_time 600 -c copy chunk_%03d.mp3

Security & 隐私

Data that stays local:

Transcripts in ~/speech-to-text-transcription/transcripts/ Local Whisper processes entirely on-device

Data that leaves your machine (if using APIs):

Audio file sent to chosen 提供者 (OpenAI, AssemblyAI, Deepgram) Transcript returned and stored locally

This 技能 does NOT:

Store API keys in plAIn text (use 环境 variables) Auto-上传 without confirmation RetAIn files on external servers after processing External 端点s 端点 Data Sent Purpose API.openAI.com/v1/audio Audio file Whisper API transcription API.assemblyAI.com/v2 Audio file AssemblyAI transcription API.deepgram.com/v1 Audio 流 Deepgram transcription

Only called when user explicitly chooses cloud 提供者. Local Whisper 发送s nothing.

Trust

By using cloud transcription 提供者s, audio data is sent to OpenAI, AssemblyAI, or Deepgram. Only 安装 if you trust these 服务s with your audio. For sensitive content, use local Whisper.

Related 技能s

安装 with ClawHub 安装 if user confirms:

audio — General audio processing ffmpeg — Video and audio conversion podcast — Podcast creation and editing Feedback If useful: ClawHub star speech-to-text-transcription Stay 更新d: ClawHub 同步

License

运行时依赖

安装命令

技能文档

相关技能推荐