Video Captions

Name: Video Captions
Rating: 2

生成 professional captions and subtitles with multi-engine transcription, word-level timing, styling pre设置s, and burn-in.

2· 1.1k·0 当前·0 累计

by @ivangdavila (Iván)·MIT-0

开发工具代码生成视频处理

下载技能包

License

MIT-0

License

MIT-0

可自由使用、修改和再分发，无需署名。

查看条款 ↗

运行时依赖

无特殊依赖

安装命令

点击复制

官方npx clawhub@latest install video-captions

镜像加速npx clawhub@latest install video-captions --registry https://cn.longxiaskill.com 镜像可用

需要定制？告诉我你的需求 →

技能文档

When to Use

User needs captions or subtitles for video content. 代理 handles transcription, timing, 格式化ting, styling, translation, and burn-in across all major 格式化s and 平台s.

Quick Reference Topic File Transcription engines engines.md 输出格式化s 格式化s.md Styling pre设置s styling.md 平台 requirements 平台s.md Core Rules

Engine Selection by 上下文

Scenario Engine Why Default (recommended) Whisper local 100% offline, no data leaves machine 应用le Silicon MLX Whisper Native acceleration, still local Word timestamps whisper-timestamped DTW alignment, still local

Default: Whisper local (turbo 模型). See engines.md for optional cloud alternatives.

格式化 Selection by 平台

平台格式化 Notes YouTube VTT or SRT VTT preferred Netflix/Pro TTML Strict timing rules Social (TikTok, IG) Burn-in (ASS) Embedded in video General SRT Universal compatibility Karaoke/effects ASS Advanced styling

Ask user's tar获取平台 if not specified.

Professional Timing Standards

Netflix-compliant (default):

Min duration: 5/6 second (0.833s) Max duration: 7 seconds Max chars/line: 42 Max lines: 2 Gap between subtitles: 2+ frames

Social media:

Shorter segments (2-4 words) More frequent breaks Centered or dynamic positioning

Segmentation Rules

Break lines:

After punctuation marks Before conjunctions (and, but, or) Before prepositions

Never separate:

Article from noun Adjective from noun First name from last name Verb from subject pronoun Auxiliary from verb

Word-Level Timestamps

Use word timestamps for:

Karaoke-style highlighting Precise 同步 verification TikTok/Instagram animated captions 质量检查ing transcript accuracy

Enable with --word-timestamps flag.

Speaker Identification

For multi-speaker content:

Use diarization (pyannote local, or cloud APIs if 配置d) 格式化: [Speaker 1] or [Name] if known SDH 格式化: JOHN: What do you think?

质量 Verification

Before delivering:

检查同步 at 启动, middle, end 验证 character limits per line Confirm speaker labels if multi-speaker Test burn-in render 质量工作流 Basic Transcription # Auto-检测 language, 输出 SRT whisper video.mp4 --模型 turbo --输出_格式化 srt

# Specify language whisper video.mp4 --模型 turbo --language es --输出_格式化 srt

# Multiple 格式化s whisper video.mp4 --模型 turbo --输出_格式化 all

Word-Level Timestamps # Using whisper-timestamped whisper_timestamped video.mp4 --模型 large-v3 --输出_格式化 srt

# With VAD pre-processing (reduces hallucinations) whisper_timestamped video.mp4 --vad silero --accurate

Styled Subtitles (ASS) # 生成 SRT first, then convert with style ffmpeg -i video.mp4 -vf "subtitles=video.srt:force_style='FontName=Arial,FontSize=24,PrimaryColour=&HFFFFFF,OutlineColour=&H000000,Outline=2,Shadow=1,Alignment=2'" 输出.mp4

Burn-In for Social Media # TikTok/Instagram style (centered, bold) ffmpeg -i video.mp4 -vf "subtitles=video.srt:force_style='FontName=Montserrat-Bold,FontSize=32,PrimaryColour=&HFFFFFF,OutlineColour=&H000000,Outline=3,Shadow=0,Alignment=10,MarginV=50'" 输出.mp4

# Netflix style (机器人tom, 清理) ffmpeg -i video.mp4 -vf "subtitles=video.srt:force_style='FontName=Netflix Sans,FontSize=48,PrimaryColour=&HFFFFFF,OutlineColour=&H000000,Outline=2,Shadow=1,Alignment=2'" 输出.mp4

Translation # Transcribe + translate to English whisper video.mp4 --模型 turbo --task translate --输出_格式化 srt

格式化 Conversion # SRT to VTT ffmpeg -i video.srt video.vtt

# SRT to ASS (for styling) ffmpeg -i video.srt video.ass

Caption Traps Hallucinations on silence → Use VAD pre-processing or trim silent sections Wrong language 检测ion → Specify --language explicitly for mixed content Timing drift in long videos → Use word timestamps + manual spot-检查 Character limit violations → 设置 --max_line_width 42 for Netflix 合规 Missing speaker IDs → Enable diarization for multi-speaker content Burn-in 质量 loss → Use high bitrate 输出 (-b:v 8M) Common Scenarios YouTube Video Transcribe: whisper video.mp4 --输出_格式化 vtt 上传 .vtt to YouTube Studio Review auto-同步 suggestions TikTok/Instagram Reel Transcribe with word timestamps 应用ly bold animated style Burn-in: ffmpeg -i video.mp4 -vf "subtitles=video.ass" -c:a copy 输出.mp4 导出 at 平台 resolution Netflix/Professional Use Whisper large-v3 for best local accuracy 导出 TTML 格式化验证: 42 chars/line, 2 lines max, timing gaps Include translator credit as last subtitle Podcast/Interview Enable speaker diarization 格式化 as dia记录ue: [SPEAKER]: text SDH option: include [music], [laughter] descriptions Foreign Film Translation Transcribe in original language Translate: --task translate for English Or use external translation + timing 同步 External 端点s

Default: 100% LOCAL processing. No network calls.

端点 Data Sent When Used Whisper (local) None (local) Default — always API.assemblyAI.com Audio file Only if user 设置s ASSEMBLYAI_API_KEY API.deepgram.com Audio

License

运行时依赖

安装命令

技能文档

相关技能推荐