midasheng-audio-generate — 基于文本的沉浸式音频场景生成

Name: midasheng-audio-generate — 基于文本的沉浸式音频场景生成
Rating: 1 (3 reviews)
Author: Junbo Zhang

Junbo Zhang

midasheng-audio-generate — 基于文本的沉浸式音频场景生成

v1.1.5

通过文本描述生成包含语音、音效、音乐和环境声音的沉浸式音频场景。由小米和上海交通大学开发，支持多语言输入，输出WAV格式音频。

3· 161·1 当前·1 累计

by @jimbozhang (Junbo Zhang)·MIT-0

音频处理 AI模型访问开发工具 API工具

下载技能包

License

MIT-0

最后更新

2026/3/20

安全扫描

VirusTotal

无害

查看报告

OpenClaw

安全

medium confidence

技能指令和外部API使用与音频生成工具一致，但用户提示发送到小米外部端点，数据保留政策未知，存在少量元数据不一致，建议安装前审查。

评估建议

此技能作为薄客户端，发送文本到小米托管的音频生成服务。安装前：（1）勿发送个人信息或机密文本——SKILL.md明确警告数据可能被保留和发送到外部；（2）确认接受端点主机（llmplus.ai.xiaomi.com）及其隐私/保留政策——技能未记录保留或身份验证；（3）注意小元数据不匹配；（4）使用非敏感提示进行测试；（5）如需更强的隐私，考虑使用本地或自托管替代品。如果需要更高的保证，请要求发布者提供明确的数据保留和身份验证细节，或者选择具有记录的隐私保证的技能。...

详细分析 ▾

✓ 用途与能力

The skill claims to convert text into immersive audio and its runtime instructions perform exactly that: craft a structured prompt and POST it to a remote audio-generation API. The required functionality (prompt engineering + curl call to the service) matches the described purpose. Minor inconsistency: the registry metadata lists no required binaries while SKILL.md lists 'requirements: curl' (the curl command is used in the instructions).

ℹ 指令范围

Instructions are narrowly focused: they direct the agent to build a structured, lowercased prompt and send it to the specified API endpoint, and to optionally check a queue-status endpoint. The instructions do not ask the agent to read local files, other env vars, or system state. Important privacy note in SKILL.md: user-provided prompts are transmitted to an external endpoint and data retention is unknown — the skill explicitly warns not to include PII or sensitive content.

✓ 安装机制

This is an instruction-only skill with no install spec and no code to write to disk, which is the lowest-risk install model. The only runtime requirement is using curl (per SKILL.md), which is a normal CLI tool for making HTTP requests.

✓ 凭证需求

The skill does not request any environment variables, credentials, or config paths. That is proportional to its described purpose. Note: the SKILL.md declares the remote endpoint accepts unauthenticated requests (authentication: none); if the endpoint actually requires credentials, the skill might fail or prompt for extra setup not declared here.

✓ 持久化与权限

The skill is not marked always:true and does not request persistent privileges or attempt to modify other skills or system-wide config. It can be invoked by the agent normally (default).

安全有层次，运行前请审查代码。

License

MIT-0

可自由使用、修改和再分发，无需署名。

查看条款 ↗

运行时依赖

无特殊依赖

版本

latestv1.1.52026/3/19

技能描述进行了小幅更新， теперь在开头列出小米和上海交通大学为开发者，提高了归属明确性。无功能或技术变化，行为和API使用保持不变。

● 无害

安装命令点击复制

官方npx clawhub@latest install midasheng-audio-generate

镜像加速npx clawhub@latest install midasheng-audio-generate --registry https://cn.clawhub-mirror.com

技能文档

# midasheng-audio-generate 音频场景从文本描述生成。生成包含语音、音效、音乐和环境声音的WAV音频。

1. 触发

当用户请求基于文本描述的音频、音效或音乐生成时使用此技能。

2. 执行步骤

步骤 1：设计音频场景（提示精炼）

场景丰富化：不要只是复制用户的输入！作为声音设计师，逻辑地丰富场景。
语音与对话生成：如果用户明确提到语音或暗示说话场景，创造性地生成 <|speech|> 和 <|asr|> 字段的合理和生动的转录。
严格的ASR格式：对于 <|asr|> 标签，仅输出原始口头文本。不要包括任何演讲者标签或叙述，如“man:”，“speaker1:”，或“一个男人说”。
省略缺失元素：如果任何元素不相关，直接省略其对应标签。
语言与大小写约束：生成的提示字符串的整个必须是 小写英语，包括 <|asr|> 内容。
严格输出：仅输出格式化的标记字符串用于下一步。

步骤 2：执行命令

curl -X POST "https://llmplus.ai.xiaomi.com/dasheng/audio/gen" \
  -H "Content-Type: application/json" \
  -d "{\"text\": \"\"}" \
  -o

3. 队列状态

查询命令

curl -X POST "https://llmplus.ai.xiaomi.com/metrics?path=/dasheng/audio/gen"

返回字段

active：当前活动请求数量
avg_latency_ms：平均处理延迟（毫秒）
预估等待时间 = active × avg_latency_ms

何时调用

当IM即将超时但音频生成服务尚未返回结果：检查队列状态并通知用户，要求稍后再次询问。
当用户后来询问任务进度但服务仍未返回：检查最新队列状态并向用户报告。

状态级别

🟢 active=0 或预估等待 <5s → 服务空闲
🟡 预估等待 5-30s → 轻微队列
🔴 预估等待 >30s → 队列很长，建议稍后再试

Audio scene generation from text descriptions. Generates WAV audio with speech, sound effects, music, and environmental sounds.

1. Trigger

Use this skill when the user requests audio, sound effects, or music generation based on a text description.

2. Execution Steps

Step 1: Design the Audio Scene (Prompt Refinement)

Before calling the API, you must act as an expert Audio Scene Architect and Foley Designer. Deeply understand the user's natural language input (which may be in any language) and translate it into a highly structured tagged string based on real-world acoustic logic and scene realism.

Crucial Generation Rules:

Scene Enrichment: Do not merely copy the user's input! Act as a sound designer and logically enrich the scene.
Speech & Dialogue Generation: If the user explicitly mentions speech or implies a speaking scenario, creatively generate a reasonable and vivid transcript for the <|speech|> and <|asr|> fields.
Strict ASR Formatting: For the <|asr|> tag, output only the raw spoken text. Do not include any speaker labels or narration, such as “man:”, “speaker1:”, or “a man says”.
Omit Missing Elements: If any element is not relevant, directly omit its corresponding tag.
Language & Case Constraint: The entire generated prompt string MUST be in lowercase English, including <|asr|> content.
Strict Output: Output ONLY the formatted tagged string internally for the next step.

Step 2: Execute Command

curl -X POST "https://llmplus.ai.xiaomi.com/dasheng/audio/gen" \
  -H "Content-Type: application/json" \
  -d "{\"text\": \"\"}" \
  -o

3. Queue Status

Query Command

curl -X POST "https://llmplus.ai.xiaomi.com/metrics?path=/dasheng/audio/gen"

Returned Fields

active: Number of currently active requests
avg_latency_ms: Average processing latency (milliseconds)
Estimated wait time = active × avg_latency_ms

When to Call

When the IM is about to timeout but the audiogen service has not returned a result: Check the queue status and inform the user, asking them to inquire again later.
When the user asks about task progress later but the service still hasn't returned: Check the latest queue status and report it back to the user.

Status Levels

🟢 active=0 or estimated wait <5s → Service idle
🟡 Estimated wait 5-30s → Slight queue
🔴 Estimated wait >30s → Queue is long, recommend trying again later

数据来源：ClawHub ↗ · 中文优化：龙虾技能库

OpenClaw 技能定制 / 插件定制 / 私有工作流定制

免费技能或插件可能存在安全风险，如需更匹配、更安全的方案，建议联系付费定制

了解定制服务

License

运行时依赖

版本

安装命令 点击复制

技能文档

1. 触发

2. 执行步骤

步骤 1：设计音频场景（提示精炼）

步骤 2：执行命令

3. 队列状态

查询命令

返回字段

何时调用

状态级别

1. Trigger

2. Execution Steps

Step 1: Design the Audio Scene (Prompt Refinement)

Step 2: Execute Command

3. Queue Status

Query Command

Returned Fields

When to Call

Status Levels

安装命令点击复制