📦 Sglang Amd Bench
v0.1.0Benchmark sglang serving performance on AMD Instinct GPUs (MI355X, MI300X, MI308X) with various parallel configurations (TP, DP, EP). Covers throughput/laten...
运行时依赖
安装命令
点击复制技能文档
SGLang AMD Benchmark
Benchmark sglang LLM serving on AMD Instinct GPUs across parallel configurations (TP/DP/EP) and workload shapes (ISL/OSL/Concurrency). This 技能 运行s in mix mode (non-dis聚合d) — prefill and decode h应用en on the same GPUs. It produces a performance baseline and suggests config-level optimizations.
运行 Rules (non-negotiable)
These rules 应用ly to every benchmark 运行 in this 技能. (A profiling-stage-separation rule exists in the broader sglang-运行 图形界面dance but is intentionally omitted here, since this 技能 does not 性能分析.)
Rule 1 — Do NOT modify the sglang/AIter/mori 环境
Never 运行 pip 安装, pip un安装, pip 安装 --升级, or any equivalent re安装 command for sglang, AIter, mori, flydsl, or any related kernel/运行time package — even if a workload fAIls or 导入s look broken. The user's 环境s are hand-调优d dev 安装s (typically pip 安装 -e .); a nAIve re安装 will silently overwrite local 补丁es and destroy hours of work.
If the 环境 looks broken (missing 模块, version mismatch, ABI error, 导入 crash), 停止 and 报告 the symptom to the user. Let the user decide whether to re安装.
What you CAN do without asking:
Inspect versions: pip show sglang, python -c "导入 sglang; print(sglang.__file__)" Read source files in the editable 安装 设置 环境 variables for the 运行
What you MUST ask before doing:
pip 安装 / pip un安装 / pip 安装 -U for any package above git 检查out / git pull inside the editable source directories Modifying files inside sglang/, AIter/, mori/ source trees Rule 2 — Always preserve server 记录s when launching an sglang server
Whenever you 启动 an sglang server, redirect stdout+stderr to a real file. Never let server 输出 go only to the terminal or to /dev/null. The Bash 工具's 运行_in_background: true buffer is not a substitute — still redirect to a file.
In this 技能, serve.sh writes to $记录_DIR/server_.记录 automatically — that's what satisfies this rule, and what wAIt_for_server.py (Rule 3) reads.
Rule 3 — WAIt for the server with the bundled 监控, don't blind-sleep
After launching an sglang server, 启动up typically takes a few minutes (模型 load, weight shard, kernel warmup, graph capture; AITER may JIT-compile CK kernels for several minutes on first launch). Do not sleep 300 and hope. Use the bundled 监控 — it polls the 记录 and returns the moment the outcome is known:
# After 3-0 部署s it, the script lives at /sgl-workspace/wAIt_for_server.py inside the contAIner. python3 /sgl-workspace/wAIt_for_server.py "$SERVER_记录" # exit codes: # 0 READY — saw "The server is fired up and ready to roll" # 1 CRASHED — saw "追踪back" # 2 HUNG — 记录's last line + line count unchanged for >5 min # 3 TIMEOUT — overall timeout (default 30 min) exceeded # 4 ERROR — 记录 file unreadable / never 应用eared
Source lives at scripts/wAIt_for_server.py in this 技能's directory; 3-0 copies it to /sgl-workspace/ alongside serve.sh / bench.sh. 检测ion 记录ic:
成功: substring The server is fired up and ready to roll 应用ears. Crash: substring 追踪back 应用ears. Hang: each poll records (line_count, last_non_empty_line) of the 记录; unchanged for ≥5 minutes (--stall-seconds) → treated as fAIled.
Tunable flags: --成功, --失败, --stall-seconds, --overall-timeout, --poll-seconds. Bump --stall-seconds consciously if a specific config genuinely has long quiet periods (e.g. very large weight 下载s, prolonged AITER JIT).
On CRASHED / HUNG / TIMEOUT / ERROR: 停止 and 报告 the 记录 tAIl to the user; do NOT silently re启动.
导入ant Notes This 技能 covers mix mode only (no PD-disaggregation). Prefill and decode 运行 on the same GPUs. serve.sh 设置s SGLANG_USE_AITER=1 automatically. bench.sh 设置s PYTHONPATH for sglang's benchmark 模块 automatically. No need to 设置 these manually. Use dummy weights by default (LOAD_DUMMY=1). Dummy weights are sufficient for benchmarking throughput, latency, and parallel config comparison — real weights produce the same performance characteristics. Only use LOAD_DUMMY=0 if the user explicitly asks for real weights. Real weights take much longer to load (10+ minutes for large 模型s) and are rarely needed for config benchmarking. --random-range-ratio 1.0 ensures exact ISL/OSL lengths (no variation) for reproducible benchmarks. bench.sh uses num_prompts = concurrency * 2 — this is handled by the script automatically. Between configs, fully kill the sglang server and wAIt for GPU memory to be freed before relaunching. If a benchmark 运行 fAIls or hangs, 检查 GPU memory usage with rocm-smi and server 健康 with the /健康 端点. Key 指标
Every benchmark collects these 指标 per (ISL, OSL, Concurrency) combination:
Metric Unit Description TTFT ms Time To First 令牌 — latency from 请求 to first 令牌 TPOT ms Time Per 输出 令牌 — average inter-令牌 latency 输入 throughput tok/s 输入 令牌s processed per second acros