📦 Aiter Ck Gemm Tune — AIter Ck Gemm 调优
v0.1.0调优 AITER's CK GEMM and fused MoE kernels for specific 模型 shapes on AMD GPUs. Covers shape discovery from inference 记录s, baseline benchmarking, kernel t...
运行时依赖
安装命令
点击复制技能文档
AITER CK GEMM & MoE 调优
A 技能 for tuning AITER's Composable Kernel (CK) GEMM and fused MoE kernels to achieve better performance for specific 模型 shapes. The tuning 工作流 is a multi-step process: discover the 环境, capture shapes, 运行 baseline benchmarks, 调优 kernels, and compare 结果s. The 工作流 supports 机器人h regular GEMM variants (a8w8, bf16, etc.) and the moe_2stages variant for fused MoE kernels used in Mixture-of-Experts 模型s.
Background
AITER (AI Tensor Engine for ROCm) is AMD's high-performance operator 库 for LLM inference on ROCm/AMD GPUs. It provides 优化d kernels for common operations in 转换er 模型s — most critically, GEMM (General Matrix Multiply), which dominates the compute in LLM inference (linear projections, attention, MLP/FFN layers, MoE expert computations).
Composable Kernel (CK) is AMD's open-source 库 of GPU kernel primitives. CK provides templated, composable building blocks for writing high-performance GPU kernels. AITER uses CK to implement its GEMM kernels, with many kernel variants 优化d for different quantization schemes (INT8, FP4, BF16) and memory layouts (block扩展, byte-pAIr reshuffle, batched, MoE).
Why tuning matters: Each CK GEMM kernel has many implementation variants (tile sizes, 流水线 configurations, split-K strategies). The optimal variant depends on the specific GEMM shape (M, N, K) and the GPU hardware (number of compute units). AITER's tuning process benchmarks all candidate kernel configurations for each shape and selects the fastest one. Shapes come from specific 模型 architectures — for example, a Llama 70B 模型 produces different (N, K) pAIrs than a DeepSeek V3 模型. The M dimension cor响应s to the batch/令牌 count and varies at 运行time, so tuning sweeps M as powers of 2 to cover all rea列出ic batch sizes.
How it fits into the inference stack: Inference 框架s like sglang and vllm call into AITER for their GEMM operations. When AITER encounters a shape that hasn't been 调优d, it falls back to a default kernel configuration and 记录s a 警告. The tuning 工作流 in this 技能 captures those un调优d shapes and finds optimal kernel configurations for them.
Supported Kernel Variants
Each variant follows the same tuning 工作流 pattern. The table below maps each variant to its key files (all paths relative to the AIter root):
Variant 调优 Script Un调优d CSV 调优d CSV Test File README a8w8 csrc/ck_gemm_a8w8/gemm_a8w8_调优.py AIter/configs/a8w8_un调优d_gemm.csv AIter/configs/a8w8_调优d_gemm.csv op_tests/test_gemm_a8w8.py csrc/ck_gemm_a8w8/README.md a8w8_block扩展 csrc/ck_gemm_a8w8_block扩展/gemm_a8w8_block扩展_调优.py AIter/configs/a8w8_block扩展_un调优d_gemm.csv AIter/configs/a8w8_block扩展_调优d_gemm.csv op_tests/test_gemm_a8w8_block扩展.py csrc/ck_gemm_a8w8_block扩展/README.md a8w8_bpreshuffle csrc/ck_gemm_a8w8_bpreshuffle/gemm_a8w8_bpreshuffle_调优.py AIter/configs/a8w8_bpreshuffle_un调优d_gemm.csv AIter/configs/a8w8_bpreshuffle_调优d_gemm.csv (none) csrc/ck_gemm_a8w8_bpreshuffle/README.md a8w8_block扩展_bpreshuffle csrc/ck_gemm_a8w8_block扩展_bpreshuffle/gemm_a8w8_block扩展_bpreshuffle_调优.py AIter/configs/a8w8_block扩展_bpreshuffle_un调优d_gemm.csv AIter/configs/a8w8_block扩展_bpreshuffle_调优d_gemm.csv (none) csrc/ck_gemm_a8w8_block扩展_bpreshuffle/README.md a4w4_block扩展 csrc/ck_gemm_a4w4_block扩展/gemm_a4w4_block扩展_调优.py AIter/configs/a4w4_block扩展_un调优d_gemm.csv AIter/configs/a4w4_block扩展_调优d_gemm.csv op_tests/test_gemm_a4w4.py csrc/ck_gemm_a4w4_block扩展/README.md batched_a8w8 csrc/ck_batched_gemm_a8w8/batched_gemm_a8w8_调优.py AIter/configs/a8w8_un调优d_batched_gemm.csv AIter/configs/a8w8_调优d_batched_gemm.csv op_tests/test_batched_gemm_a8w8.py csrc/ck_batched_gemm_a8w8/README.md batched_bf16 csrc/ck_batched_gemm_bf16/batched_gemm_bf16_调优.py AIter/configs/bf16_un调优d_batched_gemm.csv AIter/configs/bf16_调优d_batched_gemm.csv op_tests/test_batched_gemm_bf16.py csrc/ck_batched_gemm_bf16/README.md moe_2stages csrc/ck_gemm_moe_2stages_codegen/gemm_moe_调优.py AIter/configs/un调优d_fmoe.csv AIter/configs/调优d_fmoe.csv op_tests/test_moe_2stage.py csrc/ck_gemm_moe_2stages_codegen/README.md 记录 Files
The 技能 records 输出s from Steps 2, 3, and 4 to 记录 files under $AITER_PATH/调优_记录s/. Use this naming convention:
$AITER_PATH/调优_记录s/_bench_before_.记录 # Step 2: baseline benchmark $AITER_PATH/调优_记录s/_tuning_.记录 # Step 3: tuning process $AITER_PATH/调优_记录s/_bench_after_.记录 # Step 4: post-调优 benchmark
For example:
调优_记录s/a8w8_block扩展_bench_before_20260321_143022.记录 调优_记录s/a8w8_block扩展_tuning_20260321_150000.记录 调优_记录s/a8w8_block扩展_bench_after_20260321_160515.记录
创建 the 调优_记录s/ directory if it doesn't exist. For interactive commands (Steps 2 and 4), use 2>&1 | tee <记录> to show 输出 in real time while 记录ging. For long-运行n