vllm.model_executor.layers.quantization.utils.flashinfer_fp4_moe ¶
Utility helpers for NVFP4 + FlashInfer fused-MoE path
interleave_linear_and_gate ¶
Interleave gate and linear weight rows for CuteDSL wrapper.
Source code in vllm/model_executor/layers/quantization/utils/flashinfer_fp4_moe.py
is_flashinfer_fp4_cutlass_moe_available ¶
is_flashinfer_fp4_cutlass_moe_available() -> bool
Return True when FlashInfer CUTLASS NV-FP4 kernels can be used.
Source code in vllm/model_executor/layers/quantization/utils/flashinfer_fp4_moe.py
prepare_nvfp4_moe_layer_for_flashinfer_cutedsl ¶
prepare_nvfp4_moe_layer_for_flashinfer_cutedsl(
layer: FusedMoE,
w13: Tensor,
w13_scale: Tensor,
w13_scale_2: Tensor,
a13_scale: Tensor,
w2: Tensor,
w2_scale: Tensor,
w2_scale_2: Tensor,
a2_scale: Tensor,
) -> tuple[
Tensor,
Tensor,
Tensor,
Tensor,
Tensor,
Tensor,
Tensor,
Tensor,
]
Prepare weights for the CuteDSL wrapper-based NvFP4 MoE backend.
Converts weight scale factors to MMA layout expected by CuteDslMoEWrapper, and interleaves w13 gate/linear rows.
Source code in vllm/model_executor/layers/quantization/utils/flashinfer_fp4_moe.py
reorder_w1w3_to_w3w1 ¶
Re-order the concatenated [w1, w3] tensors to [w3, w1]