vllm.model_executor.layers.quantization.utils.flashinfer_utils ¶
FlashinferMoeBackend ¶
Bases: Enum
Source code in vllm/model_executor/layers/quantization/utils/flashinfer_utils.py
align_fp8_moe_weights_for_fi ¶
align_fp8_moe_weights_for_fi(
w13: Tensor, w2: Tensor, is_act_and_mul: bool
) -> tuple[Tensor, Tensor, int]
Pad intermediate size so FlashInfer kernels' alignment constraints hold.
Some FlashInfer FP8 MoE kernels require the (gated) intermediate size used for GEMM to be divisible by a small alignment value. When this is not satisfied (e.g. with certain tensor-parallel sizes), we pad the gate/up and down projection weights along the intermediate dim.
Source code in vllm/model_executor/layers/quantization/utils/flashinfer_utils.py
apply_fi_trtllm_fp8_per_tensor_moe ¶
apply_fi_trtllm_fp8_per_tensor_moe(
layer: Module,
hidden_states: Tensor,
router_logits: Tensor,
routing_bias: Tensor | None,
top_k: int,
num_expert_group: int | None,
topk_group: int | None,
global_num_experts: int,
apply_router_weight_on_input: bool,
) -> Tensor
Source code in vllm/model_executor/layers/quantization/utils/flashinfer_utils.py
build_flashinfer_fp8_cutlass_moe_prepare_finalize ¶
build_flashinfer_fp8_cutlass_moe_prepare_finalize(
moe: FusedMoEConfig | None,
use_deepseek_fp8_block_scale: bool = False,
) -> FusedMoEPrepareAndFinalize
Create a FlashInfer CUTLASS fused-MoE prepare finalize kernel
Source code in vllm/model_executor/layers/quantization/utils/flashinfer_utils.py
calculate_tile_tokens_dim ¶
Source code in vllm/model_executor/layers/quantization/utils/flashinfer_utils.py
get_flashinfer_moe_backend ¶
get_flashinfer_moe_backend() -> FlashinferMoeBackend
Source code in vllm/model_executor/layers/quantization/utils/flashinfer_utils.py
is_flashinfer_supporting_global_sf ¶
is_flashinfer_supporting_global_sf(
backend: FlashinferMoeBackend | None,
) -> bool
Source code in vllm/model_executor/layers/quantization/utils/flashinfer_utils.py
make_fp8_moe_alpha_scales_for_fi ¶
make_fp8_moe_alpha_scales_for_fi(
w13_scale: Tensor,
w13_input_scale: Tensor,
w2_scale: Tensor,
w2_input_scale: Tensor,
) -> tuple[Tensor, Tensor]
Source code in vllm/model_executor/layers/quantization/utils/flashinfer_utils.py
prepare_fp8_moe_layer_for_fi ¶
prepare_fp8_moe_layer_for_fi(
layer: Module,
w13: Tensor,
w2: Tensor,
w13_scale: Tensor,
w13_input_scale: Tensor | None,
w2_scale: Tensor,
w2_input_scale: Tensor | None,
is_trtllm: bool = False,
) -> tuple[Tensor, Tensor, Tensor]
Convert Fp8 MoE weights to flashinfer kernel format
Note that for trtllm we update the model state dict with the scale format needed for these kernels.
Note that for per-tensor, we update the layer's intermediate size if the weights needed padding.
Source code in vllm/model_executor/layers/quantization/utils/flashinfer_utils.py
register_scales_for_trtllm_fp8_per_tensor_moe ¶
register_scales_for_trtllm_fp8_per_tensor_moe(
layer: Module,
w13_scale: Tensor,
w13_input_scale: Tensor,
w2_scale: Tensor,
w2_input_scale: Tensor,
) -> None
Register necessary scales for FlashInfer TRTLLM FP8 MoE kernel
Source code in vllm/model_executor/layers/quantization/utils/flashinfer_utils.py
rotate_weights_for_fi_trtllm_fp8_per_tensor_moe ¶
Shuffle weights for for FI TRT-LLM Format
Source code in vllm/model_executor/layers/quantization/utils/flashinfer_utils.py
select_cutlass_fp8_gemm_impl ¶
select_cutlass_fp8_gemm_impl(
moe: FusedMoEConfig | None,
quant_config: FusedMoEQuantConfig,
out_dtype: dtype | None = None,
use_deepseek_fp8_block_scale: bool = False,
) -> FusedMoEPermuteExpertsUnpermute
Return a GEMM experts implementation for fused-MoE layers