vllm.model_executor.layers.fused_moe.triton_cutlass_moe ¶
TritonOrCutlassExperts ¶
Bases: FallbackExperts
Cutlass with fallback to Triton for low latency shapes on SM100.
Source code in vllm/model_executor/layers/fused_moe/triton_cutlass_moe.py
__init__ ¶
__init__(
e: int,
n: int,
k: int,
out_dtype: dtype | None,
quant_config: FusedMoEQuantConfig,
device: dtype,
)
Source code in vllm/model_executor/layers/fused_moe/triton_cutlass_moe.py
_select_experts_impl ¶
_select_experts_impl(
hidden_states: Tensor, w1: Tensor, w2: Tensor
) -> FusedMoEPermuteExpertsUnpermute
Source code in vllm/model_executor/layers/fused_moe/triton_cutlass_moe.py
workspace_shapes ¶
workspace_shapes(
M: int,
N: int,
K: int,
topk: int,
global_num_experts: int,
local_num_experts: int,
expert_tokens_meta: ExpertTokensMetadata | None,
activation: str,
) -> tuple[
tuple[int, ...], tuple[int, ...], tuple[int, ...]
]