vllm.model_executor.layers.fused_moe.triton_deep_gemm_moe ¶
TritonOrDeepGemmExperts ¶
Bases: FallbackExperts
DeepGemm with fallback to Triton for low latency shapes.
Source code in vllm/model_executor/layers/fused_moe/triton_deep_gemm_moe.py
__init__ ¶
__init__(quant_config: FusedMoEQuantConfig)
_select_experts_impl ¶
_select_experts_impl(
hidden_states: Tensor, w1: Tensor, w2: Tensor
) -> FusedMoEPermuteExpertsUnpermute
Source code in vllm/model_executor/layers/fused_moe/triton_deep_gemm_moe.py
workspace_shapes ¶
workspace_shapes(
M: int,
N: int,
K: int,
topk: int,
global_num_experts: int,
local_num_experts: int,
expert_tokens_meta: ExpertTokensMetadata | None,
activation: str,
) -> tuple[
tuple[int, ...], tuple[int, ...], tuple[int, ...]
]