vllm.model_executor.layers.quantization.utils.quant_utils ¶
This file is used for /tests and /benchmarks
kDynamic128Scale module-attribute ¶
kDynamic128Scale = ScaleDesc(
float32, False, GroupShape(1, 128)
)
kFp8Dynamic128Sym module-attribute ¶
kFp8Dynamic128Sym = QuantKey(
FP8_DTYPE, kDynamic128Scale, symmetric=True
)
kFp8Dynamic64Sym module-attribute ¶
kFp8Dynamic64Sym = QuantKey(
FP8_DTYPE, kDynamic64Scale, symmetric=True
)
kFp8DynamicTensorSym module-attribute ¶
kFp8DynamicTensorSym = QuantKey(
FP8_DTYPE, kDynamicTensorScale, symmetric=True
)
kFp8DynamicTokenSym module-attribute ¶
kFp8DynamicTokenSym = QuantKey(
FP8_DTYPE, kDynamicTokenScale, symmetric=True
)
kFp8StaticTensorSym module-attribute ¶
kFp8StaticTensorSym = QuantKey(
FP8_DTYPE, kStaticTensorScale, symmetric=True
)
kNvfp4GroupScale module-attribute ¶
kNvfp4GroupScale = ScaleDesc(
FP8_DTYPE, False, GroupShape(1, 16)
)
kNvfp4Quant module-attribute ¶
kNvfp4Quant = QuantKey(
FP4_DTYPE,
scale=kNvfp4GroupScale,
scale2=kStaticTensorScale,
)
GroupShape ¶
Bases: _GroupShape
This class describes the quantization group shape. It includes static members for common shapes (per-tensor, per-token).
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
QuantKey dataclass ¶
Class for identifying the type of quantization. dtype: quantized data type scale: scale descriptor scale2: second-level scale descriptor symmetric: symmetric if True, asymmetric if False
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
ScaleDesc dataclass ¶
Class for describing a single quantization scaling factor. dtype: data type of the scale static: static scale if True, dynamic if False group_shape: group shape of the scale
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
__str__ ¶
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
_GroupShape ¶
_normalize_quant_group_shape ¶
_normalize_quant_group_shape(
x: Tensor, group_shape: GroupShape
)
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
awq_pack ¶
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
convert_bf16_scales_to_fp8 ¶
Convert a BF16 scale tensor into the pair of (fp8_scales, channel_scales) expected by W4A8 GEMM kernels.
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
convert_packed_uint4b8_to_signed_int4_inplace ¶
Convert int4b8 (packed to int32) to signed int4
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
cutlass_fp4_supported ¶
cutlass_fp4_supported() -> bool
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
get_and_maybe_dequant_weights ¶
get_and_maybe_dequant_weights(
layer: LinearBase, out_dtype: dtype = float32
)
Return layer's unquantized weights in [out, in] layout
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
get_attribute_fallback ¶
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
get_fp8_min_max ¶
Get the min and max values for FP8 quantization.
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
get_pack_factor ¶
gptq_pack ¶
gptq_quantize_weights ¶
gptq_quantize_weights(
w: Tensor,
quant_type: ScalarType,
group_size: int,
act_order: bool,
test_perm: Tensor | None = None,
)
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
group_broadcast ¶
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
is_layer_skipped ¶
is_layer_skipped(
prefix: str,
ignored_layers: list[str],
fused_mapping: Mapping[
str, list[str]
] = MappingProxyType({}),
*,
skip_with_substr: bool = False,
) -> bool
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
pack_cols ¶
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
pack_quantized_values_into_int32 ¶
pack_quantized_values_into_int32(
w_q: Tensor, wtype: ScalarType, packed_dim: int = 0
)
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
pack_rows ¶
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
permute_rows ¶
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
quantize_weights ¶
quantize_weights(
w: Tensor,
quant_type: ScalarType,
group_size: int | None,
zero_points: bool = False,
ref_zero_points_after_scales: bool = False,
)
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 | |
scaled_dequantize ¶
scaled_dequantize(
x_q: Tensor,
x_s: Tensor,
group_shape: GroupShape | None = None,
out_dtype: dtype = float32,
) -> Tensor
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
scaled_quantize ¶
scaled_quantize(
x: Tensor,
group_shape: GroupShape,
quant_dtype: dtype,
compute_dtype: dtype | None = None,
) -> tuple[Tensor, Tensor]
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x | Tensor | Input tensor to quantize | required |
group_shape | GroupShape | Shape of quantization groups | required |
quant_dtype | dtype | Target quantized dtype (e.g., torch.float8_e4m3fn) | required |
compute_dtype | dtype | None | Optional dtype for intermediate computations. If None, uses input dtype. Use torch.float32 for higher precision. | None |
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
sort_weights ¶
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
swizzle_blockscale ¶
Pad and block-interleave the FP4 block-scales so that they match the data layout expected by the CUTLASS / FlashInfer kernels.
Parameters¶
scale: torch.Tensor
Returns¶
torch.Tensor The swizzled tensor with the same logical shape as scale.
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
unpack_cols ¶
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
unpack_quantized_values_into_int32 ¶
unpack_quantized_values_into_int32(
w_q: Tensor, wtype: ScalarType, packed_dim: int = 0
)