vllm.v1.kv_cache_interface ¶
AttentionSpec dataclass ¶
Bases: KVCacheSpec
Source code in vllm/v1/kv_cache_interface.py
ChunkedLocalAttentionSpec dataclass ¶
Bases: AttentionSpec
Source code in vllm/v1/kv_cache_interface.py
__init__ ¶
__init__(
block_size: int,
*,
num_kv_heads: int,
head_size: int,
dtype: dtype,
page_size_padded: int | None = None,
attention_chunk_size: int,
) -> None
max_memory_usage_bytes ¶
max_memory_usage_bytes(vllm_config: VllmConfig) -> int
Source code in vllm/v1/kv_cache_interface.py
CrossAttentionSpec dataclass ¶
Bases: AttentionSpec
KV cache spec for cross-attention layers in encoder-decoder models.
Source code in vllm/v1/kv_cache_interface.py
__init__ ¶
__init__(
block_size: int,
*,
num_kv_heads: int,
head_size: int,
dtype: dtype,
page_size_padded: int | None = None,
) -> None
max_memory_usage_bytes ¶
max_memory_usage_bytes(vllm_config: VllmConfig) -> int
Source code in vllm/v1/kv_cache_interface.py
EncoderOnlyAttentionSpec dataclass ¶
FullAttentionSpec dataclass ¶
Bases: AttentionSpec
When hybrid allocator is disabled and the model contains both full attention layers and sliding window attention layers, sliding window attention are regarded as full attention in KV cache manager (blocks are allocated for all tokens), while computed as sliding window attention in model runner. In this case, we use FullAttentionSpec and record the sliding window size.
Source code in vllm/v1/kv_cache_interface.py
90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 | |
sliding_window class-attribute instance-attribute ¶
sliding_window: int | None = None
Default to None for not using sliding window attention.
__init__ ¶
__init__(
block_size: int,
*,
num_kv_heads: int,
head_size: int,
dtype: dtype,
page_size_padded: int | None = None,
head_size_v: int | None = None,
sliding_window: int | None = None,
attention_chunk_size: int | None = None,
) -> None
__post_init__ ¶
max_memory_usage_bytes ¶
max_memory_usage_bytes(vllm_config: VllmConfig) -> int
Source code in vllm/v1/kv_cache_interface.py
merge classmethod ¶
Merge a list of FullAttentionSpec objects into a single FullAttentionSpec object.
Source code in vllm/v1/kv_cache_interface.py
merge_window_sizes classmethod ¶
Source code in vllm/v1/kv_cache_interface.py
KVCacheConfig dataclass ¶
The KV cache configuration of a model.
Source code in vllm/v1/kv_cache_interface.py
kv_cache_groups instance-attribute ¶
kv_cache_groups: list[KVCacheGroupSpec]
The kv cache groups of the model. For models with only one type of attention, there is only one group that contains all layers. For models with multiple types of attention, there will be multiple groups, see _get_kv_cache_config_uniform_page_size for more details.
kv_cache_tensors instance-attribute ¶
kv_cache_tensors: list[KVCacheTensor]
How should model runner initialize the KV cache tensors for each layer
__init__ ¶
__init__(
num_blocks: int,
kv_cache_tensors: list[KVCacheTensor],
kv_cache_groups: list[KVCacheGroupSpec],
) -> None
KVCacheGroupSpec dataclass ¶
Represents a group of model layers that share the same KV cache block table. These layers are regarded as one layer in the KV cache manager.
Source code in vllm/v1/kv_cache_interface.py
KVCacheSpec dataclass ¶
A base class for specifying the KV cache format of one layer.
Source code in vllm/v1/kv_cache_interface.py
copy_with_new_block_size ¶
Create a new KVCacheSpec from self but replacing the block size.
max_memory_usage_bytes ¶
max_memory_usage_bytes(vllm_config: VllmConfig) -> int
The maximum possible memory usage of this KV cache in bytes.
Returns:
| Type | Description |
|---|---|
int | The KV cache size in bytes |
merge classmethod ¶
Merge a list of KVCacheSpec objects into a single KVCacheSpec object.
Source code in vllm/v1/kv_cache_interface.py
KVCacheTensor dataclass ¶
A class for specifying how the workers should initialize the KV cache.
Source code in vllm/v1/kv_cache_interface.py
MLAAttentionSpec dataclass ¶
Bases: FullAttentionSpec
Source code in vllm/v1/kv_cache_interface.py
__init__ ¶
__init__(
block_size: int,
*,
num_kv_heads: int,
head_size: int,
dtype: dtype,
page_size_padded: int | None = None,
head_size_v: int | None = None,
sliding_window: int | None = None,
attention_chunk_size: int | None = None,
cache_dtype_str: str | None = None,
) -> None
merge classmethod ¶
Source code in vllm/v1/kv_cache_interface.py
MambaSpec dataclass ¶
Bases: KVCacheSpec
Source code in vllm/v1/kv_cache_interface.py
SinkFullAttentionSpec dataclass ¶
Bases: FullAttentionSpec
Source code in vllm/v1/kv_cache_interface.py
__init__ ¶
__init__(
block_size: int,
sink_len: int | None = None,
*,
num_kv_heads: int,
head_size: int,
dtype: dtype,
page_size_padded: int | None = None,
head_size_v: int | None = None,
sliding_window: int | None = None,
attention_chunk_size: int | None = None,
) -> None
merge classmethod ¶
Merge a list of FullAttentionSpec objects into a single FullAttentionSpec object.
Source code in vllm/v1/kv_cache_interface.py
SlidingWindowSpec dataclass ¶
Bases: AttentionSpec
Source code in vllm/v1/kv_cache_interface.py
__init__ ¶
__init__(
block_size: int,
*,
num_kv_heads: int,
head_size: int,
dtype: dtype,
page_size_padded: int | None = None,
sliding_window: int,
) -> None
max_memory_usage_bytes ¶
max_memory_usage_bytes(vllm_config: VllmConfig) -> int
Source code in vllm/v1/kv_cache_interface.py
UniformTypeKVCacheSpecs dataclass ¶
Bases: KVCacheSpec
A KV cache spec for multiple layers with the same type of attention. Here, same types means always need the same number of token slots. For example, sliding window attentions with different window sizes are not the same type and should not be merged into one UniformTypeKVCacheSpecs.
Source code in vllm/v1/kv_cache_interface.py
from_specs classmethod ¶
from_specs(
kv_cache_specs: dict[str, KVCacheSpec],
) -> Self | None
Return a SameTypeKVCacheSpecs object if all layers have the same type of KV cache spec. Return None if not.
Source code in vllm/v1/kv_cache_interface.py
is_uniform_type classmethod ¶
is_uniform_type(
kv_cache_specs: dict[str, KVCacheSpec],
) -> bool
Whether all layers have the same type of KV cache spec.
Source code in vllm/v1/kv_cache_interface.py
max_memory_usage_bytes ¶
max_memory_usage_bytes(vllm_config: VllmConfig) -> int