vllm.model_executor.models.gemma3n_audio_utils ¶
Lightweight utility functions for Gemma3n audio processing.
This module is separate from gemma3n_mm.py to avoid heavy CUDA dependencies, making it testable without a full vLLM build.
adjust_audio_features_to_expected_length ¶
adjust_audio_features_to_expected_length(
audio_features: Tensor,
expected_tokens: int,
audio_padding_embs: Tensor,
) -> tuple[Tensor, int]
Adjust audio features to expected token length via padding or truncation.
The Gemma3nProcessor expects all audio will be ~30s in length and inserts a fixed number of audio soft tokens into the text. However, the audio preprocessing and encoder do not guarantee they will produce exactly that many soft tokens; they may produce fewer tokens (for shorter audio) or more tokens (for longer audio or due to BOA/EOA special tokens).
This function handles both cases: - If fewer tokens: pad with the provided padding embeddings - If more tokens: truncate to the expected count
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
audio_features | Tensor | Audio embeddings tensor of shape (batch_size, seq_len, embed_dim) | required |
expected_tokens | int | The expected number of audio tokens (e.g., 188) | required |
audio_padding_embs | Tensor | Padding embeddings tensor of shape (1, 1, embed_dim) | required |
Returns:
| Type | Description |
|---|---|
Tensor | Tuple of: |
int |
|
tuple[Tensor, int] |
|