vllm.multimodal.audio ¶
MONO_AUDIO_SPEC module-attribute ¶
AudioEmbeddingMediaIO ¶
Source code in vllm/multimodal/audio.py
__init__ ¶
encode_base64 ¶
load_base64 ¶
load_bytes ¶
Source code in vllm/multimodal/audio.py
load_file ¶
Source code in vllm/multimodal/audio.py
AudioMediaIO ¶
Bases: MediaIO[tuple[NDArray, float]]
Source code in vllm/multimodal/audio.py
AudioResampler ¶
Resample audio data to a target sample rate.
Source code in vllm/multimodal/audio.py
__init__ ¶
resample ¶
Source code in vllm/multimodal/audio.py
AudioSpec dataclass ¶
Specification for target audio format.
This dataclass defines the expected audio format for a model's feature extractor. It is used to normalize audio data before processing.
Attributes:
| Name | Type | Description |
|---|---|---|
target_channels | int | None | Number of output channels. None means passthrough (no normalization). 1 = mono, 2 = stereo, etc. |
channel_reduction | ChannelReduction | Method to reduce channels when input has more channels than target. Only used when reducing channels. |
Source code in vllm/multimodal/audio.py
__init__ ¶
__init__(
target_channels: int | None = 1,
channel_reduction: ChannelReduction = MEAN,
) -> None
ChannelReduction ¶
Method to reduce multi-channel audio to target channels.
Source code in vllm/multimodal/audio.py
normalize_audio ¶
Normalize audio to the specified format.
This function handles channel reduction for multi-channel audio, supporting both numpy arrays and torch tensors.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
audio | NDArray[floating] | Tensor | Input audio data. Can be: - 1D array/tensor: (time,) - already mono - 2D array/tensor: (channels, time) - standard format from torchaudio - 2D array/tensor: (time, channels) - format from soundfile (will be auto-detected and transposed if time > channels) | required |
spec | AudioSpec | AudioSpec defining the target format. | required |
Returns:
| Type | Description |
|---|---|
NDArray[floating] | Tensor | Normalized audio in the same type as input (numpy or torch). |
NDArray[floating] | Tensor | For mono output (target_channels=1), returns 1D array/tensor. |
Raises:
| Type | Description |
|---|---|
ValueError | If audio has unsupported dimensions or channel expansion is requested (e.g., mono to stereo). |