Whisper 音频处理小记

Kevin 吴嘉文大约 10 分钟

音频格式与编码

声音是什么？

物理层面 ：空气分子的振动 → 声压随时间变化的波。
模拟信号 ：连续的波形，既有时间连续性，也有幅度连续性。

但计算机只能处理离散的数字，所以要“采样 + 量化”成数字信号。

采样与量化（数字化的第一步）

声音本质：连续的模拟信号

人的讲话、音乐等都是空气压力随时间 连续变化 的波。
这是一条模拟信号曲线（time-continuous, amplitude-continuous）。

麦克风：声波 → 电压信号

麦克风膜片随声压振动，把空气压力变化转成 电压变化 。
电压信号依然是 连续的模拟信号 ，例如 0.1V → 0.15V → 0.2V …。

采样（Sampling）：时间离散化

使用模数转换器（ADC, Analog-to-Digital Converter）， 定时读取电压的瞬时值 。
采样频率（Sampling rate, $f_s$ ） = 每秒取多少个点。
CD 音频：44.1 kHz ⇒ 每秒取 44,100 个点；
语音识别：16 kHz ⇒ 每秒取 16,000 个点。

根据 奈奎斯特采样定理 ：采样率 ≥ 2 × 信号最高频率，才能无失真重建。人耳最高能听约 20 kHz ⇒ CD 选择 44.1 kHz（略高于 2×20 kHz）。

量化（Quantization）：幅度离散化

每个采样点的电压值是连续的，计算机必须用有限比特表示。
量化位深（bit depth） ：
8-bit ⇒ 256 个电平；
16-bit ⇒ 65,536 个电平（CD 质量）；
24-bit ⇒ 16,777,216 个电平（录音室常用）。

例子：

真实电压值 = 0.12345 V
如果 16-bit 量化 ⇒ 可能存储为“0.12340 V”对应的离散电平。
越高位深 ⇒ 还原越接近真实，噪声（量化误差）越小。

编码（Encoding）：存入文件

采样 + 量化后，得到一串整数序列（PCM 数据）。
再打包进 文件容器（如 WAV、AIFF） ，并写入参数：采样率、位深、声道数。

示例（16kHz，16-bit，单声道，1 秒音频）：

总采样点数 = 16,000；
每个点 16 bit = 2 字节；
文件大小 ≈ 16,000 × 2 = 32 KB。

音频文件格式（File Format）

音频文件格式一般分两部分： 容器格式 （file container）+ 编码格式 （codec）。

容器格式（Container）

就像文件的外壳，负责存储数据、采样率、声道数、比特率等元信息（metadata）。
常见容器：WAV、AIFF、MP4、MKV。

编码格式（Codec, Coder-Decoder）

决定音频数据如何压缩和存储。
分为无压缩、无损压缩和有损压缩。

音频编码方式（Encoding）

无压缩（Uncompressed）

PCM ：最常见，几乎所有音频处理都会先转成 PCM。
格式：WAV（Windows）、AIFF（Mac）。
优点：质量最高；缺点：体积大。

无损压缩（Lossless Compression）

压缩但不丢失信息，解码后与原始 PCM 完全一致。
常见：FLAC、ALAC（Apple）、APE。
优点：比无压缩小一半左右；缺点：还是比较大。

有损压缩（Lossy Compression）

丢弃人耳不敏感的部分信息，用心理声学模型压缩。
常见：MP3、AAC、OGG Vorbis、Opus。
优点：体积小，适合传输；缺点：不可逆，质量取决于比特率。

常见音频格式速查表

格式	编码方式	特点	应用场景
WAV	PCM（无压缩）	保真度高，体积大	专业录音、音频处理
AIFF	PCM（无压缩）	Apple 标准，类似 WAV	专业音频工作站
FLAC	无损压缩	体积缩小 ~50%，质量无损	高保真音乐收藏
ALAC	无损压缩	Apple 的 FLAC	iTunes, iOS
MP3	有损压缩	最流行，兼容性强	音乐流媒体
AAC	有损压缩	比 MP3 更高效	YouTube, iTunes, iPhone
OGG (Vorbis)	有损压缩	开源 MP3 替代品	游戏、开源应用
Opus	有损压缩	动态码率，超低延迟	Zoom、Discord、WebRTC

音频的重要参数

采样率 ：决定频率范围（8kHz 电话质感，44.1kHz CD，48kHz 视频）。
比特率（Bitrate） ：每秒数据量（kbps）。
比如 MP3 常见 128 kbps、192 kbps、320 kbps；
AAC/Opus 用同样比特率通常效果更好。
声道（Channels） ：单声道（Mono）、立体声（Stereo）、环绕声（5.1、7.1）。

编解码流程（Codec Pipeline）

编码（Encoding） ：

模拟 → 数字（采样 + 量化） → 压缩（可选） → 存储为文件。

解码（Decoding） ：

文件 → 解压缩 → PCM 波形 → 声卡 → 模拟信号 → 耳机/音响。

信号，傅里叶变化， Mel 频谱图；参考文章：https://zhuanlan.zhihu.com/p/198900624

Whisper Audio Encoder

参考 from transformers import WhisperProcessor, WhisperForConditionalGeneration 中的源码，在 whisper 中，模型对 raw speech 进行了以下预处理。

通常音频文件需要转换成 numpy 形式。并且仅支持 mono-channel，输出的 Numpy array shape 如 (60075,)。

import numpy as np
import subprocess

def ffmpeg_read(bpayload: bytes, sampling_rate: int) -> np.array:
    """
    Helper function to read an audio file through ffmpeg.
    """
    ar = f"{sampling_rate}"
    ac = "1"
    format_for_conversion = "f32le"
    ffmpeg_command = [
        "ffmpeg",
        "-i",
        "pipe:0",
        "-ac",
        ac,
        "-ar",
        ar,
        "-f",
        format_for_conversion,
        "-hide_banner",
        "-loglevel",
        "quiet",
        "pipe:1",
    ]

    try:
        with subprocess.Popen(ffmpeg_command, stdin=subprocess.PIPE, stdout=subprocess.PIPE) as ffmpeg_process:
            output_stream = ffmpeg_process.communicate(bpayload)
    except FileNotFoundError as error:
        raise ValueError("ffmpeg was not found but is required to load audio files from filename") from error
    out_bytes = output_stream[0]
    audio = np.frombuffer(out_bytes, np.float32)
    if audio.shape[0] == 0:
        raise ValueError(
            "Soundfile is either not in the correct format or is malformed. Ensure that the soundfile has "
            "a valid audio file extension (e.g. wav, flac or mp3) and is not corrupted. If reading from a remote "
            "URL, ensure that the URL is the full address to  **download**  the audio file."
        )
    return audio

默认情况下，再 batch 处理时候，根据 max-length 进行 padding
- 在 Whisper Processor 中，预处理根据 self.n_samples = chunk_length * sampling_rate 进行 padding。其中 chunk_length=30, sampling_rate=16000

根据输入，计算 log-mel spectrogram

将输入的波形按照 frame_length 的大小切分成一系列帧，相邻帧之间有 frame_length - hop_length 个采样点的重叠。

# split waveform into frames of frame_length size
    num_frames = int(1 + np.floor((waveform.size - frame_length) / hop_length))

    num_frequency_bins = (fft_length // 2) + 1 if onesided else fft_length
    spectrogram = np.empty((num_frames, num_frequency_bins), dtype=np.complex64)
    # num_frequency_bins = 201, num_frames 根据输入长度而定

每一帧都与一个窗口函数相乘，然后放入大小为 fft_length 的缓冲区。

spectrogram = np.empty((num_batches, num_frames, num_frequency_bins), dtype=np.complex64)

# rfft is faster than fft
fft_func = np.fft.rfft if onesided else np.fft.fft
buffer = np.zeros((num_batches, fft_length))

for frame_idx in range(num_frames):
    timestep = frame_idx * hop_length
    # frame_length = 400, hop_length = 160
    buffer[:, :frame_length] = padded_waveform_batch[:, timestep : timestep + frame_length]
    buffer[:, :frame_length] *= window

对每个加窗后的帧进行离散傅里叶变换（DFT）。

spectrogram[:, frame_idx] = fft_func(buffer)
if mel_filters is not None:
    spectrogram = np.maximum(mel_floor, np.dot(mel_filters.T, spectrogram))
    # mel_filters.shape=(201, 80)
    # spectrogram.shape=(x, 201)

将结果堆叠起来，得到频谱图（spectrogram）。形状如 [2, 80, 3000]

参考 openai whisper 的预处理器：

from typing import Optional, Union

import numpy as np

from ... import is_torch_available
from ...audio_utils import mel_filter_bank, spectrogram, window_function
from ...feature_extraction_sequence_utils import SequenceFeatureExtractor
from ...feature_extraction_utils import BatchFeature
from ...utils import TensorType, logging


if is_torch_available():
    import torch

logger = logging.get_logger(__name__)


class WhisperFeatureExtractor(SequenceFeatureExtractor):
    r"""
    Constructs a Whisper feature extractor.

    This feature extractor inherits from [`~feature_extraction_sequence_utils.SequenceFeatureExtractor`] which contains
    most of the main methods. Users should refer to this superclass for more information regarding those methods.

    This class extracts mel-filter bank features from raw speech using a custom numpy implementation of the `Short Time
    Fourier Transform` which should match pytorch's `torch.stft` equivalent.

    Args:
        feature_size (`int`, *optional*, defaults to 80):
            The feature dimension of the extracted features.
        sampling_rate (`int`, *optional*, defaults to 16000):
            The sampling rate at which the audio files should be digitalized expressed in hertz (Hz).
        hop_length (`int`, *optional*, defaults to 160):
            Length of the overlapping windows for the STFT used to obtain the Mel Frequency coefficients.
        chunk_length (`int`, *optional*, defaults to 30):
            The maximum number of chunks of `sampling_rate` samples used to trim and pad longer or shorter audio
            sequences.
        n_fft (`int`, *optional*, defaults to 400):
            Size of the Fourier transform.
        padding_value (`float`, *optional*, defaults to 0.0):
            Padding value used to pad the audio. Should correspond to silences.
        dither (`float`, *optional*, defaults to 0.0):
            Adds dithering. In other words, adds a small Gaussian noise to each frame.
            E.g. use 0.0001 to add dithering with a normal distribution centered
            around 0.0 with standard deviation 0.0001 (assuming [-1,+1] range of raw_speech).
            The value 0.0 means no dithering.
            Dithering has similar effect as `spectrogram(mel_floor=...)`. It reduces
            the high log_mel_fbank values for signals with hard-zero sections,
            when VAD cutoff is present in the signal.
    """

    model_input_names = ["input_features"]

    def __init__(
        self,
        feature_size=80,
        sampling_rate=16000,
        hop_length=160,
        chunk_length=30,
        n_fft=400,
        padding_value=0.0,
        dither=0.0,
        return_attention_mask=False,  # pad inputs to max length with silence token (zero) and no attention mask
        **kwargs,
    ):
        super().__init__(
            feature_size=feature_size,
            sampling_rate=sampling_rate,
            padding_value=padding_value,
            return_attention_mask=return_attention_mask,
            **kwargs,
        )
        self.n_fft = n_fft
        self.hop_length = hop_length
        self.chunk_length = chunk_length
        self.n_samples = chunk_length * sampling_rate
        self.nb_max_frames = self.n_samples // hop_length
        self.sampling_rate = sampling_rate
        self.dither = dither
        self.mel_filters = mel_filter_bank(
            num_frequency_bins=1 + n_fft // 2,
            num_mel_filters=feature_size,
            min_frequency=0.0,
            max_frequency=8000.0,
            sampling_rate=sampling_rate,
            norm="slaney",
            mel_scale="slaney",
        )

    def _np_extract_fbank_features(self, waveform_batch: np.array, device: str) -> np.ndarray:
        """
        Compute the log-mel spectrogram of the provided audio, gives similar results to Whisper's original torch
        implementation with 1e-5 tolerance.
        """
        if device != "cpu":
            raise ValueError(
                f"Got device `{device}` for feature extraction, but feature extraction on CUDA accelerator "
                "devices requires torch, which is not installed. Either set `device='cpu'`, or "
                "install torch according to the official instructions: https://pytorch.org/get-started/locally/"
            )
        log_spec_batch = []
        for waveform in waveform_batch:
            log_spec = spectrogram(
                waveform,
                window_function(self.n_fft, "hann"),
                frame_length=self.n_fft,
                hop_length=self.hop_length,
                power=2.0,
                dither=self.dither,
                mel_filters=self.mel_filters,
                log_mel="log10",
            )
            log_spec = log_spec[:, :-1]
            log_spec = np.maximum(log_spec, log_spec.max() - 8.0)
            log_spec = (log_spec + 4.0) / 4.0
            log_spec_batch.append(log_spec)
        log_spec_batch = np.array(log_spec_batch)
        return log_spec_batch

    def _torch_extract_fbank_features(self, waveform: np.array, device: str = "cpu") -> np.ndarray:
        """
        Compute the log-mel spectrogram of the audio using PyTorch's GPU-accelerated STFT implementation with batching,
        yielding results similar to cpu computing with 1e-5 tolerance.
        """
        waveform = torch.from_numpy(waveform).to(device, torch.float32)
        window = torch.hann_window(self.n_fft, device=device)

        # Note: it would be better to dither the chunked waveform,
        # so overlapping signal does not get the same dithering.
        # But, chunking is happening inside pytorch, so it is here.
        if self.dither != 0.0:
            waveform += self.dither * torch.randn(waveform.shape, dtype=waveform.dtype, device=waveform.device)

        stft = torch.stft(waveform, self.n_fft, self.hop_length, window=window, return_complex=True)
        magnitudes = stft[..., :-1].abs() ** 2

        mel_filters = torch.from_numpy(self.mel_filters).to(device, torch.float32)
        mel_spec = mel_filters.T @ magnitudes

        log_spec = torch.clamp(mel_spec, min=1e-10).log10()
        if waveform.dim() == 2:
            max_val = log_spec.max(dim=2, keepdim=True)[0].max(dim=1, keepdim=True)[0]
            log_spec = torch.maximum(log_spec, max_val - 8.0)
        else:
            log_spec = torch.maximum(log_spec, log_spec.max() - 8.0)
        log_spec = (log_spec + 4.0) / 4.0
        if device != "cpu":
            log_spec = log_spec.detach().cpu()
        return log_spec.numpy()


    def __call__(
        self,
        raw_speech: Union[np.ndarray, list[float], list[np.ndarray], list[list[float]]],
        truncation: bool = True,
        pad_to_multiple_of: Optional[int] = None,
        return_tensors: Optional[Union[str, TensorType]] = None,
        return_attention_mask: Optional[bool] = None,
        padding: Optional[str] = "max_length",
        max_length: Optional[int] = None,
        sampling_rate: Optional[int] = None,
        do_normalize: Optional[bool] = None,
        device: Optional[str] = "cpu",
        return_token_timestamps: Optional[bool] = None,
        **kwargs,
    ) -> BatchFeature:
        """
        Main method to featurize and prepare for the model one or several sequence(s). Implementation uses PyTorch for
        the STFT computation if available, otherwise a slower NumPy based one.

        Args:
            raw_speech (`np.ndarray`, `list[float]`, `list[np.ndarray]`, `list[list[float]]`):
                The sequence or batch of sequences to be padded. Each sequence can be a numpy array, a list of float
                values, a list of numpy arrays or a list of list of float values. Must be mono channel audio, not
                stereo, i.e. single float per timestep.
            truncation (`bool`, *optional*, default to `True`):
                Activates truncation to cut input sequences longer than *max_length* to *max_length*.
            pad_to_multiple_of (`int`, *optional*, defaults to None):
                If set will pad the sequence to a multiple of the provided value.

                This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability
                `>= 7.5` (Volta), or on TPUs which benefit from having sequence lengths be a multiple of 128.
            return_attention_mask (`bool`, *optional*):
                Whether to return the attention mask. If left to the default, will return the attention mask according
                to the specific feature_extractor's default.

                [What are attention masks?](../glossary#attention-mask)

                <Tip>

                For Whisper models, `attention_mask` should always be passed for batched inference, to avoid subtle
                bugs.

                </Tip>

            return_tensors (`str` or [`~utils.TensorType`], *optional*):
                If set, will return tensors instead of list of python integers. Acceptable values are:

                - `'tf'`: Return TensorFlow `tf.constant` objects.
                - `'pt'`: Return PyTorch `torch.Tensor` objects.
                - `'np'`: Return Numpy `np.ndarray` objects.
            sampling_rate (`int`, *optional*):
                The sampling rate at which the `raw_speech` input was sampled. It is strongly recommended to pass
                `sampling_rate` at the forward call to prevent silent errors and allow automatic speech recognition
                pipeline.
            padding_value (`float`, *optional*, defaults to 0.0):
                The value that is used to fill the padding values / vectors.
            do_normalize (`bool`, *optional*, defaults to `False`):
                Whether or not to zero-mean unit-variance normalize the input. Normalizing can help to significantly
                improve the performance of the model.
            device (`str`, *optional*, defaults to `'cpu'`):
                Specifies the device for computation of the log-mel spectrogram of audio signals in the
                `_torch_extract_fbank_features` method. (e.g., "cpu", "cuda")
            return_token_timestamps (`bool`, *optional*, defaults to `None`):
                Deprecated. Use `return_attention_mask` instead from which the number of frames can be inferred.

                Whether or not to return the number of frames of the input raw_speech.
                These num_frames can be used by the model to compute word level timestamps.
        """
        if sampling_rate is not None:
            if sampling_rate != self.sampling_rate:
                raise ValueError(
                    f"The model corresponding to this feature extractor: {self.__class__.__name__} was trained using a"
                    f" sampling rate of {self.sampling_rate}. Please make sure that the provided `raw_speech` input"
                    f" was sampled with {self.sampling_rate} and not {sampling_rate}."
                )
        else:
            logger.warning(
                f"It is strongly recommended to pass the `sampling_rate` argument to `{self.__class__.__name__}()`. "
                "Failing to do so can result in silent errors that might be hard to debug."
            )

        is_batched_numpy = isinstance(raw_speech, np.ndarray) and len(raw_speech.shape) > 1
        if is_batched_numpy and len(raw_speech.shape) > 2:
            raise ValueError(f"Only mono-channel audio is supported for input to {self}")
        is_batched = is_batched_numpy or (
            isinstance(raw_speech, (list, tuple)) and (isinstance(raw_speech[0], (np.ndarray, tuple, list)))
        )

        if is_batched:
            raw_speech = [np.asarray([speech], dtype=np.float32).T for speech in raw_speech]
        elif not is_batched and not isinstance(raw_speech, np.ndarray):
            raw_speech = np.asarray(raw_speech, dtype=np.float32)
        elif isinstance(raw_speech, np.ndarray) and raw_speech.dtype is np.dtype(np.float64):
            raw_speech = raw_speech.astype(np.float32)

        # always return batch
        if not is_batched:
            raw_speech = [np.asarray([raw_speech]).T]

        batched_speech = BatchFeature({"input_features": raw_speech})

        # convert into correct format for padding

        padded_inputs = self.pad(
            batched_speech,
            padding=padding,
            max_length=max_length if max_length else self.n_samples,
            truncation=truncation,
            pad_to_multiple_of=pad_to_multiple_of,
            return_attention_mask=return_attention_mask or do_normalize,
        )

        # make sure list is in array format
        input_features = padded_inputs.get("input_features").transpose(2, 0, 1)

        extract_fbank_features = (
            self._torch_extract_fbank_features if is_torch_available() else self._np_extract_fbank_features
        )
        input_features = extract_fbank_features(input_features[0], device)

        if isinstance(input_features[0], list):
            padded_inputs["input_features"] = [np.asarray(feature, dtype=np.float32) for feature in input_features]

        else:
            padded_inputs["input_features"] = input_features

        if return_token_timestamps is not None:
            logger.warning_once(
                f"`return_token_timestamps` is deprecated for {self.__class__.__name__} and will be removed in Transformers v5. Use `return_attention_mask` instead, as the number of frames can be inferred from it."
            )
            padded_inputs["num_frames"] = [len(raw_speech_i) // self.hop_length for raw_speech_i in raw_speech]

        return padded_inputs


__all__ = ["WhisperFeatureExtractor"]

Whisper 音频处理小记

# 音频格式与编码

# Whisper Audio Encoder

音频格式与编码

Whisper Audio Encoder