跳至主要內容

Whisper 音频处理小记

Kevin 吴嘉文大约 10 分钟知识笔记NLPAIGCLLMAgent

音频格式与编码

  1. 声音是什么?
  • 物理层面 :空气分子的振动 → 声压随时间变化的波。
  • 模拟信号 :连续的波形,既有时间连续性,也有幅度连续性。

但计算机只能处理离散的数字,所以要“采样 + 量化”成数字信号。


  1. 采样与量化(数字化的第一步)

声音本质:连续的模拟信号

  • 人的讲话、音乐等都是空气压力随时间 连续变化 的波。
  • 这是一条模拟信号曲线(time-continuous, amplitude-continuous)。

麦克风:声波 → 电压信号

  • 麦克风膜片随声压振动,把空气压力变化转成 电压变化
  • 电压信号依然是 连续的模拟信号 ,例如 0.1V → 0.15V → 0.2V …。

采样(Sampling):时间离散化

  • 使用模数转换器(ADC, Analog-to-Digital Converter), 定时读取电压的瞬时值
  • 采样频率(Sampling rate, fsf_s = 每秒取多少个点。
  • CD 音频:44.1 kHz ⇒ 每秒取 44,100 个点;
  • 语音识别:16 kHz ⇒ 每秒取 16,000 个点。

根据 奈奎斯特采样定理 :采样率 ≥ 2 × 信号最高频率,才能无失真重建。 人耳最高能听约 20 kHz ⇒ CD 选择 44.1 kHz(略高于 2×20 kHz)。

量化(Quantization):幅度离散化

  • 每个采样点的电压值是连续的,计算机必须用有限比特表示。
  • 量化位深(bit depth)
  • 8-bit ⇒ 256 个电平;
  • 16-bit ⇒ 65,536 个电平(CD 质量);
  • 24-bit ⇒ 16,777,216 个电平(录音室常用)。

例子:

  • 真实电压值 = 0.12345 V
  • 如果 16-bit 量化 ⇒ 可能存储为“0.12340 V”对应的离散电平。
  • 越高位深 ⇒ 还原越接近真实,噪声(量化误差)越小。

编码(Encoding):存入文件

  • 采样 + 量化后,得到一串整数序列(PCM 数据)。
  • 再打包进 文件容器(如 WAV、AIFF) ,并写入参数:采样率、位深、声道数。

示例(16kHz,16-bit,单声道,1 秒音频):

  • 总采样点数 = 16,000;
  • 每个点 16 bit = 2 字节;
  • 文件大小 ≈ 16,000 × 2 = 32 KB。

  1. 音频文件格式(File Format)

音频文件格式一般分两部分: 容器格式 (file container)+ 编码格式 (codec)。

  1. 容器格式(Container)
  • 就像文件的外壳,负责存储数据、采样率、声道数、比特率等元信息(metadata)。
  • 常见容器:WAV、AIFF、MP4、MKV。
  1. 编码格式(Codec, Coder-Decoder)
  • 决定音频数据如何压缩和存储。
  • 分为无压缩、无损压缩和有损压缩。

  1. 音频编码方式(Encoding)

无压缩(Uncompressed)

  • PCM :最常见,几乎所有音频处理都会先转成 PCM。
  • 格式:WAV(Windows)、AIFF(Mac)。
  • 优点:质量最高;缺点:体积大。

无损压缩(Lossless Compression)

  • 压缩但不丢失信息,解码后与原始 PCM 完全一致。
  • 常见:FLAC、ALAC(Apple)、APE。
  • 优点:比无压缩小一半左右;缺点:还是比较大。

有损压缩(Lossy Compression)

  • 丢弃人耳不敏感的部分信息,用心理声学模型压缩。
  • 常见:MP3、AAC、OGG Vorbis、Opus。
  • 优点:体积小,适合传输;缺点:不可逆,质量取决于比特率。

  1. 常见音频格式速查表
格式编码方式特点应用场景
WAVPCM(无压缩)保真度高,体积大专业录音、音频处理
AIFFPCM(无压缩)Apple 标准,类似 WAV专业音频工作站
FLAC无损压缩体积缩小 ~50%,质量无损高保真音乐收藏
ALAC无损压缩Apple 的 FLACiTunes, iOS
MP3有损压缩最流行,兼容性强音乐流媒体
AAC有损压缩比 MP3 更高效YouTube, iTunes, iPhone
OGG (Vorbis)有损压缩开源 MP3 替代品游戏、开源应用
Opus有损压缩动态码率,超低延迟Zoom、Discord、WebRTC

  1. 音频的重要参数
  • 采样率 :决定频率范围(8kHz 电话质感,44.1kHz CD,48kHz 视频)。
  • 比特率(Bitrate) :每秒数据量(kbps)。
  • 比如 MP3 常见 128 kbps、192 kbps、320 kbps;
  • AAC/Opus 用同样比特率通常效果更好。
  • 声道(Channels) :单声道(Mono)、立体声(Stereo)、环绕声(5.1、7.1)。

  1. 编解码流程(Codec Pipeline)

编码(Encoding)

  • 模拟 → 数字(采样 + 量化) → 压缩(可选) → 存储为文件。

解码(Decoding)

  • 文件 → 解压缩 → PCM 波形 → 声卡 → 模拟信号 → 耳机/音响。

  1. 信号,傅里叶变化, Mel 频谱图;参考 文章:https://zhuanlan.zhihu.com/p/198900624

Whisper Audio Encoder

参考 from transformers import WhisperProcessor, WhisperForConditionalGeneration 中的源码,在 whisper 中,模型对 raw speech 进行了以下预处理。

  1. 通常音频文件需要转换成 numpy 形式。并且仅支持 mono-channel,输出的 Numpy array shape 如 (60075,)

    import numpy as np
    import subprocess
    
    def ffmpeg_read(bpayload: bytes, sampling_rate: int) -> np.array:
        """
        Helper function to read an audio file through ffmpeg.
        """
        ar = f"{sampling_rate}"
        ac = "1"
        format_for_conversion = "f32le"
        ffmpeg_command = [
            "ffmpeg",
            "-i",
            "pipe:0",
            "-ac",
            ac,
            "-ar",
            ar,
            "-f",
            format_for_conversion,
            "-hide_banner",
            "-loglevel",
            "quiet",
            "pipe:1",
        ]
    
        try:
            with subprocess.Popen(ffmpeg_command, stdin=subprocess.PIPE, stdout=subprocess.PIPE) as ffmpeg_process:
                output_stream = ffmpeg_process.communicate(bpayload)
        except FileNotFoundError as error:
            raise ValueError("ffmpeg was not found but is required to load audio files from filename") from error
        out_bytes = output_stream[0]
        audio = np.frombuffer(out_bytes, np.float32)
        if audio.shape[0] == 0:
            raise ValueError(
                "Soundfile is either not in the correct format or is malformed. Ensure that the soundfile has "
                "a valid audio file extension (e.g. wav, flac or mp3) and is not corrupted. If reading from a remote "
                "URL, ensure that the URL is the full address to  **download**  the audio file."
            )
        return audio
    
  2. 默认情况下,再 batch 处理时候,根据 max-length 进行 padding

    • 在 Whisper Processor 中,预处理根据 self.n_samples = chunk_length * sampling_rate 进行 padding。其中 chunk_length=30, sampling_rate=16000
  3. 根据输入,计算 log-mel spectrogram

    1. 将输入的波形按照 frame_length 的大小切分成一系列帧,相邻帧之间有 frame_length - hop_length 个采样点的重叠。

      # split waveform into frames of frame_length size
          num_frames = int(1 + np.floor((waveform.size - frame_length) / hop_length))
      
          num_frequency_bins = (fft_length // 2) + 1 if onesided else fft_length
          spectrogram = np.empty((num_frames, num_frequency_bins), dtype=np.complex64)
          # num_frequency_bins = 201, num_frames 根据输入长度而定
      
    2. 每一帧都与一个窗口函数相乘,然后放入大小为 fft_length 的缓冲区。

      spectrogram = np.empty((num_batches, num_frames, num_frequency_bins), dtype=np.complex64)
      
      # rfft is faster than fft
      fft_func = np.fft.rfft if onesided else np.fft.fft
      buffer = np.zeros((num_batches, fft_length))
      
      for frame_idx in range(num_frames):
          timestep = frame_idx * hop_length
          # frame_length = 400, hop_length = 160
          buffer[:, :frame_length] = padded_waveform_batch[:, timestep : timestep + frame_length]
          buffer[:, :frame_length] *= window
         
      
    3. 对每个加窗后的帧进行离散傅里叶变换(DFT)。

      spectrogram[:, frame_idx] = fft_func(buffer)
      if mel_filters is not None:
          spectrogram = np.maximum(mel_floor, np.dot(mel_filters.T, spectrogram))
          # mel_filters.shape=(201, 80)
          # spectrogram.shape=(x, 201)
      
    4. 将结果堆叠起来,得到频谱图(spectrogram)。形状如 [2, 80, 3000]

参考 openai whisper 的预处理器:

from typing import Optional, Union

import numpy as np

from ... import is_torch_available
from ...audio_utils import mel_filter_bank, spectrogram, window_function
from ...feature_extraction_sequence_utils import SequenceFeatureExtractor
from ...feature_extraction_utils import BatchFeature
from ...utils import TensorType, logging


if is_torch_available():
    import torch

logger = logging.get_logger(__name__)


class WhisperFeatureExtractor(SequenceFeatureExtractor):
    r"""
    Constructs a Whisper feature extractor.

    This feature extractor inherits from [`~feature_extraction_sequence_utils.SequenceFeatureExtractor`] which contains
    most of the main methods. Users should refer to this superclass for more information regarding those methods.

    This class extracts mel-filter bank features from raw speech using a custom numpy implementation of the `Short Time
    Fourier Transform` which should match pytorch's `torch.stft` equivalent.

    Args:
        feature_size (`int`, *optional*, defaults to 80):
            The feature dimension of the extracted features.
        sampling_rate (`int`, *optional*, defaults to 16000):
            The sampling rate at which the audio files should be digitalized expressed in hertz (Hz).
        hop_length (`int`, *optional*, defaults to 160):
            Length of the overlapping windows for the STFT used to obtain the Mel Frequency coefficients.
        chunk_length (`int`, *optional*, defaults to 30):
            The maximum number of chunks of `sampling_rate` samples used to trim and pad longer or shorter audio
            sequences.
        n_fft (`int`, *optional*, defaults to 400):
            Size of the Fourier transform.
        padding_value (`float`, *optional*, defaults to 0.0):
            Padding value used to pad the audio. Should correspond to silences.
        dither (`float`, *optional*, defaults to 0.0):
            Adds dithering. In other words, adds a small Gaussian noise to each frame.
            E.g. use 0.0001 to add dithering with a normal distribution centered
            around 0.0 with standard deviation 0.0001 (assuming [-1,+1] range of raw_speech).
            The value 0.0 means no dithering.
            Dithering has similar effect as `spectrogram(mel_floor=...)`. It reduces
            the high log_mel_fbank values for signals with hard-zero sections,
            when VAD cutoff is present in the signal.
    """

    model_input_names = ["input_features"]

    def __init__(
        self,
        feature_size=80,
        sampling_rate=16000,
        hop_length=160,
        chunk_length=30,
        n_fft=400,
        padding_value=0.0,
        dither=0.0,
        return_attention_mask=False,  # pad inputs to max length with silence token (zero) and no attention mask
        **kwargs,
    ):
        super().__init__(
            feature_size=feature_size,
            sampling_rate=sampling_rate,
            padding_value=padding_value,
            return_attention_mask=return_attention_mask,
            **kwargs,
        )
        self.n_fft = n_fft
        self.hop_length = hop_length
        self.chunk_length = chunk_length
        self.n_samples = chunk_length * sampling_rate
        self.nb_max_frames = self.n_samples // hop_length
        self.sampling_rate = sampling_rate
        self.dither = dither
        self.mel_filters = mel_filter_bank(
            num_frequency_bins=1 + n_fft // 2,
            num_mel_filters=feature_size,
            min_frequency=0.0,
            max_frequency=8000.0,
            sampling_rate=sampling_rate,
            norm="slaney",
            mel_scale="slaney",
        )

    def _np_extract_fbank_features(self, waveform_batch: np.array, device: str) -> np.ndarray:
        """
        Compute the log-mel spectrogram of the provided audio, gives similar results to Whisper's original torch
        implementation with 1e-5 tolerance.
        """
        if device != "cpu":
            raise ValueError(
                f"Got device `{device}` for feature extraction, but feature extraction on CUDA accelerator "
                "devices requires torch, which is not installed. Either set `device='cpu'`, or "
                "install torch according to the official instructions: https://pytorch.org/get-started/locally/"
            )
        log_spec_batch = []
        for waveform in waveform_batch:
            log_spec = spectrogram(
                waveform,
                window_function(self.n_fft, "hann"),
                frame_length=self.n_fft,
                hop_length=self.hop_length,
                power=2.0,
                dither=self.dither,
                mel_filters=self.mel_filters,
                log_mel="log10",
            )
            log_spec = log_spec[:, :-1]
            log_spec = np.maximum(log_spec, log_spec.max() - 8.0)
            log_spec = (log_spec + 4.0) / 4.0
            log_spec_batch.append(log_spec)
        log_spec_batch = np.array(log_spec_batch)
        return log_spec_batch

    def _torch_extract_fbank_features(self, waveform: np.array, device: str = "cpu") -> np.ndarray:
        """
        Compute the log-mel spectrogram of the audio using PyTorch's GPU-accelerated STFT implementation with batching,
        yielding results similar to cpu computing with 1e-5 tolerance.
        """
        waveform = torch.from_numpy(waveform).to(device, torch.float32)
        window = torch.hann_window(self.n_fft, device=device)

        # Note: it would be better to dither the chunked waveform,
        # so overlapping signal does not get the same dithering.
        # But, chunking is happening inside pytorch, so it is here.
        if self.dither != 0.0:
            waveform += self.dither * torch.randn(waveform.shape, dtype=waveform.dtype, device=waveform.device)

        stft = torch.stft(waveform, self.n_fft, self.hop_length, window=window, return_complex=True)
        magnitudes = stft[..., :-1].abs() ** 2

        mel_filters = torch.from_numpy(self.mel_filters).to(device, torch.float32)
        mel_spec = mel_filters.T @ magnitudes

        log_spec = torch.clamp(mel_spec, min=1e-10).log10()
        if waveform.dim() == 2:
            max_val = log_spec.max(dim=2, keepdim=True)[0].max(dim=1, keepdim=True)[0]
            log_spec = torch.maximum(log_spec, max_val - 8.0)
        else:
            log_spec = torch.maximum(log_spec, log_spec.max() - 8.0)
        log_spec = (log_spec + 4.0) / 4.0
        if device != "cpu":
            log_spec = log_spec.detach().cpu()
        return log_spec.numpy()


    def __call__(
        self,
        raw_speech: Union[np.ndarray, list[float], list[np.ndarray], list[list[float]]],
        truncation: bool = True,
        pad_to_multiple_of: Optional[int] = None,
        return_tensors: Optional[Union[str, TensorType]] = None,
        return_attention_mask: Optional[bool] = None,
        padding: Optional[str] = "max_length",
        max_length: Optional[int] = None,
        sampling_rate: Optional[int] = None,
        do_normalize: Optional[bool] = None,
        device: Optional[str] = "cpu",
        return_token_timestamps: Optional[bool] = None,
        **kwargs,
    ) -> BatchFeature:
        """
        Main method to featurize and prepare for the model one or several sequence(s). Implementation uses PyTorch for
        the STFT computation if available, otherwise a slower NumPy based one.

        Args:
            raw_speech (`np.ndarray`, `list[float]`, `list[np.ndarray]`, `list[list[float]]`):
                The sequence or batch of sequences to be padded. Each sequence can be a numpy array, a list of float
                values, a list of numpy arrays or a list of list of float values. Must be mono channel audio, not
                stereo, i.e. single float per timestep.
            truncation (`bool`, *optional*, default to `True`):
                Activates truncation to cut input sequences longer than *max_length* to *max_length*.
            pad_to_multiple_of (`int`, *optional*, defaults to None):
                If set will pad the sequence to a multiple of the provided value.

                This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability
                `>= 7.5` (Volta), or on TPUs which benefit from having sequence lengths be a multiple of 128.
            return_attention_mask (`bool`, *optional*):
                Whether to return the attention mask. If left to the default, will return the attention mask according
                to the specific feature_extractor's default.

                [What are attention masks?](../glossary#attention-mask)

                <Tip>

                For Whisper models, `attention_mask` should always be passed for batched inference, to avoid subtle
                bugs.

                </Tip>

            return_tensors (`str` or [`~utils.TensorType`], *optional*):
                If set, will return tensors instead of list of python integers. Acceptable values are:

                - `'tf'`: Return TensorFlow `tf.constant` objects.
                - `'pt'`: Return PyTorch `torch.Tensor` objects.
                - `'np'`: Return Numpy `np.ndarray` objects.
            sampling_rate (`int`, *optional*):
                The sampling rate at which the `raw_speech` input was sampled. It is strongly recommended to pass
                `sampling_rate` at the forward call to prevent silent errors and allow automatic speech recognition
                pipeline.
            padding_value (`float`, *optional*, defaults to 0.0):
                The value that is used to fill the padding values / vectors.
            do_normalize (`bool`, *optional*, defaults to `False`):
                Whether or not to zero-mean unit-variance normalize the input. Normalizing can help to significantly
                improve the performance of the model.
            device (`str`, *optional*, defaults to `'cpu'`):
                Specifies the device for computation of the log-mel spectrogram of audio signals in the
                `_torch_extract_fbank_features` method. (e.g., "cpu", "cuda")
            return_token_timestamps (`bool`, *optional*, defaults to `None`):
                Deprecated. Use `return_attention_mask` instead from which the number of frames can be inferred.

                Whether or not to return the number of frames of the input raw_speech.
                These num_frames can be used by the model to compute word level timestamps.
        """
        if sampling_rate is not None:
            if sampling_rate != self.sampling_rate:
                raise ValueError(
                    f"The model corresponding to this feature extractor: {self.__class__.__name__} was trained using a"
                    f" sampling rate of {self.sampling_rate}. Please make sure that the provided `raw_speech` input"
                    f" was sampled with {self.sampling_rate} and not {sampling_rate}."
                )
        else:
            logger.warning(
                f"It is strongly recommended to pass the `sampling_rate` argument to `{self.__class__.__name__}()`. "
                "Failing to do so can result in silent errors that might be hard to debug."
            )

        is_batched_numpy = isinstance(raw_speech, np.ndarray) and len(raw_speech.shape) > 1
        if is_batched_numpy and len(raw_speech.shape) > 2:
            raise ValueError(f"Only mono-channel audio is supported for input to {self}")
        is_batched = is_batched_numpy or (
            isinstance(raw_speech, (list, tuple)) and (isinstance(raw_speech[0], (np.ndarray, tuple, list)))
        )

        if is_batched:
            raw_speech = [np.asarray([speech], dtype=np.float32).T for speech in raw_speech]
        elif not is_batched and not isinstance(raw_speech, np.ndarray):
            raw_speech = np.asarray(raw_speech, dtype=np.float32)
        elif isinstance(raw_speech, np.ndarray) and raw_speech.dtype is np.dtype(np.float64):
            raw_speech = raw_speech.astype(np.float32)

        # always return batch
        if not is_batched:
            raw_speech = [np.asarray([raw_speech]).T]

        batched_speech = BatchFeature({"input_features": raw_speech})

        # convert into correct format for padding

        padded_inputs = self.pad(
            batched_speech,
            padding=padding,
            max_length=max_length if max_length else self.n_samples,
            truncation=truncation,
            pad_to_multiple_of=pad_to_multiple_of,
            return_attention_mask=return_attention_mask or do_normalize,
        )

        # make sure list is in array format
        input_features = padded_inputs.get("input_features").transpose(2, 0, 1)

        extract_fbank_features = (
            self._torch_extract_fbank_features if is_torch_available() else self._np_extract_fbank_features
        )
        input_features = extract_fbank_features(input_features[0], device)

        if isinstance(input_features[0], list):
            padded_inputs["input_features"] = [np.asarray(feature, dtype=np.float32) for feature in input_features]

        else:
            padded_inputs["input_features"] = input_features

        if return_token_timestamps is not None:
            logger.warning_once(
                f"`return_token_timestamps` is deprecated for {self.__class__.__name__} and will be removed in Transformers v5. Use `return_attention_mask` instead, as the number of frames can be inferred from it."
            )
            padded_inputs["num_frames"] = [len(raw_speech_i) // self.hop_length for raw_speech_i in raw_speech]

        return padded_inputs


__all__ = ["WhisperFeatureExtractor"]