Whisper 音频处理小记
音频格式与编码
- 声音是什么?
- 物理层面 :空气分子的振动 → 声压随时间变化的波。
- 模拟信号 :连续的波形,既有时间连续性,也有幅度连续性。
但计算机只能处理离散的数字,所以要“采样 + 量化”成数字信号。
- 采样与量化(数字化的第一步)
声音本质:连续的模拟信号
- 人的讲话、音乐等都是空气压力随时间 连续变化 的波。
- 这是一条模拟信号曲线(time-continuous, amplitude-continuous)。
麦克风:声波 → 电压信号
- 麦克风膜片随声压振动,把空气压力变化转成 电压变化 。
- 电压信号依然是 连续的模拟信号 ,例如 0.1V → 0.15V → 0.2V …。
采样(Sampling):时间离散化
- 使用模数转换器(ADC, Analog-to-Digital Converter), 定时读取电压的瞬时值 。
- 采样频率(Sampling rate, ) = 每秒取多少个点。
- CD 音频:44.1 kHz ⇒ 每秒取 44,100 个点;
- 语音识别:16 kHz ⇒ 每秒取 16,000 个点。
根据 奈奎斯特采样定理 :采样率 ≥ 2 × 信号最高频率,才能无失真重建。 人耳最高能听约 20 kHz ⇒ CD 选择 44.1 kHz(略高于 2×20 kHz)。
量化(Quantization):幅度离散化
- 每个采样点的电压值是连续的,计算机必须用有限比特表示。
- 量化位深(bit depth) :
- 8-bit ⇒ 256 个电平;
- 16-bit ⇒ 65,536 个电平(CD 质量);
- 24-bit ⇒ 16,777,216 个电平(录音室常用)。
例子:
- 真实电压值 = 0.12345 V
- 如果 16-bit 量化 ⇒ 可能存储为“0.12340 V”对应的离散电平。
- 越高位深 ⇒ 还原越接近真实,噪声(量化误差)越小。
编码(Encoding):存入文件
- 采样 + 量化后,得到一串整数序列(PCM 数据)。
- 再打包进 文件容器(如 WAV、AIFF) ,并写入参数:采样率、位深、声道数。
示例(16kHz,16-bit,单声道,1 秒音频):
- 总采样点数 = 16,000;
- 每个点 16 bit = 2 字节;
- 文件大小 ≈ 16,000 × 2 = 32 KB。
- 音频文件格式(File Format)
音频文件格式一般分两部分: 容器格式 (file container)+ 编码格式 (codec)。
- 容器格式(Container)
- 就像文件的外壳,负责存储数据、采样率、声道数、比特率等元信息(metadata)。
- 常见容器:WAV、AIFF、MP4、MKV。
- 编码格式(Codec, Coder-Decoder)
- 决定音频数据如何压缩和存储。
- 分为无压缩、无损压缩和有损压缩。
- 音频编码方式(Encoding)
无压缩(Uncompressed)
- PCM :最常见,几乎所有音频处理都会先转成 PCM。
- 格式:WAV(Windows)、AIFF(Mac)。
- 优点:质量最高;缺点:体积大。
无损压缩(Lossless Compression)
- 压缩但不丢失信息,解码后与原始 PCM 完全一致。
- 常见:FLAC、ALAC(Apple)、APE。
- 优点:比无压缩小一半左右;缺点:还是比较大。
有损压缩(Lossy Compression)
- 丢弃人耳不敏感的部分信息,用心理声学模型压缩。
- 常见:MP3、AAC、OGG Vorbis、Opus。
- 优点:体积小,适合传输;缺点:不可逆,质量取决于比特率。
- 常见音频格式速查表
格式 | 编码方式 | 特点 | 应用场景 |
---|---|---|---|
WAV | PCM(无压缩) | 保真度高,体积大 | 专业录音、音频处理 |
AIFF | PCM(无压缩) | Apple 标准,类似 WAV | 专业音频工作站 |
FLAC | 无损压缩 | 体积缩小 ~50%,质量无损 | 高保真音乐收藏 |
ALAC | 无损压缩 | Apple 的 FLAC | iTunes, iOS |
MP3 | 有损压缩 | 最流行,兼容性强 | 音乐流媒体 |
AAC | 有损压缩 | 比 MP3 更高效 | YouTube, iTunes, iPhone |
OGG (Vorbis) | 有损压缩 | 开源 MP3 替代品 | 游戏、开源应用 |
Opus | 有损压缩 | 动态码率,超低延迟 | Zoom、Discord、WebRTC |
- 音频的重要参数
- 采样率 :决定频率范围(8kHz 电话质感,44.1kHz CD,48kHz 视频)。
- 比特率(Bitrate) :每秒数据量(kbps)。
- 比如 MP3 常见 128 kbps、192 kbps、320 kbps;
- AAC/Opus 用同样比特率通常效果更好。
- 声道(Channels) :单声道(Mono)、立体声(Stereo)、环绕声(5.1、7.1)。
- 编解码流程(Codec Pipeline)
编码(Encoding) :
- 模拟 → 数字(采样 + 量化) → 压缩(可选) → 存储为文件。
解码(Decoding) :
- 文件 → 解压缩 → PCM 波形 → 声卡 → 模拟信号 → 耳机/音响。
- 信号,傅里叶变化, Mel 频谱图;参考 文章:https://zhuanlan.zhihu.com/p/198900624
Whisper Audio Encoder
参考 from transformers import WhisperProcessor, WhisperForConditionalGeneration
中的源码,在 whisper 中,模型对 raw speech 进行了以下预处理。
通常音频文件需要转换成 numpy 形式。并且仅支持
mono-channel
,输出的 Numpy array shape 如(60075,)
。import numpy as np import subprocess def ffmpeg_read(bpayload: bytes, sampling_rate: int) -> np.array: """ Helper function to read an audio file through ffmpeg. """ ar = f"{sampling_rate}" ac = "1" format_for_conversion = "f32le" ffmpeg_command = [ "ffmpeg", "-i", "pipe:0", "-ac", ac, "-ar", ar, "-f", format_for_conversion, "-hide_banner", "-loglevel", "quiet", "pipe:1", ] try: with subprocess.Popen(ffmpeg_command, stdin=subprocess.PIPE, stdout=subprocess.PIPE) as ffmpeg_process: output_stream = ffmpeg_process.communicate(bpayload) except FileNotFoundError as error: raise ValueError("ffmpeg was not found but is required to load audio files from filename") from error out_bytes = output_stream[0] audio = np.frombuffer(out_bytes, np.float32) if audio.shape[0] == 0: raise ValueError( "Soundfile is either not in the correct format or is malformed. Ensure that the soundfile has " "a valid audio file extension (e.g. wav, flac or mp3) and is not corrupted. If reading from a remote " "URL, ensure that the URL is the full address to **download** the audio file." ) return audio
默认情况下,再 batch 处理时候,根据 max-length 进行 padding
- 在 Whisper Processor 中,预处理根据
self.n_samples = chunk_length * sampling_rate
进行 padding。其中chunk_length=30
,sampling_rate=16000
- 在 Whisper Processor 中,预处理根据
根据输入,计算 log-mel spectrogram
将输入的波形按照
frame_length
的大小切分成一系列帧,相邻帧之间有frame_length - hop_length
个采样点的重叠。# split waveform into frames of frame_length size num_frames = int(1 + np.floor((waveform.size - frame_length) / hop_length)) num_frequency_bins = (fft_length // 2) + 1 if onesided else fft_length spectrogram = np.empty((num_frames, num_frequency_bins), dtype=np.complex64) # num_frequency_bins = 201, num_frames 根据输入长度而定
每一帧都与一个窗口函数相乘,然后放入大小为
fft_length
的缓冲区。spectrogram = np.empty((num_batches, num_frames, num_frequency_bins), dtype=np.complex64) # rfft is faster than fft fft_func = np.fft.rfft if onesided else np.fft.fft buffer = np.zeros((num_batches, fft_length)) for frame_idx in range(num_frames): timestep = frame_idx * hop_length # frame_length = 400, hop_length = 160 buffer[:, :frame_length] = padded_waveform_batch[:, timestep : timestep + frame_length] buffer[:, :frame_length] *= window
对每个加窗后的帧进行离散傅里叶变换(DFT)。
spectrogram[:, frame_idx] = fft_func(buffer) if mel_filters is not None: spectrogram = np.maximum(mel_floor, np.dot(mel_filters.T, spectrogram)) # mel_filters.shape=(201, 80) # spectrogram.shape=(x, 201)
将结果堆叠起来,得到频谱图(spectrogram)。形状如
[2, 80, 3000]
参考 openai whisper 的预处理器:
from typing import Optional, Union
import numpy as np
from ... import is_torch_available
from ...audio_utils import mel_filter_bank, spectrogram, window_function
from ...feature_extraction_sequence_utils import SequenceFeatureExtractor
from ...feature_extraction_utils import BatchFeature
from ...utils import TensorType, logging
if is_torch_available():
import torch
logger = logging.get_logger(__name__)
class WhisperFeatureExtractor(SequenceFeatureExtractor):
r"""
Constructs a Whisper feature extractor.
This feature extractor inherits from [`~feature_extraction_sequence_utils.SequenceFeatureExtractor`] which contains
most of the main methods. Users should refer to this superclass for more information regarding those methods.
This class extracts mel-filter bank features from raw speech using a custom numpy implementation of the `Short Time
Fourier Transform` which should match pytorch's `torch.stft` equivalent.
Args:
feature_size (`int`, *optional*, defaults to 80):
The feature dimension of the extracted features.
sampling_rate (`int`, *optional*, defaults to 16000):
The sampling rate at which the audio files should be digitalized expressed in hertz (Hz).
hop_length (`int`, *optional*, defaults to 160):
Length of the overlapping windows for the STFT used to obtain the Mel Frequency coefficients.
chunk_length (`int`, *optional*, defaults to 30):
The maximum number of chunks of `sampling_rate` samples used to trim and pad longer or shorter audio
sequences.
n_fft (`int`, *optional*, defaults to 400):
Size of the Fourier transform.
padding_value (`float`, *optional*, defaults to 0.0):
Padding value used to pad the audio. Should correspond to silences.
dither (`float`, *optional*, defaults to 0.0):
Adds dithering. In other words, adds a small Gaussian noise to each frame.
E.g. use 0.0001 to add dithering with a normal distribution centered
around 0.0 with standard deviation 0.0001 (assuming [-1,+1] range of raw_speech).
The value 0.0 means no dithering.
Dithering has similar effect as `spectrogram(mel_floor=...)`. It reduces
the high log_mel_fbank values for signals with hard-zero sections,
when VAD cutoff is present in the signal.
"""
model_input_names = ["input_features"]
def __init__(
self,
feature_size=80,
sampling_rate=16000,
hop_length=160,
chunk_length=30,
n_fft=400,
padding_value=0.0,
dither=0.0,
return_attention_mask=False, # pad inputs to max length with silence token (zero) and no attention mask
**kwargs,
):
super().__init__(
feature_size=feature_size,
sampling_rate=sampling_rate,
padding_value=padding_value,
return_attention_mask=return_attention_mask,
**kwargs,
)
self.n_fft = n_fft
self.hop_length = hop_length
self.chunk_length = chunk_length
self.n_samples = chunk_length * sampling_rate
self.nb_max_frames = self.n_samples // hop_length
self.sampling_rate = sampling_rate
self.dither = dither
self.mel_filters = mel_filter_bank(
num_frequency_bins=1 + n_fft // 2,
num_mel_filters=feature_size,
min_frequency=0.0,
max_frequency=8000.0,
sampling_rate=sampling_rate,
norm="slaney",
mel_scale="slaney",
)
def _np_extract_fbank_features(self, waveform_batch: np.array, device: str) -> np.ndarray:
"""
Compute the log-mel spectrogram of the provided audio, gives similar results to Whisper's original torch
implementation with 1e-5 tolerance.
"""
if device != "cpu":
raise ValueError(
f"Got device `{device}` for feature extraction, but feature extraction on CUDA accelerator "
"devices requires torch, which is not installed. Either set `device='cpu'`, or "
"install torch according to the official instructions: https://pytorch.org/get-started/locally/"
)
log_spec_batch = []
for waveform in waveform_batch:
log_spec = spectrogram(
waveform,
window_function(self.n_fft, "hann"),
frame_length=self.n_fft,
hop_length=self.hop_length,
power=2.0,
dither=self.dither,
mel_filters=self.mel_filters,
log_mel="log10",
)
log_spec = log_spec[:, :-1]
log_spec = np.maximum(log_spec, log_spec.max() - 8.0)
log_spec = (log_spec + 4.0) / 4.0
log_spec_batch.append(log_spec)
log_spec_batch = np.array(log_spec_batch)
return log_spec_batch
def _torch_extract_fbank_features(self, waveform: np.array, device: str = "cpu") -> np.ndarray:
"""
Compute the log-mel spectrogram of the audio using PyTorch's GPU-accelerated STFT implementation with batching,
yielding results similar to cpu computing with 1e-5 tolerance.
"""
waveform = torch.from_numpy(waveform).to(device, torch.float32)
window = torch.hann_window(self.n_fft, device=device)
# Note: it would be better to dither the chunked waveform,
# so overlapping signal does not get the same dithering.
# But, chunking is happening inside pytorch, so it is here.
if self.dither != 0.0:
waveform += self.dither * torch.randn(waveform.shape, dtype=waveform.dtype, device=waveform.device)
stft = torch.stft(waveform, self.n_fft, self.hop_length, window=window, return_complex=True)
magnitudes = stft[..., :-1].abs() ** 2
mel_filters = torch.from_numpy(self.mel_filters).to(device, torch.float32)
mel_spec = mel_filters.T @ magnitudes
log_spec = torch.clamp(mel_spec, min=1e-10).log10()
if waveform.dim() == 2:
max_val = log_spec.max(dim=2, keepdim=True)[0].max(dim=1, keepdim=True)[0]
log_spec = torch.maximum(log_spec, max_val - 8.0)
else:
log_spec = torch.maximum(log_spec, log_spec.max() - 8.0)
log_spec = (log_spec + 4.0) / 4.0
if device != "cpu":
log_spec = log_spec.detach().cpu()
return log_spec.numpy()
def __call__(
self,
raw_speech: Union[np.ndarray, list[float], list[np.ndarray], list[list[float]]],
truncation: bool = True,
pad_to_multiple_of: Optional[int] = None,
return_tensors: Optional[Union[str, TensorType]] = None,
return_attention_mask: Optional[bool] = None,
padding: Optional[str] = "max_length",
max_length: Optional[int] = None,
sampling_rate: Optional[int] = None,
do_normalize: Optional[bool] = None,
device: Optional[str] = "cpu",
return_token_timestamps: Optional[bool] = None,
**kwargs,
) -> BatchFeature:
"""
Main method to featurize and prepare for the model one or several sequence(s). Implementation uses PyTorch for
the STFT computation if available, otherwise a slower NumPy based one.
Args:
raw_speech (`np.ndarray`, `list[float]`, `list[np.ndarray]`, `list[list[float]]`):
The sequence or batch of sequences to be padded. Each sequence can be a numpy array, a list of float
values, a list of numpy arrays or a list of list of float values. Must be mono channel audio, not
stereo, i.e. single float per timestep.
truncation (`bool`, *optional*, default to `True`):
Activates truncation to cut input sequences longer than *max_length* to *max_length*.
pad_to_multiple_of (`int`, *optional*, defaults to None):
If set will pad the sequence to a multiple of the provided value.
This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability
`>= 7.5` (Volta), or on TPUs which benefit from having sequence lengths be a multiple of 128.
return_attention_mask (`bool`, *optional*):
Whether to return the attention mask. If left to the default, will return the attention mask according
to the specific feature_extractor's default.
[What are attention masks?](../glossary#attention-mask)
<Tip>
For Whisper models, `attention_mask` should always be passed for batched inference, to avoid subtle
bugs.
</Tip>
return_tensors (`str` or [`~utils.TensorType`], *optional*):
If set, will return tensors instead of list of python integers. Acceptable values are:
- `'tf'`: Return TensorFlow `tf.constant` objects.
- `'pt'`: Return PyTorch `torch.Tensor` objects.
- `'np'`: Return Numpy `np.ndarray` objects.
sampling_rate (`int`, *optional*):
The sampling rate at which the `raw_speech` input was sampled. It is strongly recommended to pass
`sampling_rate` at the forward call to prevent silent errors and allow automatic speech recognition
pipeline.
padding_value (`float`, *optional*, defaults to 0.0):
The value that is used to fill the padding values / vectors.
do_normalize (`bool`, *optional*, defaults to `False`):
Whether or not to zero-mean unit-variance normalize the input. Normalizing can help to significantly
improve the performance of the model.
device (`str`, *optional*, defaults to `'cpu'`):
Specifies the device for computation of the log-mel spectrogram of audio signals in the
`_torch_extract_fbank_features` method. (e.g., "cpu", "cuda")
return_token_timestamps (`bool`, *optional*, defaults to `None`):
Deprecated. Use `return_attention_mask` instead from which the number of frames can be inferred.
Whether or not to return the number of frames of the input raw_speech.
These num_frames can be used by the model to compute word level timestamps.
"""
if sampling_rate is not None:
if sampling_rate != self.sampling_rate:
raise ValueError(
f"The model corresponding to this feature extractor: {self.__class__.__name__} was trained using a"
f" sampling rate of {self.sampling_rate}. Please make sure that the provided `raw_speech` input"
f" was sampled with {self.sampling_rate} and not {sampling_rate}."
)
else:
logger.warning(
f"It is strongly recommended to pass the `sampling_rate` argument to `{self.__class__.__name__}()`. "
"Failing to do so can result in silent errors that might be hard to debug."
)
is_batched_numpy = isinstance(raw_speech, np.ndarray) and len(raw_speech.shape) > 1
if is_batched_numpy and len(raw_speech.shape) > 2:
raise ValueError(f"Only mono-channel audio is supported for input to {self}")
is_batched = is_batched_numpy or (
isinstance(raw_speech, (list, tuple)) and (isinstance(raw_speech[0], (np.ndarray, tuple, list)))
)
if is_batched:
raw_speech = [np.asarray([speech], dtype=np.float32).T for speech in raw_speech]
elif not is_batched and not isinstance(raw_speech, np.ndarray):
raw_speech = np.asarray(raw_speech, dtype=np.float32)
elif isinstance(raw_speech, np.ndarray) and raw_speech.dtype is np.dtype(np.float64):
raw_speech = raw_speech.astype(np.float32)
# always return batch
if not is_batched:
raw_speech = [np.asarray([raw_speech]).T]
batched_speech = BatchFeature({"input_features": raw_speech})
# convert into correct format for padding
padded_inputs = self.pad(
batched_speech,
padding=padding,
max_length=max_length if max_length else self.n_samples,
truncation=truncation,
pad_to_multiple_of=pad_to_multiple_of,
return_attention_mask=return_attention_mask or do_normalize,
)
# make sure list is in array format
input_features = padded_inputs.get("input_features").transpose(2, 0, 1)
extract_fbank_features = (
self._torch_extract_fbank_features if is_torch_available() else self._np_extract_fbank_features
)
input_features = extract_fbank_features(input_features[0], device)
if isinstance(input_features[0], list):
padded_inputs["input_features"] = [np.asarray(feature, dtype=np.float32) for feature in input_features]
else:
padded_inputs["input_features"] = input_features
if return_token_timestamps is not None:
logger.warning_once(
f"`return_token_timestamps` is deprecated for {self.__class__.__name__} and will be removed in Transformers v5. Use `return_attention_mask` instead, as the number of frames can be inferred from it."
)
padded_inputs["num_frames"] = [len(raw_speech_i) // self.hop_length for raw_speech_i in raw_speech]
return padded_inputs
__all__ = ["WhisperFeatureExtractor"]