DiffWave

Posted Dec 2, 2024 Updated Dec 2, 2024

8 min read

DiffWave

ICLR 2021. Paper Demo Unofficial Github Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, Bryan Catanzaro
Computer Science and Engineering, UCSD | NVIDIA | Baidu Research
30 Mar 2021

%% DDPM 수식적으로 깊게 공부 후 꼭 다시 읽기!!!

💽 Key Takeaways

1️⃣ non-Autoregressive → high-dim waveform을 병렬로 합성 가능
2️⃣ flow-based model과 달리 Latent와 data 사이에 bijection을 유지하지 않아도 된다.
3️⃣ 다목적 모델: Conditional 및 unconditional waveform 모두 생성 가능

1. Introduction

기존 audio 합성 모델

likelihood-based models, 특히 autoregressive models와 low-based models가 오디오 합성에서 주로 사용되었다. 이들은 단순한 학습 목표와 실제 데이터의 파형 세부 사항을 잘 모델링하는 능력 덕분에 주목받았다.
- Autoregressive model(예: WaveNet)은 unconditional 환경에서 made-up word-like sound나 품질이 낮은 샘플을 합성하는 경향이 있음.
- 대부분의 waveform model은 informative local conditioner(예: mel-spectrogram, aligned linguistic feature)를 사용하여 오디오를 합성함.
그 외에도 flow-based models, variational auto-encoder (VAE) 기반 모델, generative adversarial network (GAN) 기반 모델 등이 있다. 이러한 모델들은 학습을 위해 보조 손실이 필요한 경우가 많다.

Diffusion probabilistic model

Markov chain을 사용하여 isotropic Gaussian과 같은 단순한 분포를 복잡한 데이터 분포로 점진적으로 변환한다.
Data likelihood는 intractable하므로, variational lower bound (ELBO)를 최적화하여 모델을 학습한다.
Denoising score matching과 같은 parameterization도 좋은 성능을 보임.
Diffusion model은 learnable parameter 없이 diffusion process를 통해 whitened latent를 얻을 수 있음.
- VAE나 GAN처럼 additional NN 필요X
- 두 네트워크의 joint training에서 발생할 수 있는 posterior collapse나 mode collapse 문제를 피할 수 있어 고품질 합성에 유리함.

DiffWave

목표: Raw audio 합성
Non-autoregressive 구조로 high-dimensional waveform을 병렬로 합성할 수 있음.
Flow-based models와 달리 latent와 data 간의 bijection을 유지할 필요가 없으며, architecture constraints가 없어 유연한 모델.
Auxiliary loss 없이 single ELBO-based training objective만을 사용.
Diffusion process: 훈련 데이터에서 noise를 추가하여 ‘whitened latent’를 얻는 과정. 이 과정에서는 NN이 필요 없고, VAE나 GAN처럼 posterior collapse나 mode collapse를 방지할 수 있어 고품질의 오디오 합성 가능
Non-autoregressive 하고 Markov chain을 통해 white noise signal을 waveform으로 변환하는 모델
데이터 likelihood에 대한 변분 경계(variational bound)를 최적화하여 학습됨
Mel-spectrogram을 조건으로 하는 신경망 vocoder, 클래스 조건부 생성(class-conditional generation), 무조건 생성(unconditional generation) 작업에서 사용 가능
DiffWave의 주요 장점:
1. Non-autoregressive: 고차원의 waveform을 병렬로 생성할 수 있어 효율적
2. 유연한 구조: Flow-based model처럼 구조적 제약이 없고, 고품질 음성을 생성할 수 있음
3. 단일 ELBO 기반의 objective function: Auxiliary loss 없이 학습 가능
4. Conditional / Unconditional waveform generation 모두에서 뛰어난 성능을 보인다.

DiffWave 성능

WaveNet 기반의 feed-forward, bidirectional dilated convolution 아키텍처 사용.
Conditional과 Unconditional waveform generation에서 뛰어난 성능 발휘.
Small DiffWave:
- 2.64M 파라미터.
- V100 GPU에서 real-time보다 5배 빠른 음성 합성 가능.

2. Diffusion Probabilistic Model

Diffusion process

$$q(x_1,⋯,x_T	x_0)=∏^T_{t=1}q(x_t	x_{t−1})$$
$$q(x_t	x_{t−1})=\mathcal{N}(x_t;\sqrt{1−β_t}x_{t−1},β_t, I)$$

Reverse process $p_{latent}(x_T)=\mathcal{N}(0,I)$ $∏^T_{t=1}p_θ(x_{t−1}|x_t)$
Parameterization

$\mu_θ(x_t,t)=\frac{1}{α_t}(x_t−\frac{β_t}{\sqrt{1-\bar{a_t}}}ϵ_θ(x_t,t))$ $σ_θ(xt,t)=\tilde{\beta}^{1/2}_t$

where $α_t=1−β_t, \bar{α_t} = ∏^T_{s=1}α_s, \tilde{\beta_t}=\frac{1-\bar{\alpha_{t-1}}}{1-\bar{\alpha_{t}}}\beta_t$

Object Function $\min_θL_{unweighted}(θ)=\mathop{\mathbb{E}}_{x_0,ϵ,t}|| ϵ−ϵ_θ(\sqrt{\bar{α_t} }x_0+\sqrt{1-\bar{α_t} }ϵ,t||^2_2$
Algorithm Pseudo-code

3. DiffWave Architecture

Diffusion-step Embedding

모델은 여러 $t$에 대해 각기 다른 $ϵ_θ(⋅, t)$를 출력해야 하므로, 입력으로 diffusion step $t$를 포함하는 것이 중요
- 각 $t$에 대해 128차원 인코딩 벡터를 사용
- transformer의 positional embedding과 비슷한 형태이다. $t_{embedding}=[\sin⁡(10^\frac{0×4}{63}t),…,\sin⁡(10^\frac{63×4}{63}t),\cos⁡(10^\frac{0×4}{63}t),…,\cos⁡(10^\frac{63×4}{63}t)]$
그 후 3개의 FC layer가 인코딩에 적용됨. 처음 두 FC는 모든 residual layer 간에 파라미터를 공유함
마지막 residual layer별 FC는 두 번째 FC의 출력을 $C$ 차원 임베딩 벡터로 매핑.
이 임베딩 벡터는 길이에 따라 broadcast되고, 모든 residual layer의 입력에 추가됨.

Conditional Generation

Local conditioner:
본 논문에서는 DiffWave를 mel spectrogram을 조건으로 하는 vocoder로 테스트.
- transposed 2D convolutions을 사용해 mel spectrogram을 waveform과 동일한 길이로 upsampling.
- 각 residual layer의 dilated convolution에 바이어스 항으로 추가하기 위해 Conv1x1 레이어가 mel-band를 $2C$ 개의 채널로 매핑.
Global conditioner:
많은 생성 작업에서 조건부 정보는 글로벌한 개별 레이블(예: speaker ID, word ID)로 제공됨.
- 모든 실험에서 128 차원의 공유 임베딩을 사용.
- 각 residual layer에서 128을 $2C$ 개의 채널로 매핑하기 위해 Conv1x1 레이어를 적용하고, dilated convolution 후 임베딩을 바이어스 항으로 추가

Unconditional Generation

Unconditional 생성:
조건 정보 없이 일관된 발화를 생성해야 함.
- 네트워크의 출력 단위가 발화의 길이 $L$보다 큰 receptive field 크기 $r$을 갖는 것이 중요.
- 실제로 $r ≥ 2L$이 필요하며, 가장 왼쪽과 오른쪽 출력 unit은 전체 $L$ 차원 입력을 포함하는 receptive field를 가짐.
Receptive field 계산: Dilated convolution layer 스택에서 출력의 receptive field 크기는 최대 $r = (k - 1) \sum_i d_i + 1$로 계산됨.
- $k$: kernel size
- $d_i$: $i$번째 residual layer에서의 dilation.
- ex. 30개의 layer로 이루어진 dilated convolution의 receptive field 크기:
  $r = 6139$ → 16kHz 오디오의 약 0.38초 길이
- 레이어 수와 dilation cycle을 더 늘릴 수 있지만, 더 깊은 레이어와 더 큰 dilation cycle은 품질 저하를 초래할 수 있음 (특히 WaveNet에서).
- DiffWave는 출력 $x_0$의 receptive field를 확대하는 데 이점이 있음.
- Reverse process에서 $x_T$에서 $x_0$까지 반복하며 receptive field 크기가 $T × r$까지 증가할 수 있어, DiffWave는 unconditional 생성에 적합.

Reference

Paper Review

Audio

This post is licensed under CC BY 4.0 by the author.

💽 Key Takeaways

1. Introduction

기존 audio 합성 모델

Diffusion probabilistic model

DiffWave

DiffWave의 주요 장점:

DiffWave 성능

2. Diffusion Probabilistic Model

3. DiffWave Architecture

Diffusion-step Embedding

Conditional Generation

Unconditional Generation

Reference

Trending Tags