Tacotron 2

Posted Dec 2, 2024 Updated Dec 2, 2024

6 min read

Tacotron 2

🍉 Key Takeaways

1️⃣ Seq2Seq with Attention을 사용한 End2End TTS 시스템을 제안하였다.
2️⃣ Modified WaveNet은 MOL에 사용될 parameter를 생성하고, 이를 바탕으로 waveform을 생성한다.
3️⃣Teacher-forcing을 사용하여 학습 효율을 증가시켰다.

1. Overall Architecture

순차적인 두 개의 stage들로 이루어져 있다. 첫 번째 stage에서는 전처리를 거친 input text로부터 Mel-Spectrogram을 생성한다(Tacotron2). 두 번째 stage에서는 이전 stage에서 생성한 Mel-Spectrogram으로 waveform, 즉 음성을 합성한다(변형된 WaveNet). 이중 Tacotron2는 첫 번째 stage를 가리킨다.

Mel-spectrogram

Tacotron 2 논문에서는 stage 1과 2를 연결하는 intermediate feature representation으로 low-level acoustic representation인 Mel-spectrogram을 선택했다. 저자들은 Mel-spectrogram을 왜 선택했을까? 첫 번째 이유는 Mel-spectrogram보다 더 smoother하기 때문이다. 두 번째 이유는 squared error loss를 이용하여 모델을 학습시키기가 더 쉽기 때문이다. Mel-spectrogram은 각 frame에서 phase가 바뀌어도 변하지 않기(invariant) 때문이다.

그리고 Mel-spectrogram은 인간의 청력에 따라 spectrogram을 scale한 것이다. 따라서 고주파수 대에 주로 존재하는 noise나 마찰음(fricatives) 등에 덜 주목하게 하는 효과가 있다.

Griffin-Lim Algorithm

Mel-spectrogram의 바탕이 되는 linear spectorgram은 phase information를 버린다는 단점이 있다. 이렇게 버려진 phase information은 Griffin-Lim 알고리즘으로 추측할 수 있다. Griffin-Lim 알고리즘은 Neural Vocoder 이전에 사용되던 traditional vocoder이다. 여기서 vocoder는 voice encoder의 줄임말이다. Griffin-Lim 알고리즘은 rule-based로, 음성을 합성할 때 오직 Mel-spectrogram으로 계산된 STFT magnitude 값만 이용한다.

Griffin-Lim 알고리즘은 원본 음성(GT)와 합성한 음성의 STFT magnitude의 Mean Squared Error(MSE)가 최소화되는 방향으로 학습한다. 즉, phase information은 time-domain의 정보이고, Griffin-Lim 알고리즘은 Inverse STFT를 통해 time-domain으로 변환한다.

2. Pre-processing

text-voice data를 character-Mel spectrogram pair로 변환시켜야 한다.
각 character에는 one-hot encoding을 적용한다.

예를 들어 ‘i love watermelon.’이라는 input text가 있으면 이 sequence는 ‘i’, ‘ ‘, ‘l’, ‘o’, ‘v’, …, ‘n’, ‘.’의 character로 분리되고, 각각의 character는 하나의 one-hot vector로 표현된다.

3. Tacotron 2

Tacotron 2는 Seq2Seq Tacotron-style model이다. 크게 Encoder, Attention module, 그리고 Decoder의 3가지 파트로 나뉜다.

Input: character sequence -> Output: mel spectrogram
Seq2Seq: RNN + Attention

<문장, 음성> 쌍으로 이루어진 데이터만 있으면 End2End로 학습 가능하다.

cf) Tacotron

1️⃣ vocode the resulting magnitude spectrograms: Griffin-Lim algorithm
2️⃣ phase estimation: inverse STFT
기존 linguistic and acoustic features 대신 단순하게 하나의 NN를 이용해서 데이터로부터 한 번에 feature를 뽑아낸다.

3.1 Encoder

character sequence를 hidden feature representation, 즉 Mel-spectrogram으로 변환한다.

3.2 Location Sensitive Attention

Location Sensitive Attention은 additive attention mechanism을 확장하여 고안되었다. 이로써 additional feature로 이전 decoder time step의 cumulative attention weights를 사용할 수 있다.

3.3 Decoder

Decoder는 autoregressive recurrent neural network, 즉 RNN이다. Decoder는 2개의 task를 병렬적으로 수행한다.

task 1: Encoder부터 생성된 Mel-spectrogram을 참고하여 Mel-spectrogram을 예측한다. 단위 시간 한 번에 하나의 frame을 예측한다.
task 2: stop token prediction

Predict Mel-Spectrogram

pre-net
- 2개의 FC로 이루어져 있다. 각 FC는 256 hidden ReLU units로 구성된다.
- attention 학습에 필수적인 information bottleneck으로 동작한다.
postnet
- 수렴을 돕기 위해 post-net 전후에 summed mean squared error(MSE)를 최소화시켰다.

“CBHG” stacks과 GRU를 사용한 기존 Tacotron과 다르게 vanilla LSTM과 CNN을 사용했다. 또한 reduction factor를 사용하지 않아, 각 Decoder step은 하나의 spectrogram frame에 대응하게 된다.

stop token prediction

sigmoid function을 사용한다.
모델이 dynamically하게 종료 시점을 정할 수 있다.
첫 번째 frame의 threshold는 0.5이다.

4. modified WaveNet vocoder

Input: mel-spectorgram -> Output: waveforms(time-domain)
WaveNet:

original version은 softmax layer를 이용하여 $-2^{15}~2^{15}+1$의 discretized buckets에 대한 확률을 추출한다. 그리고 이를 바탕으로 waveform을 생성한다. 이에 반해, 저자들은 10개의 logistic distributions mixture를 이용하였다. loss는 GT의 negative log-likelihood로 계산된다.

5. Training

Teacher-forcing 이용: stage 1을 먼저 학습시킨 후, stage 2를 학습시킨다. 각 stage에서 input으로 이전 시점의 데이터가 아닌 GT를 사용한다.

Reference

Img Source

Paper Review

Audio

This post is licensed under CC BY 4.0 by the author.