PromptTTS2

Posted Dec 2, 2024 Updated Dec 2, 2024

9 min read

PromptTTS2

Paper Link

🍕 Key Takeaways

1️⃣ Diffusion 기반의 Variation NW로 reference representation을 모델링
2️⃣ 음성에 text prompt를 자동으로 라벨링하는 파이프라인 개발

Introduction

Speech vs. Text

음성은 텍스트보다 더 많은 정보를 전달할 수 있다. 같은 단어라도 다양한 목소리로 발음하면 서로 다른 정보를 전달할 수 있기 때문이다.

Traditional TTS vs. Text-based TTS

traditional TTS: 음성 프롬프트(Reference Speech)에 의존하여 음성 변이를 표현한다.
Text-based TTS
- Text Prompts(설명) 사용
- 음성 프롬프트를 찾기 어려운 경우나 존재하지 않을 때 유용

Challenges of Text-based TTS

One-to-Many Problem

Speech는 음성 변이(voice variability)를 자세하게 포함하고 있어, 텍스트 프롬프트는 음성의 모든 특징을 포착할 수 없음.
같은 text prompt로 여러 가지 음성 샘플을 생성할 수 있음.
이 문제는 TTS 모델 훈련을 어렵게 만들어 과적합(overfitting)이나 모드 붕괴(mode collapse)로 이어질 수 있음.
현재까지 One-to-Many 문제를 해결하기 위한 구체적인 방법X

Data-Scale Problem

Text prompt로 음성을 설명하는 데이터셋을 구성하는 것이 어려움.
High Cost
텍스트 프롬프트 데이터셋은 20K 문장 정도로 상대적으로 작거나 공개되지 않음.

2. Overview

구성 요소

1️⃣ Variation NW
- Diffusion 모델 사용
- prompt representation $(P_1, …, P_M)$을 조건으로 reference representation $(R_1, …, R_N)$ 예측
2️⃣ Style Module
- Text Prompt Encoder
  - BERT-based model
  - text prompt의 hidden representation 추출
- Reference Speech
  - reference speech encoder를 사용하여 text prompt에서 다루지 못하는 음성 변화를 모델링 → One-to-many mapping 문제 해결
- Cross attention
  - Prompt hidden과 Reference hidden에 각각 적용됨
  - fixed length representation 추출
3️⃣ TTS Module
- 음성을 합성한다. Style Module에 의해 음성 특성이 제어됨
- 음성을 phonemes(음소)에서 합성할 수 있는 어떤 TTS backbone이라도 사용 가능

Inference phase

without reference speech
- only text prompt provided
훈련된 variation network를 이용해 text prompt representation을 기반으로 reference representation $(R_1, \dots, R_N)$을 예측한다.

3. Variation NW

Goal: prompt representation $(P_1, …, P_M)$을 조건으로 reference representation $(R_1, …, R_N)$ 예측

Diffusion Model 사용

Diffusion model로 reference representation 모델링
Diffusion model
- a robust capability in modeling multimodal distributions and complex data spaces
- 이 모델은 또한 variation NW가 Gaussian noise에서 다양한 voice variability를 샘플링할 수 있게 한다.

Diffusion Process

Forward Diffusion:
- 주어진 reference representation $z_0$ → Gaussian noise로 변환
- 이 과정은 noise schedule $\beta_t$를 따른다: $\frac{d{z_t}}{d_t}=−\frac{1}2{\beta_t}z_t+\sqrt{\beta_t} dw_t, \quad t \in [0, 1]$
Denoising Process:
- noisy representation $z_t$ → reference representation $z_0$

\[\frac{d{z_t}}{d_t}=−\frac{1}2(z_t+∇\log⁡p_t(z_t) )\beta_t, \quad t \in [0, 1]\]

Training phase

Training Goal: noisy data의 log-density gradient $∇\log⁡p_t(z_t)$ 추정

Variation NW 아키텍처

Transformer Encoder 기반
Input (3가지)
- prompt representation $(P_1, …, P_M)$
- noised reference representation $(R^t_1, …, R^t_M)$
- diffusion step $t$
Output
- 원본 reference representation $z_0$에 해당하는 hidden representation
  - L1 loss로 최적화된다.
모델이 diffusion step $t$를 더 잘 인식할 수 있도록 FiLM을 Transformer Encoder의 각 레이어에 적용한다.

Inference phase

Prompt Representation 추출:
- style module을 사용하여 text prompt에서 prompt representation을 추출한다.
Reference Representation 예측:
- prompt representation을 조건으로 reference representation을 예측
- Gaussian noise에서 denoising 수행한다.
Concatenation
- the prompt representation are concatenated with the reference representation to guide the TTS module through cross attention

4. Text Prompt Generation Pipeline w/ LLM

![Overview of our prompt generation pipeline. We first recognize…

Download Scientific Diagram](https://www.researchgate.net/publication/373715169/figure/fig1/AS:11431281186866909@1694056309869/Overview-of-our-prompt-generation-pipeline-We-first-recognize-attribute-from-speech-with.jpg)

1️⃣ SLU: 음성에서 속성(예: 성별, 감정, 나이 등)을 인식하여 label 태깅
2️⃣ LLM: 태깅된 label을 기반으로 text prompt 생성

LLM Part

LLM 부분을 더 자세히 살펴보자. LLM 부분은 크게 4단계로 이루어진다.

Stage 1) Keyword Construction

SLU: 음성의 attribute를 인식하고, 각 attribute에 대해 여러 class를 인식한다.
LLM: 각 class에 대해 여러 keyword를 생성한다.
ex. “성별” attribute는 “남성”과 “여성” class를 가지고, “남성” class의 keyword는 “man”, “he” 등이 될 수 있다.

Stage 2) Sentence Construction

for 문장의 다양성
LLM은 각 attribute에 대해 여러 문장을 생성하도록 지시된다.
LLM은 attribute를 설명할 때 placeholder(예: “[Gender]”)를 사용하여 문장을 작성한다.

Stage 3) Sentence Combination

여러 attribute를 설명하는 텍스트 프롬프트가 필요하므로, 2단계에서 생성된 문장을 결합한다.
LLM은 여러 attribute가 결합된 새로운 문장을 생성하도록 지시된다.
사용자가 TTS 시스템에 제공하는 텍스트 프롬프트는 반드시 형식에 맞는 문장이 아닐 수 있기 때문에, LLM은 다양성을 더하기 위해 구문을 결합한 문장도 생성한다.

Stage 4) Dataset Instantiation

위의 세 단계를 통해 생성된 결과들은 최종 텍스트 프롬프트 데이터셋을 형성하며, 이는 음성 데이터셋과 함께 사용된다.
음성 데이터 $S$에 대해 SLU 모델로 각 attribute에 클래스를 태깅한 후, 각 attribute에 대한 문장을 선택한다.
attribute에 해당하는 keyword를 문장에서 placeholder에 삽입하여 최종 텍스트 프롬프트를 생성한다.

5. Experiment

Datasets

Speech Dataset: Multilingual LibriSpeech (MLS)의 영어 하위 집합을 사용한다. 이 데이터셋은 44K 시간 분량의 전사된 speech 데이터를 포함하고 있으며, LibriVox audiobooks에서 수집되었다.
Text Prompt Dataset: PromptSpeech (Guo et al., 2023)를 사용하여, pitch, gender, volume, speed 등 네 가지 속성을 설명하는 20K개의 text prompts가 포함되어 있다.
Generated Prompts: LLM (GPT-3.5-TURBO)를 활용해 20K개의 text prompts를 생성한다.
Test Set: PromptSpeech의 test set은 1305개의 text prompts로 구성되어 있다.
Attribute Recognition: SLU model을 이용해 gender는 공개된 모델을 사용하고, 나머지 속성들(피치, 볼륨, 속도)은 digital signal processing tools을 통해 인식한다.

Experiment Details

이 논문에서는 TTS backbone으로 NaturalSpeech 2를 선택했다.
Reference Speech Encoder와 Variation Network의 레이어 수는 각각 6과 12로 설정되며, hidden size는 512이다.
Style Module의 query number $M, N$은 모두 8로 설정된다.
TTS Backbone과 Text Prompt Encoder는 각각 NaturalSpeech 2 와 PromptTTS의 설정을 따른다.

6. Result

Attribute Control Accuracy

PromptTTS 2는 baseline systems와 비교하여 모든 속성에 대해 더 높은 정확도를 보였다. 평균적으로 1.79% 향상된 성능을 기록하였다.

Speech Quality

PromptTTS 2는 MOS (Mean Opinion Score)와 CMOS (Comparative MOS) 테스트에서 baseline systems보다 더 높은 speech quality를 달성하였다.

Paper Review

Audio

This post is licensed under CC BY 4.0 by the author.

🍕 Key Takeaways

Introduction

Speech vs. Text

Traditional TTS vs. Text-based TTS

Challenges of Text-based TTS

One-to-Many Problem

Data-Scale Problem

2. Overview

구성 요소

Inference phase

3. Variation NW

Diffusion Model 사용

Diffusion Process

Training phase

Variation NW 아키텍처

Inference phase

4. Text Prompt Generation Pipeline w/ LLM

LLM Part

Stage 1) Keyword Construction

Stage 2) Sentence Construction

Stage 3) Sentence Combination

Stage 4) Dataset Instantiation

5. Experiment

Datasets

Experiment Details

6. Result

Attribute Control Accuracy

Speech Quality

Trending Tags