Post

PromptTTS2

PromptTTS2

Paper Link

๐Ÿ• Key Takeaways

  • 1๏ธโƒฃ Diffusion ๊ธฐ๋ฐ˜์˜ Variation NW๋กœ reference representation์„ ๋ชจ๋ธ๋ง
  • 2๏ธโƒฃ ์Œ์„ฑ์— text prompt๋ฅผ ์ž๋™์œผ๋กœ ๋ผ๋ฒจ๋งํ•˜๋Š” ํŒŒ์ดํ”„๋ผ์ธ ๊ฐœ๋ฐœ

Introduction

Speech vs. Text

์Œ์„ฑ์€ ํ…์ŠคํŠธ๋ณด๋‹ค ๋” ๋งŽ์€ ์ •๋ณด๋ฅผ ์ „๋‹ฌํ•  ์ˆ˜ ์žˆ๋‹ค. ๊ฐ™์€ ๋‹จ์–ด๋ผ๋„ ๋‹ค์–‘ํ•œ ๋ชฉ์†Œ๋ฆฌ๋กœ ๋ฐœ์Œํ•˜๋ฉด ์„œ๋กœ ๋‹ค๋ฅธ ์ •๋ณด๋ฅผ ์ „๋‹ฌํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

Traditional TTS vs. Text-based TTS

  • traditional TTS: ์Œ์„ฑ ํ”„๋กฌํ”„ํŠธ(Reference Speech)์— ์˜์กดํ•˜์—ฌ ์Œ์„ฑ ๋ณ€์ด๋ฅผ ํ‘œํ˜„ํ•œ๋‹ค.
  • Text-based TTS
    • Text Prompts(์„ค๋ช…) ์‚ฌ์šฉ
    • ์Œ์„ฑ ํ”„๋กฌํ”„ํŠธ๋ฅผ ์ฐพ๊ธฐ ์–ด๋ ค์šด ๊ฒฝ์šฐ๋‚˜ ์กด์žฌํ•˜์ง€ ์•Š์„ ๋•Œ ์œ ์šฉ

Challenges of Text-based TTS

One-to-Many Problem

  • Speech๋Š” ์Œ์„ฑ ๋ณ€์ด(voice variability)๋ฅผ ์ž์„ธํ•˜๊ฒŒ ํฌํ•จํ•˜๊ณ  ์žˆ์–ด, ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ๋Š” ์Œ์„ฑ์˜ ๋ชจ๋“  ํŠน์ง•์„ ํฌ์ฐฉํ•  ์ˆ˜ ์—†์Œ.
  • ๊ฐ™์€ text prompt๋กœ ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ์Œ์„ฑ ์ƒ˜ํ”Œ์„ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์Œ.
  • ์ด ๋ฌธ์ œ๋Š” TTS ๋ชจ๋ธ ํ›ˆ๋ จ์„ ์–ด๋ ต๊ฒŒ ๋งŒ๋“ค์–ด ๊ณผ์ ํ•ฉ(overfitting)์ด๋‚˜ ๋ชจ๋“œ ๋ถ•๊ดด(mode collapse)๋กœ ์ด์–ด์งˆ ์ˆ˜ ์žˆ์Œ.
  • ํ˜„์žฌ๊นŒ์ง€ One-to-Many ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ ๊ตฌ์ฒด์ ์ธ ๋ฐฉ๋ฒ•X

Data-Scale Problem

  • Text prompt๋กœ ์Œ์„ฑ์„ ์„ค๋ช…ํ•˜๋Š” ๋ฐ์ดํ„ฐ์…‹์„ ๊ตฌ์„ฑํ•˜๋Š” ๊ฒƒ์ด ์–ด๋ ค์›€.
  • High Cost
  • ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ ๋ฐ์ดํ„ฐ์…‹์€ 20K ๋ฌธ์žฅ ์ •๋„๋กœ ์ƒ๋Œ€์ ์œผ๋กœ ์ž‘๊ฑฐ๋‚˜ ๊ณต๊ฐœ๋˜์ง€ ์•Š์Œ.

2. Overview

image

๊ตฌ์„ฑ ์š”์†Œ

  • 1๏ธโƒฃ Variation NW
    • Diffusion ๋ชจ๋ธ ์‚ฌ์šฉ
    • prompt representation $(P_1, โ€ฆ, P_M)$์„ ์กฐ๊ฑด์œผ๋กœ reference representation $(R_1, โ€ฆ, R_N)$ ์˜ˆ์ธก
  • 2๏ธโƒฃ Style Module
    • Text Prompt Encoder
      • BERT-based model
      • text prompt์˜ hidden representation ์ถ”์ถœ
    • Reference Speech
      • reference speech encoder๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ text prompt์—์„œ ๋‹ค๋ฃจ์ง€ ๋ชปํ•˜๋Š” ์Œ์„ฑ ๋ณ€ํ™”๋ฅผ ๋ชจ๋ธ๋ง โ†’ One-to-many mapping ๋ฌธ์ œ ํ•ด๊ฒฐ
    • Cross attention
      • Prompt hidden๊ณผ Reference hidden์— ๊ฐ๊ฐ ์ ์šฉ๋จ
      • fixed length representation ์ถ”์ถœ
  • 3๏ธโƒฃ TTS Module
    • ์Œ์„ฑ์„ ํ•ฉ์„ฑํ•œ๋‹ค. Style Module์— ์˜ํ•ด ์Œ์„ฑ ํŠน์„ฑ์ด ์ œ์–ด๋จ
    • ์Œ์„ฑ์„ phonemes(์Œ์†Œ)์—์„œ ํ•ฉ์„ฑํ•  ์ˆ˜ ์žˆ๋Š” ์–ด๋–ค TTS backbone์ด๋ผ๋„ ์‚ฌ์šฉ ๊ฐ€๋Šฅ

Inference phase

  • without reference speech
    • only text prompt provided
  • ํ›ˆ๋ จ๋œ variation network๋ฅผ ์ด์šฉํ•ด text prompt representation์„ ๊ธฐ๋ฐ˜์œผ๋กœ reference representation $(R_1, \dots, R_N)$์„ ์˜ˆ์ธกํ•œ๋‹ค.

3. Variation NW

  • Goal: prompt representation $(P_1, โ€ฆ, P_M)$์„ ์กฐ๊ฑด์œผ๋กœ reference representation $(R_1, โ€ฆ, R_N)$ ์˜ˆ์ธก

Diffusion Model ์‚ฌ์šฉ

  • Diffusion model๋กœ reference representation ๋ชจ๋ธ๋ง
  • Diffusion model
    • a robust capability in modeling multimodal distributions and complex data spaces
    • ์ด ๋ชจ๋ธ์€ ๋˜ํ•œ variation NW๊ฐ€ Gaussian noise์—์„œ ๋‹ค์–‘ํ•œ voice variability๋ฅผ ์ƒ˜ํ”Œ๋งํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•œ๋‹ค.

Diffusion Process

  • Forward Diffusion:
    • ์ฃผ์–ด์ง„ reference representation $z_0$ โ†’ Gaussian noise๋กœ ๋ณ€ํ™˜
    • ์ด ๊ณผ์ •์€ noise schedule $\beta_t$๋ฅผ ๋”ฐ๋ฅธ๋‹ค: \(\frac{d{z_t}}{d_t}=โˆ’\frac{1}2{\beta_t}z_t+\sqrt{\beta_t} dw_t, \quad t \in [0, 1]\)
  • Denoising Process:
    • noisy representation $z_t$ โ†’ reference representation $z_0$
\[\frac{d{z_t}}{d_t}=โˆ’\frac{1}2(z_t+โˆ‡\logโกp_t(z_t) )\beta_t, \quad t \in [0, 1]\]

Training phase

  • Training Goal: noisy data์˜ log-density gradient $โˆ‡\logโกp_t(z_t)$ ์ถ”์ •

Variation NW ์•„ํ‚คํ…์ฒ˜

  • Transformer Encoder ๊ธฐ๋ฐ˜
  • Input (3๊ฐ€์ง€)
    • prompt representation $(P_1, โ€ฆ, P_M)$
    • noised reference representation $(R^t_1, โ€ฆ, R^t_M)$
    • diffusion step $t$
  • Output
    • ์›๋ณธ reference representation $z_0$์— ํ•ด๋‹นํ•˜๋Š” hidden representation
      • L1 loss๋กœ ์ตœ์ ํ™”๋œ๋‹ค.
  • ๋ชจ๋ธ์ด diffusion step $t$๋ฅผ ๋” ์ž˜ ์ธ์‹ํ•  ์ˆ˜ ์žˆ๋„๋ก FiLM์„ Transformer Encoder์˜ ๊ฐ ๋ ˆ์ด์–ด์— ์ ์šฉํ•œ๋‹ค.

Inference phase

  • Prompt Representation ์ถ”์ถœ:
    • style module์„ ์‚ฌ์šฉํ•˜์—ฌ text prompt์—์„œ prompt representation์„ ์ถ”์ถœํ•œ๋‹ค.
  • Reference Representation ์˜ˆ์ธก:
    • prompt representation์„ ์กฐ๊ฑด์œผ๋กœ reference representation์„ ์˜ˆ์ธก
    • Gaussian noise์—์„œ denoising ์ˆ˜ํ–‰ํ•œ๋‹ค.
  • Concatenation
    • the prompt representation are concatenated with the reference representation to guide the TTS module through cross attention

4. Text Prompt Generation Pipeline w/ LLM

![Overview of our prompt generation pipeline. We first recognizeโ€ฆDownload Scientific Diagram](https://www.researchgate.net/publication/373715169/figure/fig1/AS:11431281186866909@1694056309869/Overview-of-our-prompt-generation-pipeline-We-first-recognize-attribute-from-speech-with.jpg)
  • 1๏ธโƒฃ SLU: ์Œ์„ฑ์—์„œ ์†์„ฑ(์˜ˆ: ์„ฑ๋ณ„, ๊ฐ์ •, ๋‚˜์ด ๋“ฑ)์„ ์ธ์‹ํ•˜์—ฌ label ํƒœ๊น…
  • 2๏ธโƒฃ LLM: ํƒœ๊น…๋œ label์„ ๊ธฐ๋ฐ˜์œผ๋กœ text prompt ์ƒ์„ฑ

LLM Part

LLM ๋ถ€๋ถ„์„ ๋” ์ž์„ธํžˆ ์‚ดํŽด๋ณด์ž. LLM ๋ถ€๋ถ„์€ ํฌ๊ฒŒ 4๋‹จ๊ณ„๋กœ ์ด๋ฃจ์–ด์ง„๋‹ค.

image

Stage 1) Keyword Construction

  • SLU: ์Œ์„ฑ์˜ attribute๋ฅผ ์ธ์‹ํ•˜๊ณ , ๊ฐ attribute์— ๋Œ€ํ•ด ์—ฌ๋Ÿฌ class๋ฅผ ์ธ์‹ํ•œ๋‹ค.
  • LLM: ๊ฐ class์— ๋Œ€ํ•ด ์—ฌ๋Ÿฌ keyword๋ฅผ ์ƒ์„ฑํ•œ๋‹ค.
  • ex. โ€œ์„ฑ๋ณ„โ€ attribute๋Š” โ€œ๋‚จ์„ฑโ€๊ณผ โ€œ์—ฌ์„ฑโ€ class๋ฅผ ๊ฐ€์ง€๊ณ , โ€œ๋‚จ์„ฑโ€ class์˜ keyword๋Š” โ€œmanโ€, โ€œheโ€ ๋“ฑ์ด ๋  ์ˆ˜ ์žˆ๋‹ค.

Stage 2) Sentence Construction

  • for ๋ฌธ์žฅ์˜ ๋‹ค์–‘์„ฑ
  • LLM์€ ๊ฐ attribute์— ๋Œ€ํ•ด ์—ฌ๋Ÿฌ ๋ฌธ์žฅ์„ ์ƒ์„ฑํ•˜๋„๋ก ์ง€์‹œ๋œ๋‹ค.
  • LLM์€ attribute๋ฅผ ์„ค๋ช…ํ•  ๋•Œ placeholder(์˜ˆ: โ€œ[Gender]โ€)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ฌธ์žฅ์„ ์ž‘์„ฑํ•œ๋‹ค.

Stage 3) Sentence Combination

  • ์—ฌ๋Ÿฌ attribute๋ฅผ ์„ค๋ช…ํ•˜๋Š” ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ๊ฐ€ ํ•„์š”ํ•˜๋ฏ€๋กœ, 2๋‹จ๊ณ„์—์„œ ์ƒ์„ฑ๋œ ๋ฌธ์žฅ์„ ๊ฒฐํ•ฉํ•œ๋‹ค.
  • LLM์€ ์—ฌ๋Ÿฌ attribute๊ฐ€ ๊ฒฐํ•ฉ๋œ ์ƒˆ๋กœ์šด ๋ฌธ์žฅ์„ ์ƒ์„ฑํ•˜๋„๋ก ์ง€์‹œ๋œ๋‹ค.
  • ์‚ฌ์šฉ์ž๊ฐ€ TTS ์‹œ์Šคํ…œ์— ์ œ๊ณตํ•˜๋Š” ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ๋Š” ๋ฐ˜๋“œ์‹œ ํ˜•์‹์— ๋งž๋Š” ๋ฌธ์žฅ์ด ์•„๋‹ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์—, LLM์€ ๋‹ค์–‘์„ฑ์„ ๋”ํ•˜๊ธฐ ์œ„ํ•ด ๊ตฌ๋ฌธ์„ ๊ฒฐํ•ฉํ•œ ๋ฌธ์žฅ๋„ ์ƒ์„ฑํ•œ๋‹ค.

Stage 4) Dataset Instantiation

  • ์œ„์˜ ์„ธ ๋‹จ๊ณ„๋ฅผ ํ†ตํ•ด ์ƒ์„ฑ๋œ ๊ฒฐ๊ณผ๋“ค์€ ์ตœ์ข… ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ ๋ฐ์ดํ„ฐ์…‹์„ ํ˜•์„ฑํ•˜๋ฉฐ, ์ด๋Š” ์Œ์„ฑ ๋ฐ์ดํ„ฐ์…‹๊ณผ ํ•จ๊ป˜ ์‚ฌ์šฉ๋œ๋‹ค.
  • ์Œ์„ฑ ๋ฐ์ดํ„ฐ $S$์— ๋Œ€ํ•ด SLU ๋ชจ๋ธ๋กœ ๊ฐ attribute์— ํด๋ž˜์Šค๋ฅผ ํƒœ๊น…ํ•œ ํ›„, ๊ฐ attribute์— ๋Œ€ํ•œ ๋ฌธ์žฅ์„ ์„ ํƒํ•œ๋‹ค.
  • attribute์— ํ•ด๋‹นํ•˜๋Š” keyword๋ฅผ ๋ฌธ์žฅ์—์„œ placeholder์— ์‚ฝ์ž…ํ•˜์—ฌ ์ตœ์ข… ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ๋ฅผ ์ƒ์„ฑํ•œ๋‹ค.

5. Experiment

Datasets

  • Speech Dataset: Multilingual LibriSpeech (MLS)์˜ ์˜์–ด ํ•˜์œ„ ์ง‘ํ•ฉ์„ ์‚ฌ์šฉํ•œ๋‹ค. ์ด ๋ฐ์ดํ„ฐ์…‹์€ 44K ์‹œ๊ฐ„ ๋ถ„๋Ÿ‰์˜ ์ „์‚ฌ๋œ speech ๋ฐ์ดํ„ฐ๋ฅผ ํฌํ•จํ•˜๊ณ  ์žˆ์œผ๋ฉฐ, LibriVox audiobooks์—์„œ ์ˆ˜์ง‘๋˜์—ˆ๋‹ค.
  • Text Prompt Dataset: PromptSpeech (Guo et al., 2023)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ, pitch, gender, volume, speed ๋“ฑ ๋„ค ๊ฐ€์ง€ ์†์„ฑ์„ ์„ค๋ช…ํ•˜๋Š” 20K๊ฐœ์˜ text prompts๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ๋‹ค.
  • Generated Prompts: LLM (GPT-3.5-TURBO)๋ฅผ ํ™œ์šฉํ•ด 20K๊ฐœ์˜ text prompts๋ฅผ ์ƒ์„ฑํ•œ๋‹ค.
  • Test Set: PromptSpeech์˜ test set์€ 1305๊ฐœ์˜ text prompts๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๋‹ค.
  • Attribute Recognition: SLU model์„ ์ด์šฉํ•ด gender๋Š” ๊ณต๊ฐœ๋œ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜๊ณ , ๋‚˜๋จธ์ง€ ์†์„ฑ๋“ค(ํ”ผ์น˜, ๋ณผ๋ฅจ, ์†๋„)์€ digital signal processing tools์„ ํ†ตํ•ด ์ธ์‹ํ•œ๋‹ค.

Experiment Details

  • ์ด ๋…ผ๋ฌธ์—์„œ๋Š” TTS backbone์œผ๋กœ NaturalSpeech 2๋ฅผ ์„ ํƒํ–ˆ๋‹ค.
  • Reference Speech Encoder์™€ Variation Network์˜ ๋ ˆ์ด์–ด ์ˆ˜๋Š” ๊ฐ๊ฐ 6๊ณผ 12๋กœ ์„ค์ •๋˜๋ฉฐ, hidden size๋Š” 512์ด๋‹ค.
  • Style Module์˜ query number $M, N$์€ ๋ชจ๋‘ 8๋กœ ์„ค์ •๋œ๋‹ค.
  • TTS Backbone๊ณผ Text Prompt Encoder๋Š” ๊ฐ๊ฐ NaturalSpeech 2 ์™€ PromptTTS์˜ ์„ค์ •์„ ๋”ฐ๋ฅธ๋‹ค.

6. Result

Attribute Control Accuracy

image

PromptTTS 2๋Š” baseline systems์™€ ๋น„๊ตํ•˜์—ฌ ๋ชจ๋“  ์†์„ฑ์— ๋Œ€ํ•ด ๋” ๋†’์€ ์ •ํ™•๋„๋ฅผ ๋ณด์˜€๋‹ค. ํ‰๊ท ์ ์œผ๋กœ 1.79% ํ–ฅ์ƒ๋œ ์„ฑ๋Šฅ์„ ๊ธฐ๋กํ•˜์˜€๋‹ค.

Speech Quality

image

PromptTTS 2๋Š” MOS (Mean Opinion Score)์™€ CMOS (Comparative MOS) ํ…Œ์ŠคํŠธ์—์„œ baseline systems๋ณด๋‹ค ๋” ๋†’์€ speech quality๋ฅผ ๋‹ฌ์„ฑํ•˜์˜€๋‹ค.

This post is licensed under CC BY 4.0 by the author.

ยฉ Su. Some rights reserved.

Using the Chirpy theme for Jekyll.