Post

Whisper

Whisper

Paper Link Github Link

๐Ÿ Key Takeaways

  • 1๏ธโƒฃ ๋ชจ๋ธ ๊ตฌ์กฐ๊ฐ€ ์•„๋‹Œ ๋ฐ์ดํ„ฐ์˜ ์–‘๊ณผ ์œ ํ˜•์ด ๋” ์ค‘์š”ํ•  ์ˆ˜ ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” Plain Transformer ๋ชจ๋ธ์— ๋งค์šฐ ํฐ ๋ฐ์ดํ„ฐ์˜ Weakly Supervised ๋ฐฉ์‹์œผ๋กœ ํ•™์Šต์‹œ์ผฐ๋‹ค.
  • 2๏ธโƒฃ ํ•™์Šต ๋ฐ์ดํ„ฐ ํฌ๊ธฐ๊ฐ€ 16๋ฐฐ ์ฆ๊ฐ€ํ•˜๋ฉด WER์ด ์ ˆ๋ฐ˜์œผ๋กœ ๊ฐ์†Œํ•˜๋Š” ๊ฒฝํ–ฅ์„ ๋ณด์ธ๋‹ค.
  • 3๏ธโƒฃ Unsupervised Pretrain & Fine-Tuning ๊ตฌ์กฐ๋Š” dataset-specific quirks์ผ ํ™•๋ฅ ์ด ๋†’๋‹ค. ์ด๋ฅผ ์œ„ํ•ด Fine-Tuning ๋‹จ๊ณ„๋ฅผ ์—†์• ๊ณ  Zero-shot์œผ๋กœ๋งŒ ์„ฑ๋Šฅ์„ ์ธก์ •ํ–ˆ๋‹ค.

1. Previous Unsupervised Pre-training์˜ ๋ฌธ์ œ์ 

Wav2Vec์œผ๋กœ ๋Œ€ํ‘œ๋˜๋Š” ๊ธฐ์กด์˜ Unsupervised pre-training์€ audio encoder์˜ ํ’ˆ์งˆ์„ ์ƒ๋‹นํ•˜๊ฒŒ ์˜ฌ๋ ธ์ง€๋งŒ, decoder๊ฐ€ ์ด๋ฅผ ๋”ฐ๋ผ์˜ฌ ์ˆ˜ ์—†์—ˆ๋‹ค. ์ฆ‰, decoder๊ฐ€ text์™€ audio representation์„ ๋งคํ•‘ํ•  ์ˆ˜ ์—†์—ˆ๋‹ค. ๋”ฐ๋ผ์„œ supervised fine-tuning์ด ํ•„์ˆ˜์ ์ด์—ˆ๋‹ค. supervised fine-tuning์€ label์ด ์žˆ์–ด์•ผ ํ•˜๋ฏ€๋กœ ๋งค์šฐ ๋น„์‹ธ๊ณ , model์˜ usefulness์™€ robustness๋ฅผ ์ €ํ•˜ํ•œ๋‹ค. ์ฆ‰, ๋ฐ์ดํ„ฐ์…‹์ด ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ์ขŒ์ง€์šฐ์ง€(dataset-specific quirks)ํ•œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.

์ €์ž๋“ค์„ ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์•„๋ž˜ ์„ ํ–‰ ์—ฐ๊ตฌ๋“ค์—์„œ ์•„์ด๋””์–ด๋ฅผ ์–ป์—ˆ๋‹ค.

  • ๋Œ€ํ˜• unlabeled dataset์„ ์‚ฌ์šฉํ•˜๋Š” Wav2Vec๊ณผ ๋‹ฌ๋ฆฌ ์†Œํ˜• labeled dataset์„ ์‚ฌ์šฉํ•˜๋Š” supervised model์ด ๋” ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค.
  • CV ๋ถ„์•ผ์—์„œ ๋Œ€ํ˜• ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ weakly supervised learning์„ ํ–ˆ์„ ๋•Œ robustness์™€ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์ด ์ฆ๊ฐ€ํ–ˆ๋‹ค.

์ €์ž๋“ค์€ ๋Œ€ํ˜• ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ weakly supervised learning์„ ์‹คํ—˜ํ•˜๊ธฐ๋กœ ํ–ˆ๋‹ค.

2. Approch

2.1 ๋ฐ์ดํ„ฐ์…‹ ์ „์ฒ˜๋ฆฌ

์ €์ž๋“ค์€ ์ด 680,000 hours์˜ ์ดˆ๋Œ€ํ˜• ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ•˜์˜€๋‹ค. ์ด๋Š” Wav2Vec์˜ 60,000 hours๋ฅผ ํ›Œ์ฉ ์ƒํšŒํ•˜๋Š” ํฌ๊ธฐ์ด๋‹ค. ๊ทธ ์ค‘ 117,000 hours๋Š” 96๊ฐœ์˜ ์–ธ์–ด ์Œ์„ฑ์ด๊ณ , ์ด๋ฅผ ์˜์–ด๋กœ ๋ฒˆ์—ญํ•œ 125,000 hours์˜ ๋ฐ์ดํ„ฐ๋„ ํฌํ•จํ•œ๋‹ค.

๊ธฐ์กด ์—ฐ๊ตฌ๋“ค์—์„œ ์ž์ฃผ ์“ฐ์ธ ์ „์ฒ˜๋ฆฌ ๋ฐฉ๋ฒ•์ธ standardization๊ณผ ITN(Inverse Text Normalization)์„ ๊ณผ๊ฐํ•˜๊ฒŒ ์ƒ๋žตํ–ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด robustness๋ฅผ ์ฆ๊ฐ€์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค. ๋Œ€์‹ ์— ์ €์ž๋“ค์€ Machine-Generated data๋ฅผ ์—ฌ๋Ÿฌ ๋ฐฉ๋ฒ•์œผ๋กœ ํ•„ํ„ฐ๋งํ•˜์—ฌ ์ œ๊ฑฐํ–ˆ๋‹ค. ๋Œ€ํ‘œ์ ์ธ ์˜ˆ๋กœ ๋ชจ๋‘ uppercase ๋˜๋Š” lowercase์ด๊ฑฐ๋‚˜, ์‰ผํ‘œ๋ฅผ ์•„์˜ˆ ์—†์•ค ๊ฒƒ์ด ์žˆ๋‹ค.

2.2 Model

์ด ๋…ผ๋ฌธ์€ ๋ชจ๋ธ ๊ตฌ์กฐ์— ์ดˆ์ ์„ ๋งž์ถ”์ง€ ์•Š์•˜๋‹ค. ๋‹จ์ˆœํ•˜๊ฒŒ Transformer ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์˜€๋‹ค.

2.3 Multitask Format

pretraining-zero shot ๊ตฌ์กฐ๋งŒ์œผ๋กœ๋„ ์˜ค๋””์˜ค ๋ถ„์•ผ์˜ ๋‹ค์–‘ํ•œ task๋ฅผ ์ง€์›ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๊ณ ์•ˆ๋˜์—ˆ๋‹ค. ๋ฐ”๋กœ special token์„ ์‚ฌ์šฉํ•˜์—ฌ task์™€ conditioning information์„ decoder์˜ ์ž…๋ ฅ์— ํ•จ๊ป˜ ๋„ฃ์–ด ์ฃผ๋Š” ๊ฒƒ์ด๋‹ค.

GitHub - openai/whisper: Robust Speech Recognition via Large-Scale Weak Supervision

  • <|nospeech|> : predict no speech in an audio segment
  • <|transcribe|> ๋˜๋Š” <|translate|> : ์–ด๋–ค task์ธ์ง€ specificํ•˜๊ฒŒ ์„ค์ •ํ•ด ์ค€๋‹ค.

์—ฌ๋‹ด์ด์ง€๋งŒ ์œ„ ๊ทธ๋ฆผ์— ํ•œ๊ตญ์–ด๊ฐ€ ์žˆ์–ด์„œ ๋„ˆ๋ฌด ๋ฐ˜๊ฐ€์› ๋‹ค.

2.4 Training Details

๋‹ค์–‘ํ•œ ๋ชจ๋ธ ํฌ๊ธฐ๋กœ ์‹คํ—˜ํ•˜์˜€๋‹ค. ์ ์€ ์ˆ˜์˜ epoch๋งŒ ํ•™์Šตํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๊ณผ์ ํ•ฉ์€ ๋”ฐ๋กœ ๊ณ ๋ คํ•˜์ง€ ์•Š์•„๋„ ๋œ๋‹ค. robustness์™€ generalization์„ ๋†’์ด๊ธฐ ์œ„ํ•ด data augmentation์ด๋‚˜ normalization์„ ๋”ฐ๋กœ ์‚ฌ์šฉํ•˜์ง€ ์•Š์•˜๋‹ค.

3. Experiments

3.1 Zero-shot Evaluation

open source dataset์„ ์‚ฌ์šฉํ•˜์—ฌ zero-shot์œผ๋กœ evaluationํ•˜์˜€๋‹ค. ๋ณดํ†ต ํ•™์Šตํ•œ ๋ชจ๋ธ์„ ํ‰๊ฐ€ํ•  ๋•Œ๋Š” train๊ณผ ๊ฐ™์€ distribution์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. ํ•˜์ง€๋งŒ Whisper์—์„œ๋Š” ์—ฌ๋Ÿฌ ์–ธ์–ด ๊ด€๋ จ ๋ฐ์ดํ„ฐ์…‹์—์„œ ๋ชจ๋ธ์ด robustํ•œ์ง€ ํ™•์ธํ•˜๊ณ ์ž ํ–ˆ๋‹ค.

3.2 Evaluation Metrics

์Œ์„ฑ ์ธ์‹(ASR) ๋ถ„์•ผ์—์„œ๋Š” ์ „ํ†ต์ ์œผ๋กœ WER(Word Error Rate)๊ฐ€ ํ‰๊ฐ€ ์ง€ํ‘œ๋กœ ์‚ฌ์šฉ๋œ๋‹ค. WER์€ string edit-distance๊ธฐ๋ฐ˜์ด๋‹ค. WER์˜ ๋ฌธ์ œ์ ์€ ๋ณ€ํ™˜๋œ transcipt์—์„œ ์‚ฌ๋žŒ์—๊ฒŒ๋Š” ์ „ํ˜€ ๋ฌธ์ œ๊ฐ€ ๋˜์ง€ ์•Š๋Š” ์ฐจ์ด๋„ ๋ฐ˜์˜ํ•œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ์ด๋Š” ํŠน์ • task์—์„œ์˜ ๋ฐ์ดํ„ฐ format์„ ๋ฐ˜์˜ํ•˜์ง€ ์•Š๋Š” Zero-shot ๋ชจ๋ธ์—์„œ๋Š” ํฐ ๋ฌธ์ œ๊ฐ€ ๋˜๋ฏ€๋กœ ์ €์ž๋“ค์€ non-semantic difference๋ฅผ ์ œ์™ธํ•œ WER๋ฅผ ๋น„๊ตํ•˜์˜€๋‹ค.

3.3 English Speech Recognition

์ €์ž๋“ค์€ WER์ด ํ•™์Šต๋œ ๋ฐ์ดํ„ฐ์…‹์— ์น˜์ค‘๋œ ํ‰๊ฐ€ ์ง€ํ‘œ๋ผ ๋น„ํŒํ•œ๋‹ค. ๊ทธ๋Ÿฌ๋ฉด์„œ ์ธ๊ฐ„๊ณผ ์ธ๊ณต์ง€๋Šฅ ๋ชจ๋ธ์€ ์„œ๋กœ ๋‹ค๋ฅธ ๋ฉด์„ ํ‰๊ฐ€๋ฐ›๋Š”๋‹ค๋Š” ํฅ๋ฏธ๋กœ์šด ์ฃผ์žฅ์„ ์ œ์‹œํ•˜์˜€๋‹ค. ์ธ๊ฐ„์€ distribution์— ๊ตฌ์•  ๋ฐ›์ง€ ์•Š๋Š”๋‹ค. ์ฆ‰ ์ธ๊ฐ„์˜ generalization๋Š” ๋ถ„ํฌ ๋ฐ–์—์„œ ์ธก์ •๋œ๋‹ค. ํ•˜์ง€๋งŒ ๋ชจ๋ธ์€ ํ•™์Šต ๋ฐ์ดํ„ฐ์˜ distribution์— conditioned๋œ๋‹ค. ์ฆ‰, ๋ชจ๋ธ ์„ฑ๋Šฅ์€ distribution ๋‚ด์˜ generalization์œผ๋กœ ๊ฒฐ์ •๋œ๋‹ค.

image

๊ธฐ์กด ๋ชจ๋ธ(ํŒŒ๋ž€์ƒ‰)๊ณผ ๋‹ค๋ฅด๊ฒŒ Whisper(๋ณด๋ผ์ƒ‰)์€ ์ธ๊ฐ„(์ฃผํ™ฉ์ƒ‰)๊ณผ์˜ ๊ฒฉ์ฐจ๋ฅผ ์ขํ˜”๋‹ค. ๊ธฐ์กด ๋ชจ๋ธ๋“ค์€ ํ•™์Šตํ•œ ๋ฐ์ดํ„ฐ๊ฐ€ ์•„๋‹Œ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ์…‹์—์„œ๋Š” ์ƒ๋‹น์ด ๋†’์€ WER ๊ฐ’์„ ๋ณด์ธ๋‹ค.

3.4 Multi-lingual Speech Recognition

image

VoxPopuli ๋ฐ์ดํ„ฐ์…‹์—์„œ๋Š” Whisper๊ฐ€ ๋‹ค๋ฅธ ๊ธฐ์กด ๋ชจ๋ธ๋“ค์— ๋น„ํ•ด ๋” ๋†’์€ WER๋ฅผ ๋ณด์ธ๋‹ค. Whisper๋Š” Zero-shot์ธ๋ฐ ๋ฐ˜ํ•ด ๊ธฐ์กด ๋ชจ๋ธ๋“ค์€ Fine-Tuning์œผ๋กœ VoxPopuli ๋ฐ์ดํ„ฐ์…‹ distribution์ด ๋Œ€ํ•ด ํ•™์Šตํ•ด์”ฉ ๋•Œ๋ฌธ์ด๋‹ค. ๋‹ค๋ฅด๊ฒŒ ์„ค๋ช…ํ•˜๋ฉด, ๋‹ค์–‘ํ•œ distribution ์„ ๊ฐ–๋Š” audio dataset ์— ๋Œ€ํ•ด์„œ speech recognition ์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒฝ์šฐ์—๋Š” fine-tuning ์‹œ์— ์‚ฌ์šฉํ•œ dataset ์—์„œ ๊ฒ€์ฆ๋œ ์„ฑ๋Šฅ๋ณด๋‹ค ์‹ค์ œ ์„ฑ๋Šฅ์ด ์ข‹์ง€ ์•Š์„ ์ˆ˜ ๋ฐ–์— ์—†๋Š” ๊ฒƒ์ด๋‹ค.

image

pretraining dataset์—์„œ ํ•ด๋‹น ์–ธ์–ด์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ์ฐจ์ง€ํ•˜๋Š” ์–‘๊ณผ ๊ทธ WER์˜ ์ƒ๊ด€๊ด€๊ณ„๋Š” $R^2=0.84$๋กœ ๊ต‰์žฅํžˆ ๋†’๊ฒŒ ๋‚˜์˜จ๋‹ค.

Reference

This post is licensed under CC BY 4.0 by the author.

ยฉ Su. Some rights reserved.

Using the Chirpy theme for Jekyll.