VALL-E

Microsoft · January 2023

activeClosedaudio

Why It Matters

Reframed speech synthesis as a language modeling problem, showing that the same autoregressive approach powering LLMs could generate remarkably natural speech from just a 3-second voice sample.

Description

Microsoft's neural codec language model for text-to-speech that can clone any voice from just a 3-second audio sample. Treats speech synthesis as a language modeling problem, generating audio codec codes from text and a brief voice prompt, enabling zero-shot voice cloning with remarkable fidelity.

Key Innovations

Text-to-Audio
Text-to-AudioGenerating speech, music, or sound effects from text descriptions.
Zero-Shot
Zero-ShotPerforming tasks without any examples — the model generalizes from its training alone.
speech-synthesis

External Links

More from Speech / Voice