VALL-T: Decoder-Only Generative Transducer for Robust and Decoding-Controllable Text-to-Speech


Chenpeng Du*1, Yiwei Guo*1, Hankun Wang*1, Yifan Yang1, Zhikang Niu1, Shuai Wang2, Hui Zhang3, Xie Chen1, Kai Yu1

1 MoE Key Lab of Artificial Intelligence, AI Institute
X-LANCE Lab, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China
2 Shenzhen Research Institute of Big Data, Shenzhen, China
3 AISpeech, Beijing, China

Abstract


Recent TTS models with decoder-only Transformer architecture, such as SPEAR-TTS and VALL-E, achieve impressive naturalness and demonstrate the ability for zero-shot adaptation given a speech prompt. However, such decoder-only TTS models lack monotonic alignment constraints, sometimes leading to hallucination issues such as mispronunciation, word skipping and repeating. To address this limitation, we propose VALL-T, a generative Transducer model that introduces shifting relative position embeddings for input phoneme sequence, explicitly indicating the monotonic generation process while maintaining the architecture of decoder-only Transformer. Consequently, VALL-T retains the capability of prompt-based zero-shot adaptation and demonstrates better robustness against hallucinations with a relative reduction of 28.3% in the word error rate.



Generation samples


Prompt: [1188_133604_000026_000001] I say, "whether color be gay or sad."
Zero-shot TTS: [1188_133604_000061_000006] It has no beauty whatsoever, no specialty of picturesqueness; and all its lines are cramped and poor.
Ground-truth Encodec resynthesis NAR resynthesis
Transduce and Speak VALL-E VALL-T
Prompt: [1995_1837_000020_000000] Up in the sick room Zora lay on the little white bed.
Zero-shot TTS: [1995_1836_000003_000002] At last the Cotton Combine was to all appearances an assured fact and he was slated for the Senate.
Ground-truth Encodec resynthesis NAR resynthesis
Transduce and Speak VALL-E VALL-T
Prompt: [2300_131720_000016_000004] The electrical work had to be done in forty eight hours!
Zero-shot TTS: [2300_131720_000006_000001] Every department of mechanics was stimulated and benefited to an extraordinary degree.
Ground-truth Encodec resynthesis NAR resynthesis
Transduce and Speak VALL-E VALL-T
Prompt: [4446_2271_000011_000001] "And have I done anything so fool as that, now?" he asked.
Zero-shot TTS: [4446_2271_000022_000010] I should torture myself-I couldn't help it." After that it was easy to forget, actually to forget.
Ground-truth Encodec resynthesis NAR resynthesis
Transduce and Speak VALL-E (Bad case, repeating) VALL-T
Prompt: [8224_274384_000017_000000] The parliament and the Scots laid their proposals before the king.
Zero-shot TTS: [8224_274384_000005_000001] He was particularly attentive to the behavior of their preachers, on whom all depended.
Ground-truth Encodec resynthesis NAR resynthesis
Transduce and Speak VALL-E (Bad case, skipping) VALL-T

* Main contributors.