VALL-T: Decoder-Only Generative Transducer for Robust and Decoding-Controllable Text-to-Speech

Chenpeng Du*1, Yiwei Guo*1, Hankun Wang*1, Yifan Yang1, Zhikang Niu1, Shuai Wang2, Hui Zhang3, Xie Chen1, Kai Yu1

1 MoE Key Lab of Artificial Intelligence, AI Institute
X-LANCE Lab, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China
2 Shenzhen Research Institute of Big Data, Shenzhen, China
3 AISpeech Ltd, Beijing, China


Recent TTS models with decoder-only Transformer architecture, such as SPEAR-TTS and VALL-E, achieve impressive naturalness and demonstrate the ability for zero-shot adaptation given a speech prompt. However, such decoder-only TTS models lack monotonic alignment constraints, sometimes leading to hallucination issues such as mispronunciation, word skipping and repeating. To address this limitation, we propose VALL-T, a generative Transducer model that introduces shifting relative position embeddings for input phoneme sequence, explicitly indicating the monotonic generation process while maintaining the architecture of decoder-only Transformer. Consequently, VALL-T retains the capability of prompt-based zero-shot adaptation and demonstrates better robustness against hallucinations with a relative reduction of 28.3% in the word error rate. Furthermore, the controllability of alignment in VALL-T during decoding facilitates the use of untranscribed speech prompts, even in unknown languages. It also enables the synthesis of lengthy speech by utilizing an aligned context window.

I. Zero-shot TTS

Prompt: [1188_133604_000026_000001] I say, "whether color be gay or sad."
Zero-shot TTS: [1188_133604_000061_000006] It has no beauty whatsoever, no specialty of picturesqueness; and all its lines are cramped and poor.
Ground-truth Encodec resynthesis NAR resynthesis
Transduce and Speak VALL-E VALL-T
Prompt: [1995_1837_000020_000000] Up in the sick room Zora lay on the little white bed.
Zero-shot TTS: [1995_1836_000003_000002] At last the Cotton Combine was to all appearances an assured fact and he was slated for the Senate.
Ground-truth Encodec resynthesis NAR resynthesis
Transduce and Speak VALL-E VALL-T
Prompt: [2300_131720_000016_000004] The electrical work had to be done in forty eight hours!
Zero-shot TTS: [2300_131720_000006_000001] Every department of mechanics was stimulated and benefited to an extraordinary degree.
Ground-truth Encodec resynthesis NAR resynthesis
Transduce and Speak VALL-E VALL-T
Prompt: [4446_2271_000011_000001] "And have I done anything so fool as that, now?" he asked.
Zero-shot TTS: [4446_2271_000022_000010] I should torture myself-I couldn't help it." After that it was easy to forget, actually to forget.
Ground-truth Encodec resynthesis NAR resynthesis
Transduce and Speak VALL-E (Bad case, repeating) VALL-T
Prompt: [8224_274384_000017_000000] The parliament and the Scots laid their proposals before the king.
Zero-shot TTS: [8224_274384_000005_000001] He was particularly attentive to the behavior of their preachers, on whom all depended.
Ground-truth Encodec resynthesis NAR resynthesis
Transduce and Speak VALL-E (Bad case, skipping) VALL-T

II. Zero-shot TTS with untranscribed speech prompts

Prompt: [5683_32866_000047_000003, no transcription available]
Zero-shot TTS: [5683_32866_000028_000000] 'It is very happy, for her at least, they are not,' said Rachel, and a long silence ensued.
Ground-truth VALL-E, w/o pseudo prompt transcription VALL-E, w/ pseudo prompt transcription
VALL-T, w/o pseudo prompt transcription VALL-T, w/ pseudo prompt transcription
Prompt: [2830_3980_000018_000001, no transcription available]
Zero-shot TTS: [2830_3980_000018_000002] He reminds them of the time when he opposed peter to his face and reproved the chief of the apostles.
Ground-truth VALL-E, w/o pseudo prompt transcription VALL-E, w/ pseudo prompt transcription
VALL-T, w/o pseudo prompt transcription VALL-T, w/ pseudo prompt transcription
Prompt: [6829_68771_000046_000000, no transcription available]
Zero-shot TTS: [6829_68769_000051_000002] Then Rogers wouldn't do anything but lead her around, and wait upon her, and the place went to rack and ruin."
Ground-truth VALL-E, w/o pseudo prompt transcription VALL-E, w/ pseudo prompt transcription
VALL-T, w/o pseudo prompt transcription VALL-T, w/ pseudo prompt transcription
Prompt: [7176_92135_000083_000005, no transcription available]
Zero-shot TTS: [7176_88083_000012_000005] The cat growled softly, picked up the prize in her jaws and trotted into the bushes to devour it.
Ground-truth VALL-E, w/o pseudo prompt transcription VALL-E, w/ pseudo prompt transcription
VALL-T, w/o pseudo prompt transcription VALL-T, w/ pseudo prompt transcription

III. Zero-shot TTS with untranscribed speech prompts in unknown languages

Prompt: [In German, no transcription available]
Zero-shot TTS: [237_134493_000007_000000] When he had been mowing the better part of an hour, he heard the rattle of a light cart on the road behind him.
VALL-E, w/ pseudo prompt transcription VALL-T, w/ pseudo prompt transcription
Prompt: [In German, no transcription available]
Zero-shot TTS: [260_123286_000005_000001] He examines the horizon all round with his glass, and folds his arms with the air of an injured man.
VALL-E, w/ pseudo prompt transcription VALL-T, w/ pseudo prompt transcription
Prompt: [In Spanish, no transcription available]
Zero-shot TTS: [4970_29095_000054_000000] ruth did not reply directly; she complained that her mother didn't understand her.
VALL-E, w/ pseudo prompt transcription VALL-T, w/ pseudo prompt transcription
Prompt: [In Spanish, no transcription available]
Zero-shot TTS: [7729_102255_000028_000008] Their assumed character changed with their changing opportunities or necessities.
VALL-E, w/ pseudo prompt transcription VALL-T, w/ pseudo prompt transcription

IV. Lengthy speech generation

Prompt: [6829_68771_000046_000000] A sudden wave of scarlet swept over Eliza's face.
Zero-shot TTS: The gardens and grounds were gaily decorated with Chinese and Japanese lanterns, streamers and Forbes banners. For the first time the maid seemed a little confused, and her gaze wandered from the face of her visitor. She sat down in a rocking chair, and clasping her hands in her lap, rocked slowly back and forth. Her manner was neither independent nor assertive, but rather one of well bred composure and calm reliance. Beth felt that she was intruding and knew that she ought to go; yet some fascination held her to the spot.
Ground-truth VALL-E
VALL-T, w/o aligned context window VALL-T, w/ aligned context window
Prompt: [1320_122617_000033_000000] "Cut his bands," said Hawkeye to David, who just then approached them.
Zero-shot TTS: "If you find a man there, he shall die a flea's death." --Merry Wives of Windsor. At last the scout spoke in English, and at once explained the embarrassment of their situation. A circle of a few hundred feet in circumference was drawn, and each of the party took a segment for his portion. "Such cunning is not without its deviltry," exclaimed Hawkeye, when he met the disappointed looks of his assistants. The whole party now proceeded, following the course of the rill, keeping anxious eyes on the regular impressions.
Ground-truth VALL-E
VALL-T, w/o aligned context window VALL-T, w/ aligned context window

* Main contributors.