UniCATS: A Unified Context-Aware Text-to-Speech Framework with
Contextual VQ-Diffusion and Vocoding


Chenpeng Du1, Yiwei Guo1, Feiyu Shen1, Zhijun Liu1, Zheng Liang1, Xie Chen1, Shuai Wang2, Hui Zhang3, Kai Yu1

1 MoE Key Lab of Artificial Intelligence, AI Institute
X-LANCE Lab, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China
2 Shenzhen Research Institute of Big Data, Shenzhen, China
3 AISpeech Ltd, Beijing, China

Abstract


The utilization of discrete speech tokens, divided into semantic tokens and acoustic tokens, has been proven superior to traditional acoustic feature mel-spectrograms in terms of naturalness and robustness for text-to-speech (TTS) synthesis. Recent popular models, such as VALL-E and SPEAR-TTS, allow zero-shot speaker adaptation through auto-regressive (AR) continuation of acoustic tokens extracted from a short speech prompt. However, these AR models are restricted to generate speech only in a left-to-right direction, making them unsuitable for speech editing where both preceding and following contexts are provided. Furthermore, these models rely on acoustic tokens, which have audio quality limitations imposed by the performance of audio codec models. In this study, we propose a unified context-aware TTS framework called UniCATS, which is capable of both speech continuation and editing. UniCATS comprises two components, an acoustic model CTX-txt2vec and a vocoder CTX-vec2wav. CTX-txt2vec employs contextual VQ-diffusion to predict semantic tokens from the input text, enabling it to incorporate the semantic context and maintain seamless concatenation with the surrounding context. Following that, CTX-vec2wav utilizes contextual vocoding to convert these semantic tokens into waveforms, taking into consideration the acoustic context. Our experimental results demonstrate that CTX-vec2wav outperforms HifiGAN and AudioLM in terms of speech resynthesis from semantic tokens. Moreover, we show that UniCATS achieves state-of-the-art performance in both speech continuation and editing.



Speech Resynthesis from Semantic Tokens


[1089_134686_000026_000003] A gentle kick from the tall boy in the bench behind urged Stephen to ask a difficult question.
Ground-truth Encodec
HifiGAN AudioLM CTX-vec2wav
[1580_141083_000015_000000] For an instant I imagined that Bannister had taken the unpardonable liberty of examining my papers.
Ground-truth Encodec
HifiGAN AudioLM CTX-vec2wav
[2830_3980_000018_000002] He reminds them of the time when he opposed peter to his face and reproved the chief of the apostles.
Ground-truth Encodec
HifiGAN AudioLM CTX-vec2wav
[6829_68769_000005_000004] So this boy was doubly foolish in ruining himself to get sixty dollars to pay an unjust demand.
Ground-truth Encodec
HifiGAN AudioLM CTX-vec2wav
[8224_274384_000005_000001] He was particularly attentive to the behavior of their preachers, on whom all depended.
Ground-truth Encodec
HifiGAN AudioLM CTX-vec2wav

Speech Continuation for Zero-Shot Speaker Adaptation (Seen)


[Prompt: 696_92939_000016_000006] He never informed them that the death sentence had been imposed. [Continuation: 696_93314_000013_000007] It seemed to him that there were so many more of the same pattern from whom she might have chosen.
Prompt Ground-truth
FastSpeech 2 VALL-E UniCATS
[Prompt: 1789_142896_000022_000005] Right after lunch you go to the boss's garage and wait for me. [Continuation: 1789_142896_000004_000002] You are formal only to the city editor, the managing editor, and the auditor.
Prompt Ground-truth
FastSpeech 2 VALL-E UniCATS
[Prompt: 1806_143948_000039_000000] You have, then, limited your efforts to sacred song? [Continuation: 1806_143946_000010_000004] His language has the richness and sententious fullness of the Chinese.
Prompt Ground-truth
FastSpeech 2 VALL-E UniCATS
[Prompt: 5239_31629_000061_000000] Fiery stream contained poisonous gases. [Continuation: 5239_32139_000017_000001] Think of a stone aqueduct reaching from the city of New York to the State of North Carolina!
Prompt Ground-truth
FastSpeech 2 VALL-E UniCATS
[Prompt: 7933_112597_000010_000001] But the bagpipe was the great favourite of the common people. [Continuation: 7933_113273_000042_000000] The fate of the others, whose wickedness has been a part of this story, was not so pleasant.
Prompt Ground-truth
FastSpeech 2 VALL-E UniCATS

Speech Continuation for Zero-Shot Speaker Adaptation (Unseen)


[Prompt: 1995_1837_000020_000000] Up in the sick room Zora lay on the little white bed. [Continuation: 1995_1836_000003_000002] At last the Cotton Combine was to all appearances an assured fact and he was slated for the Senate.
Prompt Ground-truth
FastSpeech 2 VALL-E UniCATS
[Prompt: 2830_3980_000018_000001] Humble man that he was, he will not now take a back seat. [Continuation: 2830_3980_000018_000000] Against these boasting, false apostles, Paul boldly defends his apostolic authority and ministry.
Prompt Ground-truth
FastSpeech 2 VALL-E UniCATS
[Prompt: 4077_13754_000013_000005] Yet the two have often been confused in the popular mind. [Continuation: 4077_13754_000003_000005] He was unjustly charged with favoring secession; but the charge was soon disproved.
Prompt Ground-truth
FastSpeech 2 VALL-E UniCATS
[Prompt: 6829_68771_000046_000000] A sudden wave of scarlet swept over Eliza's face. [Continuation: 6829_68769_000030_000000] Then he deliberately locked Kenneth and Beth in with the forger, and retreated along the passage.
Prompt Ground-truth
FastSpeech 2 VALL-E UniCATS
[Prompt: 8230_279154_000004_000008] To deal with this problem, we must have a theory of memory. [Continuation: 8230_279154_000019_000000] The first of our vague but indubitable data is that there is knowledge of the past.
Prompt Ground-truth
FastSpeech 2 VALL-E UniCATS

Speech Editing (An example)


In a few moments he heard the cherries dropping smartly into the pail, and he began to swing his scythe.
In a few moments he heard the rhythmic swishing sound of the wind, and he began to swing his scythe.

Speech Editing (Short)


[1995_1826_000016_000010] Miss Taylor was soon starving for human companionship, for the lighter touches of life and some of its warmth and laughter.
Ground-truth RetrieverTTS UniCATS
[4970_29095_000056_000000] Ruth was glad to hear that Philip had made a push into the world, and she was sure that his talent and courage would make a way for him. She should pray for his success at any rate, and especially that the Indians, in St. Louis, would not take his scalp.
Ground-truth RetrieverTTS UniCATS
[7176_92135_000006_000007] "My dear Sir," I should reply (or Madam), "you have come to the right shop."
Ground-truth RetrieverTTS UniCATS

Speech Editing (Long)


[8463_294828_000014_000001] I wanted nothing more than to see my country again, my friends, my modest quarters by the Botanical Gardens, my dearly beloved collections!
Ground-truth RetrieverTTS UniCATS
[6829_68769_000009_000000] If the prosecution were withdrawn and the case settled with the victim of the forged check, then the young man would be allowed his freedom. But under the circumstances I doubt if such an arrangement could be made.
Ground-truth RetrieverTTS UniCATS
[8455_210777_000003_000000] I remained there alone for many hours, but I must acknowledge that before I left the chambers I had gradually brought myself to look at the matter in another light.
Ground-truth RetrieverTTS UniCATS


Note: The utterance list for both test sets A and B, along with their corresponding prompts, is available here. [Test set A] [Test set B]

Citation


@article{du2023unicats,
  title={UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding},
  author={Du, Chenpeng and Guo, Yiwei and Shen, Feiyu and Liu, Zhijun and Liang, Zheng and Wang, Shuai and Zhang, Hui and Chen, Xie and Yu, Kai},
  journal={arXiv preprint arXiv:2306.07547},
  year={2023}
}