Hierarchical Emotion Prediction and Control in Text-to-Speech Synthesis

Inoue, Sho; Zhou, Kun; Wang, Shuai; Li, Haizhou

doi:10.1109/ICASSP48485.2024.10445996

Full-text links:

Download:

Current browse context:

cs.SD

< prev | next >

new | recent | 2405

Computer Science > Sound

Title: Hierarchical Emotion Prediction and Control in Text-to-Speech Synthesis

Authors: Sho Inoue, Kun Zhou, Shuai Wang, Haizhou Li

(Submitted on 15 May 2024)

Abstract: It remains a challenge to effectively control the emotion rendering in text-to-speech (TTS) synthesis. Prior studies have primarily focused on learning a global prosodic representation at the utterance level, which strongly correlates with linguistic prosody. Our goal is to construct a hierarchical emotion distribution (ED) that effectively encapsulates intensity variations of emotions at various levels of granularity, encompassing phonemes, words, and utterances. During TTS training, the hierarchical ED is extracted from the ground-truth audio and guides the predictor to establish a connection between emotional and linguistic prosody. At run-time inference, the TTS model generates emotional speech and, at the same time, provides quantitative control of emotion over the speech constituents. Both objective and subjective evaluations validate the effectiveness of the proposed framework in terms of emotion prediction and control.

Comments:	This is accepted to IEEE ICASSP 2024
Subjects:	Sound (cs.SD); Audio and Speech Processing (eess.AS)
DOI:	10.1109/ICASSP48485.2024.10445996
Cite as:	arXiv:2405.09171 [cs.SD]
	(or arXiv:2405.09171v1 [cs.SD] for this version)

Submission history

From: Sho Inoue [view email]
[v1] Wed, 15 May 2024 08:21:56 GMT (1657kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2405.09171

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Sound

Title: Hierarchical Emotion Prediction and Control in Text-to-Speech Synthesis

Submission history