Cross-speaker Emotion Transfer Based on Speaker Condition Layer Normalization and Semi-Supervised Training in Text-To-Speech
Pengfei Wu, Junjie Pan, Chenchang Xu, Junhui Zhang, Lin Wu, Xiang Yin, Zejun Ma
AI Lab, ByteDance
Abstract
In expressive speech synthesis, there are high requirements for emotion interpretation. However, it is time-consuming to acquire emotional audio corpus for arbitrary speakers due to their deduction ability. In response to this problem, this paper proposes a cross-speaker emotion transfer method that can realize the transfer of emotions from source speaker to target speaker. A set of emotion tokens is firstly defined to represent various categories of emotions. They are trained to be highly correlated with corresponding emotions for controllable synthesis by cross-entropy loss and semi-supervised training strategy. Meanwhile, to eliminate the down-gradation to the timbre similarity from cross-speaker emotion transfer, speaker condition layer normalization is implemented to model speaker characteristics. Experimental results show that the proposed method outperforms the multi-reference based baseline in terms of timbre similarity, stability and emotion perceive evaluations.
Reference speech of source speaker and target speakers
source speaker: A male multi-emotion speech database with 7-emotion annotations.
target speaker-M: A male audiobook database without emotion annotations.
target speaker-F: A female audiobook database without emotion annotations, it's not utilized in our paper.
| emotion |
source speaker |
target speaker-M |
target speaker-F |
|
neutral
|
|
|
|
|
happy
|
|
|
sad
|
|
|
angry
|
|
|
surprise
|
|
|
scare
|
|
|
hate
|
|
Synthesized samples
1. Samples for comparing with baseline and ablation studies
The first 4 columns corresponding to Section 3.2 in our paper.
The 5th column corresponding to cross-gender emotion transfer demos generated by proposed model. It's not appear in our experiments because the model was trained after the paper was submitted.
baseline: A parallel tacotron-based multi-reference emotion transfer model using the paired-unpaired training strategy and adversarial cycle consistency scheme.
M1: An ablation model which removes the SCLN module in decoder LConv blocks and the speaker embeddings are added to encoder outputs.
M2: An ablation model which removes emotion classifier loss in the training stage, multi-head attention is utilized in this model.
proposed: Proposed model describe in our paper.
proposed-F: Cross-gender examples generated by proposed model.
|
emotion
|
baseline
|
M1
|
M2
|
proposed
|
proposed-F
|
|
neutral
|
|
|
|
|
|
|
happy
|
|
|
|
|
|
|
sad
|
|
|
|
|
|
|
angry
|
|
|
|
|
|
|
surprise
|
|
|
|
|
|
|
scare
|
|
|
|
|
|
|
hate
|
|
|
|
|
|
2. Examples of emotion intensity control
This part is not shown in the paper due to the page limits. In fact, we can control the emotion intensity by scalaring the emotion embedding with values.
|
emotion
|
0.3
|
0.5
|
0.7
|
1.0
|
|
happy
|
|
|
|
|
|
sad
|
|
|
|
|
|
angry
|
|
|
|
|
|
surprise
|
|
|
|
|
|
scare
|
|
|
|
|
|
hate
|
|
|
|
|