Mobile UltimateMobile Ultimate

Phoneme position prediction and robust neural codec language modeling for zero-shot TTS

2301.02111] Neural Codec Language Models are Zero-Shot Text to Speech  Synthesizers

1. Introduction

Large language models (LLMs) have demonstrated impressive capabilities for in-context learning [1]. Inspired by these successes, recent studies [2-5] have extended LLM applications to text-to-speech (TTS) systems by representing speech through discrete acoustic codes. These LLM-based models show remarkable progress in zero-shot TTS, successfully generating high-quality personalized speech for unseen speakers using a brief audio prompt. However, these models have robustness issues that show up in generated speech as word skipping, repetition, and mispronunciations. To alleviate this challenge, we propose a simple yet effective method to enhance alignment between acoustic codes and phonemes. As illustrated in Figure 1, our approach synchronously predicts both acoustic codes and their corresponding phoneme positions within the input sequence. We then utilize this combined information as conditions for subsequent predictions. The enforced monotonic progression of positional indices ensures that every phoneme in the attention framework is covered by this method, allowing each acoustic code to precisely focus on its target phoneme. 2. Method
Notably, while implementing our method using VALL-E [2] as foundation, the design maintains compatibility with any decoder-only AR-TTS models.
Let the target text-speech pair be denoted as {x,y}, and the prompt text-speech pair from same speaker as { x̃ , ỹ }. A pre-trained neural codec model encodes the speech y and ỹ into discrete acoustic codes, denoted as
c
N
T
=
C
o
d
e
c
(
y
)
and
c
N
T

, respectively. The total number of time steps for each matrix is represented by T and T’, and the number of quantizers in the codec model is represented by N. Concurrently, the text x and x̃ are converted into phoneme sequences
p
=
p
1
,
p
2
,

,
p
L
and p, in that order. It should be noted that text tokens or sequences of syllables can also be used to represent the text. VALL-E is a decoder-only TTS system that regards zero-shot TTS as a codec language modeling task. In order to predict the acoustic codes in a hierarchical manner, it incorporates two Transformers. In order to sequentially predict the first quantizer’s target acoustic codes, one AR transformer takes p as its input. c
1
1
:
T
and tries to spread as much as possible: 2.2 Using phoneme position prediction to improve alignment In our proposed method, we primarily focus on enhancing the robustness of the AR transformer while leaving the NAR transformer unchanged. We propose incorporating phoneme position prediction into the AR model to strengthen the alignment between acoustic codes and phonemes. To achieve this, we first derive the alignment
d
=
d
1
,

,
d
L
between phonemes and acoustic codes, where
d
i
represents the duration of i-th phoneme, and

L
i
=
1
d
i
=
T
. Subsequently, we construct the repeated phoneme position sequence pos by repeating each position i ∈ [1,L]
d
i
times. Our model is then trained to synchronously predict both acoustic codes, as shown in Figure 2’s upper section. c
1
1
:
T
and the phoneme positions that correspond to them p
o
s
1
:
T
using teacher forcing, conditioned on the phoneme sequence p. The model is optimized by maximizing the following probability:
In practice, the model predicts
c
1
t
and
p
o
s
t
with two separate heads, and their embeddings are summed up and fed to the model for the prediction of the next step.
2.3 Zero-shot TTS inference
We implement zero-shot TTS through prompting during inference, following VALL-E but differing in two key aspects to accommodate our AR model modifications, as depicted in the lower portion of Figure

2. First, given an enrollment speech

Sample from an unseen speaker as prompt, in addition to the phoneme sequence and acoustic code matrix, phoneme durations should also be included. Second, we employ two distinct strategies for each prediction target during the decoding process. We employ greedy decoding for phoneme positions and nucleus sampling [6] with a predetermined top-p value of 0.98 for acoustic codes.

3. Results

On the zero-shot TTS task, we evaluate the proposed method using several baselines.

3.1 Objective evaluations

We adopt Seed-TTS [7] test-zh for objective evaluation, and use Character Error Rate (CER) and Speaker Embedding Cosine Similarity (SECS) to evaluate the robustness and speaker similarity of synthesized samples respectively. The evaluation results are detailed in Table 1. All synthetic methods achieve comparable SECS scores for speaker similarity that are within 1% of the GT. Our method significantly improves robustness across all subsets, particularly in terms of reducing deletion (Del) and insertion (Ins) errors on the hard subset.

3.2 Personal judgments

We assess the naturalness and speaker similarity of the synthesized speech, respectively, using Comparative Mean Opinion Score (CMOS) and Speaker Similarity Mean Opinion Score (SMOS) tests for subjective evaluation. The evaluation results, which are shown in detail in Table 2, show that our system performs better than the other two VALL-E-based systems and that our method is effective.

3.3 Alignment analysis

We first visualized the attention weights of all attention heads across all layers and selected the head with the most distinct diagonal alignment patterns between phonemes and acoustic codes for detailed analysis on the Seed-TTS test-zh hard samples to ensure that the improved robustness of our model is due to improved alignment between phonemes and acoustic codes. Each alignment error, including phoneme skipping, repetition, and one-to-many alignment, was quantified, and the results are shown in Table

3. We observed that our approach

Completely eliminates all three types of alignment errors, achieving zero error occurrences. This indicates that the proposed phoneme position prediction mechanism effectively addresses alignment issues, thereby enhancing the robustness and accuracy of the TTS system.