Duration Extraction

Forced Alignment

Recall that we had prepared a lexicon file that maps graphemes (words) into phonemes. These phonemes will then be aligned to segments of the corresponding audio, whose duration we will then use to feed into models like LightSpeech. The end result of audio alignment would look something like the following:

Notice that chunks of audio are aligned to each word and its subsequent phonemes. By the end of the training, TextGrid files will be generated, containing the alignment results. They look like the following:

File type = "ooTextFile"
Object class = "TextGrid"

xmin = 0 
xmax = 7.398163265306122 
tiers? <exists> 
size = 2 
item []: 
    item [1]:
        class = "IntervalTier" 
        name = "words" 
        xmin = 0 
        xmax = 7.398163265306122 
        intervals: size = 34 
        intervals [1]:
            xmin = 0 
            xmax = 0.04 
            text = "" 
        intervals [2]:
            xmin = 0.04 
            xmax = 0.26 
            text = "he" 
        intervals [3]:
            xmin = 0.26 
            xmax = 0.44 
            text = "is" 
        intervals [4]:
            xmin = 0.44 
            xmax = 0.48 
            text = "" 
        intervals [5]:
            xmin = 0.48 
            xmax = 0.91 
            text = "going" 
        intervals [6]:
            xmin = 0.91 
            xmax = 0.94 
            text = "" 
        intervals [7]:
            xmin = 0.94 
            xmax = 1.05 
            text = "to"

Montreal Forced Aligner (MFA) is an algorithm and a library that will help train these kinds of acoustic models. The TextGrid files will then be parsed and the durations of phonemes will be obtained. It is these phoneme durations that will then be learned by LightSpeech via L2 loss.

External Aligner-free Models

Recent development is aiming to completely remove the usage of external-aligner tools like MFA, and for good reasons. It will not only simplify the training pipeline by learning the token duration on-the-fly, but will also improve the speech quality and speed up alignment convergence, especially on longer sequence of text during inference.

Most of the latest models that does its own alignment learning normally use a combination of Connectionist Temporal Classification and Monotonic Alignment Search -- for instance Glow-TTS, VITS, and JETS, among many others. As far as I know, these types of models have yet to be integrated into TensorFlowTTS, although there is an existing branch that attempts to implement AlignTTS.

Installing Montreal Forced Aligner

To begin, we can start by first installing Montreal Forced Aligner. It is much easier to install it via Conda Forge. In the same Conda environment, you can run these commands

conda config --add channels conda-forge
conda install montreal-forced-aligner

If you're not using Conda (e.g. upgrading from non-Conda version, or installing from Source), you can follow the official guide here.

To confirm that the installation is successful, you can run

mfa version

which will return the version you've installed (in my case, it's 2.0.5).

Training an MFA Aligner

With MFA installed, training an aligner model is as simple as running the command

mfa train {YOUR_DATASET} {LEXICON} {OUTPUT_ACOUSTIC_MODEL} {TEXTGRID_OUTPUT_DIR} --punctuation ""

Example

Using the same sample dataset and a lexicon file lexicon.txt, the command to run will be similar to the following

mfa train ./en-bookbot ./lexicon.txt ./outputs/en_bookbot_acoustic_model.zip ./outputs/parsed --punctuation ""

Parsing TextGrid Files

We can ignore the acoustic model generated by MFA, although you can always keep this for other purposes. What's more important for us is the resultant TextGrid files located in your {TEXTGRID_OUTPUT_DIR}. Then, you could either use the original parser script or what I usually use is a modified version of the script.

TxtGridParser

import os
from dataclasses import dataclass
from tqdm.auto import tqdm
import textgrid
import numpy as np
import re


@dataclass
class TxtGridParser:
    sample_rate: int
    multi_speaker: bool
    txt_grid_path: str
    hop_size: int
    output_durations_path: str
    dataset_path: str
    training_file: str = "train.txt"
    phones_mapper = {"sil": "SIL", "": "SIL"}
    sil_phones = set(phones_mapper.keys())
    punctuations = [";", "?", "!", ".", ",", ":"]

    def parse(self):
        speakers = (
            [
                i
                for i in os.listdir(self.txt_grid_path)
                if os.path.isdir(os.path.join(self.txt_grid_path, i))
            ]
            if self.multi_speaker
            else []
        )

        data = []

        if speakers:
            for speaker in speakers:
                file_list = os.listdir(os.path.join(self.txt_grid_path, speaker))
                self.parse_text_grid(file_list, data, speaker)
        else:
            file_list = os.listdir(self.txt_grid_path)
            self.parse_text_grid(file_list, data, "")

        with open(
            os.path.join(self.dataset_path, self.training_file), "w", encoding="utf-8"
        ) as f:
            f.writelines(data)

    def parse_text_grid(self, file_list: list, data: list, speaker_name: str):
        for f_name in tqdm(file_list):
            text_grid = textgrid.TextGrid.fromFile(
                os.path.join(self.txt_grid_path, speaker_name, f_name)
            )
            pha = text_grid[1]
            durations = []
            phs = []
            flags = []
            for iterator, interval in enumerate(pha.intervals):
                mark = interval.mark

                if mark in self.sil_phones or mark in self.punctuations:
                    flags.append(True)
                else:
                    flags.append(False)

                if mark in self.sil_phones:
                    mark = self.phones_mapper[mark]

                dur = interval.duration() * (self.sample_rate / self.hop_size)
                durations.append(round(dur))
                phs.append(mark)

            new_durations = []
            for idx, (flag, dur) in enumerate(zip(flags, durations)):
                if len(new_durations) == 0 or (flag and flags[idx - 1] == False):
                    new_durations.append(dur)
                elif flag:
                    new_durations[len(new_durations) - 1] += dur
                else:
                    new_durations.append(dur)

            full_ph = " ".join(phs)
            new_ph = full_ph
            matches = re.finditer("( ?SIL)* ?([,!\?\.;:] ?){1,} ?(SIL ?)*", full_ph)
            for match in matches:
                substring = full_ph[match.start() : match.end()]
                new_ph = new_ph.replace(
                    substring, f" {substring.replace('SIL', '').strip()[0]} ", 1
                ).strip()

            assert new_ph.split().__len__() == new_durations.__len__()  # safety check
            base_name = f_name.split(".TextGrid")[0]
            np.save(
                os.path.join(self.output_durations_path, f"{base_name}-durations.npy"),
                np.array(new_durations).astype(np.int32),
                allow_pickle=False,
            )
            data.append(f"{speaker_name}/{base_name}|{new_ph}|{speaker_name}\n")

Then to use the modified script above, you just have to specify the arguments to the parser.

args = {
    "dataset_path": "./en-bookbot",
    "txt_grid_path" : "./outputs/parsed", # (1)
    "output_durations_path": "./en-bookbot/durations",
    "sample_rate" : 44100, # (2)
    "hop_size" : 512, # (3)
    "multi_speaker" : True, # (4)
    "training_file": "train.txt"
}

txt_grid_parser = TxtGridParser(**args)
txt_grid_parser.parse()

Replace this with whatever your {TEXTGRID_OUTPUT_DIR} was.
Set this to the desired sample rate of your text-to-speech model.
Set this to the desired hop size of your text-to-speech model.
Multi-speaker or not, you can keep this as True.

With the duration files located in durations/ and train.txt, we can finally train our own text-to-speech model!