Duration Extraction
Forced Alignment
Recall that we had prepared a lexicon file that maps graphemes (words) into phonemes. These phonemes will then be aligned to segments of the corresponding audio, whose duration we will then use to feed into models like LightSpeech. The end result of audio alignment would look something like the following:
Notice that chunks of audio are aligned to each word and its subsequent phonemes. By the end of the training, TextGrid
files will be generated, containing the alignment results. They look like the following:
File type = "ooTextFile"
Object class = "TextGrid"
xmin = 0
xmax = 7.398163265306122
tiers? <exists>
size = 2
item []:
item [1]:
class = "IntervalTier"
name = "words"
xmin = 0
xmax = 7.398163265306122
intervals: size = 34
intervals [1]:
xmin = 0
xmax = 0.04
text = ""
intervals [2]:
xmin = 0.04
xmax = 0.26
text = "he"
intervals [3]:
xmin = 0.26
xmax = 0.44
text = "is"
intervals [4]:
xmin = 0.44
xmax = 0.48
text = ""
intervals [5]:
xmin = 0.48
xmax = 0.91
text = "going"
intervals [6]:
xmin = 0.91
xmax = 0.94
text = ""
intervals [7]:
xmin = 0.94
xmax = 1.05
text = "to"
Montreal Forced Aligner (MFA) is an algorithm and a library that will help train these kinds of acoustic models. The TextGrid
files will then be parsed and the durations of phonemes will be obtained. It is these phoneme durations that will then be learned by LightSpeech via L2 loss.
External Aligner-free Models
Recent development is aiming to completely remove the usage of external-aligner tools like MFA, and for good reasons. It will not only simplify the training pipeline by learning the token duration on-the-fly, but will also improve the speech quality and speed up alignment convergence, especially on longer sequence of text during inference.
Most of the latest models that does its own alignment learning normally use a combination of Connectionist Temporal Classification and Monotonic Alignment Search -- for instance Glow-TTS, VITS, and JETS, among many others. As far as I know, these types of models have yet to be integrated into TensorFlowTTS, although there is an existing branch that attempts to implement AlignTTS.
Installing Montreal Forced Aligner
To begin, we can start by first installing Montreal Forced Aligner. It is much easier to install it via Conda Forge. In the same Conda environment, you can run these commands
If you're not using Conda (e.g. upgrading from non-Conda version, or installing from Source), you can follow the official guide here.
To confirm that the installation is successful, you can run
which will return the version you've installed (in my case, it's 2.0.5
).
Training an MFA Aligner
With MFA installed, training an aligner model is as simple as running the command
Example
Using the same sample dataset and a lexicon file lexicon.txt
, the command to run will be similar to the following
mfa train ./en-bookbot ./lexicon.txt ./outputs/en_bookbot_acoustic_model.zip ./outputs/parsed --punctuation ""
Parsing TextGrid Files
We can ignore the acoustic model generated by MFA, although you can always keep this for other purposes. What's more important for us is the resultant TextGrid
files located in your {TEXTGRID_OUTPUT_DIR}
. Then, you could either use the original parser script or what I usually use is a modified version of the script.
TxtGridParser
import os
from dataclasses import dataclass
from tqdm.auto import tqdm
import textgrid
import numpy as np
import re
@dataclass
class TxtGridParser:
sample_rate: int
multi_speaker: bool
txt_grid_path: str
hop_size: int
output_durations_path: str
dataset_path: str
training_file: str = "train.txt"
phones_mapper = {"sil": "SIL", "": "SIL"}
sil_phones = set(phones_mapper.keys())
punctuations = [";", "?", "!", ".", ",", ":"]
def parse(self):
speakers = (
[
i
for i in os.listdir(self.txt_grid_path)
if os.path.isdir(os.path.join(self.txt_grid_path, i))
]
if self.multi_speaker
else []
)
data = []
if speakers:
for speaker in speakers:
file_list = os.listdir(os.path.join(self.txt_grid_path, speaker))
self.parse_text_grid(file_list, data, speaker)
else:
file_list = os.listdir(self.txt_grid_path)
self.parse_text_grid(file_list, data, "")
with open(
os.path.join(self.dataset_path, self.training_file), "w", encoding="utf-8"
) as f:
f.writelines(data)
def parse_text_grid(self, file_list: list, data: list, speaker_name: str):
for f_name in tqdm(file_list):
text_grid = textgrid.TextGrid.fromFile(
os.path.join(self.txt_grid_path, speaker_name, f_name)
)
pha = text_grid[1]
durations = []
phs = []
flags = []
for iterator, interval in enumerate(pha.intervals):
mark = interval.mark
if mark in self.sil_phones or mark in self.punctuations:
flags.append(True)
else:
flags.append(False)
if mark in self.sil_phones:
mark = self.phones_mapper[mark]
dur = interval.duration() * (self.sample_rate / self.hop_size)
durations.append(round(dur))
phs.append(mark)
new_durations = []
for idx, (flag, dur) in enumerate(zip(flags, durations)):
if len(new_durations) == 0 or (flag and flags[idx - 1] == False):
new_durations.append(dur)
elif flag:
new_durations[len(new_durations) - 1] += dur
else:
new_durations.append(dur)
full_ph = " ".join(phs)
new_ph = full_ph
matches = re.finditer("( ?SIL)* ?([,!\?\.;:] ?){1,} ?(SIL ?)*", full_ph)
for match in matches:
substring = full_ph[match.start() : match.end()]
new_ph = new_ph.replace(
substring, f" {substring.replace('SIL', '').strip()[0]} ", 1
).strip()
assert new_ph.split().__len__() == new_durations.__len__() # safety check
base_name = f_name.split(".TextGrid")[0]
np.save(
os.path.join(self.output_durations_path, f"{base_name}-durations.npy"),
np.array(new_durations).astype(np.int32),
allow_pickle=False,
)
data.append(f"{speaker_name}/{base_name}|{new_ph}|{speaker_name}\n")
Then to use the modified script above, you just have to specify the arguments to the parser.
args = {
"dataset_path": "./en-bookbot",
"txt_grid_path" : "./outputs/parsed", # (1)
"output_durations_path": "./en-bookbot/durations",
"sample_rate" : 44100, # (2)
"hop_size" : 512, # (3)
"multi_speaker" : True, # (4)
"training_file": "train.txt"
}
txt_grid_parser = TxtGridParser(**args)
txt_grid_parser.parse()
- Replace this with whatever your
{TEXTGRID_OUTPUT_DIR}
was. - Set this to the desired sample rate of your text-to-speech model.
- Set this to the desired hop size of your text-to-speech model.
- Multi-speaker or not, you can keep this as
True
.
With the duration files located in durations/
and train.txt
, we can finally train our own text-to-speech model!