Introduction

This guide will explain how to train a LightSpeech acoustic model and a Multi-band MelGAN vocoder model (with a HiFi-GAN Discriminator). In particular, we will be training a 44.1 kHz model, with a hop size of 512.

This guide expects you to train an IPA-based model. For now, this tutorial only supports English and Indonesian -- because only these two languages have its correponding IPA-based grapheme-to-phoneme processor added to the custom fork.

For English, we use gruut and for Indonesian, we use g2p_id. To add support for other languages, you would need a grapheme-to-phoneme converter for that language, and support it as a processor in TensorFlowTTS. We will introduce a separate tutorial for that in the future.

LightSpeech

LightSpeech follows the same architecture as FastSpeech2, except with an optimized model configuration obtained via Neural Architecture Search (NAS). In our case, we don't really perform NAS, but use the previously found best model configuration.

Multi-band MelGAN

Multi-Band MelGAN is an improvement upon MelGAN that does waveform generation and waveform discrimination on a multi-band basis.

HiFi-GAN Discrminator

Further, instead of using the original discriminator, we can use the discriminator presented in the HiFi-GAN paper.

We specifically use the multi-period discriminator (MPD) on the right.