Inference

Everything here can be followed along in Google Colab!

Load Models and Processor

Inferencing TensorFlowTTS models are quite straightforward given that you know the expected inputs and outputs of each model. But first, we have to load pre-trained model weights. If you've followed the additional step of pushing model weights to HuggingFace Hub, you can simply load weights stored there! This also includes the processor that comes hand-in-hand with the text2mel model.

To be able to load private models, you must first log into HuggingFace Hub.

from huggingface_hub import notebook_login
notebook_login()

Token is valid.
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.huggingface/token
Login successful

You can then load private models by specifying use_auth_token=True.

from tensorflow_tts.inference import TFAutoModel, AutoProcessor
import tensorflow as tf

vocoder = TFAutoModel.from_pretrained("bookbot/mb-melgan-hifi-postnets-en", use_auth_token=True)
text2mel = TFAutoModel.from_pretrained("bookbot/lightspeech-mfa-en", use_auth_token=True)

processor = AutoProcessor.from_pretrained("bookbot/lightspeech-mfa-en", use_auth_token=True)
processor.mode = "eval" # change processor from train to eval mode

Tokenization

Then, we'll need to tokenize raw text into their corresponding input IDs, which we can achieve by simply calling the text_to_sequence method of our processor.

text = "Hello world."
input_ids = processor.text_to_sequence(text)

LightSpeech Inference

To perform inference of LightSpeech models, you will need to specify 5 different inputs:

Input IDs
Speaker ID
Speed Ratio
Pitch Ratio
Energy Ratio

We already have our input IDs -- we just need to specify which speaker index we'd want to use and optionally, specify other ratios. For now, we'll only be parameterizing the speaker ID and set the rest to 1.0 (default value).

Keep in mind that TensorFlow models expect inputs to be batched. This is why we need to use the tf.expand_dims function on our input IDs (to make them of batch size 1) and the other inputs are also lists instead of raw scalar values.

Moreover, LightSpeech (at least our implementation) returns 3 things:

Mel-Spectrogram
Duration Predictions
Pitch (F0) Prediction

We'll only be keeping the first (index 0) and ignore the rest.

Note

This is where understanding what our model's outputs are becomes important. We need to know at which index our desired output is. For instance, FastSpeech returns another mel-spectrogram (often called mel_after, "after" meaning, after the initial mel-spectrogram prediction is additionally passed through a Tacotron PostNet module), while LightSpeech only has one mel-spectrogram output, located at index 0.

speaker_id = 0

mel_spectrogram, *_ = text2mel.inference(
    input_ids=tf.expand_dims(tf.convert_to_tensor(input_ids, dtype=tf.int32), 0),
    speaker_ids=tf.convert_to_tensor([speaker_id], dtype=tf.int32),
    speed_ratios=tf.convert_to_tensor([1.0], dtype=tf.float32),
    f0_ratios=tf.convert_to_tensor([1.0], dtype=tf.float32),
    energy_ratios=tf.convert_to_tensor([1.0], dtype=tf.float32),
)

Multi-Band MelGAN Inference

After that, we can take the output mel-spectrogram generated by LightSpeech and use that as input to our MB-MelGAN model.

audio = vocoder.inference(mel_spectrogram)[0, :, 0]

Save Synthesized Audio as File

Finally, we can save the predicted audio waveforms as a file by utilizing SoundFile. We'll just need to specify several parameters such as the output save name, the audio tensor, sample rate, and subtype.

import soundfile as sf

sf.write("./audio.wav", audio, 44100, "PCM_16")

And with that, we have just synthesized an audio from pure text! If you're following from a Jupyter Notebook, you can play the saved audio via IPython's Audio widget.

from IPython.display import Audio

Audio("audio.wav")