Inference
Everything here can be followed along in Google Colab!
Load Models and Processor
Inferencing TensorFlowTTS models are quite straightforward given that you know the expected inputs and outputs of each model. But first, we have to load pre-trained model weights. If you've followed the additional step of pushing model weights to HuggingFace Hub, you can simply load weights stored there! This also includes the processor that comes hand-in-hand with the text2mel model.
To be able to load private models, you must first log into HuggingFace Hub.
Token is valid.
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.huggingface/token
Login successful
You can then load private models by specifying use_auth_token=True
.
from tensorflow_tts.inference import TFAutoModel, AutoProcessor
import tensorflow as tf
vocoder = TFAutoModel.from_pretrained("bookbot/mb-melgan-hifi-postnets-en", use_auth_token=True)
text2mel = TFAutoModel.from_pretrained("bookbot/lightspeech-mfa-en", use_auth_token=True)
processor = AutoProcessor.from_pretrained("bookbot/lightspeech-mfa-en", use_auth_token=True)
processor.mode = "eval" # change processor from train to eval mode
Tokenization
Then, we'll need to tokenize raw text into their corresponding input IDs, which we can achieve by simply calling the text_to_sequence
method of our processor.
LightSpeech Inference
To perform inference of LightSpeech models, you will need to specify 5 different inputs:
- Input IDs
- Speaker ID
- Speed Ratio
- Pitch Ratio
- Energy Ratio
We already have our input IDs -- we just need to specify which speaker index we'd want to use and optionally, specify other ratios. For now, we'll only be parameterizing the speaker ID and set the rest to 1.0
(default value).
Keep in mind that TensorFlow models expect inputs to be batched. This is why we need to use the tf.expand_dims
function on our input IDs (to make them of batch size 1) and the other inputs are also lists instead of raw scalar values.
Moreover, LightSpeech (at least our implementation) returns 3 things:
- Mel-Spectrogram
- Duration Predictions
- Pitch (F0) Prediction
We'll only be keeping the first (index 0) and ignore the rest.
Note
This is where understanding what our model's outputs are becomes important. We need to know at which index our desired output is. For instance, FastSpeech returns another mel-spectrogram (often called mel_after
, "after" meaning, after the initial mel-spectrogram prediction is additionally passed through a Tacotron PostNet module), while LightSpeech only has one mel-spectrogram output, located at index 0.
speaker_id = 0
mel_spectrogram, *_ = text2mel.inference(
input_ids=tf.expand_dims(tf.convert_to_tensor(input_ids, dtype=tf.int32), 0),
speaker_ids=tf.convert_to_tensor([speaker_id], dtype=tf.int32),
speed_ratios=tf.convert_to_tensor([1.0], dtype=tf.float32),
f0_ratios=tf.convert_to_tensor([1.0], dtype=tf.float32),
energy_ratios=tf.convert_to_tensor([1.0], dtype=tf.float32),
)
Multi-Band MelGAN Inference
After that, we can take the output mel-spectrogram generated by LightSpeech and use that as input to our MB-MelGAN model.
Save Synthesized Audio as File
Finally, we can save the predicted audio waveforms as a file by utilizing SoundFile. We'll just need to specify several parameters such as the output save name, the audio tensor, sample rate, and subtype.
And with that, we have just synthesized an audio from pure text! If you're following from a Jupyter Notebook, you can play the saved audio via IPython's Audio
widget.