Skip to content

Kaldi 2.0 Indonesian ASR

GitHub Contributor Covenant chat on Discord HuggingFace Space

Indonesian speech/phoneme recognizer powered by Kaldi 2.0 (lhotse, icefall, sherpa). Trained on open source speech data. Deployable on Desktop (via Python/C++), web apps, iOS, and Android.

All models released here are trained on icefall (which runs on PyTorch) and are converted for deployment via sherpa-ncnn. Icefall is Kaldi 2.0 / Next-Gen Kaldi, and unifies the application of k2 for finite state automata (FSA) and lhotse (audio data-loading).

Through this repository, we aim to document and release our open source models for the public's use.

Training Dataset

As of the time of writing, we use the following datasets to train our models:

Noticeably, these datasets only contain text annotations and do not contain phoneme annotations. We used g2p ID to phonemize those text annotations.

Moreover, LibriVox Indonesia's original annotation is written with old Indonesian Republican Spelling System (Edjaan Repoeblik). We pre-converted them into EYD (Ejaan yang Disempurnakan) via Doeloe, before phonemizing them.

Available Models

Pruned Stateless Zipformer RNN-T Streaming ID (Phonemes)

Model Format Link
Icefall Pruned Stateless Zipformer RNN-T Streaming ID
Sherpa NCNN Sherpa-ncnn Pruned Stateless Zipformer RNN-T Streaming ID
Sherpa ONNX TBA

Results (PER)

Decoding LibriVox FLEURS Common Voice
Greedy Search 4.87% 11.45% 14.97%
Modified Beam Search 4.71% 11.25% 14.31%
Fast Beam Search 4.85% 12.55% 14.89%

Usage

There are various ways to export and deploy these models for production. Sherpa (Kaldi 2.0's main deployment framework) also has various counterparts for running on NCNN and/or ONNX engines. Or, you can also directly use these models via icefall, but they require a working PyTorch installation and is unoptimized for production.

We will provide a few external links to Sherpa's thorough documentation which you can follow. We will also provide usage examples for Recognize a file and Real-time recognition with a microphone in Python.

Inference Framework Platform Language Link
Sherpa Desktop C++ Guide
Sherpa NCNN Desktop Python Guide
Sherpa NCNN Android Kotlin Guide
Sherpa NCNN iOS Swift Guide
Sherpa ONNX Desktop Python Guide
Sherpa ONNX Android Kotlin Guide
Sherpa ONNX iOS Swift Guide

Example: Recognize a File (Python - Sherpa NCNN)

The following code is adapted from this example. View this example running in our live demo!

import wave
import numpy as np
import sherpa_ncnn

path = "./sherpa-ncnn-pruned-transducer-stateless7-streaming-id"

def main():
    recognizer = sherpa_ncnn.Recognizer(
        tokens=f"{path}/tokens.txt",
        encoder_param=f"{path}/encoder_jit_trace-pnnx.ncnn.param",
        encoder_bin=f"{path}/encoder_jit_trace-pnnx.ncnn.bin",
        decoder_param=f"{path}/decoder_jit_trace-pnnx.ncnn.param",
        decoder_bin=f"{path}/decoder_jit_trace-pnnx.ncnn.bin",
        joiner_param=f"{path}/joiner_jit_trace-pnnx.ncnn.param",
        joiner_bin=f"{path}/joiner_jit_trace-pnnx.ncnn.bin",
        num_threads=4,
    )

    filename = ("path/to/your/audio.wav")
    with wave.open(filename) as f:
        assert f.getframerate() == recognizer.sample_rate, (
            f.getframerate(),
            recognizer.sample_rate,
        )
        assert f.getnchannels() == 1, f.getnchannels()
        assert f.getsampwidth() == 2, f.getsampwidth()  # it is in bytes
        num_samples = f.getnframes()
        samples = f.readframes(num_samples)
        samples_int16 = np.frombuffer(samples, dtype=np.int16)
        samples_float32 = samples_int16.astype(np.float32)
        samples_float32 = samples_float32 / 32768

    recognizer.accept_waveform(recognizer.sample_rate, samples_float32)

    tail_paddings = np.zeros(int(recognizer.sample_rate * 0.5), dtype=np.float32)
    recognizer.accept_waveform(recognizer.sample_rate, tail_paddings)

    recognizer.input_finished()
    print(recognizer.text)

Example: Real-time Recognition with a Microphone (Python - Sherpa NCNN)

The following code is adapted from this example. View this example running in our live demo!

import sys
import sounddevice as sd
import sherpa_ncnn

path = "./sherpa-ncnn-pruned-transducer-stateless7-streaming-id"

def create_recognizer():
    recognizer = sherpa_ncnn.Recognizer(
        tokens=f"{path}/tokens.txt",
        encoder_param=f"{path}/encoder_jit_trace-pnnx.ncnn.param",
        encoder_bin=f"{path}/encoder_jit_trace-pnnx.ncnn.bin",
        decoder_param=f"{path}/decoder_jit_trace-pnnx.ncnn.param",
        decoder_bin=f"{path}/decoder_jit_trace-pnnx.ncnn.bin",
        joiner_param=f"{path}/joiner_jit_trace-pnnx.ncnn.param",
        joiner_bin=f"{path}/joiner_jit_trace-pnnx.ncnn.bin",
        num_threads=4,
    )
    return recognizer


def main():
    print("Started! Please speak")
    recognizer = create_recognizer()
    sample_rate = recognizer.sample_rate
    samples_per_read = int(0.1 * sample_rate)  # 0.1 second = 100 ms
    last_result = ""
    with sd.InputStream(channels=1, dtype="float32", samplerate=sample_rate) as s:
        while True:
            samples, _ = s.read(samples_per_read)  # a blocking read
            samples = samples.reshape(-1)
            recognizer.accept_waveform(sample_rate, samples)
            result = recognizer.text
            if last_result != result:
                last_result = result
                print(result)


if __name__ == "__main__":
    devices = sd.query_devices()
    default_input_device_idx = sd.default.device[0]
    print(f'Use default device: {devices[default_input_device_idx]["name"]}')
    main()

License

Our models and inference code are released with Apache-2.0 license. Common Voice and LibriVox Indonesia are released under Public Domain, CC-0. FLEURS is licensed under the Creative Commons license (CC-BY).

References

@inproceedings{commonvoice:2020,
  author = {Ardila, R. and Branson, M. and Davis, K. and Henretty, M. and Kohler, M. and Meyer, J. and Morais, R. and Saunders, L. and Tyers, F. M. and Weber, G.},
  title = {Common Voice: A Massively-Multilingual Speech Corpus},
  booktitle = {Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020)},
  pages = {4211--4215},
  year = 2020
}
@article{fleurs2022arxiv,
  title = {FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech},
  author = {Conneau, Alexis and Ma, Min and Khanuja, Simran and Zhang, Yu and Axelrod, Vera and Dalmia, Siddharth and Riesa, Jason and Rivera, Clara and Bapna, Ankur},
  journal={arXiv preprint arXiv:2205.12446},
  url = {https://arxiv.org/abs/2205.12446},
  year = {2022},
}