KALDI ASR (Automatic Speech Recognition)


During the last few years, many different Automatic Speech Recognition (ASR) frameworks have been developed for research purposes and, nowadays, various open-source ASR toolkits are available to research laboratories. Systems such as HTK [1], SONIC [2], [3], SPHINX [4], [5], RWTH [6], JULIUS [7], KALDI [8], the more recent ASR framework SIMON [9], and the relatively new system called BAVIECA [10] are a simple and probably not exhaustive list. Deep Neural Networks (DNNs) are the latest hot topic in speech recognition. Since around 2010 many papers have been published in this area, and some of the largest companies (e.g. Google, Microsoft) are starting to use DNNs in their production systems. Indeed new systems such as KALDI [8] demonstrated the effectiveness of easily incorporate “Deep Neural Network” (DNN) techniques [11] in order to improve the recognition performance in almost all recognition tasks.

Main Features

As written in his official web site (https://kaldi-asr.org/), the KALDI ASR environment should be mainly taken into consideration for the following simple reasons:

  • it’s “easy to use” (once you learn the basics, and assuming you understand the underlying science);
  • it’s “easy to extend and modify”;
  • it’s “redistributable”: unrestrictive license, community project;
  • if your stuff works or is interesting, the KALDI team is open to including it and your example scripts in our central repository: more citation, as others build on it.

In particular, even if KALDI is similar in aims and scope to HTK, and the goal is still to have modern and flexible code, written in C++, that is easy to modify and extend, the important features that represent the main reasons to use KALDI versus other toolkits include:

  • code-level integration with Finite State Transducers (FSTs);
    • compiling against the OpenFst toolkit (using it as a library).
  • extensive linear algebra support;
    • including a matrix library that wraps standard BLAS and LAPACK routines;
  • extensible design;
    • providing, as far as possible, algorithms in the most generic form possible; for instance, decoders are templated on an object that provides a score indexed by a (frame, fst- input-symbol) tuple, this meaning that the decoder could work from any suitable source of scores, such as a neural net;
  • open license;
    • the code is licensed under Apache 2.0, which is one of the least restrictive licenses available;
  • complete recipes;
    • making available complete recipes for building speech recognition systems, that work from widely available databases such as those provided by the ELRA or Linguistic Data Consortium (LDC).

It should be noted that the goal of releasing complete recipes is an important aspect of KALDI. Since the code is publicly available under a license that permits modifications and re-release, this encourages people to release their code, along with their script directories, in a similar format to KALDI ‘s own example script.

DNN (Deep Neural Network in Kaldy

Most of the text in this section is taken from “Deep Neural Networks in Kaldi”
(https://kaldi-asr.org/doc/dnn.html) with permission from the Author (Daniel Povey).

An active area of research like Deep Neural Networks (DNNs) is difficult for a toolkit like KALDI to be well supported, because the state of the art changes constantly, which means code changes are required to keep up, and architectural decisions may need to be rethought.

KALDI currently contains two parallel implementations for DNN training. Both of these recipes are deep neural networks where the last (output) layer is a softmax layer whose output dimension equals the number of context-dependent states in the system (typically several thousand). The neural net is trained to predict the posterior probability of each context-dependent state. During decoding the output probabilities are divided by the prior probability of each state to form a “pseudo- likelihood” that is used in place of the state emission probabilities in the HMM

The first implementation is as described in [12, 13]. This implementation supports Restricted Boltzmann Machines (RBM) pre-training [14, 15, 16], stochastic gradient descent training using NVidia Graphics Processing Units (GPUs), and discriminative training such as boosted MMI [17] and state- level minimum Bayes risk (sMBR) [18, 19]. The second implementation of DNNs in KALDI [20, 21, 22] was originally written to support parallel training on multiple CPUs, although it has now been extended to support parallel GPU-based training and it does not support discriminative training.

One is located in code sub-directories “nnet” and “nnetbin” (see the code at: https://sourceforge.net/p/kaldi/code/HEAD/tree/trunk/src/), and is primarily maintained by Karel Vesely. The other is located in code subdirectories nnet2/ and nnet2bin/, and is primarily maintained by Daniel Povey (this code was originally based on an earlier version of Karel’s code, but it has been extensively rewritten). Neither codebase is more “official” than the other. Both are still being developed in parallel.

(source: https://www3.pd.istc.cnr.it/piero/ASR/default_kaldi.htm)