- [Script and Sample Code](#script-and-sample-code)
- [Script Parameters](#script-parameters)
- [Training and eval Process](#training-process)
- [Export MindIR](#convert-process)
- [Convert](#convert)
- [Model Description](#model-description)
- [Performance](#performance)
- [Training Performance](#training-performance)
- [Inference Performance](#inference-performance)
- [ModelZoo Homepage](#modelzoo-homepage)
# [DeepSpeech2 Description](#contents)
DeepSpeech2 is a speech recognition models which is trained with CTC loss. It replaces entire pipelines of hand-engineered components with neural networks and can handle a diverse variety of speech including noisy
[Paper](https://arxiv.org/pdf/1512.02595v1.pdf): Amodei, Dario, et al. Deep speech 2: End-to-end speech recognition in english and mandarin.
# [Model Architecture](#contents)
The current reproduced model consists of:
- two convolutional layers:
- number of channels is 32, kernel size is [41, 11], stride is [2, 2]
- number of channels is 32, kernel size is [41, 11], stride is [2, 1]
- five bidirectional LSTM layers (size is 1024)
- one projection layer (size is number of characters plus 1 for CTC blank symbol, 29)
# [Dataset](#contents)
Note that you can run the scripts based on the dataset mentioned in original paper or widely used in relevant domain/network architecture. In the following sections, we will introduce how to run the scripts using the related dataset below.
Parameters for training and evaluation can be set in file `config.py`
```text
config for training.
epochs number of training epoch, default is 70
```
```text
config for dataloader.
train_manifest train manifest path, default is 'data/libri_train_manifest.csv'
val_manifest dev manifest path, default is 'data/libri_val_manifest.csv'
batch_size batch size for training, default is 8
labels_path tokens json path for model output, default is "./labels.json"
sample_rate sample rate for the data/model features, default is 16000
window_size window size for spectrogram generation (seconds), default is 0.02
window_stride window stride for spectrogram generation (seconds), default is 0.01
window window type for spectrogram generation, default is 'hamming'
speed_volume_perturb use random tempo and gain perturbations, default is False, not used in current model
spec_augment use simple spectral augmentation on mel spectograms, default is False, not used in current model
noise_dir directory to inject noise into audio. If default, noise Inject not added, default is '', not used in current model
noise_prob probability of noise being added per sample, default is 0.4, not used in current model
noise_min minimum noise level to sample from. (1.0 means all noise, not original signal), default is 0.0, not used in current model
noise_max maximum noise levels to sample from. Maximum 1.0, default is 0.5, not used in current model
```
```text
config for model.
rnn_type type of RNN to use in model, default is 'LSTM'. Currently, only LSTM is supported
hidden_size hidden size of RNN Layer, default is 1024
hidden_layers number of RNN layers, default is 5
lookahead_context look ahead context, default is 20, not used in current model
```
```text
config for optimizer.
learning_rate initial learning rate, default is 3e-4
learning_anneal annealing applied to learning rate after each epoch, default is 1.1
weight_decay weight decay, default is 1e-5
momentum momentum, default is 0.9
eps Adam eps, default is 1e-8
betas Adam betas, default is (0.9, 0.999)
loss_scale loss scale, default is 1024
```
```text
config for checkpoint.
ckpt_file_name_prefix ckpt_file_name_prefix, default is 'DeepSpeech'
ckpt_path path to save ckpt, default is 'checkpoints'
keep_checkpoint_max max number of checkpoints to save, delete older checkpoints, default is 10
```
# [Training and Eval process](#contents)
Before training, the dataset should be processed. We use the scripts provided by [SeanNaren](https://github.com/SeanNaren/deepspeech.pytorch) to process the dataset.
This script in [SeanNaren](https://github.com/SeanNaren/deepspeech.pytorch) will automatically download the dataset and process it. After the process, the