|
4 years ago | |
---|---|---|
.. | ||
scripts | 4 years ago | |
src | 4 years ago | |
wavenet_vocoder | 4 years ago | |
README.md | 4 years ago | |
evaluate.py | 4 years ago | |
export.py | 4 years ago | |
train.py | 4 years ago |
README.md
Contents
- WaveNet Description
- Model Architecture
- Dataset
- Environment Requirements
- Script Description
- Model Description
- ModelZoo Homepage
WaveNet Description
WaveNet is a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones. We support training and evaluation on both GPU and CPU.
Paper: ord A, Dieleman S, Zen H, et al. Wavenet: A generative model for raw audio
Model Architecture
The current model consists of a pre-convolution layer, followed by several residual block which has residual and skip connection with gated activation units. Finally, post convolution layers are added to predict the distribution.
Dataset
In the following sections, we will introduce how to run the scripts using the related dataset below.
Dataset used: The LJ Speech Dataset
-
Dataset size:2.6G
-
Data format:audio clips(13100) and transcription
-
The dataset structure is as follows:
. └── LJSpeech-1.1 ├─ wavs //audio clips files └─ metadata.csv //transcripts
Environment Requirements
- Hardware(GPU/CPU)
- Prepare hardware environment with GPU/CPU processor.
- Framework
- For more information, please check the resources below:
Script Description
Script and Sample Code
Note that some of the scripts described below are not included our code. These scripts should first be download them from r9y9 and added into this project.
.
├── audio
└──wavenet
├── scripts
│ ├──run_distribute_train_gpu.sh // launch distributed training with gpu platform(8p)
│ ├──run_eval_cpu.sh // launch evaluation with cpu platform
│ ├──run_eval_gpu.sh // launch evaluation with gpu platform
│ ├──run_standalone_train_cpu.sh // launch standalone training with cpu platform
│ └──run_standalone_train_gpu.sh // launch standalone training with gpu platform(1p)
├──datasets // Note the datasets folder should be downloaded from the above link
├──egs // Note the egs folder should be downloaded from the above link
├──utils // Note the utils folder should be downloaded from the above link
├── audio.py // Audio utils. Note this script should be downloaded from the above link
├── compute-meanvar-stats.py // Compute mean-variance normalization stats. Note this script should be downloaded from the above link
├── evaluate.py // Evaluation
├── export.py // Convert mindspore model to air model
├── hparams.py // Hyper-parameter configuration. Note this script should be downloaded from the above link
├── mksubset.py // Make subset of dataset. Note this script should be downloaded from the above link
├── preprocess.py // Preprocess dataset. Note this script should be downloaded from the above link
├── preprocess_normalize.py // Perform meanvar normalization to preprocessed features. Note this script should be downloaded from the above link
├── README.md // Descriptions about WaveNet
├── train.py // Training scripts
├── train_pytorch.py // Note this script should be downloaded from the above link. The initial name of this script is train.py in the project from the link
├── src
│ ├──__init__.py
│ ├──dataset.py // Generate dataloader and data processing entry
│ ├──callback.py // Callbacks to monitor the training
│ ├──lr_generator.py // Learning rate generator
│ └──loss.py // Loss function definition
└── wavenet_vocoder
├──__init__.py
├──conv.py // Extended 1D convolution
├──mixture.py // Loss function for training and sample function for testing
├──modules.py // Modules for Wavenet construction
├──upsample.py // Upsample layer definition
├──util.py // Utils. Note this script should be downloaded from the above link
├──wavenet.py // WaveNet networks
└──tfcompat // Note this script should be downloaded from the above link
├──__init__.py
└──hparam.py // Param management tools
Script Parameters
Training
usage: train.py [--data_path DATA_PATH] [--preset PRESET]
[--checkpoint_dir CHECKPOINT_DIR] [--checkpoint CHECKPOINT]
[--speaker_id SPEAKER_ID] [--platform PLATFORM]
[--is_distributed IS_DISTRIBUTED]
options:
--data_path dataset path
--preset path of preset parameters (json)
--checkpoint_dir directory of saving model checkpoints
--checkpoint pre-trained ckpt path, default is "./checkpoints"
--speaker_id specific speaker of data in case for multi-speaker datasets, not used currently
--platform specify platform to be used, defeault is "GPU"
--is_distributed whether distributed training or not
Evaluation
usage: evaluate.py [--data_path DATA_PATH] [--preset PRESET]
[--pretrain_ckpt PRETRAIN_CKPT] [--is_numpy]
[--output_path OUTPUT_PATH] [--speaker_id SPEAKER_ID]
[--platform PLATFORM]
options:
--data_path dataset path
--preset path of preset parameters (json)
--pretrain_ckpt pre-trained ckpt path
--is_numpy whether using numpy for inference or not
--output_path path to save synthesized audio
--speaker_id specific speaker of data in case for multi-speaker datasets, not used currently
--platform specify platform to be used, defeault is "GPU"
More parameters for training and evaluation can be set in file hparams.py
.
Training Process
Before your first training, some dependency scripts should be downloaded and placed in correct directory as described in [Script and Sample Code].
After that, raw data should be pre-processed by using the scripts in egs
. The directory of egs is as follows:
.
├── egs
├──gaussian
│ ├──conf
│ │ ├──gaussian_wavenet.json
│ │ └──gaussian_wavenet_demo.json
│ └──run.sh
├──mol
│ ├──conf
│ │ ├──mol_wavenet.json
│ │ └──mol_wavenet_demo.json
│ └──run.sh
├──mulaw256
│ ├──conf
│ │ ├──mulaw_wavenet.json
│ │ └──mulaw_wavenet_demo.json
│ └──run.sh
└──README.md
In this project, three different losses are implemented to train the network:
- mulaw256: categorical output distribution. The input is 8-bit mulaw quantized waveform.
- mol: discretized mix logistic loss. The input is 16-bit raw audio.
- gaussian: mix gaussian loss. The input is 16-bit raw audio.
The three folder gaussian, mol, mulaw is used to generate corresponding training data respectively. For example, To generate the training data for
mix gaussian loss, you should first modify the run.sh
in line 28. Change conf/gaussian_wavenet_demo.json
to
conf/gaussian_wavenet.json
. We use the default parameter in gaussian_wavenet.json
. By this setting, data will be generated to adapt to mix gaussian loss and
some parameters in hparams.py
will be covered by that in gaussian_wavenet.json
. You can also define your own hyper-parameter json here. After the modification,
The following command can be ran for data generation. Note that if you want to change values of some parameters, you may need to modify in gaussian_wavenet.json
instead of hparams.py
since gaussian_wavenet.json
may cover that inhparams.py
.
bash run.sh --stage 0 --stop-stage 0 --db-root /path_to_dataset/LJSpeech-1.1/wavs
bash run.sh --stage 1 --stop-stage 1
After the processing, the directory of gaussian will be as follows:
.
├── gaussian
├──conf
├──data
├──exp
└──dump
└──lj
└──logmelspectrogram
├──org
└──norm
├──train_no_dev
├──dev
└──eval
The train_no_dev folder contains the final training data. For mol and gaussian, the process is the same. When the training data is prepared, you can run the following command to train the network:
Standalone training
GPU:
sh ./scripts/run_standalone_train_gpu.sh [CUDA_DEVICE_ID] [/path_to_egs/egs/gaussian/dump/lj/logmelspectrogram/norm/] [/path_to_egs/egs/gaussian/conf/gaussian_wavenet.json] [path_to_save_ckpt]
CPU:
sh ./scripts/run_standalone_train_cpu.sh [/path_to_egs/egs/gaussian/dump/lj/logmelspectrogram/norm/] [/path_to_egs/egs/gaussian/conf/gaussian_wavenet.json] [path_to_save_ckpt]
Distributed training(8p)
sh ./scripts/run_distribute_train_gpu.sh [/path_to_egs/egs/gaussian/dump/lj/logmelspectrogram/norm/] [/path_to_egs/egs/gaussian/conf/gaussian_wavenet.json] [path_to_save_ckpt]
Evaluation Process
WaveNet has a process of auto-regression and this process currently cannot be run in Graph mode(place the auto-regression into construct
). Therefore, we implement the process in a common function. Here, we provide two kinds of ways to realize the function: using Numpy or using MindSpore ops. One can set is_numpy
to determine which mode is used. We recommend using numpy since it is much faster than using MindSpore ops. This is because the auto-regression process only calls some simple operation like Matmul and Bias_add. Unlike Graph mode, there will exist some fixed cost each step and this leads to a lower speed. For more information, please refer to
this link
Evaluation
GPU (using numpy):
sh ./scripts/run_eval_gpu.sh [CUDA_DEVICE_ID] [/path_to_egs/egs/gaussian/dump/lj/logmelspectrogram/norm/] [/path_to_egs/egs/gaussian/conf/gaussian_wavenet.json] [path_to_load_ckpt] is_numpy [path_to_save_audio]
GPU (using mindspore):
sh ./scripts/run_eval_gpu.sh [CUDA_DEVICE_ID] [/path_to_egs/egs/gaussian/dump/lj/logmelspectrogram/norm/] [/path_to_egs/egs/gaussian/conf/gaussian_wavenet.json] [path_to_load_ckpt] [path_to_save_audio]
CPU:
sh ./scripts/run_eval_cpu.sh [/path_to_egs/egs/gaussian/dump/lj/logmelspectrogram/norm/] [/path_to_egs/egs/gaussian/conf/gaussian_wavenet.json] [path_to_load_ckpt] [is_numpy] [path_to_save_audio]
Convert Process
GPU:
python export.py --preset=/path_to_egs/egs/gaussian/conf/gaussian_wavenet.json --checkpoint_dir=path_to_dump_hparams --pretrain_ckpt=path_to_load_ckpt
CPU:
python export.py --preset=/path_to_egs/egs/gaussian/conf/gaussian_wavenet.json --checkpoint_dir=path_to_dump_hparams --pretrain_ckpt=path_to_load_ckpt --platform=CPU
Model Description
Performance
Training Performance on GPU
Parameters | WaveNet |
---|---|
Resource | NV SMX2 V100-32G |
uploaded Date | 01/14/2021 (month/day/year) |
MindSpore Version | 1.0.0 |
Dataset | LJSpeech-1.1 |
Training Parameters | 1p, epoch=600(max), steps=1635 * epoch, batch_size = 8, lr=1e-3 |
Optimizer | Adam |
Loss Function | SoftmaxCrossEntropyWithLogits/discretized_mix_logistic/mix_gaussian |
Loss | around 2.0(mulaw256)/around 4.5(mol)/around -6.0(gaussian) |
Speed | 1p 1.467s/step |
Total time: training | 1p(mol/gaussian): around 4 days; 2p(mulaw256):around 1 week |
Checkpoint | 59.79MM/54.87M/54.83M (.ckpt file) |
Scripts | WaveNet script |
Inference Performance On GPU
Audio samples will be demonstrated online soon.
ModelZoo Homepage
Please check the official homepage.