- [Description of Random Situation](#Description-of-Random-Situation)
- [ModelZoo Homepage](#ModelZoo-Homepage)
## Description
This is an example of training Bert with MLPerf v0.7 dataset by second-order optimizer THOR. THOR is a novel approximate seond-order optimization method in MindSpore. With fewer iterations, THOR can finish Bert-Large training in 14 minutes to a masked lm accuracy of 71.3% using 8 Ascend 910, which is much faster than SGD with Momentum.
This is an example of training Bert with MLPerf v0.7 dataset by second-order optimizer THOR. THOR is a novel approximate seond-order optimization method in MindSpore. With fewer iterations, THOR can finish Bert-Large training in 14 minutes to a masked lm accuracy of 71.3% using 8 Ascend 910, which is much faster than SGD with Momentum.
## Model Architecture
The architecture of Bert contains 3 embedding layers which are used to look up token embeddings, position embeddings and segmentation embeddings; Then BERT basically consists of a stack of Transformer encoder blocks; finally bert is trained for two tasks: Masked Language Model and Next Sentence Prediction.
## Dataset
Dataset used: MLPerf v0.7 dataset for BERT
- Dataset size 9,600,000 samples
- Train:9,600,000 samples
- Test:first 10,000 consecutive samples of the training set
- Data format:tfrecord
- Train:9,600,000 samples
- Test:first 10,000 consecutive samples of the training set
- Data format:tfrecord
- Download and preporcess datasets
- Note:Data will be processed using scripts in [pretraining data creation](https://github.com/mlperf/training/tree/master/language_model/tensorflow/bert),
- Note:Data will be processed using scripts in [pretraining data creation](https://github.com/mlperf/training/tree/master/language_model/tensorflow/bert)
with the help of this link users could make the data files step by step.
The classical first-order optimization algorithm, such as SGD, has a small amount of computation, but the convergence speed is slow and requires lots of iterations. The second-order optimization algorithm uses the second-order derivative of the target function to accelerate convergence, can converge faster to the optimal value of the model and requires less iterations. But the application of the second-order optimization algorithm in deep neural network training is not common because of the high computation cost. The main computational cost of the second-order optimization algorithm lies in the inverse operation of the second-order information matrix (Hessian matrix, FIM, etc.), and the time complexity is about O (n^3). On the basis of the existing natural gradient algorithm, we developed the available second-order optimizer THOR in MindSpore by adopting approximation and shearing of FIM information matrix to reduce the computational complexity of the inverse matrix. With eight Ascend 910 chips, THOR can complete Bert-Large training in 14min.
## Environment Requirements
- Hardware(Ascend/GPU)
- Prepare hardware environment with Ascend or GPU processor. If you want to try Ascend , please send the [application form](https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/file/other/Ascend%20Model%20Zoo%E4%BD%93%E9%AA%8C%E8%B5%84%E6%BA%90%E7%94%B3%E8%AF%B7%E8%A1%A8.docx) to ascend@huawei.com. Once approved, you can get the resources.
- Hardware(Ascend)
- Prepare hardware environment with Ascend. If you want to try Ascend , please send the [application form](https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/file/other/Ascend%20Model%20Zoo%E4%BD%93%E9%AA%8C%E8%B5%84%E6%BA%90%E7%94%B3%E8%AF%B7%E8%A1%A8.docx) to ascend@huawei.com. Once approved, you can get the resources.
After installing MindSpore via the official website, you can start training and evaluation as follows:
After installing MindSpore via the official website, you can start training and evaluation as follows:
- Running on Ascend
```python
```shell
# run distributed training example
sh scripts/run_distribute_pretrain.sh [DEVICE_NUM] [EPOCH_SIZE] [DATA_DIR] [SCHEMA_DIR] [RANK_TABLE_FILE]
# run evaluation example
python pretrain_eval.py
```
> For distributed training, a hccl configuration file with JSON format needs to be created in advance. About the configuration file, you can refer to the [HCCL_TOOL](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools).
## Script Description
@ -73,39 +79,38 @@ python pretrain_eval.py
```shell
├── model_zoo
├──official
├──nlp
├── bert_thor
├── README.md # descriptions bert_thor
├── scripts
├── run_distribute_pretrain.sh # launch distributed training for Ascend
└── run_standalone_pretrain.sh # launch single training for Ascend
├──src
├── bert_for_pre_training.py # Bert for pretraining
├── bert_model.py # Bert model
├── bert_net_config.py # network config setting
├── config.py # config setting used in dataset.py
├── dataset.py # Data operations used in run_pretrain.py
├── dataset_helper.py # Dataset help for minddata dataset
├── evaluation_config.py # config settings, will be used in finetune.py
├── fused_layer_norm.py # fused layernorm
├── grad_reducer_thor.py # grad_reducer_thor
├── lr_generator.py # learning rate generator
├── model_thor.py # Model
├── thor_for_bert.py # thor_for_bert
├── thor_for_bert_arg.py # thor_for_bert_arg
├── thor_layer.py # thor_layer
└── utils.py # utils
├── pretrain_eval.py # infer script
└── run_pretrain.py # train script
├──official
├──nlp
├── bert_thor
├── README.md # descriptions bert_thor
├── scripts
├── run_distribute_pretrain.sh # launch distributed training for Ascend
└── run_standalone_pretrain.sh # launch single training for Ascend
├──src
├── bert_for_pre_training.py # Bert for pretraining
├── bert_model.py # Bert model
├── bert_net_config.py # network config setting
├── config.py # config setting used in dataset.py
├── dataset.py # Data operations used in run_pretrain.py
├── dataset_helper.py # Dataset help for minddata dataset
├── evaluation_config.py # config settings, will be used in finetune.py
├── fused_layer_norm.py # fused layernorm
├── grad_reducer_thor.py # grad_reducer_thor
├── lr_generator.py # learning rate generator
├── model_thor.py # Model
├── thor_for_bert.py # thor_for_bert
├── thor_for_bert_arg.py # thor_for_bert_arg
├── thor_layer.py # thor_layer
└── utils.py # utils
├── pretrain_eval.py # infer script
└── run_pretrain.py # train script
```
### Script Parameters
Parameters for both training and inference can be set in config.py.
```
```shell
"device_target": 'Ascend', # device where the code will be implemented
"distribute": "false", # Run distribute
"epoch_size": "1", # Epoch size
@ -121,22 +126,26 @@ Parameters for both training and inference can be set in config.py.
"save_checkpoint_steps",: 1000, # Save checkpoint steps
"save_checkpoint_num": 1, # Save checkpoint numbers, default is 1
```
### Training Process
#### Ascend 910
#### Ascend 910
```
```shell
sh run_distribute_pretrain.sh [DEVICE_NUM] [EPOCH_SIZE] [DATA_DIR] [SCHEMA_DIR] [RANK_TABLE_FILE]
```
We need five parameters for this scripts.
- `DEVICE_NUM`: the device number for distributed train.
- `EPOCH_SIZE`: Epoch size used in the model
- `DATA_DIR`:Data path, it is better to use absolute path.
- `SCHEMA_DIR`:Schema path, it is better to use absolute path
- `SCHEMA_DIR`:Schema path, it is better to use absolute path
- `RANK_TABLE_FILE`: rank table file with JSON format
Training result will be stored in the current path, whose folder name begins with the file name that the user defines. Under this, you can find checkpoint file together with result like the followings in log.
```
```shell
...
epoch: 1, step: 1, outputs are [5.0842705], total_time_span is 795.4807660579681, step_time_span is 795.4807660579681
epoch: 1, step: 100, outputs are [4.4550357], total_time_span is 579.6836116313934, step_time_span is 5.855390016478721
@ -163,16 +172,18 @@ epoch: 3, step: 2500, outputs are [1.265375], total_time_span is 26.374578714370
...
```
### Evaluation Process
Before running the command below, please check the checkpoint path used for evaluation. Please set the checkpoint path to be the absolute full path, e.g., "username/bert_thor/LOG0/checkpoint_bert-3_1000.ckpt".
#### Ascend 910
```
#### Ascend910
```shell
python pretrain_eval.py
```
We need two parameters in evaluation_config.py for this scripts.
- `DATA_FILE`:the file of evaluation dataset.
- `FINETUNE_CKPT`: the absolute path for checkpoint file.
@ -180,14 +191,15 @@ We need two parameters in evaluation_config.py for this scripts.
Inference result will be stored in the example path, you can find result like the followings in log.