- [Description of Random Situation](#description-of-random-situation)
@ -50,8 +50,9 @@ For FP16 operators, if the input data type is FP32, the backend of MindSpore wil
# [Environment Requirements](#contents)
- Hardware(Ascend)
- Prepare hardware environment with Ascend processor. If you want to try Ascend , please send the [application form](https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/file/other/Ascend%20Model%20Zoo%E4%BD%93%E9%AA%8C%E8%B5%84%E6%BA%90%E7%94%B3%E8%AF%B7%E8%A1%A8.docx) to ascend@huawei.com. Once approved, you can get the resources.
- Hardware(Ascend/GPU)
- Prepare hardware environment with Ascend processor. If you want to try Ascend, please send the [application form](https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/file/other/Ascend%20Model%20Zoo%E4%BD%93%E9%AA%8C%E8%B5%84%E6%BA%90%E7%94%B3%E8%AF%B7%E8%A1%A8.docx) to ascend@huawei.com. Once approved, you can get the resources.
- For more information, please check the resources below:
@ -67,6 +68,8 @@ For FP16 operators, if the input data type is FP32, the backend of MindSpore wil
├─run_distribute_train_gpu.sh # launch distributed training with gpu platform(8p)
├─run_eval_gpu.sh # launch evaluating with gpu platform
├─run_standalone_train_ascend.sh # launch standalone training with ascend platform(1p)
├─run_distribute_train_ascend.sh # launch distributed training with ascend platform(8p)
└─run_eval_ascend.sh # launch evaluating with ascend platform
@ -125,6 +128,13 @@ sh scripts/run_standalone_train_ascend.sh DEVICE_ID DATA_DIR
> This is processor cores binding operation regarding the `device_num` and total processor numbers. If you are not expect to do it, remove the operations `taskset` in `scripts/run_distribute_train.sh`
- GPU:
# distribute training example(8p)
sh scripts/run_distribute_train_gpu.sh DATA_PATH
### Launch
@ -135,11 +145,16 @@ sh scripts/run_standalone_train_ascend.sh DEVICE_ID DATA_DIR
sh scripts/run_distribute_train_ascend.sh RANK_TABLE_FILE DATA_PATH DATA_DIR
# standalone training
sh scripts/run_standalone_train_ascend.sh DEVICE_ID DATA_DIR
# distribute training example(8p)
sh scripts/run_distribute_train_gpu.sh DATA_PATH
### Result
Training result will be stored in the example path. Checkpoints will be stored at `ckpt_path` by default, and training log will be redirected to `./log.txt` like following.
Training result will be stored in the example path. Checkpoints will be stored at `ckpt_path` by default, and training log will be redirected to `./log.txt` like followings.
- Ascend
epoch: 1 step: 1251, loss is 5.4833196
@ -150,6 +165,17 @@ epoch: 3 step: 1251, loss is 3.6242008
Epoch time: 288507.506, per step time: 230.622
epoch: 1 step: 1251, loss is 6.49775
Epoch time: 1487493.604, per step time: 1189.044
epoch: 2 step: 1251, loss is 5.6884665
Epoch time: 1421838.433, per step time: 1136.561
epoch: 3 step: 1251, loss is 5.5168786
Epoch time: 1423009.501, per step time: 1137.498
## [Eval process](#contents)
### Usage
@ -162,6 +188,12 @@ You can start training using python or shell scripts. The usage of shell scripts
sh scripts/run_eval_ascend.sh DEVICE_ID DATA_DIR CHECKPOINT_PATH
sh scripts/run_eval_gpu.sh DATA_DIR CHECKPOINT_PATH
### Launch
@ -169,57 +201,67 @@ You can start training using python or shell scripts. The usage of shell scripts
sh scripts/run_eval_ascend.sh DEVICE_ID DATA_DIR CHECKPOINT_PATH
sh scripts/run_eval_gpu.sh DATA_DIR CHECKPOINT_PATH
> checkpoint can be produced in training process.
### Result
Evaluation result will be stored in the example path, you can find result like the following in `eval.log`.
Evaluation result will be stored in the example path, you can find result like the followings in `eval.log`.