You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
107 lines
3.0 KiB
107 lines
3.0 KiB
5 years ago
|
# VGG16 Example
|
||
|
|
||
|
## Description
|
||
|
|
||
|
This example is for VGG16 model training and evaluation.
|
||
|
|
||
|
## Requirements
|
||
|
|
||
|
- Install [MindSpore](https://www.mindspore.cn/install/en).
|
||
|
|
||
5 years ago
|
- Download the CIFAR-10 binary version dataset.
|
||
5 years ago
|
|
||
|
> Unzip the CIFAR-10 dataset to any path you want and the folder structure should be as follows:
|
||
|
> ```
|
||
|
> .
|
||
|
> ├── cifar-10-batches-bin # train dataset
|
||
|
> └── cifar-10-verify-bin # infer dataset
|
||
|
> ```
|
||
|
|
||
|
## Running the Example
|
||
|
|
||
|
### Training
|
||
|
|
||
|
```
|
||
|
python train.py --data_path=your_data_path --device_id=6 > out.train.log 2>&1 &
|
||
|
```
|
||
|
The python command above will run in the background, you can view the results through the file `out.train.log`.
|
||
|
|
||
|
After training, you'll get some checkpoint files under the script folder by default.
|
||
|
|
||
|
You will get the loss value as following:
|
||
|
```
|
||
|
# grep "loss is " out.train.log
|
||
|
epoch: 1 step: 781, loss is 2.093086
|
||
|
epcoh: 2 step: 781, loss is 1.827582
|
||
|
...
|
||
|
```
|
||
|
|
||
|
### Evaluation
|
||
|
|
||
|
```
|
||
|
python eval.py --data_path=your_data_path --device_id=6 --checkpoint_path=./train_vgg_cifar10-70-781.ckpt > out.eval.log 2>&1 &
|
||
|
```
|
||
|
The above python command will run in the background, you can view the results through the file `out.eval.log`.
|
||
|
|
||
|
You will get the accuracy as following:
|
||
|
```
|
||
|
# grep "result: " out.eval.log
|
||
|
result: {'acc': 0.92}
|
||
|
```
|
||
|
|
||
5 years ago
|
### Distribute Training
|
||
|
```
|
||
|
sh run_distribute_train.sh rank_table.json your_data_path
|
||
|
```
|
||
|
The above shell script will run distribute training in the background, you can view the results through the file `train_parallel[X]/log`.
|
||
|
|
||
|
You will get the loss value as following:
|
||
|
```
|
||
|
# grep "result: " train_parallel*/log
|
||
|
train_parallel0/log:epoch: 1 step: 97, loss is 1.9060308
|
||
|
train_parallel0/log:epcoh: 2 step: 97, loss is 1.6003821
|
||
|
...
|
||
|
train_parallel1/log:epoch: 1 step: 97, loss is 1.7095519
|
||
|
train_parallel1/log:epcoh: 2 step: 97, loss is 1.7133579
|
||
|
...
|
||
|
...
|
||
|
```
|
||
|
> About rank_table.json, you can refer to the [distributed training tutorial](https://www.mindspore.cn/tutorial/en/master/advanced_use/distributed_training.html).
|
||
5 years ago
|
|
||
|
## Usage:
|
||
|
|
||
|
### Training
|
||
|
```
|
||
|
usage: train.py [--device_target TARGET][--data_path DATA_PATH]
|
||
|
[--device_id DEVICE_ID]
|
||
|
|
||
|
parameters/options:
|
||
|
--device_target the training backend type, default is Ascend.
|
||
|
--data_path the storage path of dataset
|
||
|
--device_id the device which used to train model.
|
||
|
|
||
|
```
|
||
|
|
||
|
### Evaluation
|
||
|
|
||
|
```
|
||
|
usage: eval.py [--device_target TARGET][--data_path DATA_PATH]
|
||
|
[--device_id DEVICE_ID][--checkpoint_path CKPT_PATH]
|
||
|
|
||
|
parameters/options:
|
||
|
--device_target the evaluation backend type, default is Ascend.
|
||
|
--data_path the storage path of datasetd
|
||
|
--device_id the device which used to evaluate model.
|
||
|
--checkpoint_path the checkpoint file path used to evaluate model.
|
||
5 years ago
|
```
|
||
|
|
||
|
### Distribute Training
|
||
|
|
||
|
```
|
||
5 years ago
|
Usage: sh script/run_distribute_train.sh [MINDSPORE_HCCL_CONFIG_PATH] [DATA_PATH]
|
||
5 years ago
|
|
||
|
parameters/options:
|
||
|
MINDSPORE_HCCL_CONFIG_PATH HCCL configuration file path.
|
||
|
DATA_PATH the storage path of dataset.
|
||
|
```
|