You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
mindspore/model_zoo/official/cv/resnet_thor
mindspore-ci-bot 5b59277158
!13851 support dynamic frequency in resnet_thor
4 years ago
..
scripts replace tab with spaces 4 years ago
src support dynamic frequency in thor 4 years ago
README.md removed the useless link of apply form 4 years ago
README_CN.md removed the useless link of apply form 4 years ago
eval.py Modelzoo interface change. 4 years ago
export.py add export for shufflenet, delete export for resnet50_quant, modify export file to support mindir 4 years ago
mindspore_hub_conf.py add bert_thor hub file 4 years ago
train.py support dynamic frequency in thor 4 years ago

README.md

ResNet-50-THOR Example

Description

This is an example of training ResNet-50 V1.5 with ImageNet2012 dataset by second-order optimizer THOR. THOR is a novel approximate seond-order optimization method in MindSpore. With fewer iterations, THOR can finish ResNet-50 V1.5 training in 72 minutes to top-1 accuracy of 75.9% using 8 Ascend 910, which is much faster than SGD with Momentum.

Model Architecture

The overall network architecture of ResNet-50 is show below:link

Dataset

Dataset used: ImageNet2012

  • Dataset size 224*224 colorful images in 1000 classes

    • Train1,281,167 images
    • Test 50,000 images
  • Data formatjpeg

    • NoteData will be processed in dataset.py
  • Download the dataset ImageNet2012

Unzip the ImageNet2012 dataset to any path you want and the folder structure should include train and eval dataset as follows:

    ├── ilsvrc                  # train dataset
    └── ilsvrc_eval             # infer dataset

Features

The classical first-order optimization algorithm, such as SGD, has a small amount of computation, but the convergence speed is slow and requires lots of iterations. The second-order optimization algorithm uses the second-order derivative of the target function to accelerate convergence, can converge faster to the optimal value of the model and requires less iterations. But the application of the second-order optimization algorithm in deep neural network training is not common because of the high computation cost. The main computational cost of the second-order optimization algorithm lies in the inverse operation of the second-order information matrix (Hessian matrix, Fisher information matrix, etc.), and the time complexity is about O (n^3). On the basis of the existing natural gradient algorithm, we developed the available second-order optimizer THOR in MindSpore by adopting approximation and shearing of Fisher information matrix to reduce the computational complexity of the inverse matrix. With eight Ascend 910 chips, THOR can complete ResNet50-v1.5-ImageNet training in 72 minutes.

Environment Requirements

Quick Start

After installing MindSpore via the official website, you can start training and evaluation as follows:

  • Running on Ascend
# run distributed training example
sh run_distribute_train.sh [RANK_TABLE_FILE] [DATASET_PATH] [DEVICE_NUM]

# run evaluation example
sh run_eval.sh [DATASET_PATH] [CHECKPOINT_PATH]

For distributed training, a hccl configuration file with JSON format needs to be created in advance. About the configuration file, you can refer to the HCCL_TOOL.

  • Running on GPU
# run distributed training example
sh run_distribute_train_gpu.sh [DATASET_PATH] [DEVICE_NUM]

# run evaluation example
sh run_eval_gpu.sh [DATASET_PATH] [CHECKPOINT_PATH]

Script Description

Script Code Structure

└── resnet_thor
    ├── README.md                                 # descriptions about resnet_thor
    ├── scripts
    │    ├── run_distribute_train.sh               # launch distributed training for Ascend
    │    └── run_eval.sh                           # launch inference for Ascend
    │    ├── run_distribute_train_gpu.sh           # launch distributed training for GPU
    │    └── run_eval_gpu.sh                       # launch inference for GPU
    ├──src
    │    ├── crossentropy.py                       # CrossEntropy loss function
    │    ├── config.py                             # parameter configuration
    │    ├── dataset_helper.py                     # dataset help for minddata dataset
    │    ├── grad_reducer_thor.py                  # grad reducer for thor
    │    ├── model_thor.py                         # model for train
    │    ├── resnet_thor.py                        # resnet50_thor backone
    │    ├── thor.py                               # thor optimizer
    │    ├── thor_layer.py                         # thor layer
    │    └── dataset.py                            # data preprocessing
    ├── eval.py                                   # infer script
    ├── train.py                                  # train script
    ├── export.py                                 # export checkpoint file into air file
    └── mindspore_hub_conf.py                     # config file for mindspore hub repository

Script Parameters

Parameters for both training and inference can be set in config.py.

  • Parameters for Ascend 910
"class_num": 1001,                # dataset class number
"batch_size": 32,                 # batch size of input tensor(only supports 32)
"loss_scale": 128,                # loss scale
"momentum": 0.9,                  # momentum of THOR optimizer
"weight_decay": 5e-4,             # weight decay
"epoch_size": 45,                 # only valid for taining, which is always 1 for inference
"save_checkpoint": True,          # whether save checkpoint or not
"save_checkpoint_epochs": 1,      # the epoch interval between two checkpoints. By default, the checkpoint will be saved every epoch
"keep_checkpoint_max": 15,        # only keep the last keep_checkpoint_max checkpoint
"save_checkpoint_path": "./",     # path to save checkpoint relative to the executed path
"use_label_smooth": True,         # label smooth
"label_smooth_factor": 0.1,       # label smooth factor
"lr_init": 0.045,                 # learning rate init value
"lr_decay": 6,                    # learning rate decay rate value
"lr_end_epoch": 70,               # learning rate end epoch value
"damping_init": 0.03,             # damping init value for Fisher information matrix
"damping_decay": 0.87,            # damping decay rate
"frequency": 834,                 # the step interval to update second-order information matrix(should be divisor of the steps of per epoch)
  • Parameters for GPU
"class_num": 1001,                # dataset class number
"batch_size": 32,                 # batch size of input tensor
"loss_scale": 128,                # loss scale
"momentum": 0.9,                  # momentum of THOR optimizer
"weight_decay": 5e-4,             # weight decay
"epoch_size": 40,                 # only valid for taining, which is always 1 for inference
"save_checkpoint": True,          # whether save checkpoint or not
"save_checkpoint_epochs": 1,      # the epoch interval between two checkpoints. By default, the checkpoint will be saved every epoch
"keep_checkpoint_max": 15,        # only keep the last keep_checkpoint_max checkpoint
"save_checkpoint_path": "./",     # path to save checkpoint relative to the executed path
"use_label_smooth": True,         # label smooth
"label_smooth_factor": 0.1,       # label smooth factor
"lr_init": 0.05672,               # learning rate init value
"lr_decay": 4.9687,               # learning rate decay rate value
"lr_end_epoch": 50,               # learning rate end epoch value
"damping_init": 0.02345,          # damping init value for Fisher information matrix
"damping_decay": 0.5467,          # damping decay rate
"frequency": 834,                 # the step interval to update second-order information matrix(should be divisor of the steps of per epoch)

Due to the limitation of operators, the value of batch size only supports 32 in Ascend currently. And the update frequency of second-order information matrix must be set the divisor of the steps of per epoch(for example, 834 is the divisor of 5004). As a word, our algorithm is not very flexible in setting those parameters due to the limitations of the framework and operators. But we will solve these problems in the future versions.

Training Process

Ascend 910

  sh run_distribute_train.sh [RANK_TABLE_FILE] [DATASET_PATH] [DEVICE_NUM]

We need three parameters for this scripts.

  • RANK_TABLE_FILEthe path of rank_table.json
  • DATASET_PATHthe path of train dataset.
  • DEVICE_NUM: the device number for distributed train.

Training result will be stored in the current path, whose folder name begins with "train_parallel". Under this, you can find checkpoint file together with result like the following in log.

...
epoch: 1 step: 5004, loss is 4.4182425
epoch: 2 step: 5004, loss is 3.740064
epoch: 3 step: 5004, loss is 4.0546017
epoch: 4 step: 5004, loss is 3.7598825
epoch: 5 step: 5004, loss is 3.3744206
......
epoch: 40 step: 5004, loss is 1.6907625
epoch: 41 step: 5004, loss is 1.8217756
epoch: 42 step: 5004, loss is 1.6453942
...

GPU

sh run_distribute_train_gpu.sh [DATASET_PATH] [DEVICE_NUM]

Training result will be stored in the current path, whose folder name begins with "train_parallel". Under this, you can find checkpoint file together with result like the following in log.

...
epoch: 1 step: 5004, loss is 4.2546034
epoch: 2 step: 5004, loss is 4.0819564
epoch: 3 step: 5004, loss is 3.7005644
epoch: 4 step: 5004, loss is 3.2668946
epoch: 5 step: 5004, loss is 3.023509
......
epoch: 36 step: 5004, loss is 1.645802
...

Evaluation Process

Before running the command below, please check the checkpoint path used for evaluation. Please set the checkpoint path to be the absolute full path, e.g., "username/resnet_thor/train_parallel0/resnet-42_5004.ckpt".

Ascend 910

  sh run_eval.sh [DATASET_PATH] [CHECKPOINT_PATH]

We need two parameters for this scripts.

  • DATASET_PATHthe path of evaluation dataset.
  • CHECKPOINT_PATH: the absolute path for checkpoint file.

checkpoint can be produced in training process.

Inference result will be stored in the example path, whose folder name is "eval". Under this, you can find result like the following in log.

  result: {'top_5_accuracy': 0.9295574583866837, 'top_1_accuracy': 0.761443661971831} ckpt=train_parallel0/resnet-42_5004.ckpt

GPU

  sh run_eval_gpu.sh [DATASET_PATH] [CHECKPOINT_PATH]

Inference result will be stored in the example path, whose folder name is "eval". Under this, you can find result like the following in log.

  result: {'top_5_accuracy': 0.9287972151088348, 'top_1_accuracy': 0.7597031049935979} ckpt=train_parallel/resnet-36_5004.ckpt

Model Description

Evaluation Performance

Parameters Ascend 910 GPU
Model Version ResNet50-v1.5 ResNet50-v1.5
Resource Ascend 910CPU 2.60GHz 192coresMemory 755G GPU(Tesla V100 SXM2)CPU 2.1GHz 24coresMemory 128G
uploaded Date 06/01/2020 (month/day/year) 09/23/2020(month/day/year)
MindSpore Version 0.3.0-alpha 1.0.0
Dataset ImageNet2012 ImageNet2012
Training Parameters epoch=45, steps per epoch=5004, batch_size = 32 epoch=40, steps per epoch=5004, batch_size = 32
Optimizer THOR THOR
Loss Function Softmax Cross Entropy Softmax Cross Entropy
outputs probability probability
Loss 1.6453942 1.645802
Speed 20.4ms/step8pcs 76ms/step8pcs
Total time(acc to 75.9%) 72 mins 229 mins
Parameters (M) 25.5 25.5
Checkpoint for Fine tuning 491M (.ckpt file) 380M (.ckpt file)
Scripts Link Link

Inference Performance

Parameters Ascend 910 GPU
Model Version ResNet50-v1.5 ResNet50-v1.5
Resource Ascend 910 GPU
Uploaded Date 06/01/2020 (month/day/year) 09/23/2020(month/day/year)
MindSpore Version 0.3.0-alpha 1.0.0
Dataset ImageNet2012 ImageNet2012
batch_size 32 32
outputs probability probability
Accuracy 76.14% 75.97%
Model for inference 98M (.air file)

Description of Random Situation

In dataset.py, we set the seed inside “create_dataset" function. We also use random seed in train.py.

ModelZoo HomePage

Please check the official homepage.