update efficientnet scripts & nasnet cn readme

pull/8880/head
panfengfeng 4 years ago
parent 5e3b135130
commit 148fc597f6

@ -1,24 +1,66 @@
# EfficientNet-B0 Example # Contents
## Description - [EfficientNet-B0 Description](#efficientnet-description)
- [Model Architecture](#model-architecture)
- [Dataset](#dataset)
- [Environment Requirements](#environment-requirements)
- [Quick Start](#quick-start)
- [Script Description](#script-description)
- [Script and Sample Code](#script-and-sample-code)
- [Script Parameters](#script-parameters)
- [Training Process](#training-process)
- [Evaluation Process](#evaluation-process)
- [Model Description](#model-description)
- [Performance](#performance)
- [Training Performance](#evaluation-performance)
- [Inference Performance](#evaluation-performance)
- [ModelZoo Homepage](#modelzoo-homepage)
This is an example of training EfficientNet-B0 in MindSpore. # [EfficientNet-B0 Description](#contents)
## Requirements
- Install [Mindspore](http://www.mindspore.cn/install/en). [Paper](https://arxiv.org/abs/1905.11946): Mingxing Tan, Quoc V. Le. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. 2019.
- Download the dataset.
## Structure # [Model architecture](#contents)
```shell The overall network architecture of EfficientNet-B0 is show below:
[Link](https://arxiv.org/abs/1905.11946)
# [Dataset](#contents)
Dataset used: [imagenet](http://www.image-net.org/)
- Dataset size: ~125G, 1.2W colorful images in 1000 classes
- Train: 120G, 1.2W images
- Test: 5G, 50000 images
- Data format: RGB images.
- Note: Data will be processed in src/dataset.py
# [Environment Requirements](#contents)
- Hardware GPU
- Prepare hardware environment with GPU processor.
- Framework
- [MindSpore](https://www.mindspore.cn/install/en)
- For more information, please check the resources below
- [MindSpore Tutorials](https://www.mindspore.cn/tutorial/training/en/master/index.html)
- [MindSpore Python API](https://www.mindspore.cn/doc/api_python/en/master/index.html)
# [Script description](#contents)
## [Script and sample code](#contents)
```python
. .
└─nasnet └─efficientnet
├─README.md ├─README.md
├─scripts ├─scripts
├─run_standalone_train_for_gpu.sh # launch standalone training with gpu platform(1p) ├─run_standalone_train_for_gpu.sh # launch standalone training with gpu platform(1p)
├─run_distribute_train_for_gpu.sh # launch distributed training with gpu platform(8p) ├─run_distribute_train_for_gpu.sh # launch distributed training with gpu platform(8p)
└─run_eval_for_gpu.sh # launch evaluating with gpu platform └─run_eval_for_gpu.sh # launch evaluating with gpu platform
├─src ├─src
├─config.py # parameter configuration ├─config.py # parameter configuration
├─dataset.py # data preprocessing ├─dataset.py # data preprocessing
@ -26,16 +68,16 @@ This is an example of training EfficientNet-B0 in MindSpore.
├─loss.py # Customized loss function ├─loss.py # Customized loss function
├─transform_utils.py # random augment utils ├─transform_utils.py # random augment utils
├─transform.py # random augment class ├─transform.py # random augment class
├─eval.py # eval net ├─eval.py # eval net
└─train.py # train net └─train.py # train net
``` ```
## Parameter Configuration ## [Script Parameters](#contents)
Parameters for both training and evaluating can be set in config.py Parameters for both training and evaluating can be set in config.py.
``` ```
'random_seed': 1, # fix random seed 'random_seed': 1, # fix random seed
'model': 'efficientnet_b0', # model name 'model': 'efficientnet_b0', # model name
'drop': 0.2, # dropout rate 'drop': 0.2, # dropout rate
@ -45,9 +87,9 @@ Parameters for both training and evaluating can be set in config.py
'batch_size': 128, # batch size 'batch_size': 128, # batch size
'decay_epochs': 2.4, # epoch interval to decay LR 'decay_epochs': 2.4, # epoch interval to decay LR
'warmup_epochs': 5, # epochs to warmup LR 'warmup_epochs': 5, # epochs to warmup LR
'decay_rate': 0.97, # LR decay rate 'decay_rate': 0.97, # LR decay rate
'weight_decay': 1e-5, # weight decay 'weight_decay': 1e-5, # weight decay
'epochs': 600, # number of epochs to train 'epochs': 600, # number of epochs to train
'workers': 8, # number of data processing processes 'workers': 8, # number of data processing processes
'amp_level': 'O0', # amp level 'amp_level': 'O0', # amp level
'opt': 'rmsprop', # optimizer 'opt': 'rmsprop', # optimizer
@ -62,35 +104,34 @@ Parameters for both training and evaluating can be set in config.py
'resume_start_epoch': 0, # resume start epoch 'resume_start_epoch': 0, # resume start epoch
``` ```
## Running the example ## [Training Process](#contents)
### Train
#### Usage #### Usage
``` ```
# distribute training example(8p) GPU:
sh run_distribute_train_for_gpu.sh DATA_DIR # distribute training example(8p)
# standalone training sh run_distribute_train_for_gpu.sh
sh run_standalone_train_for_gpu.sh DATA_DIR DEVICE_ID # standalone training
sh run_standalone_train_for_gpu.sh DEVICE_ID DATA_DIR
``` ```
#### Launch #### Launch
```bash ```bash
# distributed training example(8p) for GPU # distributed training example(8p) for GPU
sh scripts/run_distribute_train_for_gpu.sh /dataset cd scripts
sh run_distribute_train_for_gpu.sh 8 0,1,2,3,4,5,6,7 /dataset/train
# standalone training example for GPU # standalone training example for GPU
sh scripts/run_standalone_train_for_gpu.sh /dataset 0 cd scripts
sh run_standalone_train_for_gpu.sh 0 /dataset/train
``` ```
#### Result
You can find checkpoint file together with result in log. You can find checkpoint file together with result in log.
### Evaluation ## [Evaluation Process](#contents)
#### Usage ### Usage
``` ```
# Evaluation # Evaluation
@ -101,11 +142,51 @@ sh run_eval_for_gpu.sh DATA_DIR DEVICE_ID PATH_CHECKPOINT
```bash ```bash
# Evaluation with checkpoint # Evaluation with checkpoint
sh scripts/run_eval_for_gpu.sh /dataset 0 ./checkpoint/efficientnet_b0-600_1251.ckpt cd scripts
sh run_eval_for_gpu.sh /dataset/eval ./checkpoint/efficientnet_b0-600_1251.ckpt
``` ```
> checkpoint can be produced in training process.
#### Result #### Result
Evaluation result will be stored in the scripts path. Under this, you can find result like the followings in log. Evaluation result will be stored in the scripts path. Under this, you can find result like the followings in log.
```
acc=76.96%(TOP1)
```
# [Model description](#contents)
## [Performance](#contents)
### Training Performance
| Parameters | efficientnet_b0 |
| -------------------------- | ------------------------- |
| Resource | NV SMX2 V100-32G |
| uploaded Date | 10/26/2020 |
| MindSpore Version | 1.0.0 |
| Dataset | ImageNet |
| Training Parameters | src/config.py |
| Optimizer | rmsprop |
| Loss Function | LabelSmoothingCrossEntropy |
| Loss | 1.8886 |
| Accuracy | 76.96%(TOP1) |
| Total time | 132 h 8ps |
| Checkpoint for Fine tuning | 64 M(.ckpt file) |
### Inference Performance
| Parameters | |
| -------------------------- | ------------------------- |
| Resource | NV SMX2 V100-32G |
| uploaded Date | 10/26/2020 |
| MindSpore Version | 1.0.0 |
| Dataset | ImageNet, 1.2W |
| batch_size | 128 |
| outputs | probability |
| Accuracy | acc=76.96%(TOP1) |
# [ModelZoo Homepage](#contents)
Please check the official [homepage](https://gitee.com/mindspore/mindspore/tree/master/model_zoo).

@ -49,7 +49,7 @@ if __name__ == '__main__':
ckpt = load_checkpoint(args_opt.checkpoint) ckpt = load_checkpoint(args_opt.checkpoint)
load_param_into_net(net, ckpt) load_param_into_net(net, ckpt)
net.set_train(False) net.set_train(False)
val_data_url = os.path.join(args_opt.data_path, 'val') val_data_url = args_opt.data_path
dataset = create_dataset_val(cfg.batch_size, val_data_url, workers=cfg.workers, distributed=False) dataset = create_dataset_val(cfg.batch_size, val_data_url, workers=cfg.workers, distributed=False)
loss = LabelSmoothingCrossEntropy(smooth_factor=cfg.smoothing) loss = LabelSmoothingCrossEntropy(smooth_factor=cfg.smoothing)
eval_metrics = {'Loss': nn.Loss(), eval_metrics = {'Loss': nn.Loss(),

@ -13,20 +13,57 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
# ============================================================================ # ============================================================================
DATA_DIR=$1 if [ $# != 3 ] && [ $# != 4 ]
then
echo "Usage:
sh run_distribute_train_for_gpu.sh [DEVICE_NUM] [VISIABLE_DEVICES(0,1,2,3,4,5,6,7)] [DATASET_PATH] [PRETRAINED_CKPT_PATH](optional)
"
exit 1
fi
current_exec_path=$(pwd) if [ $1 -lt 1 ] && [ $1 -gt 8 ]
echo ${current_exec_path} then
echo "error: DEVICE_NUM=$1 is not in (1-8)"
exit 1
fi
curtime=`date '+%Y%m%d-%H%M%S'` # check dataset file
RANK_SIZE=8 if [ ! -d $3 ]
then
echo "error: DATASET_PATH=$3 is not a directory"
exit 1
fi
rm ${current_exec_path}/device_parallel/ -rf export DEVICE_NUM=$1
mkdir ${current_exec_path}/device_parallel export RANK_SIZE=$1
echo ${curtime} > ${current_exec_path}/device_parallel/starttime
BASEPATH=$(cd "`dirname $0`" || exit; pwd)
export PYTHONPATH=${BASEPATH}:$PYTHONPATH
if [ -d "../train" ];
then
rm -rf ../train
fi
mkdir ../train
cd ../train || exit
export CUDA_VISIBLE_DEVICES="$2"
if [ $# == 3 ]
then
mpirun -n $1 --allow-run-as-root --output-filename log_output --merge-stderr-to-stdout \
python ${BASEPATH}/../train.py \
--GPU \
--distributed \
--data_path $3 > train.log 2>&1 &
fi
if [ $# == 4 ]
then
mpirun -n $1 --allow-run-as-root --output-filename log_output --merge-stderr-to-stdout \
python ${BASEPATH}/../train.py \
--GPU \
--distributed \
--data_path $3 \
--resume $4 > train.log 2>&1 &
fi
mpirun --allow-run-as-root -n $RANK_SIZE python ${current_exec_path}/train.py \
--GPU \
--distributed \
--data_path ${DATA_DIR} \
--cur_time ${curtime} > ${current_exec_path}/device_parallel/efficientnet_b0.log 2>&1 &

@ -13,15 +13,34 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
# ============================================================================ # ============================================================================
DATA_DIR=$1 if [ $# != 2 ]
DEVICE_ID=$2 then
PATH_CHECKPOINT=$3 echo "GPU: sh run_eval_for_gpu.sh [DATASET_PATH] [CHECKPOINT_PATH]"
exit 1
fi
current_exec_path=$(pwd) # check dataset file
echo ${current_exec_path} if [ ! -d $1 ]
then
echo "error: DATASET_PATH=$1 is not a directory"
exit 1
fi
curtime=`date '+%Y%m%d-%H%M%S'` # check checkpoint file
if [ ! -f $2 ]
then
echo "error: CHECKPOINT_PATH=$2 is not a file"
exit 1
fi
echo ${curtime} > ${current_exec_path}/eval_starttime BASEPATH=$(cd "`dirname $0`" || exit; pwd)
export PYTHONPATH=${BASEPATH}:$PYTHONPATH
CUDA_VISIBLE_DEVICES=${DEVICE_ID} python ./eval.py --platform 'GPU' --data_path ${DATA_DIR} --checkpoint ${PATH_CHECKPOINT} > ${current_exec_path}/eval.log 2>&1 & if [ -d "../eval" ];
then
rm -rf ../eval
fi
mkdir ../eval
cd ../eval || exit
python ${BASEPATH}/../eval.py --platform 'GPU' --data_path $1 --checkpoint=$2 > ./eval.log 2>&1 &

@ -13,19 +13,38 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
# ============================================================================ # ============================================================================
DATA_DIR=$1 if [ $# != 2 ] && [ $# != 3 ]
DEVICE_ID=$2 then
echo "Usage:
sh run_standalone_train_for_gpu.sh [DEVICE_ID] [DATASET_PATH] [PRETRAINED_CKPT_PATH](optional)
"
exit 1
fi
current_exec_path=$(pwd) # check dataset file
echo ${current_exec_path} if [ ! -d $2 ]
then
echo "error: DATASET_PATH=$2 is not a directory"
exit 1
fi
curtime=`date '+%Y%m%d-%H%M%S'` BASEPATH=$(cd "`dirname $0`" || exit; pwd)
export PYTHONPATH=${BASEPATH}:$PYTHONPATH
if [ -d "../train" ];
then
rm -rf ../train
fi
mkdir ../train
cd ../train || exit
rm ${current_exec_path}/device_${DEVICE_ID}/ -rf export CUDA_VISIBLE_DEVICES=$1
mkdir ${current_exec_path}/device_${DEVICE_ID}
echo ${curtime} > ${current_exec_path}/device_${DEVICE_ID}/starttime
CUDA_VISIBLE_DEVICES=${DEVICE_ID} python ${current_exec_path}/train.py \ if [ $# == 2 ]
--GPU \ then
--data_path ${DATA_DIR} \ python ${BASEPATH}/../train.py --GPU --data_path $2 > train.log 2>&1 &
--cur_time ${curtime} > ${current_exec_path}/device_${DEVICE_ID}/efficientnet_b0.log 2>&1 & fi
if [ $# == 3 ]
then
python ${BASEPATH}/../train.py --GPU --data_path $2 --resume $3 > train.log 2>&1 &
fi

@ -85,7 +85,6 @@ def create_dataset(batch_size, train_data_url='', workers=8, distributed=False):
input_columns=["image", "label"], input_columns=["image", "label"],
num_parallel_workers=2, num_parallel_workers=2,
drop_remainder=True) drop_remainder=True)
ds_train = ds_train.repeat(1)
return ds_train return ds_train
@ -121,5 +120,4 @@ def create_dataset_val(batch_size=128, val_data_url='', workers=8, distributed=F
dataset = dataset.map(input_columns=["label"], operations=type_cast_op, num_parallel_workers=workers) dataset = dataset.map(input_columns=["label"], operations=type_cast_op, num_parallel_workers=workers)
dataset = dataset.map(input_columns=["image"], operations=ctrans, num_parallel_workers=workers) dataset = dataset.map(input_columns=["image"], operations=ctrans, num_parallel_workers=workers)
dataset = dataset.batch(batch_size, drop_remainder=True, num_parallel_workers=workers) dataset = dataset.batch(batch_size, drop_remainder=True, num_parallel_workers=workers)
dataset = dataset.repeat(1)
return dataset return dataset

@ -17,7 +17,6 @@ import argparse
import math import math
import os import os
import random import random
import time
import numpy as np import numpy as np
import mindspore import mindspore
@ -115,8 +114,6 @@ def main():
if args.GPU: if args.GPU:
context.set_context(device_target='GPU') context.set_context(device_target='GPU')
is_master = not args.distributed or (rank_id == 0)
net = efficientnet_b0(num_classes=cfg.num_classes, net = efficientnet_b0(num_classes=cfg.num_classes,
drop_rate=cfg.drop, drop_rate=cfg.drop,
drop_connect_rate=cfg.drop_connect, drop_connect_rate=cfg.drop_connect,
@ -124,18 +121,7 @@ def main():
bn_tf=cfg.bn_tf, bn_tf=cfg.bn_tf,
) )
cur_time = args.cur_time train_data_url = args.data_path
output_base = './output'
exp_name = '-'.join([
cur_time,
cfg.model,
str(224)
])
time.sleep(rank_id)
output_dir = get_outdir(output_base, exp_name)
train_data_url = os.path.join(args.data_path, 'train')
train_dataset = create_dataset( train_dataset = create_dataset(
cfg.batch_size, train_data_url, workers=cfg.workers, distributed=args.distributed) cfg.batch_size, train_data_url, workers=cfg.workers, distributed=args.distributed)
batches_per_epoch = train_dataset.get_dataset_size() batches_per_epoch = train_dataset.get_dataset_size()
@ -152,7 +138,7 @@ def main():
config_ck = CheckpointConfig( config_ck = CheckpointConfig(
save_checkpoint_steps=batches_per_epoch, keep_checkpoint_max=cfg.keep_checkpoint_max) save_checkpoint_steps=batches_per_epoch, keep_checkpoint_max=cfg.keep_checkpoint_max)
ckpoint_cb = ModelCheckpoint( ckpoint_cb = ModelCheckpoint(
prefix=cfg.model, directory=output_dir, config=config_ck) prefix=cfg.model, directory='./ckpt_' + str(rank_id) + '/', config=config_ck)
callbacks += [ckpoint_cb] callbacks += [ckpoint_cb]
lr = Tensor(get_lr(base_lr=cfg.lr, total_epochs=cfg.epochs, steps_per_epoch=batches_per_epoch, lr = Tensor(get_lr(base_lr=cfg.lr, total_epochs=cfg.epochs, steps_per_epoch=batches_per_epoch,
@ -180,7 +166,7 @@ def main():
amp_level=cfg.amp_level amp_level=cfg.amp_level
) )
callbacks = callbacks if is_master else [] # callbacks = callbacks if is_master else []
if args.resume: if args.resume:
real_epoch = cfg.epochs - cfg.resume_start_epoch real_epoch = cfg.epochs - cfg.resume_start_epoch

@ -0,0 +1,130 @@
# NASNet示例
<!-- TOC -->
- [NASNet示例](#nasnet示例)
- [概述](#概述)
- [要求](#要求)
- [结构](#结构)
- [参数配置](#参数配置)
- [运行示例](#运行示例)
- [训练](#训练)
- [用法](#用法)
- [运行](#运行)
- [结果](#结果)
- [评估](#评估)
- [用法](#用法-1)
- [启动](#启动)
- [结果](#结果-1)
<!-- /TOC -->
## 概述
此为MindSpore中训练NASNet-A-Mobile的示例。
## 要求
- 安装[Mindspore](http://www.mindspore.cn/install/en)。
- 下载数据集。
## 结构
```shell
.
└─nasnet
├─README.md
├─scripts
├─run_standalone_train_for_gpu.sh # 使用GPU平台启动单机训练单卡
├─Run_distribute_train_for_gpu.sh # 使用GPU平台启动分布式训练8卡
└─Run_eval_for_gpu.sh # 使用GPU平台进行启动评估
├─src
├─config.py # 参数配置
├─dataset.py # 数据预处理
├─loss.py # 自定义交叉熵损失函数
├─lr_generator.py # 学习率生成器
├─nasnet_a_mobile.py # 网络定义
├─eval.py # 评估网络
├─export.py # 转换检查点
└─train.py # 训练网络
```
## 参数配置
在config.py中可以同时配置训练参数和评估参数。
```
'random_seed':1, # 固定随机种子
'rank':0, # 分布式训练进程序号
'group_size':1, # 分布式训练分组大小
'work_nums':8, # 数据读取人员数
'epoch_size':500, # 总周期数
'keep_checkpoint_max':100, # 保存检查点最大数
'ckpt_path':'./checkpoint/', # 检查点保存路径
'is_save_on_master':1 # 在rank0上保存检查点分布式参数
'batch_size':32, # 输入批次大小
'num_classes':1000, # 数据集类数
'label_smooth_factor':0.1, # 标签平滑因子
'aux_factor':0.4, # 副对数损失系数
'lr_init':0.04, # 启动学习率
'lr_decay_rate':0.97, # 学习率衰减率
'num_epoch_per_decay':2.4, # 衰减周期数
'weight_decay':0.00004, # 重量衰减
'momentum':0.9, # 动量
'opt_eps':1.0, # epsilon参数
'rmsprop_decay':0.9, # rmsprop衰减
'loss_scale':1, # 损失规模
```
## 运行示例
### 训练
#### 用法
```
# 分布式训练示例8卡
sh run_distribute_train_for_gpu.sh DATA_DIR
# 单机训练
sh run_standalone_train_for_gpu.sh DEVICE_ID DATA_DIR
```
#### 运行
```bash
# GPU分布式训练示例8卡
sh scripts/run_distribute_train_for_gpu.sh /dataset/train
# GPU单机训练示例
sh scripts/run_standalone_train_for_gpu.sh 0 /dataset/train
```
#### 结果
可以在日志中找到检查点文件及结果。
### 评估
#### 用法
```
# 评估
sh run_eval_for_gpu.sh DEVICE_ID DATA_DIR PATH_CHECKPOINT
```
#### 启动
```bash
# 检查点评估
sh scripts/run_eval_for_gpu.sh 0 /dataset/val ./checkpoint/nasnet-a-mobile-rank0-248_10009.ckpt
```
> 训练过程中可以生成检查点。
#### 结果
评估结果保存在脚本路径下。路径下的日志中,可以找到如下结果:
Loading…
Cancel
Save