From 148fc597f694429f0fa05b0da54337c1aafe7612 Mon Sep 17 00:00:00 2001 From: panfengfeng Date: Thu, 19 Nov 2020 16:30:16 +0800 Subject: [PATCH] update efficientnet scripts & nasnet cn readme --- model_zoo/official/cv/efficientnet/README.md | 153 +++++++++++++----- model_zoo/official/cv/efficientnet/eval.py | 2 +- .../scripts/run_distribute_train_for_gpu.sh | 63 ++++++-- .../efficientnet/scripts/run_eval_for_gpu.sh | 35 +++- .../scripts/run_standalone_train_for_gpu.sh | 43 +++-- .../official/cv/efficientnet/src/dataset.py | 2 - model_zoo/official/cv/efficientnet/train.py | 20 +-- model_zoo/official/cv/nasnet/README_CN.md | 130 +++++++++++++++ 8 files changed, 359 insertions(+), 89 deletions(-) create mode 100644 model_zoo/official/cv/nasnet/README_CN.md diff --git a/model_zoo/official/cv/efficientnet/README.md b/model_zoo/official/cv/efficientnet/README.md index 24ffa9b6c8..f7d001f2a4 100644 --- a/model_zoo/official/cv/efficientnet/README.md +++ b/model_zoo/official/cv/efficientnet/README.md @@ -1,24 +1,66 @@ -# EfficientNet-B0 Example +# Contents -## Description +- [EfficientNet-B0 Description](#efficientnet-description) +- [Model Architecture](#model-architecture) +- [Dataset](#dataset) +- [Environment Requirements](#environment-requirements) +- [Quick Start](#quick-start) +- [Script Description](#script-description) + - [Script and Sample Code](#script-and-sample-code) + - [Script Parameters](#script-parameters) + - [Training Process](#training-process) + - [Evaluation Process](#evaluation-process) +- [Model Description](#model-description) + - [Performance](#performance) + - [Training Performance](#evaluation-performance) + - [Inference Performance](#evaluation-performance) +- [ModelZoo Homepage](#modelzoo-homepage) -This is an example of training EfficientNet-B0 in MindSpore. +# [EfficientNet-B0 Description](#contents) -## Requirements -- Install [Mindspore](http://www.mindspore.cn/install/en). -- Download the dataset. +[Paper](https://arxiv.org/abs/1905.11946): Mingxing Tan, Quoc V. Le. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. 2019. -## Structure +# [Model architecture](#contents) -```shell +The overall network architecture of EfficientNet-B0 is show below: + +[Link](https://arxiv.org/abs/1905.11946) + + +# [Dataset](#contents) + +Dataset used: [imagenet](http://www.image-net.org/) + +- Dataset size: ~125G, 1.2W colorful images in 1000 classes + - Train: 120G, 1.2W images + - Test: 5G, 50000 images +- Data format: RGB images. + - Note: Data will be processed in src/dataset.py + + +# [Environment Requirements](#contents) + +- Hardware GPU + - Prepare hardware environment with GPU processor. +- Framework + - [MindSpore](https://www.mindspore.cn/install/en) +- For more information, please check the resources below: + - [MindSpore Tutorials](https://www.mindspore.cn/tutorial/training/en/master/index.html) + - [MindSpore Python API](https://www.mindspore.cn/doc/api_python/en/master/index.html) + +# [Script description](#contents) + +## [Script and sample code](#contents) + +```python . -└─nasnet +└─efficientnet ├─README.md - ├─scripts - ├─run_standalone_train_for_gpu.sh # launch standalone training with gpu platform(1p) - ├─run_distribute_train_for_gpu.sh # launch distributed training with gpu platform(8p) - └─run_eval_for_gpu.sh # launch evaluating with gpu platform + ├─scripts + ├─run_standalone_train_for_gpu.sh # launch standalone training with gpu platform(1p) + ├─run_distribute_train_for_gpu.sh # launch distributed training with gpu platform(8p) + └─run_eval_for_gpu.sh # launch evaluating with gpu platform ├─src ├─config.py # parameter configuration ├─dataset.py # data preprocessing @@ -26,16 +68,16 @@ This is an example of training EfficientNet-B0 in MindSpore. ├─loss.py # Customized loss function ├─transform_utils.py # random augment utils ├─transform.py # random augment class - ├─eval.py # eval net - └─train.py # train net +├─eval.py # eval net +└─train.py # train net ``` -## Parameter Configuration +## [Script Parameters](#contents) -Parameters for both training and evaluating can be set in config.py +Parameters for both training and evaluating can be set in config.py. -``` +``` 'random_seed': 1, # fix random seed 'model': 'efficientnet_b0', # model name 'drop': 0.2, # dropout rate @@ -45,9 +87,9 @@ Parameters for both training and evaluating can be set in config.py 'batch_size': 128, # batch size 'decay_epochs': 2.4, # epoch interval to decay LR 'warmup_epochs': 5, # epochs to warmup LR -'decay_rate': 0.97, # LR decay rate +'decay_rate': 0.97, # LR decay rate 'weight_decay': 1e-5, # weight decay -'epochs': 600, # number of epochs to train +'epochs': 600, # number of epochs to train 'workers': 8, # number of data processing processes 'amp_level': 'O0', # amp level 'opt': 'rmsprop', # optimizer @@ -62,35 +104,34 @@ Parameters for both training and evaluating can be set in config.py 'resume_start_epoch': 0, # resume start epoch ``` -## Running the example - -### Train +## [Training Process](#contents) #### Usage ``` -# distribute training example(8p) -sh run_distribute_train_for_gpu.sh DATA_DIR -# standalone training -sh run_standalone_train_for_gpu.sh DATA_DIR DEVICE_ID +GPU: + # distribute training example(8p) + sh run_distribute_train_for_gpu.sh + # standalone training + sh run_standalone_train_for_gpu.sh DEVICE_ID DATA_DIR ``` #### Launch ```bash # distributed training example(8p) for GPU -sh scripts/run_distribute_train_for_gpu.sh /dataset +cd scripts +sh run_distribute_train_for_gpu.sh 8 0,1,2,3,4,5,6,7 /dataset/train # standalone training example for GPU -sh scripts/run_standalone_train_for_gpu.sh /dataset 0 +cd scripts +sh run_standalone_train_for_gpu.sh 0 /dataset/train ``` -#### Result - You can find checkpoint file together with result in log. -### Evaluation +## [Evaluation Process](#contents) -#### Usage +### Usage ``` # Evaluation @@ -101,11 +142,51 @@ sh run_eval_for_gpu.sh DATA_DIR DEVICE_ID PATH_CHECKPOINT ```bash # Evaluation with checkpoint -sh scripts/run_eval_for_gpu.sh /dataset 0 ./checkpoint/efficientnet_b0-600_1251.ckpt +cd scripts +sh run_eval_for_gpu.sh /dataset/eval ./checkpoint/efficientnet_b0-600_1251.ckpt ``` -> checkpoint can be produced in training process. - #### Result Evaluation result will be stored in the scripts path. Under this, you can find result like the followings in log. + +``` +acc=76.96%(TOP1) +``` + +# [Model description](#contents) + +## [Performance](#contents) + +### Training Performance + +| Parameters | efficientnet_b0 | +| -------------------------- | ------------------------- | +| Resource | NV SMX2 V100-32G | +| uploaded Date | 10/26/2020 | +| MindSpore Version | 1.0.0 | +| Dataset | ImageNet | +| Training Parameters | src/config.py | +| Optimizer | rmsprop | +| Loss Function | LabelSmoothingCrossEntropy | +| Loss | 1.8886 | +| Accuracy | 76.96%(TOP1) | +| Total time | 132 h 8ps | +| Checkpoint for Fine tuning | 64 M(.ckpt file) | + +### Inference Performance + +| Parameters | | +| -------------------------- | ------------------------- | +| Resource | NV SMX2 V100-32G | +| uploaded Date | 10/26/2020 | +| MindSpore Version | 1.0.0 | +| Dataset | ImageNet, 1.2W | +| batch_size | 128 | +| outputs | probability | +| Accuracy | acc=76.96%(TOP1) | + + +# [ModelZoo Homepage](#contents) + +Please check the official [homepage](https://gitee.com/mindspore/mindspore/tree/master/model_zoo). diff --git a/model_zoo/official/cv/efficientnet/eval.py b/model_zoo/official/cv/efficientnet/eval.py index 098db060ba..75158fd15b 100644 --- a/model_zoo/official/cv/efficientnet/eval.py +++ b/model_zoo/official/cv/efficientnet/eval.py @@ -49,7 +49,7 @@ if __name__ == '__main__': ckpt = load_checkpoint(args_opt.checkpoint) load_param_into_net(net, ckpt) net.set_train(False) - val_data_url = os.path.join(args_opt.data_path, 'val') + val_data_url = args_opt.data_path dataset = create_dataset_val(cfg.batch_size, val_data_url, workers=cfg.workers, distributed=False) loss = LabelSmoothingCrossEntropy(smooth_factor=cfg.smoothing) eval_metrics = {'Loss': nn.Loss(), diff --git a/model_zoo/official/cv/efficientnet/scripts/run_distribute_train_for_gpu.sh b/model_zoo/official/cv/efficientnet/scripts/run_distribute_train_for_gpu.sh index c9165841a8..13371549d5 100644 --- a/model_zoo/official/cv/efficientnet/scripts/run_distribute_train_for_gpu.sh +++ b/model_zoo/official/cv/efficientnet/scripts/run_distribute_train_for_gpu.sh @@ -13,20 +13,57 @@ # See the License for the specific language governing permissions and # limitations under the License. # ============================================================================ -DATA_DIR=$1 +if [ $# != 3 ] && [ $# != 4 ] +then + echo "Usage: + sh run_distribute_train_for_gpu.sh [DEVICE_NUM] [VISIABLE_DEVICES(0,1,2,3,4,5,6,7)] [DATASET_PATH] [PRETRAINED_CKPT_PATH](optional) + " +exit 1 +fi -current_exec_path=$(pwd) -echo ${current_exec_path} +if [ $1 -lt 1 ] && [ $1 -gt 8 ] +then + echo "error: DEVICE_NUM=$1 is not in (1-8)" +exit 1 +fi -curtime=`date '+%Y%m%d-%H%M%S'` -RANK_SIZE=8 +# check dataset file +if [ ! -d $3 ] +then + echo "error: DATASET_PATH=$3 is not a directory" +exit 1 +fi -rm ${current_exec_path}/device_parallel/ -rf -mkdir ${current_exec_path}/device_parallel -echo ${curtime} > ${current_exec_path}/device_parallel/starttime +export DEVICE_NUM=$1 +export RANK_SIZE=$1 + +BASEPATH=$(cd "`dirname $0`" || exit; pwd) +export PYTHONPATH=${BASEPATH}:$PYTHONPATH +if [ -d "../train" ]; +then + rm -rf ../train +fi +mkdir ../train +cd ../train || exit + +export CUDA_VISIBLE_DEVICES="$2" + +if [ $# == 3 ] +then + mpirun -n $1 --allow-run-as-root --output-filename log_output --merge-stderr-to-stdout \ + python ${BASEPATH}/../train.py \ + --GPU \ + --distributed \ + --data_path $3 > train.log 2>&1 & +fi + +if [ $# == 4 ] +then + mpirun -n $1 --allow-run-as-root --output-filename log_output --merge-stderr-to-stdout \ + python ${BASEPATH}/../train.py \ + --GPU \ + --distributed \ + --data_path $3 \ + --resume $4 > train.log 2>&1 & +fi -mpirun --allow-run-as-root -n $RANK_SIZE python ${current_exec_path}/train.py \ - --GPU \ - --distributed \ - --data_path ${DATA_DIR} \ - --cur_time ${curtime} > ${current_exec_path}/device_parallel/efficientnet_b0.log 2>&1 & diff --git a/model_zoo/official/cv/efficientnet/scripts/run_eval_for_gpu.sh b/model_zoo/official/cv/efficientnet/scripts/run_eval_for_gpu.sh index 32ef1273bf..9e9c462467 100644 --- a/model_zoo/official/cv/efficientnet/scripts/run_eval_for_gpu.sh +++ b/model_zoo/official/cv/efficientnet/scripts/run_eval_for_gpu.sh @@ -13,15 +13,34 @@ # See the License for the specific language governing permissions and # limitations under the License. # ============================================================================ -DATA_DIR=$1 -DEVICE_ID=$2 -PATH_CHECKPOINT=$3 +if [ $# != 2 ] +then + echo "GPU: sh run_eval_for_gpu.sh [DATASET_PATH] [CHECKPOINT_PATH]" +exit 1 +fi -current_exec_path=$(pwd) -echo ${current_exec_path} +# check dataset file +if [ ! -d $1 ] +then + echo "error: DATASET_PATH=$1 is not a directory" +exit 1 +fi -curtime=`date '+%Y%m%d-%H%M%S'` +# check checkpoint file +if [ ! -f $2 ] +then + echo "error: CHECKPOINT_PATH=$2 is not a file" +exit 1 +fi -echo ${curtime} > ${current_exec_path}/eval_starttime +BASEPATH=$(cd "`dirname $0`" || exit; pwd) +export PYTHONPATH=${BASEPATH}:$PYTHONPATH -CUDA_VISIBLE_DEVICES=${DEVICE_ID} python ./eval.py --platform 'GPU' --data_path ${DATA_DIR} --checkpoint ${PATH_CHECKPOINT} > ${current_exec_path}/eval.log 2>&1 & +if [ -d "../eval" ]; +then + rm -rf ../eval +fi +mkdir ../eval +cd ../eval || exit + +python ${BASEPATH}/../eval.py --platform 'GPU' --data_path $1 --checkpoint=$2 > ./eval.log 2>&1 & diff --git a/model_zoo/official/cv/efficientnet/scripts/run_standalone_train_for_gpu.sh b/model_zoo/official/cv/efficientnet/scripts/run_standalone_train_for_gpu.sh index ad3d6bdfa8..780a5b5d16 100644 --- a/model_zoo/official/cv/efficientnet/scripts/run_standalone_train_for_gpu.sh +++ b/model_zoo/official/cv/efficientnet/scripts/run_standalone_train_for_gpu.sh @@ -13,19 +13,38 @@ # See the License for the specific language governing permissions and # limitations under the License. # ============================================================================ -DATA_DIR=$1 -DEVICE_ID=$2 +if [ $# != 2 ] && [ $# != 3 ] +then + echo "Usage: + sh run_standalone_train_for_gpu.sh [DEVICE_ID] [DATASET_PATH] [PRETRAINED_CKPT_PATH](optional) + " +exit 1 +fi -current_exec_path=$(pwd) -echo ${current_exec_path} +# check dataset file +if [ ! -d $2 ] +then + echo "error: DATASET_PATH=$2 is not a directory" +exit 1 +fi -curtime=`date '+%Y%m%d-%H%M%S'` +BASEPATH=$(cd "`dirname $0`" || exit; pwd) +export PYTHONPATH=${BASEPATH}:$PYTHONPATH +if [ -d "../train" ]; +then + rm -rf ../train +fi +mkdir ../train +cd ../train || exit -rm ${current_exec_path}/device_${DEVICE_ID}/ -rf -mkdir ${current_exec_path}/device_${DEVICE_ID} -echo ${curtime} > ${current_exec_path}/device_${DEVICE_ID}/starttime +export CUDA_VISIBLE_DEVICES=$1 -CUDA_VISIBLE_DEVICES=${DEVICE_ID} python ${current_exec_path}/train.py \ - --GPU \ - --data_path ${DATA_DIR} \ - --cur_time ${curtime} > ${current_exec_path}/device_${DEVICE_ID}/efficientnet_b0.log 2>&1 & +if [ $# == 2 ] +then + python ${BASEPATH}/../train.py --GPU --data_path $2 > train.log 2>&1 & +fi + +if [ $# == 3 ] +then + python ${BASEPATH}/../train.py --GPU --data_path $2 --resume $3 > train.log 2>&1 & +fi diff --git a/model_zoo/official/cv/efficientnet/src/dataset.py b/model_zoo/official/cv/efficientnet/src/dataset.py index e30dd87bc4..e4971c313f 100644 --- a/model_zoo/official/cv/efficientnet/src/dataset.py +++ b/model_zoo/official/cv/efficientnet/src/dataset.py @@ -85,7 +85,6 @@ def create_dataset(batch_size, train_data_url='', workers=8, distributed=False): input_columns=["image", "label"], num_parallel_workers=2, drop_remainder=True) - ds_train = ds_train.repeat(1) return ds_train @@ -121,5 +120,4 @@ def create_dataset_val(batch_size=128, val_data_url='', workers=8, distributed=F dataset = dataset.map(input_columns=["label"], operations=type_cast_op, num_parallel_workers=workers) dataset = dataset.map(input_columns=["image"], operations=ctrans, num_parallel_workers=workers) dataset = dataset.batch(batch_size, drop_remainder=True, num_parallel_workers=workers) - dataset = dataset.repeat(1) return dataset diff --git a/model_zoo/official/cv/efficientnet/train.py b/model_zoo/official/cv/efficientnet/train.py index 9e3acd2fde..d8c68d3c85 100644 --- a/model_zoo/official/cv/efficientnet/train.py +++ b/model_zoo/official/cv/efficientnet/train.py @@ -17,7 +17,6 @@ import argparse import math import os import random -import time import numpy as np import mindspore @@ -115,8 +114,6 @@ def main(): if args.GPU: context.set_context(device_target='GPU') - is_master = not args.distributed or (rank_id == 0) - net = efficientnet_b0(num_classes=cfg.num_classes, drop_rate=cfg.drop, drop_connect_rate=cfg.drop_connect, @@ -124,18 +121,7 @@ def main(): bn_tf=cfg.bn_tf, ) - cur_time = args.cur_time - output_base = './output' - - exp_name = '-'.join([ - cur_time, - cfg.model, - str(224) - ]) - time.sleep(rank_id) - output_dir = get_outdir(output_base, exp_name) - - train_data_url = os.path.join(args.data_path, 'train') + train_data_url = args.data_path train_dataset = create_dataset( cfg.batch_size, train_data_url, workers=cfg.workers, distributed=args.distributed) batches_per_epoch = train_dataset.get_dataset_size() @@ -152,7 +138,7 @@ def main(): config_ck = CheckpointConfig( save_checkpoint_steps=batches_per_epoch, keep_checkpoint_max=cfg.keep_checkpoint_max) ckpoint_cb = ModelCheckpoint( - prefix=cfg.model, directory=output_dir, config=config_ck) + prefix=cfg.model, directory='./ckpt_' + str(rank_id) + '/', config=config_ck) callbacks += [ckpoint_cb] lr = Tensor(get_lr(base_lr=cfg.lr, total_epochs=cfg.epochs, steps_per_epoch=batches_per_epoch, @@ -180,7 +166,7 @@ def main(): amp_level=cfg.amp_level ) - callbacks = callbacks if is_master else [] +# callbacks = callbacks if is_master else [] if args.resume: real_epoch = cfg.epochs - cfg.resume_start_epoch diff --git a/model_zoo/official/cv/nasnet/README_CN.md b/model_zoo/official/cv/nasnet/README_CN.md new file mode 100644 index 0000000000..d15627f7d3 --- /dev/null +++ b/model_zoo/official/cv/nasnet/README_CN.md @@ -0,0 +1,130 @@ +# NASNet示例 + + + +- [NASNet示例](#nasnet示例) + - [概述](#概述) + - [要求](#要求) + - [结构](#结构) + - [参数配置](#参数配置) + - [运行示例](#运行示例) + - [训练](#训练) + - [用法](#用法) + - [运行](#运行) + - [结果](#结果) + - [评估](#评估) + - [用法](#用法-1) + - [启动](#启动) + - [结果](#结果-1) + + + +## 概述 + +此为MindSpore中训练NASNet-A-Mobile的示例。 + +## 要求 + +- 安装[Mindspore](http://www.mindspore.cn/install/en)。 +- 下载数据集。 + +## 结构 + +```shell +. +└─nasnet + ├─README.md + ├─scripts + ├─run_standalone_train_for_gpu.sh # 使用GPU平台启动单机训练(单卡) + ├─Run_distribute_train_for_gpu.sh # 使用GPU平台启动分布式训练(8卡) + └─Run_eval_for_gpu.sh # 使用GPU平台进行启动评估 + ├─src + ├─config.py # 参数配置 + ├─dataset.py # 数据预处理 + ├─loss.py # 自定义交叉熵损失函数 + ├─lr_generator.py # 学习率生成器 + ├─nasnet_a_mobile.py # 网络定义 + ├─eval.py # 评估网络 + ├─export.py # 转换检查点 + └─train.py # 训练网络 + +``` + +## 参数配置 + +在config.py中可以同时配置训练参数和评估参数。 + +``` +'random_seed':1, # 固定随机种子 +'rank':0, # 分布式训练进程序号 +'group_size':1, # 分布式训练分组大小 +'work_nums':8, # 数据读取人员数 +'epoch_size':500, # 总周期数 +'keep_checkpoint_max':100, # 保存检查点最大数 +'ckpt_path':'./checkpoint/', # 检查点保存路径 +'is_save_on_master':1 # 在rank0上保存检查点,分布式参数 +'batch_size':32, # 输入批次大小 +'num_classes':1000, # 数据集类数 +'label_smooth_factor':0.1, # 标签平滑因子 +'aux_factor':0.4, # 副对数损失系数 +'lr_init':0.04, # 启动学习率 +'lr_decay_rate':0.97, # 学习率衰减率 +'num_epoch_per_decay':2.4, # 衰减周期数 +'weight_decay':0.00004, # 重量衰减 +'momentum':0.9, # 动量 +'opt_eps':1.0, # epsilon参数 +'rmsprop_decay':0.9, # rmsprop衰减 +'loss_scale':1, # 损失规模 + +``` + + + +## 运行示例 + +### 训练 + +#### 用法 + +``` +# 分布式训练示例(8卡) +sh run_distribute_train_for_gpu.sh DATA_DIR +# 单机训练 +sh run_standalone_train_for_gpu.sh DEVICE_ID DATA_DIR +``` + +#### 运行 + +```bash +# GPU分布式训练示例(8卡) +sh scripts/run_distribute_train_for_gpu.sh /dataset/train +# GPU单机训练示例 +sh scripts/run_standalone_train_for_gpu.sh 0 /dataset/train +``` + +#### 结果 + +可以在日志中找到检查点文件及结果。 + +### 评估 + +#### 用法 + +``` +# 评估 +sh run_eval_for_gpu.sh DEVICE_ID DATA_DIR PATH_CHECKPOINT +``` + +#### 启动 + +```bash +# 检查点评估 +sh scripts/run_eval_for_gpu.sh 0 /dataset/val ./checkpoint/nasnet-a-mobile-rank0-248_10009.ckpt +``` + +> 训练过程中可以生成检查点。 + +#### 结果 + +评估结果保存在脚本路径下。路径下的日志中,可以找到如下结果: +