update efficientnet scripts & nasnet cn readme

5 years ago · 148fc597f6
parent 5e3b135130
commit 148fc597f6
8 changed files with 359 additions and 89 deletions
--- a/model_zoo/official/cv/efficientnet/README.md
+++ b/model_zoo/official/cv/efficientnet/README.md
@ -1,24 +1,66 @@
-# EfficientNet-B0 Example
+# Contents
-## Description
+- [EfficientNet-B0 Description](#efficientnet-description)
 - [Model Architecture](#model-architecture)
 - [Dataset](#dataset)
 - [Environment Requirements](#environment-requirements)
 - [Quick Start](#quick-start)
 - [Script Description](#script-description)
    - [Script and Sample Code](#script-and-sample-code)
    - [Script Parameters](#script-parameters)
    - [Training Process](#training-process)
    - [Evaluation Process](#evaluation-process)
 - [Model Description](#model-description)
    - [Performance](#performance)
        - [Training Performance](#evaluation-performance)
        - [Inference Performance](#evaluation-performance)
 - [ModelZoo Homepage](#modelzoo-homepage)
-This is an example of training EfficientNet-B0 in MindSpore.
+# [EfficientNet-B0 Description](#contents)
 ## Requirements
- Install [Mindspore](http://www.mindspore.cn/install/en).
+[Paper](https://arxiv.org/abs/1905.11946): Mingxing Tan, Quoc V. Le. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. 2019.
 - Download the dataset.
-## Structure
+# [Model architecture](#contents)
-```shell
+The overall network architecture of EfficientNet-B0 is show below:
 [Link](https://arxiv.org/abs/1905.11946)
 # [Dataset](#contents)
 Dataset used: [imagenet](http://www.image-net.org/)
 - Dataset size: ~125G, 1.2W colorful images in 1000 classes
  - Train: 120G, 1.2W images
  - Test: 5G, 50000 images
 - Data format: RGB images.
  - Note: Data will be processed in src/dataset.py
 # [Environment Requirements](#contents)
 - Hardware GPU
  - Prepare hardware environment with GPU processor.
 - Framework
  - [MindSpore](https://www.mindspore.cn/install/en)
 - For more information, please check the resources below：
  - [MindSpore Tutorials](https://www.mindspore.cn/tutorial/training/en/master/index.html)
  - [MindSpore Python API](https://www.mindspore.cn/doc/api_python/en/master/index.html)
 # [Script description](#contents)
 ## [Script and sample code](#contents)
 ```python
 .
-└─nasnet      
+└─efficientnet
  ├─README.md
-  ├─scripts      
+  ├─scripts
-    ├─run_standalone_train_for_gpu.sh         # launch standalone training with gpu platform(1p)
+    ├─run_standalone_train_for_gpu.sh # launch standalone training with gpu platform(1p)
-    ├─run_distribute_train_for_gpu.sh         # launch distributed training with gpu platform(8p)
+    ├─run_distribute_train_for_gpu.sh # launch distributed training with gpu platform(8p)
-    └─run_eval_for_gpu.sh                     # launch evaluating with gpu platform
+    └─run_eval_for_gpu.sh             # launch evaluating with gpu platform
  ├─src
    ├─config.py                       # parameter configuration
    ├─dataset.py                      # data preprocessing
@ -26,16 +68,16 @@ This is an example of training EfficientNet-B0 in MindSpore.
    ├─loss.py                         # Customized loss function
    ├─transform_utils.py              # random augment utils
    ├─transform.py                    # random augment class
-  ├─eval.py                           # eval net
+├─eval.py                             # eval net
-  └─train.py                          # train net
+└─train.py                            # train net
 ```
-## Parameter Configuration
+## [Script Parameters](#contents)
-Parameters for both training and evaluating can be set in config.py
+Parameters for both training and evaluating can be set in config.py.
-```       
+```
 'random_seed': 1,                # fix random seed
 'model': 'efficientnet_b0',      # model name
 'drop': 0.2,                     # dropout rate
@ -45,9 +87,9 @@ Parameters for both training and evaluating can be set in config.py
 'batch_size': 128,               # batch size
 'decay_epochs': 2.4,             # epoch interval to decay LR
 'warmup_epochs': 5,              # epochs to warmup LR
-'decay_rate': 0.97,              # LR decay rate   
+'decay_rate': 0.97,              # LR decay rate
 'weight_decay': 1e-5,            # weight decay
-'epochs': 600,                   # number of epochs to train    
+'epochs': 600,                   # number of epochs to train
 'workers': 8,                    # number of data processing processes
 'amp_level': 'O0',               # amp level
 'opt': 'rmsprop',                # optimizer
@ -62,35 +104,34 @@ Parameters for both training and evaluating can be set in config.py
 'resume_start_epoch': 0,         # resume start epoch
 ```
-## Running the example
+## [Training Process](#contents)
 ### Train
 #### Usage
 ```
-# distribute training example(8p)
+GPU:
-sh run_distribute_train_for_gpu.sh DATA_DIR
+    # distribute training example(8p)
-# standalone training
+    sh run_distribute_train_for_gpu.sh 
-sh run_standalone_train_for_gpu.sh DATA_DIR DEVICE_ID
+    # standalone training
    sh run_standalone_train_for_gpu.sh DEVICE_ID DATA_DIR
 ```
 #### Launch
 ```bash
 # distributed training example(8p) for GPU
-sh scripts/run_distribute_train_for_gpu.sh /dataset
+cd scripts
 sh run_distribute_train_for_gpu.sh 8 0,1,2,3,4,5,6,7 /dataset/train
 # standalone training example for GPU
-sh scripts/run_standalone_train_for_gpu.sh /dataset 0
+cd scripts
 sh run_standalone_train_for_gpu.sh 0 /dataset/train
 ```
 #### Result
 You can find checkpoint file together with result in log.
-### Evaluation
+## [Evaluation Process](#contents)
-#### Usage
+### Usage
 ```
 # Evaluation
@ -101,11 +142,51 @@ sh run_eval_for_gpu.sh DATA_DIR DEVICE_ID PATH_CHECKPOINT
 ```bash
 # Evaluation with checkpoint
-sh scripts/run_eval_for_gpu.sh /dataset 0 ./checkpoint/efficientnet_b0-600_1251.ckpt
+cd scripts
 sh run_eval_for_gpu.sh /dataset/eval ./checkpoint/efficientnet_b0-600_1251.ckpt
 ```
 > checkpoint can be produced in training process.
 #### Result
 Evaluation result will be stored in the scripts path. Under this, you can find result like the followings in log.
 ```
 acc=76.96%(TOP1)
 ```
 # [Model description](#contents)
 ## [Performance](#contents)
 ### Training Performance
 | Parameters                 | efficientnet_b0           |
 | -------------------------- | ------------------------- |
 | Resource                   | NV SMX2 V100-32G          |
 | uploaded Date              | 10/26/2020                |
 | MindSpore Version          | 1.0.0                     |
 | Dataset                    | ImageNet                  |
 | Training Parameters        | src/config.py             |
 | Optimizer                  | rmsprop                   |
 | Loss Function              | LabelSmoothingCrossEntropy |
 | Loss                       | 1.8886                    |
 | Accuracy                   | 76.96%(TOP1)               |
 | Total time                 | 132 h 8ps                 |
 | Checkpoint for Fine tuning | 64 M(.ckpt file)         |
 ### Inference Performance
 | Parameters                 |                           |
 | -------------------------- | ------------------------- |
 | Resource                   | NV SMX2 V100-32G          |
 | uploaded Date              | 10/26/2020                |
 | MindSpore Version          | 1.0.0                     |
 | Dataset                    | ImageNet, 1.2W            |
 | batch_size                 | 128                       |
 | outputs                    | probability               |
 | Accuracy                   | acc=76.96%(TOP1)          |
 # [ModelZoo Homepage](#contents)
 Please check the official [homepage](https://gitee.com/mindspore/mindspore/tree/master/model_zoo).
--- a/model_zoo/official/cv/efficientnet/eval.py
+++ b/model_zoo/official/cv/efficientnet/eval.py
@ -49,7 +49,7 @@ if __name__ == '__main__':
    ckpt = load_checkpoint(args_opt.checkpoint)
    load_param_into_net(net, ckpt)
    net.set_train(False)
-    val_data_url = os.path.join(args_opt.data_path, 'val')
+    val_data_url = args_opt.data_path
    dataset = create_dataset_val(cfg.batch_size, val_data_url, workers=cfg.workers, distributed=False)
    loss = LabelSmoothingCrossEntropy(smooth_factor=cfg.smoothing)
    eval_metrics = {'Loss': nn.Loss(),
--- a/model_zoo/official/cv/efficientnet/scripts/run_distribute_train_for_gpu.sh
+++ b/model_zoo/official/cv/efficientnet/scripts/run_distribute_train_for_gpu.sh
@ -13,20 +13,57 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ============================================================================
-DATA_DIR=$1
+if [ $# != 3 ] && [ $# != 4 ]
 then
    echo "Usage:
          sh run_distribute_train_for_gpu.sh [DEVICE_NUM] [VISIABLE_DEVICES(0,1,2,3,4,5,6,7)] [DATASET_PATH] [PRETRAINED_CKPT_PATH](optional)
          "
 exit 1
 fi
-current_exec_path=$(pwd)
+if [ $1 -lt 1 ] && [ $1 -gt 8 ]
-echo ${current_exec_path}
+then
    echo "error: DEVICE_NUM=$1 is not in (1-8)"
 exit 1
 fi
-curtime=`date '+%Y%m%d-%H%M%S'`
+# check dataset file
-RANK_SIZE=8
+if [ ! -d $3 ]
 then
    echo "error: DATASET_PATH=$3 is not a directory"
 exit 1
 fi
-rm ${current_exec_path}/device_parallel/ -rf
+export DEVICE_NUM=$1
-mkdir ${current_exec_path}/device_parallel
+export RANK_SIZE=$1
-echo ${curtime} > ${current_exec_path}/device_parallel/starttime
+
 BASEPATH=$(cd "`dirname $0`" || exit; pwd)
 export PYTHONPATH=${BASEPATH}:$PYTHONPATH
 if [ -d "../train" ];
 then
    rm -rf ../train
 fi
 mkdir ../train
 cd ../train || exit
 export CUDA_VISIBLE_DEVICES="$2"
 if [ $# == 3 ]
 then
    mpirun -n $1 --allow-run-as-root --output-filename log_output --merge-stderr-to-stdout \
    python ${BASEPATH}/../train.py \
        --GPU \
        --distributed \
        --data_path $3 > train.log 2>&1 &
 fi
 if [ $# == 4 ]
 then
    mpirun -n $1 --allow-run-as-root --output-filename log_output --merge-stderr-to-stdout \
    python ${BASEPATH}/../train.py \
        --GPU \
        --distributed \
        --data_path $3 \
        --resume $4 > train.log 2>&1 &
 fi
 mpirun --allow-run-as-root -n $RANK_SIZE python ${current_exec_path}/train.py \
                                                --GPU \
                                                --distributed \
                                                --data_path ${DATA_DIR} \
                                                --cur_time ${curtime} > ${current_exec_path}/device_parallel/efficientnet_b0.log 2>&1 &
--- a/model_zoo/official/cv/efficientnet/scripts/run_eval_for_gpu.sh
+++ b/model_zoo/official/cv/efficientnet/scripts/run_eval_for_gpu.sh
@ -13,15 +13,34 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ============================================================================
-DATA_DIR=$1
+if [ $# != 2 ]
-DEVICE_ID=$2
+then
-PATH_CHECKPOINT=$3
+    echo "GPU: sh run_eval_for_gpu.sh [DATASET_PATH] [CHECKPOINT_PATH]"
 exit 1
 fi
-current_exec_path=$(pwd)
+# check dataset file
-echo ${current_exec_path}
+if [ ! -d $1 ]
 then
    echo "error: DATASET_PATH=$1 is not a directory"
 exit 1
 fi
-curtime=`date '+%Y%m%d-%H%M%S'`
+# check checkpoint file
 if [ ! -f $2 ]
 then
    echo "error: CHECKPOINT_PATH=$2 is not a file"
 exit 1
 fi
-echo ${curtime} > ${current_exec_path}/eval_starttime
+BASEPATH=$(cd "`dirname $0`" || exit; pwd)
 export PYTHONPATH=${BASEPATH}:$PYTHONPATH
-CUDA_VISIBLE_DEVICES=${DEVICE_ID} python ./eval.py --platform 'GPU' --data_path ${DATA_DIR} --checkpoint ${PATH_CHECKPOINT} > ${current_exec_path}/eval.log 2>&1 &
+if [ -d "../eval" ];
 then
    rm -rf ../eval
 fi
 mkdir ../eval
 cd ../eval || exit
 python ${BASEPATH}/../eval.py --platform 'GPU' --data_path $1 --checkpoint=$2 > ./eval.log 2>&1 &
--- a/model_zoo/official/cv/efficientnet/scripts/run_standalone_train_for_gpu.sh
+++ b/model_zoo/official/cv/efficientnet/scripts/run_standalone_train_for_gpu.sh
@ -13,19 +13,38 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ============================================================================
-DATA_DIR=$1
+if [ $# != 2 ] && [ $# != 3 ]
-DEVICE_ID=$2
+then
    echo "Usage: 
          sh run_standalone_train_for_gpu.sh [DEVICE_ID] [DATASET_PATH] [PRETRAINED_CKPT_PATH](optional) 
          "
 exit 1
 fi
-current_exec_path=$(pwd)
+# check dataset file
-echo ${current_exec_path}
+if [ ! -d $2 ]
 then
    echo "error: DATASET_PATH=$2 is not a directory"    
 exit 1
 fi
-curtime=`date '+%Y%m%d-%H%M%S'`
+BASEPATH=$(cd "`dirname $0`" || exit; pwd)
 export PYTHONPATH=${BASEPATH}:$PYTHONPATH
 if [ -d "../train" ];
 then
    rm -rf ../train
 fi
 mkdir ../train
 cd ../train || exit
-rm ${current_exec_path}/device_${DEVICE_ID}/ -rf
+export CUDA_VISIBLE_DEVICES=$1
 mkdir ${current_exec_path}/device_${DEVICE_ID}
 echo ${curtime} > ${current_exec_path}/device_${DEVICE_ID}/starttime
-CUDA_VISIBLE_DEVICES=${DEVICE_ID} python ${current_exec_path}/train.py \
+if [ $# == 2 ]
-                                         --GPU \
+then
-                                         --data_path ${DATA_DIR} \
+    python ${BASEPATH}/../train.py --GPU --data_path $2 > train.log 2>&1 &
-                                         --cur_time ${curtime} > ${current_exec_path}/device_${DEVICE_ID}/efficientnet_b0.log 2>&1 &
+fi
 if [ $# == 3 ]
 then
    python ${BASEPATH}/../train.py --GPU --data_path $2 --resume $3 > train.log 2>&1 &
 fi
--- a/model_zoo/official/cv/efficientnet/src/dataset.py
+++ b/model_zoo/official/cv/efficientnet/src/dataset.py
@ -85,7 +85,6 @@ def create_dataset(batch_size, train_data_url='', workers=8, distributed=False):
                                   input_columns=["image", "label"],
                                   num_parallel_workers=2,
                                   drop_remainder=True)
    ds_train = ds_train.repeat(1)
    return ds_train
@ -121,5 +120,4 @@ def create_dataset_val(batch_size=128, val_data_url='', workers=8, distributed=F
    dataset = dataset.map(input_columns=["label"], operations=type_cast_op, num_parallel_workers=workers)
    dataset = dataset.map(input_columns=["image"], operations=ctrans, num_parallel_workers=workers)
    dataset = dataset.batch(batch_size, drop_remainder=True, num_parallel_workers=workers)
    dataset = dataset.repeat(1)
    return dataset
--- a/model_zoo/official/cv/efficientnet/train.py
+++ b/model_zoo/official/cv/efficientnet/train.py
@ -17,7 +17,6 @@ import argparse
 import math
 import os
 import random
 import time
 import numpy as np
 import mindspore
@ -115,8 +114,6 @@ def main():
        if args.GPU:
            context.set_context(device_target='GPU')
    is_master = not args.distributed or (rank_id == 0)
    net = efficientnet_b0(num_classes=cfg.num_classes,
                          drop_rate=cfg.drop,
                          drop_connect_rate=cfg.drop_connect,
@ -124,18 +121,7 @@ def main():
                          bn_tf=cfg.bn_tf,
                          )
-    cur_time = args.cur_time
+    train_data_url = args.data_path
    output_base = './output'
    exp_name = '-'.join([
        cur_time,
        cfg.model,
        str(224)
    ])
    time.sleep(rank_id)
    output_dir = get_outdir(output_base, exp_name)
    train_data_url = os.path.join(args.data_path, 'train')
    train_dataset = create_dataset(
        cfg.batch_size, train_data_url, workers=cfg.workers, distributed=args.distributed)
    batches_per_epoch = train_dataset.get_dataset_size()
@ -152,7 +138,7 @@ def main():
        config_ck = CheckpointConfig(
            save_checkpoint_steps=batches_per_epoch, keep_checkpoint_max=cfg.keep_checkpoint_max)
        ckpoint_cb = ModelCheckpoint(
-            prefix=cfg.model, directory=output_dir, config=config_ck)
+            prefix=cfg.model, directory='./ckpt_' + str(rank_id) + '/', config=config_ck)
        callbacks += [ckpoint_cb]
    lr = Tensor(get_lr(base_lr=cfg.lr, total_epochs=cfg.epochs, steps_per_epoch=batches_per_epoch,
@ -180,7 +166,7 @@ def main():
                  amp_level=cfg.amp_level
                  )
-    callbacks = callbacks if is_master else []
+#    callbacks = callbacks if is_master else []
    if args.resume:
        real_epoch = cfg.epochs - cfg.resume_start_epoch
--- a/model_zoo/official/cv/nasnet/README_CN.md
+++ b/model_zoo/official/cv/nasnet/README_CN.md
@ -0,0 +1,130 @@
 # NASNet示例
 <!-- TOC -->
 - [NASNet示例](#nasnet示例)
    - [概述](#概述)
    - [要求](#要求)
    - [结构](#结构)
    - [参数配置](#参数配置)
    - [运行示例](#运行示例)
        - [训练](#训练)
            - [用法](#用法)
            - [运行](#运行)
            - [结果](#结果)
        - [评估](#评估)
            - [用法](#用法-1)
            - [启动](#启动)
            - [结果](#结果-1)
 <!-- /TOC -->
 ## 概述
 此为MindSpore中训练NASNet-A-Mobile的示例。
 ## 要求
 - 安装[Mindspore](http://www.mindspore.cn/install/en)。
 - 下载数据集。
 ## 结构
 ```shell
 .
 └─nasnet      
  ├─README.md
  ├─scripts      
    ├─run_standalone_train_for_gpu.sh         # 使用GPU平台启动单机训练（单卡）
    ├─Run_distribute_train_for_gpu.sh         # 使用GPU平台启动分布式训练（8卡）
    └─Run_eval_for_gpu.sh                     # 使用GPU平台进行启动评估
  ├─src
    ├─config.py                       # 参数配置
    ├─dataset.py                      # 数据预处理
    ├─loss.py                         # 自定义交叉熵损失函数
    ├─lr_generator.py                 # 学习率生成器
    ├─nasnet_a_mobile.py              # 网络定义
  ├─eval.py                           # 评估网络
  ├─export.py                         # 转换检查点
  └─train.py                          # 训练网络
 ```
 ## 参数配置
 在config.py中可以同时配置训练参数和评估参数。
 ```       
 'random_seed':1,                # 固定随机种子
 'rank':0,                       # 分布式训练进程序号
 'group_size':1,                 # 分布式训练分组大小
 'work_nums':8,                  # 数据读取人员数
 'epoch_size':500,               # 总周期数
 'keep_checkpoint_max':100,      # 保存检查点最大数
 'ckpt_path':'./checkpoint/',    # 检查点保存路径
 'is_save_on_master':1           # 在rank0上保存检查点，分布式参数
 'batch_size':32,                # 输入批次大小
 'num_classes':1000,             # 数据集类数
 'label_smooth_factor':0.1,      # 标签平滑因子
 'aux_factor':0.4,               # 副对数损失系数
 'lr_init':0.04,                 # 启动学习率
 'lr_decay_rate':0.97,           # 学习率衰减率
 'num_epoch_per_decay':2.4,      # 衰减周期数
 'weight_decay':0.00004,         # 重量衰减
 'momentum':0.9,                 # 动量
 'opt_eps':1.0,                  # epsilon参数
 'rmsprop_decay':0.9,            # rmsprop衰减
 'loss_scale':1,                 # 损失规模
 ```
 ## 运行示例
 ### 训练
 #### 用法
 ```
 # 分布式训练示例（8卡）
 sh run_distribute_train_for_gpu.sh DATA_DIR 
 # 单机训练
 sh run_standalone_train_for_gpu.sh DEVICE_ID DATA_DIR
 ```
 #### 运行
 ```bash
 # GPU分布式训练示例（8卡）
 sh scripts/run_distribute_train_for_gpu.sh /dataset/train
 # GPU单机训练示例
 sh scripts/run_standalone_train_for_gpu.sh 0 /dataset/train
 ```
 #### 结果
 可以在日志中找到检查点文件及结果。
 ### 评估
 #### 用法
 ```
 # 评估
 sh run_eval_for_gpu.sh DEVICE_ID DATA_DIR PATH_CHECKPOINT
 ```
 #### 启动
 ```bash
 # 检查点评估
 sh scripts/run_eval_for_gpu.sh 0 /dataset/val ./checkpoint/nasnet-a-mobile-rank0-248_10009.ckpt
 ```
 > 训练过程中可以生成检查点。
 #### 结果
 评估结果保存在脚本路径下。路径下的日志中，可以找到如下结果：