maskrcnn and fasterrcnn Dockerfile

4 years ago · df2540b9c1
parent f31dfa129a
commit df2540b9c1
6 changed files with 75 additions and 6 deletions
--- a/model_zoo/official/cv/faster_rcnn/Dockerfile
+++ b/model_zoo/official/cv/faster_rcnn/Dockerfile
@ -1,5 +1,6 @@
-ARG FROM_IMAGE_NAME=ascend-mindspore-arm:20.1.0
+ARG FROM_IMAGE_NAME
 FROM ${FROM_IMAGE_NAME}

+RUN apt install libgl1-mesa-glx -y
 COPY requirements.txt .
 RUN pip3.7 install -r requirements.txt
--- a/model_zoo/official/cv/faster_rcnn/README.md
+++ b/model_zoo/official/cv/faster_rcnn/README.md
@ -5,6 +5,7 @@
 - [Dataset](#dataset)
 - [Environment Requirements](#environment-requirements)
 - [Quick Start](#quick-start)
+- [Run in docker](#Run-in-docker)
 - [Script Description](#script-description)
    - [Script and Sample Code](#script-and-sample-code)
    - [Training Process](#training-process)
--- a/model_zoo/official/cv/faster_rcnn/README_CN.md
+++ b/model_zoo/official/cv/faster_rcnn/README_CN.md
@ -6,6 +6,7 @@
 - [数据集](#数据集)
 - [环境要求](#环境要求)
 - [快速入门](#快速入门)
+- [在docker上运行](#在docker上运行)
 - [脚本说明](#脚本说明)
    - [脚本及样例代码](#脚本及样例代码)
    - [训练过程](#训练过程)
--- a/model_zoo/official/cv/maskrcnn/Dockerfile
+++ b/model_zoo/official/cv/maskrcnn/Dockerfile
@ -1,5 +1,6 @@
-ARG FROM_IMAGE_NAME=ascend-mindspore-arm:20.1.0
+ARG FROM_IMAGE_NAME
 FROM ${FROM_IMAGE_NAME}

+RUN apt install libgl1-mesa-glx -y
 COPY requirements.txt .
 RUN pip3.7 install -r requirements.txt
--- a/model_zoo/official/cv/maskrcnn/README.md
+++ b/model_zoo/official/cv/maskrcnn/README.md
@ -5,6 +5,7 @@
 - [Dataset](#dataset)
 - [Environment Requirements](#environment-requirements)
 - [Quick Start](#quick-start)
+- [Run in docker](#Run-in-docker)
 - [Script Description](#script-description)
    - [Script and Sample Code](#script-and-sample-code)
    - [Script Parameters](#script-parameters)
@ -397,7 +398,7 @@ bash run_distribute_train.sh [RANK_TABLE_FILE] [PRETRAINED_MODEL]

 - Notes
 1. hccl.json which is specified by RANK_TABLE_FILE is needed when you are running a distribute task. You can generate it by using the [hccl_tools](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools).
-2. As for PRETRAINED_MODEL，it should be a trained ResNet50 checkpoint. If not set, the model will be trained from the very beginning. If you need to load Ready-made pretrained FasterRcnn checkpoint, you may make changes to the train.py script as follows.
+2. As for PRETRAINED_MODEL，it should be a trained ResNet50 checkpoint. If not set, the model will be trained from the very beginning. If you need to load Ready-made pretrained MaskRcnn checkpoint, you may make changes to the train.py script as follows.

 ```python
 # Comment out the following code
--- a/model_zoo/official/cv/maskrcnn/README_CN.md
+++ b/model_zoo/official/cv/maskrcnn/README_CN.md
@ -58,6 +58,8 @@ MaskRCNN是一个两级目标检测网络，作为FasterRCNN的扩展模型，
    - 采用昇腾处理器搭建硬件环境。如需试用昇腾处理器，请发送[申请表](https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/file/other/Ascend%20Model%20Zoo%E4%BD%93%E9%AA%8C%E8%B5%84%E6%BA%90%E7%94%B3%E8%AF%B7%E8%A1%A8.docx)至ascend@huawei.com，审核通过即可获得资源。
 - 框架
    - [MindSpore](https://gitee.com/mindspore/mindspore)
+- 获取基础镜像
+    - [Ascend Hub](ascend.huawei.com/ascendhub/#/home)
 - 如需查看详情，请参见如下资源：
    - [MindSpore教程](https://www.mindspore.cn/tutorial/training/zh-CN/master/index.html)
    - [MindSpore Python API](https://www.mindspore.cn/doc/api_python/zh-CN/master/index.html)
@ -134,6 +136,39 @@ pip install mmcv=0.2.14
   1. AIR_PATH是在910上使用export脚本导出的模型。
   2. ANN_FILE_PATH是推理使用的标注文件。

+# 在docker上运行
+
+1. 编译镜像
+
+```shell
+# 编译镜像
+docker build -t maskrcnn:20.1.0 . --build-arg FROM_IMAGE_NAME=ascend-mindspore-arm:20.1.0
+```
+
+2. 启动容器实例
+
+```shell
+# 启动容器实例
+bash scripts/docker_start.sh maskrcnn:20.1.0 [DATA_DIR] [MODEL_DIR]
+```
+
+3. 训练
+
+```shell
+# 单机训练
+bash run_standalone_train.sh [PRETRAINED_CKPT]
+
+# 分布式训练
+bash run_distribute_train.sh [RANK_TABLE_FILE] [PRETRAINED_CKPT]
+```
+
+4. 评估
+
+```shell
+# 评估
+bash run_eval.sh [VALIDATION_JSON_FILE] [CHECKPOINT_PATH]
+```
+
 # 脚本说明

 ## 脚本和样例代码
@ -358,9 +393,38 @@ sh run_standalone_train.sh [PRETRAINED_MODEL]
 sh run_distribute_train.sh [RANK_TABLE_FILE] [PRETRAINED_MODEL]
 ```

-> 运行分布式任务时要用到由RANK_TABLE_FILE指定的hccl.json文件。您可使用[hccl_tools](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools)生成该文件。
-> 若不设置PRETRAINED_MODEL，模型将会从头开始训练。暂无预训练模型可用，请持续关注。
-> 本操作涉及处理器内核绑定，需要设置`device_num`及处理器总数。若无需此操作，请删除`scripts/run_distribute_train.sh`中的`taskset`。
+- Notes
+
+1. 运行分布式任务时要用到由RANK_TABLE_FILE指定的hccl.json文件。您可使用[hccl_tools](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools)生成该文件。
+2. PRETRAINED_MODEL应该是训练好的ResNet50检查点。如果此参数未设置，网络将从头开始训练。如果想要加载训练好的MaskRcnn检查点，需要对train.py作如下修改：
+
+```python
+# Comment out the following code
+#   load_path = args_opt.pre_trained
+#    if load_path != "":
+#        param_dict = load_checkpoint(load_path)
+#        for item in list(param_dict.keys()):
+#            if not item.startswith('backbone'):
+#                param_dict.pop(item)
+#        load_param_into_net(net, param_dict)
+
+# Add the following codes after optimizer definition since the FasterRcnn checkpoint includes optimizer parameters：
+    lr = Tensor(dynamic_lr(config, rank_size=device_num, start_steps=config.pretrain_epoch_size * dataset_size),
+                mstype.float32)
+    opt = Momentum(params=net.trainable_params(), learning_rate=lr, momentum=config.momentum,
+                   weight_decay=config.weight_decay, loss_scale=config.loss_scale)
+
+    if load_path != "":
+        param_dict = load_checkpoint(load_path)
+        if config.pretrain_epoch_size == 0:
+            for item in list(param_dict.keys()):
+                if item in ("global_step", "learning_rate") or "rcnn.cls" in item or "rcnn.mask" in item:
+                    param_dict.pop(item)
+        load_param_into_net(net, param_dict)
+        load_param_into_net(opt, param_dict)
+```
+
+3. 本操作涉及处理器内核绑定，需要设置`device_num`及处理器总数。若无需此操作，请删除`scripts/run_distribute_train.sh`中的`taskset`

 ### 训练结果