!10449 Add docker files in fasterrcnn and maskrcnn

From: @ttudu
Reviewed-by: 
Signed-off-by:
pull/10449/MERGE
mindspore-ci-bot 4 years ago committed by Gitee
commit b0794bb5f6

@ -0,0 +1,5 @@
ARG FROM_IMAGE_NAME=ascend-mindspore-arm:20.1.0
FROM ${FROM_IMAGE_NAME}
COPY requirements.txt .
RUN pip3.7 install -r requirements.txt

@ -46,6 +46,12 @@ Dataset used: [COCO2017](<https://cocodataset.org/>)
# Environment Requirements
- HardwareAscend
- Prepare hardware environment with Ascend processor. If you want to try Ascend, please send the [application form](https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/file/other/Ascend%20Model%20Zoo%E4%BD%93%E9%AA%8C%E8%B5%84%E6%BA%90%E7%94%B3%E8%AF%B7%E8%A1%A8.docx) to ascend@huawei.com. Once approved, you can get the resources.
- Docker base image
- [Ascend Hub](ascend.huawei.com/ascendhub/#/home)
- Install [MindSpore](https://www.mindspore.cn/install/en).
- Download the dataset COCO2017.
@ -104,6 +110,39 @@ sh run_distribute_train_ascend.sh [RANK_TABLE_FILE] [PRETRAINED_MODEL]
sh run_eval_ascend.sh [VALIDATION_JSON_FILE] [CHECKPOINT_PATH]
```
# Run in docker
1. Build docker images
```shell
# build docker
docker build -t fasterrcnn:20.1.0 . --build-arg FROM_IMAGE_NAME=ascend-mindspore-arm:20.1.0
```
2. Create a container layer over the created image and start it
```shell
# start docker
bash scripts/docker_start.sh fasterrcnn:20.1.0 [DATA_DIR] [MODEL_DIR]
```
3. Train
```shell
# standalone training
sh run_standalone_train_ascend.sh [PRETRAINED_MODEL]
# distributed training
sh run_distribute_train_ascend.sh [RANK_TABLE_FILE] [PRETRAINED_MODEL]
```
4. Eval
```shell
# eval
sh run_eval_ascend.sh [VALIDATION_JSON_FILE] [CHECKPOINT_PATH]
```
# Script Description
## Script and Sample Code
@ -150,9 +189,36 @@ sh run_standalone_train_ascend.sh [PRETRAINED_MODEL]
sh run_distribute_train_ascend.sh [RANK_TABLE_FILE] [PRETRAINED_MODEL]
```
> Rank_table.json which is specified by RANK_TABLE_FILE is needed when you are running a distribute task. You can generate it by using the [hccl_tools](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools).
> As for PRETRAINED_MODELit should be a ResNet50 checkpoint that trained over ImageNet2012. Ready-made pretrained_models are not available now. Stay tuned.
> The original dataset path needs to be in the config.py,you can select "coco_root" or "image_dir".
Notes:
1. Rank_table.json which is specified by RANK_TABLE_FILE is needed when you are running a distribute task. You can generate it by using the [hccl_tools](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools).
2. As for PRETRAINED_MODELit should be a trained ResNet50 checkpoint. If you need to load Ready-made pretrained FasterRcnn checkpoint, you may make changes to the train.py script as follows.
```python
# Comment out the following code
# load_path = args_opt.pre_trained
# if load_path != "":
# param_dict = load_checkpoint(load_path)
# for item in list(param_dict.keys()):
# if not item.startswith('backbone'):
# param_dict.pop(item)
# load_param_into_net(net, param_dict)
# Add the following codes after optimizer definition since the FasterRcnn checkpoint includes optimizer parameters
lr = Tensor(dynamic_lr(config, rank_size=device_num), mstype.float32)
opt = SGD(params=net.trainable_params(), learning_rate=lr, momentum=config.momentum,
weight_decay=config.weight_decay, loss_scale=config.loss_scale)
if load_path != "":
param_dict = load_checkpoint(load_path)
for item in list(param_dict.keys()):
if item in ("global_step", "learning_rate") or "rcnn.reg_scores" in item or "rcnn.cls_scores" in item:
param_dict.pop(item)
load_param_into_net(opt, param_dict)
load_param_into_net(net, param_dict)
```
3. The original dataset path needs to be in the config.py,you can select "coco_root" or "image_dir".
### Result

@ -47,6 +47,12 @@ Faster R-CNN是一个两阶段目标检测网络该网络采用RPN可以
# 环境要求
- 硬件Ascend
- 使用Ascend处理器来搭建硬件环境。如需试用Ascend处理器请发送[申请表](https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/file/other/Ascend%20Model%20Zoo%E4%BD%93%E9%AA%8C%E8%B5%84%E6%BA%90%E7%94%B3%E8%AF%B7%E8%A1%A8.docx)至ascend@huawei.com审核通过即可获得资源。
- 获取基础镜像
- [Ascend Hub](ascend.huawei.com/ascendhub/#/home)
- 安装[MindSpore](https://www.mindspore.cn/install)。
- 下载数据集COCO 2017。
@ -107,6 +113,39 @@ sh run_distribute_train_ascend.sh [RANK_TABLE_FILE] [PRETRAINED_MODEL]
sh run_eval_ascend.sh [VALIDATION_JSON_FILE] [CHECKPOINT_PATH]
```
# 在docker上运行
1. 编译镜像
```shell
# 编译镜像
docker build -t fasterrcnn:20.1.0 . --build-arg FROM_IMAGE_NAME=ascend-mindspore-arm:20.1.0
```
2. 启动容器实例
```shell
# 启动容器实例
bash scripts/docker_start.sh fasterrcnn:20.1.0 [DATA_DIR] [MODEL_DIR]
```
3. 训练
```shell
# 单机训练
sh run_standalone_train_ascend.sh [PRETRAINED_MODEL]
# 分布式训练
sh run_distribute_train_ascend.sh [RANK_TABLE_FILE] [PRETRAINED_MODEL]
```
4. 评估
```shell
# 评估
sh run_eval_ascend.sh [VALIDATION_JSON_FILE] [CHECKPOINT_PATH]
```
# 脚本说明
## 脚本及样例代码
@ -153,9 +192,36 @@ sh run_standalone_train_ascend.sh [PRETRAINED_MODEL]
sh run_distribute_train_ascend.sh [RANK_TABLE_FILE] [PRETRAINED_MODEL]
```
> 运行分布式任务时需要用到RANK_TABLE_FILE指定的rank_table.json。您可以使用[hccl_tools](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools)生成该文件。
> PRETRAINED_MODEL应该是在ImageNet 2012上训练的ResNet-50检查点。现成的pretrained_models目前不可用。敬请期待。
> config.py中包含原数据集路径可以选择“coco_root”或“image_dir”。
Notes:
1. 运行分布式任务时需要用到RANK_TABLE_FILE指定的rank_table.json。您可以使用[hccl_tools](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools)生成该文件。
2. PRETRAINED_MODEL应该是训练好的ResNet-50检查点。如果需要加载训练好的FasterRcnn的检查点需要对train.py作如下修改:
```python
# 注释掉如下代码
# load_path = args_opt.pre_trained
# if load_path != "":
# param_dict = load_checkpoint(load_path)
# for item in list(param_dict.keys()):
# if not item.startswith('backbone'):
# param_dict.pop(item)
# load_param_into_net(net, param_dict)
# 加载训练好的FasterRcnn检查点时需加载网络参数和优化器到模型因此可以在定义优化器后添加如下代码
lr = Tensor(dynamic_lr(config, rank_size=device_num), mstype.float32)
opt = SGD(params=net.trainable_params(), learning_rate=lr, momentum=config.momentum,
weight_decay=config.weight_decay, loss_scale=config.loss_scale)
if load_path != "":
param_dict = load_checkpoint(load_path)
for item in list(param_dict.keys()):
if item in ("global_step", "learning_rate") or "rcnn.reg_scores" in item or "rcnn.cls_scores" in item:
param_dict.pop(item)
load_param_into_net(opt, param_dict)
load_param_into_net(net, param_dict)
```
3. config.py中包含原数据集路径可以选择“coco_root”或“image_dir”。
### 结果

@ -0,0 +1,3 @@
Cython
pycocotools
mmcv==0.2.14

@ -0,0 +1,28 @@
#!/bin/bash
docker_image=$1
data_dir=$2
model_dir=$3
docker run -it --ipc=host \
--device=/dev/davinci0 \
--device=/dev/davinci1 \
--device=/dev/davinci2 \
--device=/dev/davinci3 \
--device=/dev/davinci4 \
--device=/dev/davinci5 \
--device=/dev/davinci6 \
--device=/dev/davinci7 \
--device=/dev/davinci_manager \
--device=/dev/devmm_svm \
--device=/dev/hisi_hdc \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/Ascend/add-ons/:/usr/local/Ascend/add-ons \
-v ${data_dir}:${data_dir} \
-v ${model_dir}:${model_dir} \
-v /var/log/npu/conf/slog/slog.conf:/var/log/npu/conf/slog/slog.conf \
-v /var/log/npu/slog/:/var/log/npu/slog/ \
-v /var/log/npu/profiling/:/var/log/npu/profiling \
-v /var/log/npu/dump/:/var/log/npu/dump \
-v /var/log/npu/:/usr/slog ${docker_image} \
/bin/bash

@ -0,0 +1,5 @@
ARG FROM_IMAGE_NAME=ascend-mindspore-arm:20.1.0
FROM ${FROM_IMAGE_NAME}
COPY requirements.txt .
RUN pip3.7 install -r requirements.txt

@ -55,6 +55,8 @@ Note that you can run the scripts based on the dataset mentioned in original pap
- Prepare hardware environment with Ascend processor. If you want to try Ascend, please send the [application form](https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/file/other/Ascend%20Model%20Zoo%E4%BD%93%E9%AA%8C%E8%B5%84%E6%BA%90%E7%94%B3%E8%AF%B7%E8%A1%A8.docx) to ascend@huawei.com. Once approved, you can get the resources.
- Framework
- [MindSpore](https://gitee.com/mindspore/mindspore)
- Docker base image
- [Ascend Hub](ascend.huawei.com/ascendhub/#/home)
- For more information, please check the resources below:
- [MindSpore Tutorials](https://www.mindspore.cn/tutorial/training/en/master/index.html)
- [MindSpore Python API](https://www.mindspore.cn/doc/api_python/en/master/index.html)
@ -120,6 +122,39 @@ pip install mmcv=0.2.14
Note:
1. VALIDATION_JSON_FILE is a label json file for evaluation.
# Run in docker
1. Build docker images
```shell
# build docker
docker build -t maskrcnn:20.1.0 . --build-arg FROM_IMAGE_NAME=ascend-mindspore-arm:20.1.0
```
2. Create a container layer over the created image and start it
```shell
# start docker
bash scripts/docker_start.sh maskrcnn:20.1.0 [DATA_DIR] [MODEL_DIR]
```
3. Train
```shell
# standalone training
bash run_standalone_train.sh [PRETRAINED_CKPT]
# distributed training
bash run_distribute_train.sh [RANK_TABLE_FILE] [PRETRAINED_CKPT]
```
4. Eval
```shell
# Evaluation
bash run_eval.sh [VALIDATION_JSON_FILE] [CHECKPOINT_PATH]
```
# [Script Description](#contents)
## [Script and Sample Code](#contents)
@ -336,9 +371,37 @@ bash run_standalone_train.sh [PRETRAINED_MODEL]
bash run_distribute_train.sh [RANK_TABLE_FILE] [PRETRAINED_MODEL]
```
> hccl.json which is specified by RANK_TABLE_FILE is needed when you are running a distribute task. You can generate it by using the [hccl_tools](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools).
> As for PRETRAINED_MODEL, if not set, the model will be trained from the very beginning. Ready-made pretrained_models are not available now. Stay tuned.
> This is processor cores binding operation regarding the `device_num` and total processor numbers. If you are not expect to do it, remove the operations `taskset` in `scripts/run_distribute_train.sh`
- Notes
1. hccl.json which is specified by RANK_TABLE_FILE is needed when you are running a distribute task. You can generate it by using the [hccl_tools](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools).
2. As for PRETRAINED_MODELit should be a trained ResNet50 checkpoint. If not set, the model will be trained from the very beginning. If you need to load Ready-made pretrained FasterRcnn checkpoint, you may make changes to the train.py script as follows.
```python
# Comment out the following code
# load_path = args_opt.pre_trained
# if load_path != "":
# param_dict = load_checkpoint(load_path)
# for item in list(param_dict.keys()):
# if not item.startswith('backbone'):
# param_dict.pop(item)
# load_param_into_net(net, param_dict)
# Add the following codes after optimizer definition since the FasterRcnn checkpoint includes optimizer parameters
lr = Tensor(dynamic_lr(config, rank_size=device_num, start_steps=config.pretrain_epoch_size * dataset_size),
mstype.float32)
opt = Momentum(params=net.trainable_params(), learning_rate=lr, momentum=config.momentum,
weight_decay=config.weight_decay, loss_scale=config.loss_scale)
if load_path != "":
param_dict = load_checkpoint(load_path)
if config.pretrain_epoch_size == 0:
for item in list(param_dict.keys()):
if item in ("global_step", "learning_rate") or "rcnn.cls" in item or "rcnn.mask" in item:
param_dict.pop(item)
load_param_into_net(net, param_dict)
load_param_into_net(opt, param_dict)
```
3. This is processor cores binding operation regarding the `device_num` and total processor numbers. If you are not expect to do it, remove the operations `taskset` in `scripts/run_distribute_train.sh`
### [Training Result](#content)

@ -0,0 +1,3 @@
Cython
pycocotools
mmcv==0.2.14

@ -0,0 +1,28 @@
#!/bin/bash
docker_image=$1
data_dir=$2
model_dir=$3
docker run -it --ipc=host \
--device=/dev/davinci0 \
--device=/dev/davinci1 \
--device=/dev/davinci2 \
--device=/dev/davinci3 \
--device=/dev/davinci4 \
--device=/dev/davinci5 \
--device=/dev/davinci6 \
--device=/dev/davinci7 \
--device=/dev/davinci_manager \
--device=/dev/devmm_svm \
--device=/dev/hisi_hdc \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/Ascend/add-ons/:/usr/local/Ascend/add-ons \
-v ${data_dir}:${data_dir} \
-v ${model_dir}:${model_dir} \
-v /var/log/npu/conf/slog/slog.conf:/var/log/npu/conf/slog/slog.conf \
-v /var/log/npu/slog/:/var/log/npu/slog/ \
-v /var/log/npu/profiling/:/var/log/npu/profiling \
-v /var/log/npu/dump/:/var/log/npu/dump \
-v /var/log/npu/:/usr/slog ${docker_image} \
/bin/bash
Loading…
Cancel
Save