You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
|
5 years ago | |
---|---|---|
.. | ||
README.md | 5 years ago | |
__init__.py | 5 years ago | |
hyper_parameter_config.ini | 5 years ago | |
run_distributed.py | 5 years ago |
README.md
Run distribute pretrain
description
The number of D chips can be automatically allocated based on the device_num set in hccl config file, You don not need to specify that.
how to use
For example, if we want to run the distributed training of Bert model on D chip, we can in /bert/
dir:
python model_zoo/utils/ascend_distributed_launcher/run_distributed.py --run_script_dir ./run_pretrain.py --hyper_parameter_config_dir model_zoo/utils/ascend_distributed_launcher/hyper_parameter_config.ini --data_dir /path/dataset/ --hccl_config_dir model_zoo/utils/hccl_tools/hccl_2p_56_x.x.x.x.json
output:
hccl_config_dir: model_zoo/utils/hccl_tools/hccl_2p_56_x.x.x.x.json
the number of logical core: 192
avg_core_per_rank: 96
rank_size: 2
start training for rank 0, device 5:
rank_id: 0
device_id: 5
core nums: 0-95
epoch_size: 8
data_dir: /data/small_512/
schema_dir:
log file dir: ./LOG5/log.txt
start training for rank 1, device 6:
rank_id: 1
device_id: 6
core nums: 96-191
epoch_size: 8
data_dir: /data/small_512/
schema_dir:
log file dir: ./LOG6/log.txt
Note
-
Note that
hccl_2p_56_x.x.x.x.json
can use hccl_tools.py to generate. -
For hyper parameter, please note that you should customize the scripts
hyper_parameter_config.ini
. Please note that these two hyper parameters are not allowed to be configured here: device_id device_num -
For Other Model, please note that you should customize the option
run_script
and Correspondinghyper_parameter_config.ini
.