You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
42 lines
1.5 KiB
42 lines
1.5 KiB
7 years ago
|
# Cluster Training Using OpenMPI
|
||
|
|
||
|
## Prepare an OpenMPI cluster
|
||
|
|
||
|
Run the following command to start a 3-node MPI cluster and one "head" node.
|
||
|
|
||
|
```bash
|
||
|
cd paddle/scripts/cluster_train_v2/openmpi/docker_cluster
|
||
|
kubectl create -f head.yaml
|
||
|
kubectl create -f mpi-nodes.yaml
|
||
|
```
|
||
|
|
||
|
Then you can log in to every OpenMPI node using ssh without input any passwords.
|
||
|
|
||
|
## Launching Cluster Job
|
||
|
|
||
|
Follow the steps to launch a PaddlePaddle training job in OpenMPI cluster:\
|
||
|
|
||
|
```bash
|
||
|
# find out node IP addresses
|
||
|
kubectl get po -o wide
|
||
|
# generate a "machines" file containing node IP addresses
|
||
|
kubectl get po -o wide | grep nodes | awk '{print $6}' > machines
|
||
|
# copy necessary files onto "head" node
|
||
|
scp -i ssh/id_rsa.mpi.pub machines prepare.py train.py start_mpi_train.sh tutorial@[headIP]:~
|
||
|
# login to head node using ssh
|
||
|
ssh -i ssh/id_rsa.mpi.pub tutorial@[headIP]
|
||
|
# --------------- in head node ---------------
|
||
|
# prepare training data
|
||
|
python prepare.py
|
||
|
# copy training data and dict file to MPI nodes
|
||
|
cat machines | xargs -i scp word_dict.pickle train.py start_mpi_train.sh machines {}:/home/tutorial
|
||
|
# creat a directory for storing log files
|
||
|
mpirun -hostfile machines -n 3 mkdir /home/tutorial/logs
|
||
|
# copy training data to every node
|
||
|
scp train.txt-00000 test.txt-00000 [node1IP]:/home/tutorial
|
||
|
scp train.txt-00001 test.txt-00001 [node2IP]:/home/tutorial
|
||
|
scp train.txt-00002 test.txt-00002 [node3IP]:/home/tutorial
|
||
|
# start the job
|
||
|
mpirun -hostfile machines -n 3 /home/tutorial/start_mpi_train.sh
|
||
|
```
|