|
|
@ -1,12 +1,12 @@
|
|
|
|
# Distributed Training with NCCL2 and RDMA
|
|
|
|
# Distributed Training with NCCL2 and RDMA
|
|
|
|
|
|
|
|
|
|
|
|
When doing distributed multi-GPU training, network bandwith often becomes the
|
|
|
|
When doing distributed multi-GPU training, network bandwidth often becomes the
|
|
|
|
bottleneck. We introduce a way to use NCCL2 to do such training job to
|
|
|
|
bottleneck. We introduce a way to use NCCL2 to do such training job to
|
|
|
|
achieve best performace.
|
|
|
|
achieve best performance.
|
|
|
|
|
|
|
|
|
|
|
|
## Prepare Hardwares with RDMA and Multiple GPUs
|
|
|
|
## Prepare Hardware with RDMA and Multiple GPUs
|
|
|
|
|
|
|
|
|
|
|
|
I'm using two Linux servers each of them is installed with 8 GPUs and
|
|
|
|
I'm using two Linux servers each of them installed with 8 GPUs and
|
|
|
|
one 100Gb RDMA card.
|
|
|
|
one 100Gb RDMA card.
|
|
|
|
Base environment is:
|
|
|
|
Base environment is:
|
|
|
|
|
|
|
|
|
|
|
@ -25,7 +25,7 @@ In general, the steps including:
|
|
|
|
1. Use docker to run tests and make sure GPUs and RDMA can work inside
|
|
|
|
1. Use docker to run tests and make sure GPUs and RDMA can work inside
|
|
|
|
the container.
|
|
|
|
the container.
|
|
|
|
|
|
|
|
|
|
|
|
I'll ommit section "Install GPU drivers" because we can find it easily
|
|
|
|
I'll omit the section "Install GPU drivers" because we can find it easily
|
|
|
|
somewhere else.
|
|
|
|
somewhere else.
|
|
|
|
|
|
|
|
|
|
|
|
### Install RDMA drivers
|
|
|
|
### Install RDMA drivers
|
|
|
@ -33,7 +33,7 @@ somewhere else.
|
|
|
|
For my case, I've got two machines with device
|
|
|
|
For my case, I've got two machines with device
|
|
|
|
"Mellanox Technologies MT27700 Family [ConnectX-4]" installed. The OS was
|
|
|
|
"Mellanox Technologies MT27700 Family [ConnectX-4]" installed. The OS was
|
|
|
|
"CentOS 7.4" and I updated the kernel to version 4.4 so that docker can
|
|
|
|
"CentOS 7.4" and I updated the kernel to version 4.4 so that docker can
|
|
|
|
work with latest overlay2 filesystem.
|
|
|
|
work with the latest overlay2 filesystem.
|
|
|
|
|
|
|
|
|
|
|
|
***NOTE: before you start, make sure you have a way to get a console
|
|
|
|
***NOTE: before you start, make sure you have a way to get a console
|
|
|
|
of the server other than ssh because we may need to re-configure the
|
|
|
|
of the server other than ssh because we may need to re-configure the
|
|
|
@ -52,7 +52,7 @@ network device.***
|
|
|
|
1. Do the same thing on the other node.
|
|
|
|
1. Do the same thing on the other node.
|
|
|
|
1. Use `ping` to test if the two nodes have typical ICMP connection.
|
|
|
|
1. Use `ping` to test if the two nodes have typical ICMP connection.
|
|
|
|
1. Use either `udaddy` or `ib_write_bw` to test the network connection is
|
|
|
|
1. Use either `udaddy` or `ib_write_bw` to test the network connection is
|
|
|
|
ready and have the desired bandwith.
|
|
|
|
ready and have the desired bandwidth.
|
|
|
|
|
|
|
|
|
|
|
|
### Prepare Docker Image to Run RDMA Programs
|
|
|
|
### Prepare Docker Image to Run RDMA Programs
|
|
|
|
|
|
|
|
|
|
|
@ -60,7 +60,7 @@ network device.***
|
|
|
|
package in it.
|
|
|
|
package in it.
|
|
|
|
1. Start a docker container and mount GPU driver libs into it (you can
|
|
|
|
1. Start a docker container and mount GPU driver libs into it (you can
|
|
|
|
skip this step if you are using nvidia-docker).
|
|
|
|
skip this step if you are using nvidia-docker).
|
|
|
|
1. Mount RDMA dirvers and libs into the docker image (see below section),
|
|
|
|
1. Mount RDMA drivers and libs into the docker image (see below section),
|
|
|
|
also `udaddy` and `ib_write_bw` if needed.
|
|
|
|
also `udaddy` and `ib_write_bw` if needed.
|
|
|
|
1. Mount GPU devices and RDMA devices into the container using `--device`
|
|
|
|
1. Mount GPU devices and RDMA devices into the container using `--device`
|
|
|
|
or just use privileged mode `--privileged`.
|
|
|
|
or just use privileged mode `--privileged`.
|
|
|
|