You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Paddle/doc/fluid/howto/cluster/nccl2_rdma_training.md

4.0 KiB

Distributed Training with NCCL2 and RDMA

When doing distributed multi-GPU training, network bandwith often becomes the bottle neck. We introduce a way to use NCCL2 to do such training job to achieve best performace.

Prepare Hardwares with RDMA and Multiple GPUs

I'm using two Linux servers each of them is installed with 8 GPUs and one 100Gb RDMA card. Base environment is:

  • OS: CentOS 7.4
  • RDMA device: "Mellanox Technologies MT27700 Family [ConnectX-4]"
  • Kernel version: 4.4.88-1.el7.elrepo.x86_64
  • Docker version: 1.12.6
  • Docker storage driver: overlay2
  • IP addresses: 192.168.16.30,192.168.16.34

In general, the steps including:

  1. Install GPU drivers
  2. Install RDMA drivers
  3. Install "InfiniBand Support"
  4. Use docker to run tests and make sure GPUs and RDMA can work inside the container.

I'll ommit section "Install GPU drivers" because we can find it easily somewhere else.

Install RDMA drivers

For my case, I've got two machines with device "Mellanox Technologies MT27700 Family [ConnectX-4]" installed. The OS was "CentOS 7.4" and I updated the kernel to version 4.4 so that docker can work with latest overlay2 filesystem.

NOTE: before you start, make sure you have a way to get a console of the server other than ssh because we may need to re-configure the network device.

  1. Go to http://www.mellanox.com/page/products_dyn?product_family=26, download MLNX_OFED software in the bottom of the page, and upload it onto the server.
  2. Run ./mlnxofedinstall --add-kernel-support in the software package.
  3. Run /etc/init.d/openibd restart to make everything work, note that this operation may cause the network goes down if you are using this RDMA device as default network device and use ssh to login the server.
  4. Re-configure the network interface, for example: ifconfig eth2 192.168.16.30/20 up, then add routes if needed: ip route add default via 192.168.16.1 dev eth2.
  5. Do the same thing on the other node.
  6. Use ping to test if the two nodes have typical ICMP connection.
  7. Use either udaddy or ib_write_bw to test the network connection is ready and have the desired bandwith.

Prepare Docker Image to Run RDMA Programs

  1. Build a docker image using cuda base image like: nvidia/cuda:8.0-cudnn5-devel-ubuntu16.04 and install paddlepaddle whl package in it.
  2. Start a docker container and mount GPU driver libs into it (you can skip this step if you are using nvidia-docker).
  3. Mount RDMA dirvers and libs into the docker image (see below section), also udaddy and ib_write_bw if needed.
  4. Mount GPU devices and RDMA devices into the container using --device or just use privileged mode --privileged.
  5. Start the container using host network mode: --net=host

RDMA Library Files Needed

Usually, MLNX_OFED install latest supported libs under /usr/lib64/mlnx_ofed/valgrind. Other libs also needed to run RDMA programs is listed below. These libs must be mounted into the docker container.

  • Libs under /usr/lib64/mlnx_ofed/valgrind
    • libibcm.so
    • libibverbs.so
    • libmlx4.so
    • libmlx5.so
    • libmlx5-rdmav2.so
    • librdmacm.so
  • Other libs:
    • libnl-3.so.200
    • libnl-route-3.so.200
    • libnuma.so.1

Start to Run the Training Job

Setting NCCL environment variables to turn NCCL switches on and off:

Env Name Description
NCCL_SOCKET_IFNAME The RDMA device, e.g. eth2
NCCL_P2P_DISABLE Set to 1 to disable P2P transfer between GPUs
NCCL_IB_DISABLE Set to 1 to disable using RDMA
NCCL_IB_CUDA_SUPPORT Set to 1 to enable GPU Direct if supported
NCCL_DEBUG Set debug level: VERSION, WARN, INFO

My two servers are: 192.168.16.30,192.168.16.34, On node 1, Run :

PADDLE_TRAINER_ID=0 PADDLE_PORT=48372 PADDLE_WORKERS=192.168.16.30,192.168.16.34 POD_IP=192.168.16.30 stdbuf -oL python vgg16.py

On node 2, Run:

PADDLE_TRAINER_ID=1 PADDLE_PORT=48372 PADDLE_WORKERS=192.168.16.30,192.168.16.34 POD_IP=192.168.16.34 stdbuf -oL python vgg16.py