Add NCCL2 dist train design doc (#11885)
* add_nccl2_dist_design * update * update by commentsrevert-12383-port_py3_syntax
parent
78790ed868
commit
a0fefc27d7
@ -0,0 +1,35 @@
|
||||
# Distributed Training with NCCL2
|
||||
|
||||
We design a pattern that can enable training with `ParallelExecutor` and
|
||||
using [NCCL2](https://developer.nvidia.com/nccl) as it's collective
|
||||
communication library.
|
||||
|
||||
In `ParallelExecutor` we can use `AllReduce` or `Reduce` and `Broadcast`
|
||||
to do multi GPU training. And if we initialize NCCL2 communicators as
|
||||
ranks in a distributed environment, we can simply run the `ParallelExecutor`
|
||||
as a distributed program! The only thing that may be different than in
|
||||
the single node version is that we need to broadcast the NCCL unique ID
|
||||
to all the nodes, and initialize communicators using that ID, so NCCL2
|
||||
will know each other as ranks.
|
||||
|
||||
To achieve this feature, we introduce a new operator: `gen_nccl_id` op,
|
||||
so we are ***not*** "bind to" running NCCL2 with MPI, we can run it in
|
||||
what ever platform you like.
|
||||
|
||||
It have two running modes:
|
||||
|
||||
1. Generate and broadcast mode, which should be used on trainer 0;
|
||||
1. Listen and fetch mode, which should be used on trainers other than 0.
|
||||
|
||||
In both two modes, this op can save the NCCL ID into current scope as a
|
||||
persistable variable, Then we can insert this op at the end of
|
||||
"startup program" of fluid, so that all workers can get the same ID to
|
||||
initialize NCCL communicator objects.
|
||||
|
||||
<img src="src/ncc2_design.png">
|
||||
|
||||
The above figure indicates the general process when training with NCCL2
|
||||
distributed. Each trainer have the number of communicators equal to the
|
||||
number of GPUs, but the ranks should match the global ranks number: here
|
||||
we have total 8 GPUs, so `nranks==8`, for each trainer, the ranks should
|
||||
be from 0 ~ 3 on trainer 0 and 4 ~ 7 on trainer 1.
|
Binary file not shown.
After Width: | Height: | Size: 92 KiB |
Loading…
Reference in new issue