From 549c74a9378593d81222238f393b3831cb7f55f1 Mon Sep 17 00:00:00 2001 From: "Yang Yang(Tony)" Date: Mon, 12 Feb 2018 17:44:55 -0800 Subject: [PATCH 1/5] Create parallel_do.md --- doc/design/parallel_do.md | 83 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 83 insertions(+) create mode 100644 doc/design/parallel_do.md diff --git a/doc/design/parallel_do.md b/doc/design/parallel_do.md new file mode 100644 index 0000000000..c41af8c413 --- /dev/null +++ b/doc/design/parallel_do.md @@ -0,0 +1,83 @@ +# Design Doc: Parallel_Do in PaddlePaddle + +In PaddlePaddle, we use parallel_do primitive to represent multithread data parallel processing. + +## Design overview + +The definition of a parallel_do op looks like the following + +```c++ +AddInput(kInputs, "Inputs needed to be split onto different devices").AsDuplicable(); +AddInput(kParameters, "Parameters are duplicated over different devices") + .AsDuplicable(); +AddInput(kPlaces, "Devices used for parallel processing"); +AddOutput(kOutputs, "Outputs needed to be merged from different devices").AsDuplicable(); +AddOutput(kParallelScopes, + "Container for all local variables in forward pass."); +AddAttr(kParallelBlock, + "List of operaters to be executed in parallel"); +``` + +A vanilla implementation of parallel_do can be shown as the following (`|` means single thread and +`||||` means multiple threads) + +``` +In the forward pass + | Split input onto different devices + | Copy parameter to onto different devices + |||| Compute forward pass in parallel + | Merge output from different devices + +In the backward pass + | Split output@grad onto different devices + |||| Compute backward pass in parallel + | accumulate param@grad from different devices to the first device + | Merge input@grad from different devices +``` + +This implementation allows to write mixed device program like this + +```python +# get embedding feature on CPU +feature = some_cpu_only_op(data) + +# parallel processing on multiple GPUs +pd = ParallelDo(gpu_places) +with pd.do(): + read_input(feature) + prediction = my_net(feature) + write_output(activation) +prediction = pd() +loss = cross_entropy(prediction, label) +``` + +## Proformance Imporvement + +There are serial places we can make this parallel_do faster. + +### forward: split input onto different devices + +If the input of the parallel_do is independent from any prior opeartors, we can avoid this step by +prefetching the input onto different devices in a seperate background thread. And the python code +looks like this. +```python +pd = ParallelDo(gpu_places) +with pd.do(): + feature = pre_fetch(gpu_places) + prediction = my_net(feature) + write_output(activation) +``` + +### forward: Copy parameter to onto different devices + +We can avoid this step by making each device have a copy of the parameter. This requires: + +1. `fluid.default_start_up_program()` to be run on all devices +1. In the backward, allreduce param@grad at different devices, this requires + 1. `backward.py` add `allreduce` operators at parallel_do_grad + 1. `allreduce` operators need to be called in async mode to achieve maximum throughput +1. apply gradients related op(i.e. cliping, normalization, decay, sgd) on different devices in parallel + +By doing so, we also avoided "backward: accumulate param@grad from different devices to the first device" + + From ad2dfef4168b31048b777ca5eb328a0368bd74ba Mon Sep 17 00:00:00 2001 From: "Yang Yang(Tony)" Date: Mon, 12 Feb 2018 17:56:28 -0800 Subject: [PATCH 2/5] Update parallel_do.md --- doc/design/parallel_do.md | 76 ++++++++++++++++++++++++++++++++++++++- 1 file changed, 75 insertions(+), 1 deletion(-) diff --git a/doc/design/parallel_do.md b/doc/design/parallel_do.md index c41af8c413..576d30329b 100644 --- a/doc/design/parallel_do.md +++ b/doc/design/parallel_do.md @@ -41,6 +41,7 @@ This implementation allows to write mixed device program like this # get embedding feature on CPU feature = some_cpu_only_op(data) +gpu_places = get_place(use_gpu=True) # parallel processing on multiple GPUs pd = ParallelDo(gpu_places) with pd.do(): @@ -51,6 +52,38 @@ prediction = pd() loss = cross_entropy(prediction, label) ``` +And the programDesc are like the following + +``` +# start_program will be run by executor(CPUPlace), all w1, w2 will be allocated on CPU +start_program +{ + vars: w1, w2 + ops: init(w1), init(w2) +} + +main_program +{ +block0 { + vars: data, places, w1, w2 + ops: data, get_place, parallel_do(block1), + parallel_do_grad(block2), + sgd(w2, w2_grad), + sgd(w1, w1_grad) +} +block1 { + vars: data, h1, h2, loss + ops: fc, fc, softmax +} +block2 { + vars: data_grad, h1_grad, h2_grad, loss_gard, w1_grad, w2_grad + ops: softmax_grad, + fc_grad + fc_grad +} +} +``` + ## Proformance Imporvement There are serial places we can make this parallel_do faster. @@ -78,6 +111,47 @@ We can avoid this step by making each device have a copy of the parameter. This 1. `allreduce` operators need to be called in async mode to achieve maximum throughput 1. apply gradients related op(i.e. cliping, normalization, decay, sgd) on different devices in parallel -By doing so, we also avoided "backward: accumulate param@grad from different devices to the first device" +By doing so, we also avoided "backward: accumulate param@grad from different devices to the first device". +And the ProgramDesc looks like the following + +``` +# w1, w2 will be allocated on all GPUs +start_program +{ +block0 { + parallel_do(block1) +} +block1 { + vars: w1, w2 + ops: init(w1), init(w2) +} +} + +main_program +{ +block0 { + vars: data, places, w1, w2 + ops: data, get_place, parallel_do(block1), + parallel_do_grad(block2), # append_backward + parallel_do(block3) # append_optimization + +} +block1 { + vars: data, h1, h2, loss + ops: fc, fc, softmax +} +block2 { + vars: data_grad, h1_grad, h2_grad, loss_gard, w1_grad, w2_grad + ops: softmax_grad, + fc_grad, allreduce(places, scopes, w1_grad), + fc_grad, allreduce(places, scopes, w2_grad) +} +block3 { + vars: lr + ops: sgd(w2, w2_grad), + sgd(w1, w1_grad) +} +} +``` From c62ef22da37924d5cc08635b490c86910c474fa6 Mon Sep 17 00:00:00 2001 From: "Yang Yang(Tony)" Date: Tue, 13 Feb 2018 13:50:22 -0800 Subject: [PATCH 3/5] Update parallel_do.md --- doc/design/parallel_do.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/doc/design/parallel_do.md b/doc/design/parallel_do.md index 576d30329b..5b90b50792 100644 --- a/doc/design/parallel_do.md +++ b/doc/design/parallel_do.md @@ -72,10 +72,12 @@ block0 { sgd(w1, w1_grad) } block1 { + parent_block: 0 vars: data, h1, h2, loss ops: fc, fc, softmax } block2 { + parent_block: 1 vars: data_grad, h1_grad, h2_grad, loss_gard, w1_grad, w2_grad ops: softmax_grad, fc_grad @@ -122,6 +124,7 @@ block0 { parallel_do(block1) } block1 { + parent_block: 0 vars: w1, w2 ops: init(w1), init(w2) } @@ -137,16 +140,19 @@ block0 { } block1 { + parent_block: 0 vars: data, h1, h2, loss ops: fc, fc, softmax } block2 { + parent_block: 1 vars: data_grad, h1_grad, h2_grad, loss_gard, w1_grad, w2_grad ops: softmax_grad, fc_grad, allreduce(places, scopes, w1_grad), fc_grad, allreduce(places, scopes, w2_grad) } block3 { + parent_block: 0 vars: lr ops: sgd(w2, w2_grad), sgd(w1, w1_grad) From 16a8def1cda1a46ed772d5e37e5f679822f55436 Mon Sep 17 00:00:00 2001 From: Yang Yang Date: Wed, 14 Feb 2018 01:50:39 +0000 Subject: [PATCH 4/5] fix style --- doc/design/parallel_do.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/doc/design/parallel_do.md b/doc/design/parallel_do.md index 5b90b50792..d51b1014d4 100644 --- a/doc/design/parallel_do.md +++ b/doc/design/parallel_do.md @@ -159,5 +159,3 @@ block3 { } } ``` - - From 8b24bd4fe851a793521f88630f56e9d4afcf8263 Mon Sep 17 00:00:00 2001 From: "Yang Yang(Tony)" Date: Wed, 14 Feb 2018 11:56:46 -0800 Subject: [PATCH 5/5] Update parallel_do.md --- doc/design/parallel_do.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/doc/design/parallel_do.md b/doc/design/parallel_do.md index d51b1014d4..221af6b6a4 100644 --- a/doc/design/parallel_do.md +++ b/doc/design/parallel_do.md @@ -13,7 +13,7 @@ AddInput(kParameters, "Parameters are duplicated over different devices") AddInput(kPlaces, "Devices used for parallel processing"); AddOutput(kOutputs, "Outputs needed to be merged from different devices").AsDuplicable(); AddOutput(kParallelScopes, - "Container for all local variables in forward pass."); + "Scopes for all local variables in forward pass. One scope for each device"); AddAttr(kParallelBlock, "List of operaters to be executed in parallel"); ``` @@ -33,6 +33,7 @@ In the backward pass |||| Compute backward pass in parallel | accumulate param@grad from different devices to the first device | Merge input@grad from different devices +  | Copy param@grad to the place of parallel_do_op ``` This implementation allows to write mixed device program like this @@ -47,7 +48,7 @@ pd = ParallelDo(gpu_places) with pd.do(): read_input(feature) prediction = my_net(feature) - write_output(activation) + write_output(prediction) prediction = pd() loss = cross_entropy(prediction, label) ``` @@ -98,7 +99,7 @@ looks like this. ```python pd = ParallelDo(gpu_places) with pd.do(): - feature = pre_fetch(gpu_places) +    feature = get_data_from_prefetch_queue(gpu_places) prediction = my_net(feature) write_output(activation) ```