From 549c74a9378593d81222238f393b3831cb7f55f1 Mon Sep 17 00:00:00 2001
From: "Yang Yang(Tony)" <yangyang62@baidu.com>
Date: Mon, 12 Feb 2018 17:44:55 -0800
Subject: [PATCH 1/5] Create parallel_do.md

---
 doc/design/parallel_do.md | 83 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 83 insertions(+)
 create mode 100644 doc/design/parallel_do.md

diff --git a/doc/design/parallel_do.md b/doc/design/parallel_do.md
new file mode 100644
index 0000000000..c41af8c413
--- /dev/null
+++ b/doc/design/parallel_do.md
@@ -0,0 +1,83 @@
+# Design Doc: Parallel_Do in PaddlePaddle
+
+In PaddlePaddle, we use parallel_do primitive to represent multithread data parallel processing.
+
+## Design overview
+
+The definition of a parallel_do op looks like the following
+
+```c++
+AddInput(kInputs, "Inputs needed to be split onto different devices").AsDuplicable();
+AddInput(kParameters, "Parameters are duplicated over different devices")
+    .AsDuplicable();
+AddInput(kPlaces, "Devices used for parallel processing");
+AddOutput(kOutputs, "Outputs needed to be merged from different devices").AsDuplicable();
+AddOutput(kParallelScopes,
+          "Container for all local variables in forward pass.");
+AddAttr<framework::BlockDesc *>(kParallelBlock,
+                                "List of operaters to be executed in parallel");
+```
+
+A vanilla implementation of parallel_do can be shown as the following (`|` means single thread and
+`||||` means multiple threads)
+
+```
+In the forward pass
+  |      Split input onto different devices
+  |      Copy parameter to onto different devices
+  ||||   Compute forward pass in parallel
+  |      Merge output from different devices
+
+In the backward pass
+  |      Split output@grad onto different devices
+  ||||   Compute backward pass in parallel
+  |      accumulate param@grad from different devices to the first device
+  |      Merge input@grad from different devices
+```
+
+This implementation allows to write mixed device program like this
+
+```python
+# get embedding feature on CPU
+feature = some_cpu_only_op(data)
+
+# parallel processing on multiple GPUs
+pd = ParallelDo(gpu_places)
+with pd.do():
+    read_input(feature)
+    prediction = my_net(feature)
+    write_output(activation)
+prediction = pd()
+loss = cross_entropy(prediction, label)
+```
+
+## Proformance Imporvement
+
+There are serial places we can make this parallel_do faster.
+
+### forward: split input onto different devices
+
+If the input of the parallel_do is independent from any prior opeartors, we can avoid this step by 
+prefetching the input onto different devices in a seperate background thread. And the python code
+looks like this.
+```python
+pd = ParallelDo(gpu_places)
+with pd.do():
+    feature = pre_fetch(gpu_places)
+    prediction = my_net(feature)
+    write_output(activation)
+```
+
+### forward: Copy parameter to onto different devices
+
+We can avoid this step by making each device have a copy of the parameter. This requires:
+
+1. `fluid.default_start_up_program()` to be run on all devices
+1. In the backward, allreduce param@grad at different devices, this requires
+    1. `backward.py` add `allreduce` operators at parallel_do_grad
+    1. `allreduce` operators need to be called in async mode to achieve maximum throughput
+1. apply gradients related op(i.e. cliping, normalization, decay, sgd) on different devices in parallel
+
+By doing so, we also avoided "backward: accumulate param@grad from different devices to the first device"
+
+

From ad2dfef4168b31048b777ca5eb328a0368bd74ba Mon Sep 17 00:00:00 2001
From: "Yang Yang(Tony)" <yangyang62@baidu.com>
Date: Mon, 12 Feb 2018 17:56:28 -0800
Subject: [PATCH 2/5] Update parallel_do.md

---
 doc/design/parallel_do.md | 76 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 75 insertions(+), 1 deletion(-)

diff --git a/doc/design/parallel_do.md b/doc/design/parallel_do.md
index c41af8c413..576d30329b 100644
--- a/doc/design/parallel_do.md
+++ b/doc/design/parallel_do.md
@@ -41,6 +41,7 @@ This implementation allows to write mixed device program like this
 # get embedding feature on CPU
 feature = some_cpu_only_op(data)
 
+gpu_places = get_place(use_gpu=True)
 # parallel processing on multiple GPUs
 pd = ParallelDo(gpu_places)
 with pd.do():
@@ -51,6 +52,38 @@ prediction = pd()
 loss = cross_entropy(prediction, label)
 ```
 
+And the programDesc are like the following
+
+```
+# start_program will be run by executor(CPUPlace), all w1, w2 will be allocated on CPU
+start_program
+{
+  vars: w1, w2
+  ops: init(w1), init(w2)
+}
+
+main_program
+{
+block0 {
+  vars: data, places, w1, w2
+  ops: data, get_place, parallel_do(block1),
+       parallel_do_grad(block2),
+       sgd(w2, w2_grad),
+       sgd(w1, w1_grad)
+}
+block1 {
+  vars: data, h1, h2, loss
+  ops: fc, fc, softmax
+}
+block2 {
+  vars: data_grad, h1_grad, h2_grad, loss_gard, w1_grad, w2_grad
+  ops: softmax_grad,
+       fc_grad
+       fc_grad
+}
+}
+```
+
 ## Proformance Imporvement
 
 There are serial places we can make this parallel_do faster.
@@ -78,6 +111,47 @@ We can avoid this step by making each device have a copy of the parameter. This
     1. `allreduce` operators need to be called in async mode to achieve maximum throughput
 1. apply gradients related op(i.e. cliping, normalization, decay, sgd) on different devices in parallel
 
-By doing so, we also avoided "backward: accumulate param@grad from different devices to the first device"
+By doing so, we also avoided "backward: accumulate param@grad from different devices to the first device".
+And the ProgramDesc looks like the following
+
+```
+# w1, w2 will be allocated on all GPUs
+start_program
+{
+block0 {
+  parallel_do(block1)
+}
+block1 {
+  vars: w1, w2
+  ops: init(w1), init(w2)
+}
+}
+
+main_program
+{
+block0 {
+  vars: data, places, w1, w2
+  ops: data, get_place, parallel_do(block1),
+       parallel_do_grad(block2),      # append_backward
+       parallel_do(block3)            # append_optimization
+       
+}
+block1 {
+  vars: data, h1, h2, loss
+  ops: fc, fc, softmax
+}
+block2 {
+  vars: data_grad, h1_grad, h2_grad, loss_gard, w1_grad, w2_grad
+  ops: softmax_grad,
+       fc_grad, allreduce(places, scopes, w1_grad),
+       fc_grad, allreduce(places, scopes, w2_grad)
+}
+block3 {
+  vars: lr
+  ops: sgd(w2, w2_grad),
+       sgd(w1, w1_grad)
+}
+}
+```
 
 

From c62ef22da37924d5cc08635b490c86910c474fa6 Mon Sep 17 00:00:00 2001
From: "Yang Yang(Tony)" <yangyang62@baidu.com>
Date: Tue, 13 Feb 2018 13:50:22 -0800
Subject: [PATCH 3/5] Update parallel_do.md

---
 doc/design/parallel_do.md | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/doc/design/parallel_do.md b/doc/design/parallel_do.md
index 576d30329b..5b90b50792 100644
--- a/doc/design/parallel_do.md
+++ b/doc/design/parallel_do.md
@@ -72,10 +72,12 @@ block0 {
        sgd(w1, w1_grad)
 }
 block1 {
+  parent_block: 0
   vars: data, h1, h2, loss
   ops: fc, fc, softmax
 }
 block2 {
+  parent_block: 1
   vars: data_grad, h1_grad, h2_grad, loss_gard, w1_grad, w2_grad
   ops: softmax_grad,
        fc_grad
@@ -122,6 +124,7 @@ block0 {
   parallel_do(block1)
 }
 block1 {
+  parent_block: 0
   vars: w1, w2
   ops: init(w1), init(w2)
 }
@@ -137,16 +140,19 @@ block0 {
        
 }
 block1 {
+  parent_block: 0
   vars: data, h1, h2, loss
   ops: fc, fc, softmax
 }
 block2 {
+  parent_block: 1
   vars: data_grad, h1_grad, h2_grad, loss_gard, w1_grad, w2_grad
   ops: softmax_grad,
        fc_grad, allreduce(places, scopes, w1_grad),
        fc_grad, allreduce(places, scopes, w2_grad)
 }
 block3 {
+  parent_block: 0
   vars: lr
   ops: sgd(w2, w2_grad),
        sgd(w1, w1_grad)

From 16a8def1cda1a46ed772d5e37e5f679822f55436 Mon Sep 17 00:00:00 2001
From: Yang Yang <yangyang62@baidu.com>
Date: Wed, 14 Feb 2018 01:50:39 +0000
Subject: [PATCH 4/5] fix style

---
 doc/design/parallel_do.md | 2 --
 1 file changed, 2 deletions(-)

diff --git a/doc/design/parallel_do.md b/doc/design/parallel_do.md
index 5b90b50792..d51b1014d4 100644
--- a/doc/design/parallel_do.md
+++ b/doc/design/parallel_do.md
@@ -159,5 +159,3 @@ block3 {
 }
 }
 ```
-
-

From 8b24bd4fe851a793521f88630f56e9d4afcf8263 Mon Sep 17 00:00:00 2001
From: "Yang Yang(Tony)" <yangyang62@baidu.com>
Date: Wed, 14 Feb 2018 11:56:46 -0800
Subject: [PATCH 5/5] Update parallel_do.md

---
 doc/design/parallel_do.md | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/doc/design/parallel_do.md b/doc/design/parallel_do.md
index d51b1014d4..221af6b6a4 100644
--- a/doc/design/parallel_do.md
+++ b/doc/design/parallel_do.md
@@ -13,7 +13,7 @@ AddInput(kParameters, "Parameters are duplicated over different devices")
 AddInput(kPlaces, "Devices used for parallel processing");
 AddOutput(kOutputs, "Outputs needed to be merged from different devices").AsDuplicable();
 AddOutput(kParallelScopes,
-          "Container for all local variables in forward pass.");
+          "Scopes for all local variables in forward pass. One scope for each device");
 AddAttr<framework::BlockDesc *>(kParallelBlock,
                                 "List of operaters to be executed in parallel");
 ```
@@ -33,6 +33,7 @@ In the backward pass
   ||||   Compute backward pass in parallel
   |      accumulate param@grad from different devices to the first device
   |      Merge input@grad from different devices
+  |      Copy param@grad to the place of parallel_do_op
 ```
 
 This implementation allows to write mixed device program like this
@@ -47,7 +48,7 @@ pd = ParallelDo(gpu_places)
 with pd.do():
     read_input(feature)
     prediction = my_net(feature)
-    write_output(activation)
+    write_output(prediction)
 prediction = pd()
 loss = cross_entropy(prediction, label)
 ```
@@ -98,7 +99,7 @@ looks like this.
 ```python
 pd = ParallelDo(gpu_places)
 with pd.do():
-    feature = pre_fetch(gpu_places)
+    feature = get_data_from_prefetch_queue(gpu_places)
     prediction = my_net(feature)
     write_output(activation)
 ```