Design Doc: Fully Static Graph

8 years ago · 253d3c494b
parent 7bcb1fc3bd
commit 253d3c494b
1 changed files with 122 additions and 0 deletions
--- a/doc/design/fully_static_graph.md
+++ b/doc/design/fully_static_graph.md
@ -0,0 +1,122 @@
+# Design Doc: Fully Static Graph
+
+## Abstract
+
+We propose the *fully static graph* rule: training and inference must
+be fully specified by the static graph. This means training and
+inference should be able to run solely on the cpp core (no Python
+involved), everything should be implemented as an OP.
+
+The user can still use Python to achieve the same result for
+convenience when experimenting locally, but the distributed training
+will not support Python.
+
+## Background
+
+There are two paradigms for expressing the computation graph: dynamic
+and static. The dynamic paradigm constructs the graph on the fly:
+every time `eval` is called, a new graph is created. The static
+paradigm constructs the graph first, and then calls `eval`. There is
+no new graph created each time `eval` is called.
+
+The dynamic graph has the advantage of being flexible but is highly
+dependent on the host language (most commonly Python). The static
+graph is not as flexible, but more optimization can be done since the
+graph is known before computing happens. PaddlePaddle is using the
+static graph approach since we are focused on production deployment
+and cluster training, efficiency is the key.
+
+This design doc is trying to address an important question for the
+static graph approach: should the training logic be fully specified by
+the static graph?
+
+For example, it's common to control the graph evaluation from Python:
+
+```Python
+for i in range(10000):
+	paddle.eval(train_op)
+```
+
+In the above example: the training logic is not fully specified by the
+graph: Python still take the control of the training logic.
+
+
+## Fully Static Graph
+
+The training logic should be fully specified by the graph (but we
+still support controlling the graph evaluation from Python). Because
+Python adds complication for distributed training:
+
+- The distributed training engine needs to place the computation graph
+  onto different nodes, and add communication OPs for data across node
+  boundaries. They are very hard to do if the training logic is not
+  fully specified by the graph.
+
+- For fault recovery, every runtime state needs to be saved. But the
+  state in Python code (such as training loop index and data reader
+  position) could not be saved.
+
+- Allowing executing arbitrary Python code on Paddle Cloud make
+  training data safety very hard if not impossible to control.
+
+
+### Benefits
+
+- A clear separation between graph declaration (current using Python)
+  and graph execution. It's easier for us to add a new language
+  binding (or invent our own deep learning graph specification
+  language).
+
+- Local or distributed graph execution is easier to optimize.
+
+- Much easier to ensure training data safety on Paddle Cloud.
+
+
+### Example
+
+To give a concrete example, for loop is essential for the training:
+with every loop, a new mini-batch is fed into the training
+system. Under the fully static graph rule, we **must** implement the for
+loop as an OP:
+
+```Python
+# pseudo code, we need to discuss the for loop interface
+i = pd.Variable(0)
+optimizer = paddle.op.Adam()
+# specify the input file as the argument, or
+# leave blank and specify using config when running on Paddle Cloud
+input = paddle.op.recordIO("/home/data/input.recordio")
+q_x, q_y = input[0], input[1]
+loss = pd.op.square(pd.op.sub(pd.op.add(pd.op.mul(x, w), b), y))
+
+def cond(i):
+    return i < 10000
+
+with pd.for_loop(cond, [i]) as loop
+    # Dequeue a new example each iteration.
+    x = q_x.dequeue()
+    y = q_y.dequeue()
+    optimizer.minimize(loss)
+    pd.add(i, 1)
+
+# or paddle.save_target(loop, "job.bin") and
+# submit the saved file to Paddle Cloud.
+paddle.eval(loop)
+```
+
+The above code can run on both locally and on Paddle Cloud.
+
+For user's convenience, he can use the Python for loop:
+```Python
+optimizer = paddle.op.Adam()
+input = paddle.op.recordIO("/home/data/input.recordio")
+q_x, q_y = input[0], input[1]
+x = q_x.dequeue()
+y = q_y.dequeue()
+loss = pd.op.square(pd.op.sub(pd.op.add(pd.op.mul(x, w), b), y))
+train_op = optimizer.minimize(loss)
+for i in range(10000):
+	paddle.eval(train_op)
+```
+
+The above code can only run locally.