Add large model design doc

7 years ago · cb7891a458
parent 5a159f34ce
commit cb7891a458
5 changed files with 40 additions and 0 deletions
--- a/doc/fluid/design/dist_train/large_model.md
+++ b/doc/fluid/design/dist_train/large_model.md
@ -0,0 +1,40 @@
+# Design Doc: Large Model
+
+## Abstract
+
+We propose an approach to support the large parameter.
+For embedding layer, the parameter may very large and could
+not be stored in one trainer's memory. In this approach, a Trainer would
+prefetch a sliced parameter from different Parameter Server instances
+according to the input `Ids`, and then run forward, backward and send
+the gradient to Parameter Server to execute the optimize program.
+
+## Design
+
+Fluid large model distributed training use 
+[Distributed Transpiler](./parameter_server.md#distributed-transpiler) to split
+a large parameter into multiple parameters which stored on Parameter Server, and
+the Trainer would prefetch them by `RPC` interface.
+
+### Split Large Parameter
+
+<img src="src/split_parameter.png" width="400" />
+
+**Distributed Transpiler** would split the large parameter
+(weight) into some sliced parameters (weight_0, weight_1, weight_2) as the 
+figure above.
+
+### Prefetch Parameters from Parameter Servers
+
+<img src="src/prefetch_parameters.png" width="400" />
+
+- `PrefetchRpc` operator would send the rows index the multiple Parameter Servers,
+  and then receive the SelctedRows.
+- The different with normal Fluid distributed training, we only prefetch the rows
+
+## TODO
+
+- Async Update
+
+  To avoid slow-node, Async update is important for distributed training,
+  we need an design doc and implement it in future.
--- a/doc/fluid/design/dist_train/src/prefetch_parameters.graffle
+++ b/doc/fluid/design/dist_train/src/prefetch_parameters.graffle
--- a/doc/fluid/design/dist_train/src/prefetch_parameters.png
+++ b/doc/fluid/design/dist_train/src/prefetch_parameters.png
--- a/doc/fluid/design/dist_train/src/split_parameter.graffle
+++ b/doc/fluid/design/dist_train/src/split_parameter.graffle
--- a/doc/fluid/design/dist_train/src/split_parameter.png
+++ b/doc/fluid/design/dist_train/src/split_parameter.png