parent
b382747396
commit
f839e91b04
@ -1,44 +1,48 @@
|
|||||||
# Design Doc: Large Model
|
# Design Doc: Prefecting Parameter From Parameter Server
|
||||||
|
|
||||||
## Abstract
|
## Abstract
|
||||||
|
|
||||||
We propose an approach to support the large parameter.
|
We propose an approach to prefetch parameter from Parameter
|
||||||
For embedding layer, the parameter may very large and could
|
Server while distributed training so that Fluid would training
|
||||||
not be stored in one trainer's memory. In this approach, a Trainer would
|
a model including the large parameter which could not be stored in one
|
||||||
prefetch a sliced parameter from different Parameter Server instances
|
trainer's memory.
|
||||||
according to the input `Ids`, and then run forward, backward and send
|
|
||||||
the gradient to Parameter Server to execute the optimize program.
|
## Background
|
||||||
|
|
||||||
|
For an embedding layer, the trainable parameter may be very large and could
|
||||||
|
not be stored in one trainer's memory. In Fluid distributed training,
|
||||||
|
[Distributed Transpiler](./parameter_server.md#distributed-transpiler) would split every parameter into a number of small
|
||||||
|
parameters and stored in Parameter Server, so we could prefetch the parameter
|
||||||
|
from the specified Parameter Server according to the input `Ids`.
|
||||||
|
|
||||||
## Design
|
## Design
|
||||||
|
|
||||||
**NOTE**: this approach is a feature of Fluid distributed trianing, maybe you want
|
This is a feature of Fluid distributed training, maybe you want
|
||||||
to know [Distributed Architecture](./distributed_architecture.md) and
|
to know [Distributed Architecture](./distributed_architecture.md) and
|
||||||
[Parameter Server](./parameter_server.md) before reading the following content.
|
[Parameter Server](./parameter_server.md) before reading the following content.
|
||||||
|
|
||||||
Fluid large model distributed training use
|
### Partationed Parameter
|
||||||
[Distributed Transpiler](./parameter_server.md#distributed-transpiler) to split
|
|
||||||
a large parameter into multiple parameters which stored on Parameter Server, and
|
|
||||||
the Trainer would prefetch them by `RPC` interface.
|
|
||||||
|
|
||||||
### Split Large Parameter
|
|
||||||
|
|
||||||
<img src="src/split_parameter.png" width="400" />
|
<img src="src/split_parameter.png" width="400" />
|
||||||
|
|
||||||
**Distributed Transpiler** would split the large parameter
|
- **Distributed Transpiler** would split the large parameter
|
||||||
(weight) into some sliced parameters (weight_0, weight_1, weight_2) as the
|
(weight) into some partitioned parameters (weight_0, weight_1, weight_2) as the
|
||||||
figure above.
|
figure above.
|
||||||
|
- We could use `round-robin` to distribute the partitioned parameter.
|
||||||
|
|
||||||
### Prefetch Parameters from Parameter Servers
|
### Prefetching Parameter
|
||||||
|
|
||||||
<img src="src/prefetch_parameters.png" width="400" />
|
<img src="src/prefetch_parameters.png" width="400" />
|
||||||
|
|
||||||
- `PrefetchRpc` operator would send the rows index the multiple Parameter Servers,
|
- `prefetch_rpc` operator would prefetch the parameter from different Parameter
|
||||||
and then receive the SelctedRows.
|
Server according with the input `Ids`, we use [SelectedRows](../../../design/selected_rows.md)
|
||||||
- The different with normal Fluid distributed training, we only prefetch the rows
|
as the received variable type.
|
||||||
|
- `merge_selected_rows` operator would merge the received parameters into one
|
||||||
|
`SelectedRows` variable.
|
||||||
|
|
||||||
## TODO
|
## TODO
|
||||||
|
|
||||||
- Async Update
|
- `prefetch_rpc` operator to send rows index and receive SelectedRows variables.
|
||||||
|
- `lookup_table` need to support `SelectedRows` variable type as input `Weight`.
|
||||||
To avoid slow-node, Async update is important for distributed training,
|
- Async Update, To avoid slow-node, Async update is important for distributed training,
|
||||||
we need an design doc and implement it in future.
|
we need a design doc and implement it in future.
|
||||||
|
Binary file not shown.
Binary file not shown.
Before Width: | Height: | Size: 68 KiB After Width: | Height: | Size: 77 KiB |
Loading…
Reference in new issue