commit
b4d710ce12
@ -0,0 +1,70 @@
|
||||
# Design Doc: Computations as a Graph
|
||||
|
||||
A primary goal of the refactorization of PaddlePaddle is a more flexible representation of deep learning computation, in particular, a graph of operators and variables, instead of sequences of layers as before.
|
||||
|
||||
This document explains that the construction of a graph as three steps:
|
||||
|
||||
- construct the forward part
|
||||
- construct the backward part
|
||||
- construct the optimization part
|
||||
|
||||
## The Construction of a Graph
|
||||
|
||||
Let us take the problem of image classification as a simple example. The application program that trains the model looks like:
|
||||
|
||||
```python
|
||||
x = layer.data("images")
|
||||
l = layer.data("label")
|
||||
y = layer.fc(x)
|
||||
cost = layer.mse(y, l)
|
||||
optimize(cost)
|
||||
train(cost, reader=mnist.train())
|
||||
```
|
||||
|
||||
### Forward Part
|
||||
|
||||
The first four lines of above program build the forward part of the graph.
|
||||
|
||||
![](images/graph_construction_example_forward_only.png)
|
||||
|
||||
In particular, the first line `x = layer.data("images")` creates variable x and a Feed operator that copies a column from the minibatch to x. `y = layer.fc(x)` creates not only the FC operator and output variable y, but also two parameters, W and b, and the initialization operators.
|
||||
|
||||
Initialization operators are kind of "run-once" operators -- the `Run` method increments a class data member counter so to run at most once. By doing so, a parameter wouldn't be initialized repeatedly, say, in every minibatch.
|
||||
|
||||
In this example, all operators are created as `OpDesc` protobuf messages, and all variables are `VarDesc`. These protobuf messages are saved in a `BlockDesc` protobuf message.
|
||||
|
||||
### Backward Part
|
||||
|
||||
The fifth line `optimize(cost)` calls two functions, `ConstructBackwardGraph` and `ConstructOptimizationGraph`.
|
||||
|
||||
`ConstructBackwardGraph` traverses the forward graph in the `BlockDesc` protobuf message and builds the backward part.
|
||||
|
||||
![](images/graph_construction_example_forward_backward.png)
|
||||
|
||||
According to the chain rule of gradient computation, `ConstructBackwardGraph` would
|
||||
|
||||
1. create a gradient operator G for each operator F,
|
||||
1. make all inputs, outputs, and outputs' gradient of F as inputs of G,
|
||||
1. create gradients for all inputs of F, except for those who don't have gradients, like x and l, and
|
||||
1. make all these gradients as outputs of G.
|
||||
|
||||
### Optimization Part
|
||||
|
||||
For each parameter, like W and b created by `layer.fc`, marked as double circles in above graphs, `ConstructOptimizationGraph` creates an optimization operator to apply its gradient. Here results in the complete graph:
|
||||
|
||||
![](images/graph_construction_example_all.png)
|
||||
|
||||
## Block and Graph
|
||||
|
||||
The word block and graph are interchangable in the desgin of PaddlePaddle. A [Block[(https://github.com/PaddlePaddle/Paddle/pull/3708) is a metaphore of the code and local variables in a pair of curly braces in programming languages, where operators are like statements or instructions. A graph of operators and variables is a representation of the block.
|
||||
|
||||
A Block keeps operators in an array `BlockDesc::ops`
|
||||
|
||||
```protobuf
|
||||
message BlockDesc {
|
||||
repeated OpDesc ops = 1;
|
||||
repeated VarDesc vars = 2;
|
||||
}
|
||||
```
|
||||
|
||||
in the order that there appear in user programs, like the Python program at the beginning of this article. We can imagine that in `ops`, we have some forward operators, followed by some gradient operators, and then some optimization operators.
|
@ -0,0 +1,11 @@
|
||||
cat ./graph_construction_example.dot | \
|
||||
sed 's/color=red/color=red, style=invis/g' | \
|
||||
sed 's/color=green/color=green, style=invis/g' | \
|
||||
dot -Tpng > graph_construction_example_forward_only.png
|
||||
|
||||
cat ./graph_construction_example.dot | \
|
||||
sed 's/color=green/color=green, style=invis/g' | \
|
||||
dot -Tpng > graph_construction_example_forward_backward.png
|
||||
|
||||
cat ./graph_construction_example.dot | \
|
||||
dot -Tpng > graph_construction_example_all.png
|
@ -0,0 +1,69 @@
|
||||
digraph ImageClassificationGraph {
|
||||
///////// The forward part /////////
|
||||
FeedX [label="Feed", color=blue, shape=box];
|
||||
FeedY [label="Feed", color=blue, shape=box];
|
||||
InitW [label="Init", color=blue, shape=diamond];
|
||||
Initb [label="Init", color=blue, shape=diamond];
|
||||
FC [label="FC", color=blue, shape=box];
|
||||
MSE [label="MSE", color=blue, shape=box];
|
||||
|
||||
x [label="x", color=blue, shape=oval];
|
||||
l [label="l", color=blue, shape=oval];
|
||||
y [label="y", color=blue, shape=oval];
|
||||
W [label="W", color=blue, shape=doublecircle];
|
||||
b [label="b", color=blue, shape=doublecircle];
|
||||
cost [label="cost", color=blue, shape=oval];
|
||||
|
||||
FeedX -> x -> FC -> y -> MSE -> cost [color=blue];
|
||||
FeedY -> l [color=blue];
|
||||
InitW -> W [color=blue];
|
||||
Initb -> b [color=blue];
|
||||
W -> FC [color=blue];
|
||||
b -> FC [color=blue];
|
||||
l -> MSE [color=blue];
|
||||
|
||||
////////// The backward part /////////
|
||||
MSE_Grad [label="MSE_grad", color=red, shape=box];
|
||||
FC_Grad [label="FC_grad", color=red, shape=box];
|
||||
|
||||
d_cost [label="d cost", color=red, shape=oval];
|
||||
d_y [label="d y", color=red, shape=oval];
|
||||
d_b [label="d b", color=red, shape=oval];
|
||||
d_W [label="d W", color=red, shape=oval];
|
||||
|
||||
cost -> MSE_Grad [color=red];
|
||||
d_cost -> MSE_Grad [color=red];
|
||||
x -> MSE_Grad [color=red];
|
||||
l -> MSE_Grad [color=red];
|
||||
y -> MSE_Grad -> d_y [color=red];
|
||||
|
||||
x -> FC_Grad [color=red];
|
||||
y -> FC_Grad [color=red];
|
||||
d_y -> FC_Grad [color=red];
|
||||
W -> FC_Grad -> d_W [color=red];
|
||||
b -> FC_Grad -> d_b [color=red];
|
||||
|
||||
////////// The optimizaiton part //////////
|
||||
|
||||
OPT_W [label="SGD", color=green, shape=box];
|
||||
OPT_b [label="SGD", color=green, shape=box];
|
||||
|
||||
W -> OPT_W [color=green];
|
||||
b -> OPT_b [color=green];
|
||||
d_W -> OPT_W -> W [color=green];
|
||||
d_b -> OPT_b -> b [color=green];
|
||||
|
||||
////////// Groupings //////////
|
||||
|
||||
subgraph clusterMSE {
|
||||
style=invis;
|
||||
MSE;
|
||||
MSE_Grad;
|
||||
}
|
||||
|
||||
subgraph clusterFC {
|
||||
style=invis;
|
||||
FC;
|
||||
FC_Grad;
|
||||
}
|
||||
}
|
After Width: | Height: | Size: 58 KiB |
After Width: | Height: | Size: 50 KiB |
After Width: | Height: | Size: 32 KiB |
@ -0,0 +1,106 @@
|
||||
# Design Doc: Operation Graph Based Parameter Server
|
||||
|
||||
## Abstract
|
||||
|
||||
We propose an approach to implement the parameter server. In this
|
||||
approach, there is no fundamental difference between the trainer and
|
||||
the parameter server: they both run subgraphs, but subgraphs of
|
||||
different purposes.
|
||||
|
||||
## Background
|
||||
|
||||
The previous implementations of the parameter server does not run a
|
||||
subgraph. parameter initialization, optimizer computation, network
|
||||
communication and checkpointing are implemented twice on both the
|
||||
trainer and the parameter server.
|
||||
|
||||
It would be great if we can write code once and use them on both the
|
||||
trainer and the parameter server: reduces code duplication and
|
||||
improves extensibility. Given that after the current refactor, we are
|
||||
representing everything as a computing graph on the
|
||||
trainer. Representing everything as a computing graph on the parameter
|
||||
server becomes a natural extension.
|
||||
|
||||
## Design
|
||||
|
||||
### Graph Converter
|
||||
|
||||
The *graph converter* converts the user-defined operation (OP) graph
|
||||
into subgraphs to be scheduled on different nodes with the following
|
||||
steps:
|
||||
|
||||
1. OP placement: the OPs will be placed on different nodes according
|
||||
to heuristic that minimizes estimated total computation
|
||||
time. Currently we will use a simple heuristic that puts parameter
|
||||
varable on parameter server workers and everything else on trainer
|
||||
workers.
|
||||
|
||||
1. Add communication OPs to enable the communication between nodes.
|
||||
|
||||
We will need these OPs: *Send*, *Recv*, *Enqueue*, *Dequeue*.
|
||||
|
||||
Below is an example of converting the user defined graph to the
|
||||
subgraphs for the trainer and the parameter server:
|
||||
|
||||
<img src="src/local-graph.png" width="300"/>
|
||||
|
||||
After converting:
|
||||
|
||||
<img src="src/dist-graph.png" width="700"/>
|
||||
|
||||
1. The parameter variable W and it's optimizer subgraph are placed on the parameter server.
|
||||
1. Operators are added to the subgraphs.
|
||||
- *Send* sends data to the connected *Recv* operator. The
|
||||
scheduler on the receive node will only schedule *Recv* operator
|
||||
to run when the *Send* operator has ran (the *Send* OP will mark
|
||||
the *Recv* OP runnable automatically).
|
||||
- *Enueue* enqueues the input variable, it can block until space
|
||||
become available in the queue.
|
||||
- *Dequeue* outputs configurable numbers of tensors from the
|
||||
queue. It will block until the queue have the required number of
|
||||
tensors.
|
||||
|
||||
|
||||
### Benefits
|
||||
|
||||
- Model parallelism become easier to implement: it's an extension to
|
||||
the trainer - parameter server approach. we already have the
|
||||
communication OPs, but need to extend the graph converter's
|
||||
placement functionality.
|
||||
|
||||
- User-defined optimizer is easier to add - user can now express it as
|
||||
a subgraph.
|
||||
|
||||
- No more duplication logic inside the trainer and the parameter
|
||||
server mentioned in the background section.
|
||||
|
||||
### Challenges
|
||||
|
||||
- It might be hard for the graph converter to cut a general graph
|
||||
(without any hint for which subgraph is the optimizer). We may need
|
||||
to label which subgraph inside the OP graph is the optimizer.
|
||||
|
||||
- It's important to balance the parameter shards of on multiple
|
||||
parameter server. If a single parameter is very big (some
|
||||
word-embedding, fully connected, softmax layer), we need to
|
||||
automatically partition the single parameter onto different
|
||||
parameter servers when possible (only element-wise optimizer depends
|
||||
on the parameter variable).
|
||||
|
||||
### Discussion
|
||||
|
||||
- In the "Aync SGD" figure, the "W" variable on the parameter server
|
||||
could be read and wrote concurrently, what is our locking strategy?
|
||||
E.g., each variable have a lock cpp method to be invoked by every
|
||||
OP, or, have a lock OP.
|
||||
|
||||
- Can the Enqueue OP be implemented under our current tensor design
|
||||
(puts the input tensor into the queue tensor)?
|
||||
|
||||
- *Dequeue* OP will have variable numbers of output (depends on the
|
||||
`min_count` attribute), does our current design support it? (similar
|
||||
question for the *Add* OP)
|
||||
|
||||
|
||||
### References:
|
||||
[1] [TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45166.pdf)
|
Binary file not shown.
After Width: | Height: | Size: 222 KiB |
Binary file not shown.
After Width: | Height: | Size: 28 KiB |
@ -0,0 +1,124 @@
|
||||
## Background
|
||||
PaddlePaddle divides the description of neural network computation graph into two stages: compile time and runtime.
|
||||
|
||||
PaddlePaddle use proto message to describe compile time graph for
|
||||
|
||||
1. Computation graph should be able to be saved to a file.
|
||||
1. In distributed training, the graph will be serialized and send to multiple workers.
|
||||
|
||||
The computation graph is constructed by Data Node and Operation Node. The concept to represent them is in the table below.
|
||||
|
||||
| |compile time|runtime|
|
||||
|---|---|---|
|
||||
|Data|VarDesc(proto)|Variable(cpp)|
|
||||
|Operation|OpDesc(proto)|Operator(cpp)|
|
||||
|
||||
|
||||
## Definition of VarDesc
|
||||
|
||||
A VarDesc should have a name and value, in PaddlePaddle, the value will always be a tensor. Since we use LoDTensor most of the time. We add a LoDTesnorDesc to represent it.
|
||||
|
||||
```proto
|
||||
message VarDesc {
|
||||
required string name = 1;
|
||||
optional LoDTensorDesc lod_tensor = 2;
|
||||
}
|
||||
```
|
||||
|
||||
## Definition of LodTensorDesc
|
||||
|
||||
```proto
|
||||
enum DataType {
|
||||
BOOL = 0;
|
||||
INT16 = 1;
|
||||
INT32 = 2;
|
||||
INT64 = 3;
|
||||
FP16 = 4;
|
||||
FP32 = 5;
|
||||
FP64 = 6;
|
||||
}
|
||||
|
||||
message LoDTensorDesc {
|
||||
required DataType data_type = 1;
|
||||
repeated int32 dims = 2; // [UNK, 640, 480] is saved as [-1, 640, 480]
|
||||
optional int32 lod_level = 3 [default=0];
|
||||
}
|
||||
```
|
||||
|
||||
## Definition of Variable in Python
|
||||
|
||||
In Python API, layer will take Variable as Input, and return Variable as Output. There should be a class `Variable` in python to help create and manage Variable.
|
||||
|
||||
```python
|
||||
image = Variable(dims=[-1, 640, 480])
|
||||
# fc1 and fc2 are both Variable
|
||||
fc1 = layer.fc(input=image, output_size=10)
|
||||
fc2 = layer.fc(input=fc1, output_size=20)
|
||||
```
|
||||
### what should class `Variable` Have
|
||||
1. `name`.a name of string type is used to mark the value of the Variable.
|
||||
1. `initializer`. Since our Tensor does not have value. we will always use some Operator to fullfill it when run. So we should have a initialize method to help add the init operator.
|
||||
1. `operator`. Variable should record which operator produce itself. The reaon is:
|
||||
- we use pd.eval(targets=[var1, var2]) to run the related ops to get the value of var1 and var2. var.op is used to trace the dependency of the current variable.
|
||||
|
||||
In PaddlePaddle, we use Block to describe Computation Graph, so in the code we will use Block but not Graph.
|
||||
|
||||
```python
|
||||
import VarDesc
|
||||
import LoDTensorDesc
|
||||
import framework
|
||||
|
||||
def AddInitialOperator(variable, initializer):
|
||||
# add an initialize Operator to block to init this Variable
|
||||
|
||||
class Variable(object):
|
||||
def __init__(self, name, dims, type, initializer):
|
||||
self._block = get_default_block()
|
||||
self._name = name
|
||||
self.op = None
|
||||
|
||||
tensor_desc = LoDTensorDesc(data_type=type, dims=dims)
|
||||
_var_desc = VarDesc(name=name, lod_tensor=tensor_desc)
|
||||
self._var = framework.CreateVar(_var_desc)
|
||||
self._block.add_var(self)
|
||||
|
||||
# add initial op according to initializer
|
||||
if initializer is not None:
|
||||
AddInitialOperator(self, initializer)
|
||||
|
||||
def dims(self):
|
||||
return self._var.dims()
|
||||
|
||||
def data_type(self):
|
||||
return self._var.data_type()
|
||||
|
||||
def to_proto(self):
|
||||
pass
|
||||
```
|
||||
|
||||
Then we can use this Variable to create a fc layer in Python.
|
||||
|
||||
```python
|
||||
import paddle as pd
|
||||
|
||||
def flatten_size(X, num_flatten_dims):
|
||||
prod = 1 # of last num_flatten_dims
|
||||
for i in xrange(num_flatten_dims):
|
||||
prod = prod * X.dims[-i-1]
|
||||
return prod
|
||||
|
||||
def layer.fc(X, output_size, num_flatten_dims):
|
||||
W = Variable(pd.random_uniform(), type=FP32, dims=[flatten_size(X, num_flatten_dims), output_size])
|
||||
b = Variable(pd.random_uniform(), type=FP32, dims=[output_size])
|
||||
out = Variable(type=FP32)
|
||||
y = operator.fc(X, W, b, output=out) # fc will put fc op input into out
|
||||
pd.InferShape(y)
|
||||
return out
|
||||
|
||||
x = Variable(dims=[-1, 640, 480])
|
||||
y = layer.fc(x, output_size=100)
|
||||
z = layer.fc(y, output_size=200)
|
||||
|
||||
paddle.eval(targets=[z], ...)
|
||||
print(z)
|
||||
```
|
File diff suppressed because it is too large
Load Diff
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in new issue