Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into dropout.update-doc-pybind
commit
963a4f3c4e
@ -0,0 +1,99 @@
|
||||
# Design Doc: Functions, Operators, and Layers
|
||||
|
||||
In a DL system, we can compose one or more fine grained operators into a coarse grained one. For example, the FC layer can be composed of a multiplication operator and an add operator.
|
||||
|
||||
Historically, some fine grained operations are known as operators, and some coarse level ones are known as layers. But we need a well-defined separation.
|
||||
|
||||
In general, operators are those very fine grained operations, e.g., mul and add. In the implementation, we can write them as C++ functions:
|
||||
|
||||
```c++
|
||||
template <typename T> T add(T x, T y) { return x + y; }
|
||||
template <typename T> T mul(T x, T y) { return x * y; }
|
||||
```
|
||||
|
||||
Then we can wrap them into operators which are C++ classes and can be created from Python bindings by name. A C macro can do this. For example, the following macro invocation
|
||||
|
||||
```c++
|
||||
#define MAKE_FUNCTION_OPERATOR(mul);
|
||||
```
|
||||
|
||||
generates
|
||||
|
||||
```c++
|
||||
template <typename T> class mulOp : public OperatorBase {...};
|
||||
REGISTER_OP(mulOp<float32>, "mul");
|
||||
```
|
||||
|
||||
so that in Python we can create operator mul by:
|
||||
|
||||
```python
|
||||
X1 = Var()
|
||||
X2 = Var()
|
||||
Y = Var()
|
||||
paddle.cpp.create_operator("mul", input=[X1, X2], output=Y)
|
||||
```
|
||||
|
||||
Also, at the same time, we can compose a coarse level C++ operator class by composing functions `mul` and `add`:
|
||||
|
||||
```c++
|
||||
template <typename T>
|
||||
class FCOp : public OperatorBase {
|
||||
public:
|
||||
void Run(...) {
|
||||
add(mul(Input<T>("X"), Input<T>("W")), Input<T>("b");
|
||||
}
|
||||
};
|
||||
REGISTER_OP(FCOp, "fc");
|
||||
```
|
||||
|
||||
We need to support such composition in Python as well. To do so, we need a higher level Python wrapping of operator creation than `paddle.cpp.create_operator`. This higher level operator API should be compatible with the layer API.
|
||||
|
||||
Let's explain using an example. Suppose that we are going to compose the FC using mul and add in Python, we'd like to have Python functions `mul` and `add` defined in module `operator`:
|
||||
|
||||
```python
|
||||
def operator.mul(X1, X2):
|
||||
O = Var()
|
||||
paddle.cpp.create_operator("mul", input={X1, Y1], output=O)
|
||||
return O
|
||||
|
||||
def operator.add(X1, X2):
|
||||
O = Var()
|
||||
paddle.cpp.create_operator("add", input={X1, X2], output=O)
|
||||
return O
|
||||
```
|
||||
|
||||
Above code snippets are automatically generated. Given them, users can define
|
||||
|
||||
```python
|
||||
def layer.fc(X):
|
||||
W = Var()
|
||||
b = Var()
|
||||
return operator.add(operator.mul(X, W), b)
|
||||
```
|
||||
|
||||
If we don't have `operator.mul` and `operator.add`, the definiton of `layer.fc` would be complicated:
|
||||
|
||||
```python
|
||||
def layer.fc(X):
|
||||
W = Var()
|
||||
b = Var()
|
||||
O1 = Var()
|
||||
paddle.cpp.create_operator("mul", input=[X, W], output=O1)
|
||||
O2 = Var()
|
||||
paddle.cpp.create_operator("add", input=[O1, b], output=O2)
|
||||
return O2
|
||||
```
|
||||
|
||||
We'd like to have Python bindings to operators in package `paddle.operator`, and Python compositions of operators in package `paddle.layer`. So we have the following concepts in above illustrative example:
|
||||
|
||||
```
|
||||
| C++ functions/functors | mul | add | | |
|
||||
| C++ operator class | mulOp | addOp | FCOp | |
|
||||
| Python binding | operator.mul | operator.add | operator.fc | |
|
||||
| Python function | | | | layer.fc |
|
||||
```
|
||||
|
||||
This is how we differentiate layer and operators in PaddlePaddle:
|
||||
|
||||
- those defined in C++ and have a lightweighted Python wrapper in module `operators` are operators; whereas
|
||||
- those who don't have C++ implementations but a Python implementation that compose C++ operators are known as layers.
|
@ -0,0 +1,51 @@
|
||||
# Design Doc: Computations as Graphs
|
||||
|
||||
A primary goal of the refactorization of PaddlePaddle is a more flexible representation of deep learning computation, in particular, a graph of operators and variables, instead of sequences of layers as before.
|
||||
|
||||
This document explains that the construction of a graph as three steps:
|
||||
|
||||
- construct the forward part
|
||||
- construct the backward part
|
||||
- construct the optimization part
|
||||
|
||||
Let us take the problem of image classification as a simple example. The application program that trains the model looks like:
|
||||
|
||||
```python
|
||||
x = layer.data("images")
|
||||
l = layer.data("label")
|
||||
y = layer.fc(x)
|
||||
cost = layer.mse(y, l)
|
||||
optimize(cost)
|
||||
train(cost, reader=mnist.train())
|
||||
```
|
||||
|
||||
### Forward Part
|
||||
|
||||
The first four lines of above program build the forward part of the graph.
|
||||
|
||||

|
||||
|
||||
In particular, the first line `x = layer.data("images")` creates variable x and a Feed operator that copies a column from the minibatch to x. `y = layer.fc(x)` creates not only the FC operator and output variable y, but also two parameters, W and b.
|
||||
|
||||
In this example, all operators are created as `OpDesc` protobuf messages, and all variables are `VarDesc`. These protobuf messages are saved in a `BlockDesc` protobuf message.
|
||||
|
||||
### Backward Part
|
||||
|
||||
The fifth line `optimize(cost)` calls two functions, `ConstructBackwardGraph` and `ConstructOptimizationGraph`.
|
||||
|
||||
`ConstructBackwardGraph` traverses the forward graph in the `BlockDesc` protobuf message and builds the backward part.
|
||||
|
||||

|
||||
|
||||
According to the chain rule of gradient computation, `ConstructBackwardGraph` would
|
||||
|
||||
1. create a gradient operator G for each operator F,
|
||||
1. make all inputs, outputs, and outputs' gradient of F as inputs of G,
|
||||
1. create gradients for all inputs of F, except for those who don't have gradients, like x and l, and
|
||||
1. make all these gradients as outputs of G.
|
||||
|
||||
### Optimization Part
|
||||
|
||||
For each parameter, like W and b created by `layer.fc`, marked as double circles in above graphs, `ConstructOptimizationGraph` creates an optimization operator to apply its gradient. Here results in the complete graph:
|
||||
|
||||

|
@ -0,0 +1,59 @@
|
||||
IfOp should have only one branch. An IfOp operator takes a `cond` variable whose value must be a vector of N boolean elements. Its return value has M (M<=N) instances, each corresponds to a true element in `cond`.
|
||||
|
||||
```python
|
||||
import paddle as pd
|
||||
|
||||
x = var()
|
||||
y = var()
|
||||
cond = var()
|
||||
|
||||
b = pd.create_ifop(inputs=[x], output_num=1)
|
||||
with b.true_block():
|
||||
x = b.inputs(0)
|
||||
z = operator.add(x, y)
|
||||
b.set_output(0, operator.softmax(z))
|
||||
|
||||
out = b(cond)
|
||||
```
|
||||
|
||||
If we want the output still has N instances, we can use IfElseOp with a default value, whose minibatch size must be N:
|
||||
|
||||
```python
|
||||
import paddle as pd
|
||||
|
||||
x = var()
|
||||
y = var()
|
||||
cond = var()
|
||||
default_value = var()
|
||||
b = pd.create_ifelseop(inputs=[x], output_num=1)
|
||||
with b.true_block():
|
||||
x = b.inputs(0)
|
||||
z = operator.add(x, y)
|
||||
b.set_output(0, operator.softmax(z))
|
||||
|
||||
with b.false_block():
|
||||
x = b.inputs(0)
|
||||
z = layer.fc(x)
|
||||
b.set_output(0, operator.softmax(z))
|
||||
|
||||
out = b(cond)
|
||||
```
|
||||
|
||||
If only true_block is set in an IfElseOp, we can have a default value for false as:
|
||||
```python
|
||||
import paddle as pd
|
||||
|
||||
x = var()
|
||||
y = var()
|
||||
cond = var()
|
||||
default_value = var()
|
||||
b = pd.create_ifelseop(inputs=[x], output_num=1, default_value)
|
||||
|
||||
with b.true_block():
|
||||
x = b.inputs(0)
|
||||
z = operator.add(x, y)
|
||||
b.set_output(0, operator.softmax(z))
|
||||
|
||||
out = b(cond)
|
||||
```
|
||||
where default_value is a list of vars for `cond` == False.
|
@ -0,0 +1,11 @@
|
||||
cat ./graph_construction_example.dot | \
|
||||
sed 's/color=red/color=red, style=invis/g' | \
|
||||
sed 's/color=green/color=green, style=invis/g' | \
|
||||
dot -Tpng > graph_construction_example_forward_only.png
|
||||
|
||||
cat ./graph_construction_example.dot | \
|
||||
sed 's/color=green/color=green, style=invis/g' | \
|
||||
dot -Tpng > graph_construction_example_forward_backward.png
|
||||
|
||||
cat ./graph_construction_example.dot | \
|
||||
dot -Tpng > graph_construction_example_all.png
|
@ -0,0 +1,65 @@
|
||||
digraph ImageClassificationGraph {
|
||||
///////// The forward part /////////
|
||||
FeedX [label="Feed", color=blue, shape=box];
|
||||
FeedY [label="Feed", color=blue, shape=box];
|
||||
FC [label="FC", color=blue, shape=box];
|
||||
MSE [label="MSE", color=blue, shape=box];
|
||||
|
||||
x [label="x", color=blue, shape=oval];
|
||||
l [label="l", color=blue, shape=oval];
|
||||
y [label="y", color=blue, shape=oval];
|
||||
W [label="W", color=blue, shape=doublecircle];
|
||||
b [label="b", color=blue, shape=doublecircle];
|
||||
cost [label="cost", color=blue, shape=oval];
|
||||
|
||||
FeedX -> x -> FC -> y -> MSE -> cost [color=blue];
|
||||
FeedY -> l [color=blue];
|
||||
W -> FC [color=blue];
|
||||
b -> FC [color=blue];
|
||||
l -> MSE [color=blue];
|
||||
|
||||
////////// The backward part /////////
|
||||
MSE_Grad [label="MSE_grad", color=red, shape=box];
|
||||
FC_Grad [label="FC_grad", color=red, shape=box];
|
||||
|
||||
d_cost [label="d cost", color=red, shape=oval];
|
||||
d_y [label="d y", color=red, shape=oval];
|
||||
d_b [label="d b", color=red, shape=oval];
|
||||
d_W [label="d W", color=red, shape=oval];
|
||||
|
||||
cost -> MSE_Grad [color=red];
|
||||
d_cost -> MSE_Grad [color=red];
|
||||
x -> MSE_Grad [color=red];
|
||||
l -> MSE_Grad [color=red];
|
||||
y -> MSE_Grad -> d_y [color=red];
|
||||
|
||||
x -> FC_Grad [color=red];
|
||||
y -> FC_Grad [color=red];
|
||||
d_y -> FC_Grad [color=red];
|
||||
W -> FC_Grad -> d_W [color=red];
|
||||
b -> FC_Grad -> d_b [color=red];
|
||||
|
||||
////////// The optimizaiton part //////////
|
||||
|
||||
OPT_W [label="SGD", color=green, shape=box];
|
||||
OPT_b [label="SGD", color=green, shape=box];
|
||||
|
||||
W -> OPT_W [color=green];
|
||||
b -> OPT_b [color=green];
|
||||
d_W -> OPT_W -> W [color=green];
|
||||
d_b -> OPT_b -> b [color=green];
|
||||
|
||||
////////// Groupings //////////
|
||||
|
||||
subgraph clusterMSE {
|
||||
style=invis;
|
||||
MSE;
|
||||
MSE_Grad;
|
||||
}
|
||||
|
||||
subgraph clusterFC {
|
||||
style=invis;
|
||||
FC;
|
||||
FC_Grad;
|
||||
}
|
||||
}
|
After Width: | Height: | Size: 54 KiB |
After Width: | Height: | Size: 46 KiB |
After Width: | Height: | Size: 28 KiB |
File diff suppressed because it is too large
Load Diff
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in new issue