commit
e9ac7df941
@ -0,0 +1,15 @@
|
||||
#!/bin/bash
|
||||
set -e
|
||||
|
||||
readonly VERSION="3.8"
|
||||
|
||||
version=$(clang-format -version)
|
||||
|
||||
if ! [[ $version == *"$VERSION"* ]]; then
|
||||
echo "clang-format version check failed."
|
||||
echo "a version contains '$VERSION' is needed, but get '$version'"
|
||||
echo "you can install the right version, and make an soft-link to '\$PATH' env"
|
||||
exit -1
|
||||
fi
|
||||
|
||||
clang-format $@
|
@ -1,14 +0,0 @@
|
||||
ABOUT
|
||||
=======
|
||||
|
||||
PaddlPaddle is an easy-to-use, efficient, flexible and scalable deep learning platform,
|
||||
which is originally developed by Baidu scientists and engineers for the purpose of applying deep learning to many products at Baidu.
|
||||
|
||||
PaddlePaddle is now open source but far from complete, which is intended to be built upon, improved, scaled, and extended.
|
||||
We hope to build an active open source community both by providing feedback and by actively contributing to the source code.
|
||||
|
||||
|
||||
Credits
|
||||
--------
|
||||
|
||||
We owe many thanks to `all contributors and developers <https://github.com/PaddlePaddle/Paddle/graphs/contributors>`_ of PaddlePaddle!
|
@ -0,0 +1,101 @@
|
||||
# Alalysis of large model distributed training in Paddle
|
||||
|
||||
***NOTE: This is only some note for how we implemeted this scheme in V1, not a new design.***
|
||||
|
||||
## What is it
|
||||
|
||||
We often encounter cases that the embedding layer parameters(sparse) are so large that we can not store it in the trainer's memory when training. So we need to put them to several servers, and fetch them row by row instead of fetch all of the parameters.
|
||||
|
||||
## How to use
|
||||
|
||||
Specify command-line argument like `--loadsave_parameters_in_pserver=true --ports_num_for_sparse=1 --use_old_updater=1` when starting the paddle trainer. And also add something like `--ports_num_for_sparse=1 --pserver_num_threads=5` when starting pserver processes.
|
||||
|
||||
Accrodingly, configure your embedding layers like:
|
||||
|
||||
```python
|
||||
SPARSE_REMOTE=True
|
||||
|
||||
w1 = data_layer(name="w1", size=dict_size)
|
||||
emb1 = embedding_layer(input=w1, size=32, param_attr=ParameterAttribute(sparse_update=SPARSE_REMOTE))
|
||||
w2 = data_layer(name="w2", size=dict_size)
|
||||
emb2 = embedding_layer(input=w2, size=32, param_attr=ParameterAttribute(sparse_update=SPARSE_REMOTE))
|
||||
...
|
||||
```
|
||||
|
||||
## Implementation details
|
||||
|
||||
```c++
|
||||
enum MatType {
|
||||
MAT_NORMAL,
|
||||
MAT_NORMAL_SHARED,
|
||||
MAT_VALUE_SHARED,
|
||||
MAT_SPARSE_ROW_IDS,
|
||||
MAT_SPARSE_ROW_AUTO_GROW,
|
||||
MAT_CACHE_ROW,
|
||||
MAT_SPARSE_ROW,
|
||||
MAT_SPARSE_ROW_PREFETCH,
|
||||
MAT_SPARSE_ROW_PREFETCH_FULL_SIZE,
|
||||
};
|
||||
```
|
||||
|
||||
`MAT_SPARSE_ROW_PREFETCH` is what we use when configured to fetch only row of matrix when training.
|
||||
|
||||
In `trainer_internal.cpp:L93 trainOneBatch`:
|
||||
|
||||
```c++
|
||||
if (config_->getOptConfig().use_sparse_remote_updater()) {
|
||||
REGISTER_TIMER("prefetch");
|
||||
gradientMachine_->prefetch(inArgs);
|
||||
parameterUpdater_->getParametersRemote();
|
||||
}
|
||||
```
|
||||
|
||||
When doing actual network forward and backward, at the beginning of each batch, the trainer will try to download one row of data from pserver.
|
||||
|
||||
In `trainer/RemoteParameterUpdater.cpp`: `parameterUpdater_->getParametersRemote();`:
|
||||
|
||||
```c++
|
||||
if (fullSize) {
|
||||
...
|
||||
} else {
|
||||
getParams = [&] {
|
||||
parameterClient_->getParameterSparse(
|
||||
/* recvParameterType= */ PARAMETER_VALUE, sendBackParameterType);
|
||||
};
|
||||
applyL1 = [](Parameter& para, real decayRate) {
|
||||
para.getMat(PARAMETER_VALUE)->applyL1(/*lr=*/1.0f, decayRate);
|
||||
};
|
||||
}
|
||||
```
|
||||
|
||||
Calling `parameterClient_->getParameterSparse` will do remote call to pserver's `getParameterSparse`:
|
||||
|
||||
```c++
|
||||
void ParameterServer2::getParameterSparse(const SendParameterRequest& request,
|
||||
std::vector<Buffer>& inputBuffers,
|
||||
SendParameterResponse* response,
|
||||
std::vector<Buffer>* outputBuffers) {
|
||||
(void)inputBuffers;
|
||||
auto& buffer = *readWriteBuffer_;
|
||||
size_t numReals = 0;
|
||||
for (const auto& block : request.blocks()) {
|
||||
numReals += getParameterConfig(block).dims(1);
|
||||
}
|
||||
buffer.resize(numReals);
|
||||
|
||||
VLOG(3) << "pserver: getParameterSparse, numReals=" << numReals;
|
||||
|
||||
ReadLockGuard guard(parameterMutex_);
|
||||
size_t offset = 0;
|
||||
for (const auto& block : request.blocks()) {
|
||||
size_t width = getParameterConfig(block).dims(1);
|
||||
Buffer buf = {buffer.data() + offset, width};
|
||||
int type = request.send_back_parameter_type();
|
||||
sendBackParameterSparse(block, type, response, &buf, width, outputBuffers);
|
||||
offset += width;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
`getParameterConfig(block).dims(1)` returns the width of the current "parameter block"(a shard of parameter object),
|
||||
then `getParameterSparse` remote call returns only one row of data to the client.
|
Binary file not shown.
Before Width: | Height: | Size: 55 KiB After Width: | Height: | Size: 49 KiB |
@ -0,0 +1,100 @@
|
||||
# Design Doc: Functions, Operators, and Layers
|
||||
|
||||
In a DL system, we can compose one or more fine grained operators into a coarse grained one. For example, the FC layer can be composed of a multiplication operator and an add operator.
|
||||
|
||||
Historically, some fine grained operations are known as operators, and some coarse level ones are known as layers. But we need a well-defined separation.
|
||||
|
||||
In general, operators are those very fine grained operations, e.g., mul and add. In the implementation, we can write them as C++ functions:
|
||||
|
||||
```c++
|
||||
template <typename T> T add(T x, T y) { return x + y; }
|
||||
template <typename T> T mul(T x, T y) { return x * y; }
|
||||
```
|
||||
|
||||
Then we can wrap them into operators which are C++ classes and can be created from Python bindings by name. A C macro can do this. For example, the following macro invocation
|
||||
|
||||
```c++
|
||||
#define MAKE_FUNCTION_OPERATOR(mul);
|
||||
```
|
||||
|
||||
generates
|
||||
|
||||
```c++
|
||||
template <typename T> class mulOp : public OperatorBase {...};
|
||||
REGISTER_OP(mulOp<float32>, "mul");
|
||||
```
|
||||
|
||||
so that in Python we can create operator mul by:
|
||||
|
||||
```python
|
||||
X1 = Var()
|
||||
X2 = Var()
|
||||
Y = Var()
|
||||
paddle.cpp.create_operator("mul", input=[X1, X2], output=Y)
|
||||
```
|
||||
|
||||
Also, at the same time, we can compose a coarse level C++ operator class by composing functions `mul` and `add`:
|
||||
|
||||
```c++
|
||||
template <typename T>
|
||||
class FCOp : public OperatorBase {
|
||||
public:
|
||||
void Run(...) {
|
||||
add(mul(Input<T>("X"), Input<T>("W")), Input<T>("b");
|
||||
}
|
||||
};
|
||||
REGISTER_OP(FCOp, "fc");
|
||||
```
|
||||
|
||||
We need to support such composition in Python as well. To do so, we need a higher level Python wrapping of operator creation than `paddle.cpp.create_operator`. This higher level operator API should be compatible with the layer API.
|
||||
|
||||
Let's explain using an example. Suppose that we are going to compose the FC using mul and add in Python, we'd like to have Python functions `mul` and `add` defined in module `operator`:
|
||||
|
||||
```python
|
||||
def operator.mul(X1, X2):
|
||||
O = Var()
|
||||
paddle.cpp.create_operator("mul", input={X1, Y1], output=O)
|
||||
return O
|
||||
|
||||
def operator.add(X1, X2):
|
||||
O = Var()
|
||||
paddle.cpp.create_operator("add", input={X1, X2], output=O)
|
||||
return O
|
||||
```
|
||||
|
||||
Above code snippets are automatically generated. Given them, users can define
|
||||
|
||||
```python
|
||||
def layer.fc(X):
|
||||
W = Var()
|
||||
b = Var()
|
||||
return operator.add(operator.mul(X, W), b)
|
||||
```
|
||||
|
||||
If we don't have `operator.mul` and `operator.add`, the definiton of `layer.fc` would be complicated:
|
||||
|
||||
```python
|
||||
def layer.fc(X):
|
||||
W = Var()
|
||||
b = Var()
|
||||
O1 = Var()
|
||||
paddle.cpp.create_operator("mul", input=[X, W], output=O1)
|
||||
O2 = Var()
|
||||
paddle.cpp.create_operator("add", input=[O1, b], output=O2)
|
||||
return O2
|
||||
```
|
||||
|
||||
We'd like to have Python bindings to operators in package `paddle.operator`, and Python compositions of operators in package `paddle.layer`. So we have the following concepts in above illustrative example:
|
||||
|
||||
|
||||
| C++ functions/functors | mul | add | | |
|
||||
|------------------------|--------------|--------------|-------------|----------|
|
||||
| C++ operator class | mulOp | addOp | FCOp | |
|
||||
| Python binding | operator.mul | operator.add | operator.fc | |
|
||||
| Python function | | | | layer.fc |
|
||||
|
||||
|
||||
This is how we differentiate layer and operators in PaddlePaddle:
|
||||
|
||||
- those defined in C++ and have a lightweighted Python wrapper in module `operators` are operators; whereas
|
||||
- those who don't have C++ implementations but a Python implementation that compose C++ operators are known as layers.
|
@ -0,0 +1,70 @@
|
||||
# Design Doc: Computations as a Graph
|
||||
|
||||
A primary goal of the refactorization of PaddlePaddle is a more flexible representation of deep learning computation, in particular, a graph of operators and variables, instead of sequences of layers as before.
|
||||
|
||||
This document explains that the construction of a graph as three steps:
|
||||
|
||||
- construct the forward part
|
||||
- construct the backward part
|
||||
- construct the optimization part
|
||||
|
||||
## The Construction of a Graph
|
||||
|
||||
Let us take the problem of image classification as a simple example. The application program that trains the model looks like:
|
||||
|
||||
```python
|
||||
x = layer.data("images")
|
||||
l = layer.data("label")
|
||||
y = layer.fc(x)
|
||||
cost = layer.mse(y, l)
|
||||
optimize(cost)
|
||||
train(cost, reader=mnist.train())
|
||||
```
|
||||
|
||||
### Forward Part
|
||||
|
||||
The first four lines of above program build the forward part of the graph.
|
||||
|
||||
![](images/graph_construction_example_forward_only.png)
|
||||
|
||||
In particular, the first line `x = layer.data("images")` creates variable x and a Feed operator that copies a column from the minibatch to x. `y = layer.fc(x)` creates not only the FC operator and output variable y, but also two parameters, W and b, and the initialization operators.
|
||||
|
||||
Initialization operators are kind of "run-once" operators -- the `Run` method increments a class data member counter so to run at most once. By doing so, a parameter wouldn't be initialized repeatedly, say, in every minibatch.
|
||||
|
||||
In this example, all operators are created as `OpDesc` protobuf messages, and all variables are `VarDesc`. These protobuf messages are saved in a `BlockDesc` protobuf message.
|
||||
|
||||
### Backward Part
|
||||
|
||||
The fifth line `optimize(cost)` calls two functions, `ConstructBackwardGraph` and `ConstructOptimizationGraph`.
|
||||
|
||||
`ConstructBackwardGraph` traverses the forward graph in the `BlockDesc` protobuf message and builds the backward part.
|
||||
|
||||
![](images/graph_construction_example_forward_backward.png)
|
||||
|
||||
According to the chain rule of gradient computation, `ConstructBackwardGraph` would
|
||||
|
||||
1. create a gradient operator G for each operator F,
|
||||
1. make all inputs, outputs, and outputs' gradient of F as inputs of G,
|
||||
1. create gradients for all inputs of F, except for those who don't have gradients, like x and l, and
|
||||
1. make all these gradients as outputs of G.
|
||||
|
||||
### Optimization Part
|
||||
|
||||
For each parameter, like W and b created by `layer.fc`, marked as double circles in above graphs, `ConstructOptimizationGraph` creates an optimization operator to apply its gradient. Here results in the complete graph:
|
||||
|
||||
![](images/graph_construction_example_all.png)
|
||||
|
||||
## Block and Graph
|
||||
|
||||
The word block and graph are interchangable in the desgin of PaddlePaddle. A [Block[(https://github.com/PaddlePaddle/Paddle/pull/3708) is a metaphore of the code and local variables in a pair of curly braces in programming languages, where operators are like statements or instructions. A graph of operators and variables is a representation of the block.
|
||||
|
||||
A Block keeps operators in an array `BlockDesc::ops`
|
||||
|
||||
```protobuf
|
||||
message BlockDesc {
|
||||
repeated OpDesc ops = 1;
|
||||
repeated VarDesc vars = 2;
|
||||
}
|
||||
```
|
||||
|
||||
in the order that there appear in user programs, like the Python program at the beginning of this article. We can imagine that in `ops`, we have some forward operators, followed by some gradient operators, and then some optimization operators.
|
@ -0,0 +1,59 @@
|
||||
IfOp should have only one branch. An IfOp operator takes a `cond` variable whose value must be a vector of N boolean elements. Its return value has M (M<=N) instances, each corresponds to a true element in `cond`.
|
||||
|
||||
```python
|
||||
import paddle as pd
|
||||
|
||||
x = var()
|
||||
y = var()
|
||||
cond = var()
|
||||
|
||||
b = pd.create_ifop(inputs=[x], output_num=1)
|
||||
with b.true_block():
|
||||
x = b.inputs(0)
|
||||
z = operator.add(x, y)
|
||||
b.set_output(0, operator.softmax(z))
|
||||
|
||||
out = b(cond)
|
||||
```
|
||||
|
||||
If we want the output still has N instances, we can use IfElseOp with a default value, whose minibatch size must be N:
|
||||
|
||||
```python
|
||||
import paddle as pd
|
||||
|
||||
x = var()
|
||||
y = var()
|
||||
cond = var()
|
||||
default_value = var()
|
||||
b = pd.create_ifelseop(inputs=[x], output_num=1)
|
||||
with b.true_block():
|
||||
x = b.inputs(0)
|
||||
z = operator.add(x, y)
|
||||
b.set_output(0, operator.softmax(z))
|
||||
|
||||
with b.false_block():
|
||||
x = b.inputs(0)
|
||||
z = layer.fc(x)
|
||||
b.set_output(0, operator.softmax(z))
|
||||
|
||||
out = b(cond)
|
||||
```
|
||||
|
||||
If only true_block is set in an IfElseOp, we can have a default value for false as:
|
||||
```python
|
||||
import paddle as pd
|
||||
|
||||
x = var()
|
||||
y = var()
|
||||
cond = var()
|
||||
default_value = var()
|
||||
b = pd.create_ifelseop(inputs=[x], output_num=1, default_value)
|
||||
|
||||
with b.true_block():
|
||||
x = b.inputs(0)
|
||||
z = operator.add(x, y)
|
||||
b.set_output(0, operator.softmax(z))
|
||||
|
||||
out = b(cond)
|
||||
```
|
||||
where default_value is a list of vars for `cond` == False.
|
@ -0,0 +1,11 @@
|
||||
cat ./graph_construction_example.dot | \
|
||||
sed 's/color=red/color=red, style=invis/g' | \
|
||||
sed 's/color=green/color=green, style=invis/g' | \
|
||||
dot -Tpng > graph_construction_example_forward_only.png
|
||||
|
||||
cat ./graph_construction_example.dot | \
|
||||
sed 's/color=green/color=green, style=invis/g' | \
|
||||
dot -Tpng > graph_construction_example_forward_backward.png
|
||||
|
||||
cat ./graph_construction_example.dot | \
|
||||
dot -Tpng > graph_construction_example_all.png
|
@ -0,0 +1,69 @@
|
||||
digraph ImageClassificationGraph {
|
||||
///////// The forward part /////////
|
||||
FeedX [label="Feed", color=blue, shape=box];
|
||||
FeedY [label="Feed", color=blue, shape=box];
|
||||
InitW [label="Init", color=blue, shape=diamond];
|
||||
Initb [label="Init", color=blue, shape=diamond];
|
||||
FC [label="FC", color=blue, shape=box];
|
||||
MSE [label="MSE", color=blue, shape=box];
|
||||
|
||||
x [label="x", color=blue, shape=oval];
|
||||
l [label="l", color=blue, shape=oval];
|
||||
y [label="y", color=blue, shape=oval];
|
||||
W [label="W", color=blue, shape=doublecircle];
|
||||
b [label="b", color=blue, shape=doublecircle];
|
||||
cost [label="cost", color=blue, shape=oval];
|
||||
|
||||
FeedX -> x -> FC -> y -> MSE -> cost [color=blue];
|
||||
FeedY -> l [color=blue];
|
||||
InitW -> W [color=blue];
|
||||
Initb -> b [color=blue];
|
||||
W -> FC [color=blue];
|
||||
b -> FC [color=blue];
|
||||
l -> MSE [color=blue];
|
||||
|
||||
////////// The backward part /////////
|
||||
MSE_Grad [label="MSE_grad", color=red, shape=box];
|
||||
FC_Grad [label="FC_grad", color=red, shape=box];
|
||||
|
||||
d_cost [label="d cost", color=red, shape=oval];
|
||||
d_y [label="d y", color=red, shape=oval];
|
||||
d_b [label="d b", color=red, shape=oval];
|
||||
d_W [label="d W", color=red, shape=oval];
|
||||
|
||||
cost -> MSE_Grad [color=red];
|
||||
d_cost -> MSE_Grad [color=red];
|
||||
x -> MSE_Grad [color=red];
|
||||
l -> MSE_Grad [color=red];
|
||||
y -> MSE_Grad -> d_y [color=red];
|
||||
|
||||
x -> FC_Grad [color=red];
|
||||
y -> FC_Grad [color=red];
|
||||
d_y -> FC_Grad [color=red];
|
||||
W -> FC_Grad -> d_W [color=red];
|
||||
b -> FC_Grad -> d_b [color=red];
|
||||
|
||||
////////// The optimizaiton part //////////
|
||||
|
||||
OPT_W [label="SGD", color=green, shape=box];
|
||||
OPT_b [label="SGD", color=green, shape=box];
|
||||
|
||||
W -> OPT_W [color=green];
|
||||
b -> OPT_b [color=green];
|
||||
d_W -> OPT_W -> W [color=green];
|
||||
d_b -> OPT_b -> b [color=green];
|
||||
|
||||
////////// Groupings //////////
|
||||
|
||||
subgraph clusterMSE {
|
||||
style=invis;
|
||||
MSE;
|
||||
MSE_Grad;
|
||||
}
|
||||
|
||||
subgraph clusterFC {
|
||||
style=invis;
|
||||
FC;
|
||||
FC_Grad;
|
||||
}
|
||||
}
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in new issue