Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into executor-design
	
		
	
				
					
				
			| 
		 After Width: | Height: | Size: 57 KiB  | 
| 
		 Before Width: | Height: | Size: 58 KiB After Width: | Height: | Size: 56 KiB  | 
| 
		 Before Width: | Height: | Size: 50 KiB After Width: | Height: | Size: 49 KiB  | 
| 
		 Before Width: | Height: | Size: 32 KiB After Width: | Height: | Size: 30 KiB  | 
@ -0,0 +1,105 @@
 | 
				
			||||
## Optimizer Design
 | 
				
			||||
 | 
				
			||||
### The Problem
 | 
				
			||||
 | 
				
			||||
A PaddlePaddle program, or a block, is a sequence of operators operating variables.  A training program needs to do three kinds of works:
 | 
				
			||||
 | 
				
			||||
1. the forward pass, which computes intermediate results and the cost(s),
 | 
				
			||||
1. the backward pass, which derives gradients from intermediate results and costs, and
 | 
				
			||||
1. the optimization pass, which update model parameters to optimize the cost(s).
 | 
				
			||||
 | 
				
			||||
These works rely on three kinds of operators:
 | 
				
			||||
 | 
				
			||||
1. forward operators,
 | 
				
			||||
1. gradient operators, and
 | 
				
			||||
1. optimization operators.
 | 
				
			||||
 | 
				
			||||
It's true that users should be able to create all these operators manually by calling some low-level API, but it would be much more convenient if they could only describe the forward pass and let PaddlePaddle create the backward and optimization operators automatically.
 | 
				
			||||
 | 
				
			||||
In this design, we propose a high-level API that automatically derives the optimisation pass and operators from the forward pass.
 | 
				
			||||
 | 
				
			||||
 | 
				
			||||
### High-level Python API to describe the training process
 | 
				
			||||
 | 
				
			||||
1. User write code to describe the network:
 | 
				
			||||
 | 
				
			||||
	```python
 | 
				
			||||
	images = layer.data("images")
 | 
				
			||||
	labels = layer.data("labels")
 | 
				
			||||
	w1 = pd.var("w1")
 | 
				
			||||
	b1 = pd.var("b1")
 | 
				
			||||
	hidden = layer.fc(images, w=w1, b=b1)
 | 
				
			||||
	cost = layer.mse(hidden, labels)
 | 
				
			||||
	```
 | 
				
			||||
 | 
				
			||||
	The above code snippet will create forward operators in [Block](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/block.md).
 | 
				
			||||
 | 
				
			||||
 | 
				
			||||
2. Users create a certain kind of Optimizer with some argument.
 | 
				
			||||
 | 
				
			||||
	```python
 | 
				
			||||
	optimizer = AdagradOptimizer(learing_rate=0.001)
 | 
				
			||||
	```
 | 
				
			||||
 | 
				
			||||
3. Users use the optimizer to `minimize` a certain `cost` through updating parameters in parameter_list.
 | 
				
			||||
 | 
				
			||||
	```python
 | 
				
			||||
	opt_op_list = optimizer.minimize(cost, parameter_list=[w1, b1])
 | 
				
			||||
	```
 | 
				
			||||
	The above code snippet will create gradient and optimization operators in Block. The return value of `minimize()` is list of optimization operators that will be run by session.
 | 
				
			||||
 | 
				
			||||
4. Users use Session/Executor to run this opt_op_list as target to do training.
 | 
				
			||||
 | 
				
			||||
	```python
 | 
				
			||||
	sess.run(target= opt_op_list, ...)
 | 
				
			||||
	```
 | 
				
			||||
 | 
				
			||||
#### Optimizer Python interface:
 | 
				
			||||
 | 
				
			||||
```python
 | 
				
			||||
class Optimizer(object):
 | 
				
			||||
    """Optimizer Base class.
 | 
				
			||||
 | 
				
			||||
    """
 | 
				
			||||
 | 
				
			||||
    def __init__(self):
 | 
				
			||||
        pass
 | 
				
			||||
 | 
				
			||||
    def create_backward_pass(self, loss, parameter_list=None):
 | 
				
			||||
        """
 | 
				
			||||
        create and add gradient Operators in BlockDesc to Compute gradients of `loss`
 | 
				
			||||
        for parameters in parameter_list
 | 
				
			||||
 | 
				
			||||
        Args:
 | 
				
			||||
          loss: an variable generated by cost function.
 | 
				
			||||
          parameter_list: parameters that need to compute gradient and update to optimize the lost.
 | 
				
			||||
 | 
				
			||||
        Returns:
 | 
				
			||||
          list of (parameters, gradients) pair.
 | 
				
			||||
        """
 | 
				
			||||
        return None
 | 
				
			||||
 | 
				
			||||
    def create_optimization_pass(self, parameters_and_grads):
 | 
				
			||||
        """Add optimization operators to update gradients to variables.
 | 
				
			||||
 | 
				
			||||
        Args:
 | 
				
			||||
          parameters_and_grads: a list of (variable, gradient) pair to update.
 | 
				
			||||
 | 
				
			||||
        Returns:
 | 
				
			||||
          optmization_op_list: a list of optimization operator that will update parameter using gradient.
 | 
				
			||||
        """
 | 
				
			||||
        return None
 | 
				
			||||
 | 
				
			||||
    def minimize(self, loss, parameter_list):
 | 
				
			||||
        """Add operations to minimize `loss` by updating `parameter_list`.
 | 
				
			||||
 | 
				
			||||
        This method combines interface `create_backward_pass()` and
 | 
				
			||||
        `create_optimization_pass()` into one.
 | 
				
			||||
        """
 | 
				
			||||
        params_grads = self.create_backward_pass(loss, parameter_list)
 | 
				
			||||
        update_ops = self.create_optimization_pass(params_grads)
 | 
				
			||||
        return update_ops
 | 
				
			||||
 | 
				
			||||
```
 | 
				
			||||
 | 
				
			||||
Users can inherit the Optimizer above to create their own Optimizer with some special logic, such as AdagradOptimizer.
 | 
				
			||||
@ -0,0 +1,74 @@
 | 
				
			||||
# Design Doc: Selected Rows
 | 
				
			||||
 | 
				
			||||
`SelectedRows` is a kind of sparse tensor data type, which is designed to support `embedding` operators. The gradient of embedding table is a sparse tensor. Only a few rows are non-zero values in that tensor. It is straightforward to represent the sparse tensor by the following sparse tensor data structure:
 | 
				
			||||
 | 
				
			||||
```cpp
 | 
				
			||||
class SelectedRows {
 | 
				
			||||
 private:
 | 
				
			||||
  vector<int> rows_;
 | 
				
			||||
  Tensor value_;
 | 
				
			||||
  int height_;
 | 
				
			||||
};
 | 
				
			||||
```
 | 
				
			||||
 | 
				
			||||
The field `height_` shows the first dimension of `SelectedRows`. The `rows` are the indices of which rows of `SelectedRows` are non-zeros. The `value_` field is an N-dim tensor and shape is `[rows.size() /* NUM_ROWS */, ...]`, which supplies values for each row. The dimension of `SelectedRows` satisfies `[height_] + value_.shape[1:]`.
 | 
				
			||||
 | 
				
			||||
Suppose that a SelectedRows-typed variable `x` has many rows, but only two of them have values -- row 73 is `[1, 2]` and row 84 is `[3, 4]`, the `SelectedRows` representation would be:
 | 
				
			||||
 | 
				
			||||
```
 | 
				
			||||
x = SelectedRow {
 | 
				
			||||
  rows = [73, 84],
 | 
				
			||||
  value = [[1, 2], [3,4]]
 | 
				
			||||
}
 | 
				
			||||
```
 | 
				
			||||
 | 
				
			||||
 | 
				
			||||
## SelectedRows in Protobuf
 | 
				
			||||
 | 
				
			||||
`SelectedRows` is a kind of `Variable`. `VarDesc` in protobuf should describe the `SelectedRows` information. Only the tensor dimension of a `SelectedRows` will be described in compile-time since the `rows_` and `value_` are related to training data. 
 | 
				
			||||
So we use `TensorDesc` to unify `data_type` and `dims`. A LodTensorDesc contains a `TensorDesc` and `lod_level`. The description of `SelectedRows` is a Tensor description.
 | 
				
			||||
 | 
				
			||||
```proto
 | 
				
			||||
message TensorDesc {
 | 
				
			||||
  required DataType data_type = 1;
 | 
				
			||||
  repeated int64 dims = 2; // [UNK, 640, 480] is saved as [-1, 640, 480]
 | 
				
			||||
}
 | 
				
			||||
 | 
				
			||||
message LodTensorDesc {
 | 
				
			||||
  required TensorDesc tensor = 1;
 | 
				
			||||
  optional int lod_level = 2;
 | 
				
			||||
}
 | 
				
			||||
 | 
				
			||||
message VarDesc {
 | 
				
			||||
  required string name = 1;
 | 
				
			||||
  enum VarType { 
 | 
				
			||||
    LOD_TENSOR = 0;
 | 
				
			||||
    SELECTED_ROWS = 1;
 | 
				
			||||
  }
 | 
				
			||||
  required VarType type = 2;
 | 
				
			||||
  optional LodTensorDesc lod_desc = 3;
 | 
				
			||||
  optional TensorDesc selected_rows_desc = 4;
 | 
				
			||||
  optional bool persistable = 5 [ default = false ];
 | 
				
			||||
}
 | 
				
			||||
```
 | 
				
			||||
 | 
				
			||||
## InferShape for Selected Rows
 | 
				
			||||
 | 
				
			||||
Just like `LoD` information, `InferShape` method will inference output tensor type as well. The operator should decide whether its output is a `SelectedRows` or `Dense` tensor.
 | 
				
			||||
 | 
				
			||||
For example, the gradient operator of `TableLookup` will always generate `SelectedRows`. Its `InferShape` method should be like following
 | 
				
			||||
 | 
				
			||||
```cpp
 | 
				
			||||
void TableLookupGrad::InferShape(context) {
 | 
				
			||||
  ...
 | 
				
			||||
  context.SetDataType("Embedding.Grad", kSelectedRows);
 | 
				
			||||
}
 | 
				
			||||
```
 | 
				
			||||
 | 
				
			||||
 | 
				
			||||
## Sparse Operators
 | 
				
			||||
 | 
				
			||||
There are several operators should be written to support `SelectedRows`. They are:
 | 
				
			||||
 | 
				
			||||
1. Operators which generates `SelectedRows` gradient. e.g. Gradient of `TableLookupOp`.
 | 
				
			||||
2. Optimize operators which support `SelectedRows` gradient. e.g. `SGD` or `AdaGrad` for `SelectedRows`. However, there should be only one `SGD` operator. `OpWithKernel::Run` should select a suitable kernel for both `dense` tensor or `SelectedRows`.
 | 
				
			||||