diff --git a/doc/design/images/feed_forward.png b/doc/design/images/feed_forward.png
new file mode 100644
index 0000000000..d312371a04
Binary files /dev/null and b/doc/design/images/feed_forward.png differ
diff --git a/doc/design/images/feed_forward_regularized.png b/doc/design/images/feed_forward_regularized.png
new file mode 100644
index 0000000000..677e99bfd9
Binary files /dev/null and b/doc/design/images/feed_forward_regularized.png differ
diff --git a/doc/design/images/l1_regularization.png b/doc/design/images/l1_regularization.png
new file mode 100644
index 0000000000..e1b9c7a44f
Binary files /dev/null and b/doc/design/images/l1_regularization.png differ
diff --git a/doc/design/images/l2_regularization.png b/doc/design/images/l2_regularization.png
new file mode 100644
index 0000000000..d5c2fcbc2c
Binary files /dev/null and b/doc/design/images/l2_regularization.png differ
diff --git a/doc/design/images/loss_equation.png b/doc/design/images/loss_equation.png
new file mode 100644
index 0000000000..14212ec8d3
Binary files /dev/null and b/doc/design/images/loss_equation.png differ
diff --git a/doc/design/regularization.md b/doc/design/regularization.md
new file mode 100644
index 0000000000..703a9fbdd4
--- /dev/null
+++ b/doc/design/regularization.md
@@ -0,0 +1,103 @@
+# Regularization in PaddlePaddle
+
+## Introduction to Regularization
+A central problem in machine learning is how to design an algorithm that will perform well not just on the training data, but also on new data. Many strategies are used by machine learning practitioners to reduce the test error, possibly at the expense of increased training error. These strategies are collectively known as **regularization**. 
+
+### Parameter Norm Penalties
+Most common regularization approaches in deep learning are based on limiting the capacity of the models by adding a parameter norm penalty to the objective function `J`. This is given as follows:
+
+<img src="./images/loss_equation.png" align="center"/><br/>
+
+The parameter `alpha` is a hyperparameter that weights the relative contribution of the norm penalty term, `omega`, relative to the standard objective function `J`.
+
+The most commonly used norm penalties are the L2 norm penalty and the L1 norm penalty. These are given as follows:
+
+##### L2 Regularization:
+<img src="./images/l2_regularization.png" align="center"/><br/>
+
+##### L1 Regularization
+<img src="./images/l1_regularization.png" align="center"/><br/>
+
+A much more detailed mathematical background of reguilarization can be found [here](http://www.deeplearningbook.org/contents/regularization.html).
+
+
+## How to do Regularization in PaddlePaddle
+
+On surveying existing frameworks like Tensorflow, PyTorch, Caffe, etc, it can be seen that there are 2 common approaches of doing regularization:
+
+1. Making regularization a part of the optimizer using an attribute like `weight_decay` that is used to control the scale of the L2 Penalty. This approach is used in PyTorch as follows:
+	```python
+	opt =  torch.optim.SGD(params, lr=0.2, weight_decay=0.2)
+	```
+    At every optimization step, this code will add the gradient of the L2 Norm of the params to the gradient of the params with respect to the loss function. This can seen in the following code snippet:
+    ```python
+    if weight_decay != 0:
+    	d_p.add_(weight_decay, p.data)
+    ```
+    This is a very restyrictive way of doing regularization and does not give the users enough flexibility. 
+    
+    **Advantages**:
+    -  It is easy to implement for us.
+    -  Faster execution of backward. However, it can be done manually by advanced users too.
+
+	**Disadvantages**:
+    - Not flexible for other regularizations such as L1/L0 regularization.
+    - Does not allow for different regularization coefficient for different parameters. For example, in most models, ony the weight matrices are regularized and the bias vectors are unregularized.
+    - Tightly coupled optimizer and regularization implementation. 
+
+
+2. Adding regularization ops to the graph through Python API. This approach is used by Tensorflow and Caffe. Using this approach, we manually add regularization ops to the graph and then add the regularization loss to the final loss function before sending them to the optimizer.
+
+	**Advantages**:
+    - Allows for greater flexibility to the users of Paddle. Using this approach, the users can put different regularization to different parameters and also choose parameters that are not a part of regularization.
+    - Makes it easy for the users to customize and extend the framework. 
+
+	**Disadvantages**:
+    - Implementation requires comprehensive design and time. 
+
+## Proposal for Regularization in PaddlePaddle
+
+### Low-Level implementation
+
+In the new design, we propose to create new operations for regularization. For now, we can add 2 ops thgat correspond to the most frequently used regularizations:
+- L2_regularization_op
+- L1_regularization_op
+
+These ops can be like any other ops with their own CPU/GPU implementations either using Eigen or separate Cpu and GPU kernels. As the initial implementation, we can implement their kernels using Eigen following the abstraction pattern implemented for [Activation Ops](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/accuracy_op.h). This abstraction pattern can make it very easy to implement new regularization schemes. other than L1 and L2 norm penalties. 
+
+The idea of building ops for regularization is in sync with the refactored Paddle philosophy of using operators to represent any computation unit. The way these ops will be added to the computation graph, will be decided by the [layer functions](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/python_api.md#layer-function) in Python API. 
+
+### Computation Graph
+
+Below is an example of a really simple feed forward neural network.
+
+<img src="./images/feed_forward.png" align="center"/><br/>
+
+The Python API will modify this computation graph to add regularization operators. The modified computation graph will look as follows:
+
+<img src="./images/feed_forward_regularized.png" align="center"/><br/>
+   
+### Python API implementation for Regularization
+
+Using the low level ops, `L2_regularization_op` and `L1_regularization_op`, any user can add regularization to their computation graphs. However, this will require a lot of lines of code and we should design Python APIs that support regularization. An example of such an API can be seen in [Keras](https://keras.io/regularizers/). As per the PaddlePaddle [Python API design](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/python_api.md), the layer functions are responsible for creating operators, operator parameters and variables. Since regularization is a property of parameters, it makes sense to create these in the layer functions. 
+
+#### Creation of Regularization ops
+There are two possibilities for creating the regularization ops:
+1. We create these ops immediately while building the computation graph. 
+2. We add these ops in a lazy manner, just before the backward, similar to the way the optimization ops are added. 
+
+The proposal is to add these ops in a lazy manner just before the backward pass. 
+
+#### Storage of Regularization attributes
+
+Since we want to create the regularization ops in a lazy manner, the regularization attributes (type of regularization and weight of regularization penalty) can be stored as attributes of the [`Parameter`](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/v2/framework/framework.py#L421) class. This is because regularization is a property of the parameters and storing regularization properties with Parameters also allows for shared parameters. 
+
+#### High-level API
+
+In PaddlePaddle Python API, users will primarily rely on [layer functions](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/python_api.md#layer-function) to create neural network layers. Hence, we lso need to provide regularization functionality in layer functions. The design of these APIs can be postponed for later right now. A good reference for these APIs can be found in [Keras](https://keras.io/regularizers/) and also by looking at Tensorflow in [`tf.contrib.layers`](https://www.tensorflow.org/api_guides/python/contrib.layers).
+
+
+
+
+
+    
diff --git a/doc/design/selected_rows.md b/doc/design/selected_rows.md
index 9e6f3b20cb..1a98839a95 100644
--- a/doc/design/selected_rows.md
+++ b/doc/design/selected_rows.md
@@ -1,6 +1,6 @@
 # Design Doc: Selected Rows
 
-`SelectedRows` is a kind of sparse tensor data type, which is designed to support `embedding` operators. The gradient of embedding table is a sparse tensor. Only a few rows are non-zero values in that tensor. It is straightforward to represent the sparse tensor by the following sparse tensor data structure:
+`SelectedRows` is a type of sparse tensor data type, which is designed to support `embedding` operators. The gradient of embedding table is a sparse tensor. Only a few rows are non-zero values in this tensor. It is straight-forward to represent a sparse tensor by the following sparse tensor data structure:
 
 ```cpp
 class SelectedRows {
@@ -11,7 +11,7 @@ class SelectedRows {
 };
 ```
 
-The field `height_` shows the first dimension of `SelectedRows`. The `rows` are the indices of which rows of `SelectedRows` are non-zeros. The `value_` field is an N-dim tensor and shape is `[rows.size() /* NUM_ROWS */, ...]`, which supplies values for each row. The dimension of `SelectedRows` satisfies `[height_] + value_.shape[1:]`.
+The field `height_` is the first dimension of `SelectedRows`. The `rows` are the indices of the non-zero rows of `SelectedRows`. The `value_` field is an N-dim tensor of shape `[rows.size() /* NUM_ROWS */, ...]`, which supplies values for each row. The dimension of `SelectedRows` satisfies `[height_] + value_.shape[1:]`.
 
 Suppose that a SelectedRows-typed variable `x` has many rows, but only two of them have values -- row 73 is `[1, 2]` and row 84 is `[3, 4]`, the `SelectedRows` representation would be:
 
@@ -25,7 +25,7 @@ x = SelectedRow {
 
 ## SelectedRows in Protobuf
 
-`SelectedRows` is a kind of `Variable`. `VarDesc` in protobuf should describe the `SelectedRows` information. Only the tensor dimension of a `SelectedRows` will be described in compile-time since the `rows_` and `value_` are related to training data. 
+`SelectedRows` is a type of `Variable`. `VarDesc` in protobuf should describe the `SelectedRows` information. Only the tensor dimension of a `SelectedRows` will be described in compile-time because the `rows_` and `value_` are dependent on the training data. 
 So we use `TensorDesc` to unify `data_type` and `dims`. A LodTensorDesc contains a `TensorDesc` and `lod_level`. The description of `SelectedRows` is a Tensor description.
 
 ```proto
@@ -54,7 +54,7 @@ message VarDesc {
 
 ## InferShape for Selected Rows
 
-Just like `LoD` information, `InferShape` method will inference output tensor type as well. The operator should decide whether its output is a `SelectedRows` or `Dense` tensor.
+Just like `LoD` information, `InferShape` method will infer the output tensor type as well. The operator should decide whether its output is a `SelectedRows` or `Dense` tensor.
 
 For example, the gradient operator of `TableLookup` will always generate `SelectedRows`. Its `InferShape` method should be like following
 
@@ -68,7 +68,7 @@ void TableLookupGrad::InferShape(context) {
 
 ## Sparse Operators
 
-There are several operators should be written to support `SelectedRows`. They are:
+There are several operators that need to be written to support `SelectedRows`. These are:
 
-1. Operators which generates `SelectedRows` gradient. e.g. Gradient of `TableLookupOp`.
+1. Operators which generate `SelectedRows` gradient. e.g. Gradient of `TableLookupOp`.
 2. Optimize operators which support `SelectedRows` gradient. e.g. `SGD` or `AdaGrad` for `SelectedRows`. However, there should be only one `SGD` operator. `OpWithKernel::Run` should select a suitable kernel for both `dense` tensor or `SelectedRows`.
diff --git a/doc/howto/cross_compiling/cross_compiling_for_android_cn.md b/doc/howto/cross_compiling/cross_compiling_for_android_cn.md
index 90dc84718c..1fc58c37cc 100644
--- a/doc/howto/cross_compiling/cross_compiling_for_android_cn.md
+++ b/doc/howto/cross_compiling/cross_compiling_for_android_cn.md
@@ -1,9 +1,46 @@
 # 构建Android平台上的PaddlePaddle库
 
-用户可通过交叉编译的方式，在用户熟悉的开发平台（Linux，Mac OS X和Windows）上编译Android平台上适用的PaddlePaddle库。
+用户可通过如下两种方式，交叉编译Android平台上适用的PaddlePaddle库：
+- 基于Docker容器的编译方式
+- 基于Linux交叉编译环境的编译方式
+
+## 基于Docker容器的编译方式
+Docker能在所有主要操作系统（包括Linux，Mac OS X和Windows）上运行，因此，使用基于Docker容器的编译方式，用户可在自己熟悉的开发平台上编译Android平台上适用的PaddlePaddle库。
+
+### 构建PaddlePaddle的Android开发镜像
+我们把PaddlePaddle的交叉编译环境打包成一个镜像，称为开发镜像，里面涵盖了交叉编译Android版PaddlePaddle库需要的所有编译工具。
+
+```bash
+$ git clone https://github.com/PaddlePaddle/Paddle.git
+$ cd Paddle
+$ docker build -t username/paddle-android:dev . -f Dockerfile.android
+```
+
+### 编译PaddlePaddle C-API库
+构建好开发镜像后，即可使用开发镜像来编译Android版PaddlePaddle C-API库。
+Android的Docker开发镜像向用户提供两个可配置的参数：
+
+| Argument        | Optional Values         | Default |
+|-----------------|-------------------------|---------|
+|`ANDROID_ABI`    |`armeabi-v7a, arm64-v8a` | `armeabi-v7a` |
+|`ANDROID_API`    |`>= 21` | `21` |
+
+- 编译`armeabi-v7a`，`Android API 21`的PaddlePaddle库
+```bash
+$ docker run -it --rm -v $PWD:/paddle -e "ANDROID_ABI=armeabi-v7a" -e "ANDROID_API=21" username/paddle-android:dev
+```
+
+- 编译`arm64-v8a`，`Android API 21`的PaddlePaddle库
+```bash
+$ docker run -it --rm -v $PWD:/paddle -e "ANDROID_ABI=arm64-v8a" -e "ANDROID_API=21" username/paddle-android:dev
+```
+
+执行上述`docker run`命令时，容器默认执行[paddle/scripts/docker/build_android.sh](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/scripts/docker/build_android.sh)脚本。该脚本中记录了交叉编译Android版PaddlePaddle库常用的CMake配置，并且会根据`ANDROID_ABI`和`ANDROID_API`自动构建独立工具链、进行编译和安装。由于arm64架构要求Android API不小于21。因此当`ANDROID_ABI=arm64-v8a`，`ANDROID_API<21`时，Docker容器中将默认使用`Android API 21`的编译工具链。用户可以参考下文**配置交叉编译参数**章节，根据个人的需求修改定制Docker容器所执行的脚本。编译安装结束之后，PaddlePaddle的C-API库将被安装到`$PWD/install_android`目录，所依赖的第三方库同时也被安装到`$PWD/install_android/third_party`目录。
+
+## 基于Linux交叉编译环境的编译方式
 本文档将以Linux x86-64平台为例，介绍交叉编译Android平台上适用的PaddlePaddle库的方法和步骤。
 
-## 准备交叉编译环境
+### 准备交叉编译环境
 
 从源码交叉编译PaddlePaddle，用户需要提前准备好交叉编译环境。Android平台上使用的C/C++交叉编译工具链为[Android NDK](https://developer.android.com/ndk/downloads/index.html?hl=zh-cn)，用户可自行前往下载预编译好的版本，也可通过以下命令获取：
 
@@ -13,18 +50,27 @@ unzip -q android-ndk-r14b-linux-x86_64.zip
 ```
 
 Android NDK中包含了所有Android API级别、所有架构（arm/arm64/x86/mips）需要用到的编译工具和系统库。用户可根据自己的编译目标架构、所需支持的最低Android API级别，构建[独立工具链](https://developer.android.google.cn/ndk/guides/standalone_toolchain.html?hl=zh-cn)。
-比如：
+
+- 构建`armeabi-v7a`、 `Android API 21`的独立工具链：
 
 ```bash
 your/path/to/android-ndk-r14b-linux-x86_64/build/tools/make-standalone-toolchain.sh \
-        --arch=arm --platform=android-21 --install-dir=your/path/to/my_standalone_toolchain
+        --arch=arm --platform=android-21 --install-dir=your/path/to/arm_standalone_toolchain
 ```
 
-此命令将在your/path/to/my_standalone_toolchain目录生成一套编译工具链，面向架构为32位ARM架构，支持的最小的Android API级别为21，使用的编译器为arm-linux-androideabi-gcc (GCC) 4.9。
+此命令将在`your/path/to/arm_standalone_toolchain`目录生成一套独立编译工具链，面向架构为32位ARM架构，支持的最小的Android API级别为21，支持编译器`arm-linux-androideabi-gcc (GCC) 4.9`和`clang 3.8`。
 
-注意：**PaddlePaddle要求使用的编译工具链所支持的Andoid API级别不小于21**。
+- 构建`arm64-v8a`、 `Android API 21`的独立工具链：
+```bash
+your/path/to/android-ndk-r14b-linux-x86_64/build/tools/make-standalone-toolchain.sh \
+        --arch=arm64 --platform=android-21 --install-dir=your/path/to/arm64_standalone_toolchain
+```
 
-## 配置交叉编译参数
+此命令将在`your/path/to/arm64_standalone_toolchain`目录生成一套独立编译工具链，面向架构为64位ARM64架构，支持的最小Android API级别为21，支持编译器`arm-linux-androideabi-gcc (GCC) 4.9`和`clang 3.8`。
+
+注意：**PaddlePaddle要求使用的编译工具链所支持的Android API级别不小于21**。
+
+### 配置交叉编译参数
 
 CMake系统对交叉编译提供了支持[cmake-toolchains](https://cmake.org/cmake/help/v3.0/manual/cmake-toolchains.7.html#cross-compiling)。为了简化cmake配置，PaddlePaddle为交叉编译提供了工具链配置文档[cmake/cross_compiling/android.cmake](https://github.com/PaddlePaddle/Paddle/blob/develop/cmake/cross_compiling/android.cmake)，以提供一些默认的编译器和编译参数相关配置。注意，从CMake 3.7版本开始，CMake官方对Android平台的交叉编译提供了通用的支持。PaddlePaddle若检测到用户使用的CMake版本不低于3.7时，将会将用户传进来的配置参数传递CMake系统，交由CMake系统本身来处理。有关参数配置的详细说明见[cmake-toolchains](https://cmake.org/cmake/help/v3.7/manual/cmake-toolchains.7.html#cross-compiling)。
 
@@ -36,32 +82,57 @@ CMake系统对交叉编译提供了支持[cmake-toolchains](https://cmake.org/cm
 Android平台可选配置参数：
 
 - `ANDROID_STANDALONE_TOOLCHAIN`，独立工具链所在的绝对路径，或者相对于构建目录的相对路径。PaddlePaddle的CMake系统将根据该值自动推导和设置需要使用的交叉编译器、sysroot、以及Android API级别；否则，用户需要在cmake时手动设置这些值。无默认值。
-- `ANDROID_ABI`，目标架构ABI。目前只支持`armeabi-v7a`，默认值为`armeabi-v7a`。
+- `ANDROID_TOOLCHAIN`，目标工具链。可设置`gcc/clang`，默认值为`clang`。
+	- CMake 3.7以上，将会始终使用`clang`工具链；CMake 3.7以下，可设置`ANDROID_TOOLCHAIN=gcc`以使用`gcc`工具链。
+	- Android官方提供的`clang`编译器要求系统支持`GLIBC 2.15`以上。
+- `ANDROID_ABI`，目标架构ABI。目前支持`armeabi-v7a`和`arm64-v8a`，默认值为`armeabi-v7a`。
 - `ANDROID_NATIVE_API_LEVEL`，工具链的Android API级别。若没有显式设置，PaddlePaddle将根据`ANDROID_STANDALONE_TOOLCHAIN`的值自动推导得到。
-- `ANROID_ARM_MODE`，是否使用ARM模式。可设置`ON/OFF`，默认值为`ON`。
-- `ANDROID_ARM_NEON`，是否使用NEON指令。目前必须设置成`ON`，默认值为`ON`。
+- `ANROID_ARM_MODE`，是否使用ARM模式。
+	- `ANDROID_ABI=armeabi-v7a`时，可设置`ON/OFF`，默认值为`ON`；
+	- `ANDROID_ABI=arm64-v8a`时，不需要设置。
+- `ANDROID_ARM_NEON`，是否使用NEON指令。
+	- `ANDROID_ABI=armeabi-v7a`时，可设置`ON/OFF`，默认值为`ON`；
+	- `ANDROID_ABI=arm64-v8a`时，不需要设置。
 
 其他配置参数：
 
+- `USE_EIGEN_FOR_BLAS`，是否使用Eigen库进行矩阵计算。可设置`ON/OFF`，默认值为`OFF`。
 - `HOST_C/CXX_COMPILER`，宿主机的C/C++编译器。在编译宿主机版protoc可执行文件和目标机版OpenBLAS库时需要用到。默认设置成环境变量`CC`的值；若环境变量`CC`没有设置，则设置成`cc`编译器。
 
-一种常用的cmake配置如下：
+常用的cmake配置如下：
 
 ```bash
 cmake -DCMAKE_SYSTEM_NAME=Android \
-      -DANDROID_STANDALONE_TOOLCHAIN=your/path/to/my_standalone_toolchain \
+      -DANDROID_STANDALONE_TOOLCHAIN=your/path/to/arm_standalone_toolchain \
       -DANDROID_ABI=armeabi-v7a \
       -DANDROID_ARM_NEON=ON \
       -DANDROID_ARM_MODE=ON \
+      -DUSE_EIGEN_FOR_BLAS=ON \
       -DCMAKE_INSTALL_PREFIX=your/path/to/install \
       -DWITH_C_API=ON \
       -DWITH_SWIG_PY=OFF \
       ..
 ```
 
+```
+cmake -DCMAKE_SYSTEM_NAME=Android \
+      -DANDROID_STANDALONE_TOOLCHAIN=your/path/to/arm64_standalone_toolchain \
+      -DANDROID_ABI=arm64-v8a \
+      -DUSE_EIGEN_FOR_BLAS=OFF \
+      -DCMAKE_INSTALL_PREFIX=your/path/to/install \  
+      -DWITH_C_API=ON \
+      -DWITH_SWIG_PY=OFF \
+      ..
+```
+
 用户还可根据自己的需求设置其他编译参数。比如希望最小化生成的库的大小，可以设置`CMAKE_BUILD_TYPE`为`MinSizeRel`；若希望最快的执行速度，则可设置`CMAKE_BUILD_TYPE`为`Release`。亦可以通过手动设置`CMAKE_C/CXX_FLAGS_MINSIZEREL/RELEASE`来影响PaddlePaddle的编译过程。
 
-## 编译和安装
+**性能TIPS**，为了达到最快的计算速度，在CMake参数配置上，有以下建议：
+- 设置`CMAKE_BUILD_TYPE`为`Release`
+- 使用`clang`编译工具链
+- `armeabi-v7a`时，设置`USE_EIGEN_BLAS=ON`，使用Eigen进行矩阵计算；`arm64-v8a`时，设置`USE_EIGEN_FOR_BLAS=OFF`，使用OpenBLAS进行矩阵计算
+
+### 编译和安装
 
 CMake配置完成后，执行以下命令，PaddlePaddle将自动下载和编译所有第三方依赖库、编译和安装PaddlePaddle预测库。
 
@@ -72,4 +143,4 @@ make install
 
 注意：如果你曾经在源码目录下编译过其他平台的PaddlePaddle库，请先使用`rm -rf`命令删除`third_party`目录和`build`目录，以确保所有的第三方依赖库和PaddlePaddle代码都是针对新的CMake配置重新编译的。
 
-执行完安装命令后，`your/path/to/install`目录中会包含`include`和`lib`目录，其中`include`中包含C-API的头文件，`lib`中包含一个Android版本的库。自此，PaddlePaddle的已经安装完成，用户可将`your/path/to/install`目录下的生成文件用于深度学习相关Android App中，调用方法见C-API文档。
+执行完安装命令后，`your/path/to/install`目录中会包含`include`、`lib`和`third_party`目录，其中`include`中包含C-API的头文件，`lib`中包含若干个不同Android ABI的PaddlePaddle库，`third_party`中包含所依赖的所有第三方库。自此，PaddlePaddle的已经安装完成，用户可将`your/path/to/install`目录下的生成文件用于深度学习相关Android App中，调用方法见C-API文档。
diff --git a/go/pserver/client/client.go b/go/pserver/client/client.go
index 20d91e7703..e5187ce3df 100644
--- a/go/pserver/client/client.go
+++ b/go/pserver/client/client.go
@@ -137,7 +137,7 @@ func (c *Client) FinishInitParams() error {
 			return err
 		}
 	}
-	return nil
+	return c.sel.Done()
 }
 
 // SendGrads sends gradients to parameter servers for updating
diff --git a/paddle/framework/attribute.cc b/paddle/framework/attribute.cc
index d6a2975aaa..29fe352ca4 100644
--- a/paddle/framework/attribute.cc
+++ b/paddle/framework/attribute.cc
@@ -19,19 +19,7 @@ limitations under the License. */
 namespace paddle {
 namespace framework {
 
-static ProgramDesc* g_program_desc = nullptr;
-
-ProgramDesc& GetProgramDesc() {
-  if (g_program_desc == nullptr) {
-    g_program_desc = new ProgramDesc();
-    auto root_block = g_program_desc->mutable_blocks()->Add();
-    root_block->set_idx(0);
-    root_block->set_parent_idx(-1);
-  }
-  return *g_program_desc;
-}
-
-Attribute GetAttrValue(const OpDesc::Attr& attr_desc) {
+Attribute GetAttrValue(const OpDesc::Attr& attr_desc, ProgramDesc* program) {
   switch (attr_desc.type()) {
     case framework::AttrType::BOOLEAN: {
       return attr_desc.b();
@@ -74,7 +62,9 @@ Attribute GetAttrValue(const OpDesc::Attr& attr_desc) {
       return val;
     }
     case framework::AttrType::BLOCK: {
-      return GetProgramDesc().mutable_blocks(attr_desc.block_idx());
+      PADDLE_ENFORCE(program != nullptr,
+                     "Need to specify ProgramDesc when get a block attr");
+      return program->mutable_blocks(attr_desc.block_idx());
     }
   }
   PADDLE_ENFORCE(false, "Unknown OpDesc::AttrDesc::type !");
diff --git a/paddle/framework/attribute.h b/paddle/framework/attribute.h
index 8a7a949346..9744662b8f 100644
--- a/paddle/framework/attribute.h
+++ b/paddle/framework/attribute.h
@@ -26,16 +26,13 @@ limitations under the License. */
 
 namespace paddle {
 namespace framework {
-
-ProgramDesc& GetProgramDesc();
-
 template <typename T>
 inline AttrType AttrTypeID() {
   Attribute tmp = T();
   return static_cast<AttrType>(tmp.which() - 1);
 }
 
-Attribute GetAttrValue(const OpDesc::Attr& attr_desc);
+Attribute GetAttrValue(const OpDesc::Attr& attr_desc, ProgramDesc* desc);
 
 class AttrReader {
  public:
diff --git a/paddle/framework/backward.cc b/paddle/framework/backward.cc
index ac80879c54..fb552fe344 100644
--- a/paddle/framework/backward.cc
+++ b/paddle/framework/backward.cc
@@ -309,8 +309,7 @@ static void CreateGradVarInBlock(
 }
 
 std::vector<std::unique_ptr<OpDescBind>> MakeOpGrad(
-    const std::unique_ptr<OpDescBind>& op_desc,
-    std::unordered_set<std::string>* no_grad_vars,
+    const OpDescBind* op_desc, std::unordered_set<std::string>* no_grad_vars,
     std::unordered_map<std::string, std::string>* grad_to_var) {
   std::vector<std::unique_ptr<OpDescBind>> grad_op_descs;
   // All input gradients of forwarding operator do not need to calculate.
@@ -357,7 +356,7 @@ std::vector<std::unique_ptr<OpDescBind>> MakeBlockBackward(
     std::unordered_set<std::string>* no_grad_vars,
     std::unordered_map<std::string, std::string>* grad_to_var) {
   BlockDescBind* cur_block = program_desc.Block(block_idx);
-  std::deque<std::unique_ptr<OpDescBind>>& op_descs = cur_block->ops_;
+  std::vector<OpDescBind*> op_descs = cur_block->AllOps();
   std::unordered_map<std::string, std::vector<size_t>> dup_out_ops;
   size_t grad_desc_idx = 0;
   std::vector<std::unique_ptr<OpDescBind>> backward_descs;
@@ -375,7 +374,7 @@ std::vector<std::unique_ptr<OpDescBind>> MakeBlockBackward(
           program_desc, step_block_idx, no_grad_vars, grad_to_var);
       BlockDescBind* backward_block = program_desc.AppendBlock(*cur_block);
       for (auto& ptr : backward_block_op_descs) {
-        backward_block->ops_.push_back(std::move(ptr));
+        backward_block->AppendAllocatedOp(std::move(ptr));
       }
       op_grads[0]->SetBlockAttr("step_block", *backward_block);
     }
@@ -432,7 +431,6 @@ ParamGradInfoMap AppendBackward(
 
   const int root_block_idx = 0;
   auto root_block = program_desc.Block(root_block_idx);
-  auto& all_ops = root_block->ops_;
 
   // insert fill one op for target
   // TODO(qiao) add some check to the target.
@@ -447,8 +445,8 @@ ParamGradInfoMap AppendBackward(
                      {{"shape", target_shape},
                       {"value", static_cast<float>(1.0)},
                       {"data_type", framework::DataType::FP32}}));
-  all_ops.push_back(std::move(fill_one_op));
-  size_t forward_op_num = all_ops.size();
+  root_block->AppendAllocatedOp(std::move(fill_one_op));
+  size_t forward_op_num = root_block->OpSize();
   size_t forward_block_num = program_desc.Size();
 
   // Insert backward operators
@@ -457,7 +455,7 @@ ParamGradInfoMap AppendBackward(
                                              &no_grad_var_names, &grad_to_var);
 
   for (auto& ptr : backward_op_descs) {
-    all_ops.push_back(std::move(ptr));
+    root_block->AppendAllocatedOp(std::move(ptr));
   }
   // Create Variable
 
diff --git a/paddle/framework/backward_test.cc b/paddle/framework/backward_test.cc
index 0c35a157bc..10301f7e39 100644
--- a/paddle/framework/backward_test.cc
+++ b/paddle/framework/backward_test.cc
@@ -495,19 +495,8 @@ TEST(Backward, linear_net_intermediate_variable_has_no_grad) {
   EXPECT_EQ(bwd_net->ops_[2]->Outputs(all).size(), 0UL);
 }
 
-// =================================== //
-
-f::ProgramDesc *GetNewProgramDesc() {
-  auto *program_desc = new f::ProgramDesc();
-  auto *root_block = program_desc->add_blocks();
-  root_block->set_idx(0);
-  root_block->set_parent_idx(-1);
-  return program_desc;
-}
-
 TEST(Backward, simple_single_op) {
-  f::ProgramDesc *program_desc = GetNewProgramDesc();
-  f::ProgramDescBind &program = f::ProgramDescBind::Instance(program_desc);
+  f::ProgramDescBind program;
   f::BlockDescBind *block = program.Block(0);
 
   f::OpDescBind *op = block->AppendOp();
@@ -543,8 +532,7 @@ TEST(Backward, simple_single_op) {
 }
 
 TEST(Backward, default_attribute) {
-  f::ProgramDesc *program_desc = GetNewProgramDesc();
-  f::ProgramDescBind &program = f::ProgramDescBind::Instance(program_desc);
+  f::ProgramDescBind program;
   f::BlockDescBind *block = program.Block(0);
   f::OpDescBind *op = block->AppendOp();
   op->SetType("mul");
@@ -570,8 +558,7 @@ TEST(Backward, default_attribute) {
 }
 
 TEST(Backward, simple_mult_op) {
-  f::ProgramDesc *program_desc = GetNewProgramDesc();
-  f::ProgramDescBind &program = f::ProgramDescBind::Instance(program_desc);
+  f::ProgramDescBind program;
   f::BlockDescBind *block = program.Block(0);
   f::OpDescBind *op1 = block->AppendOp();
   op1->SetType("rowwise_add");
@@ -654,8 +641,7 @@ TEST(Backward, simple_mult_op) {
 }
 
 TEST(Backward, intermedia_var_no_grad) {
-  f::ProgramDesc *program_desc = GetNewProgramDesc();
-  f::ProgramDescBind &program = f::ProgramDescBind::Instance(program_desc);
+  f::ProgramDescBind program;
   f::BlockDescBind *block = program.Block(0);
   f::OpDescBind *op1 = block->AppendOp();
   op1->SetType("rowwise_add");
@@ -725,8 +711,7 @@ TEST(Backward, intermedia_var_no_grad) {
 }
 
 TEST(Backward, var_no_grad) {
-  f::ProgramDesc *program_desc = GetNewProgramDesc();
-  f::ProgramDescBind &program = f::ProgramDescBind::Instance(program_desc);
+  f::ProgramDescBind program;
   f::BlockDescBind *block = program.Block(0);
   f::OpDescBind *op1 = block->AppendOp();
   op1->SetType("mult_in_out");
@@ -802,8 +787,7 @@ TEST(Backward, var_no_grad) {
 }
 
 TEST(Backward, shared_var) {
-  f::ProgramDesc *program_desc = GetNewProgramDesc();
-  f::ProgramDescBind &program = f::ProgramDescBind::Instance(program_desc);
+  f::ProgramDescBind program;
   f::BlockDescBind *block = program.Block(0);
   f::OpDescBind *op1 = block->AppendOp();
   op1->SetType("rowwise_add");
@@ -893,8 +877,7 @@ TEST(Backward, shared_var) {
 }
 
 TEST(Backward, half_backward) {
-  f::ProgramDesc *program_desc = GetNewProgramDesc();
-  f::ProgramDescBind &program = f::ProgramDescBind::Instance(program_desc);
+  f::ProgramDescBind program;
   f::BlockDescBind *block = program.Block(0);
   auto *op1 = block->AppendOp();
   op1->SetType("minus");
diff --git a/paddle/framework/block_desc.cc b/paddle/framework/block_desc.cc
index 47b75228cd..92ac302e46 100644
--- a/paddle/framework/block_desc.cc
+++ b/paddle/framework/block_desc.cc
@@ -19,11 +19,11 @@ namespace paddle {
 namespace framework {
 
 VarDescBind *BlockDescBind::Var(const std::string &name) {
-  need_update_ = true;
   auto it = vars_.find(name);
   if (it != vars_.end()) {
     return it->second.get();
   }
+  need_update_ = true;
   auto *var = new VarDescBind(name);
   vars_[name].reset(var);
   return var;
@@ -55,6 +55,11 @@ OpDescBind *BlockDescBind::AppendOp() {
   return ops_.back().get();
 }
 
+void BlockDescBind::AppendAllocatedOp(std::unique_ptr<OpDescBind> &&op_desc) {
+  need_update_ = true;
+  ops_.emplace_back(std::move(op_desc));
+}
+
 OpDescBind *BlockDescBind::PrependOp() {
   need_update_ = true;
   ops_.emplace_front(new OpDescBind());
@@ -70,15 +75,19 @@ std::vector<OpDescBind *> BlockDescBind::AllOps() const {
 }
 
 void BlockDescBind::Flush() {
+  for (auto &op_desc : ops_) {
+    op_desc->Flush();
+  }
+
   if (need_update_) {
     auto &op_field = *this->desc_->mutable_ops();
-    op_field.Clear();
+    this->ClearPBOps();
     op_field.Reserve(static_cast<int>(ops_.size()));
     for (auto &op_desc : ops_) {
       op_field.AddAllocated(op_desc->Proto());
     }
     auto &var_field = *this->desc_->mutable_vars();
-    var_field.Clear();
+    this->ClearPBVars();
     var_field.Reserve(static_cast<int>(vars_.size()));
     for (auto &var_desc : vars_) {
       var_field.AddAllocated(var_desc.second->Proto());
@@ -99,5 +108,21 @@ BlockDesc *BlockDescBind::Proto() {
   return desc_;
 }
 
+void BlockDescBind::ClearPBOps() {
+  auto ops = this->desc_->mutable_ops();
+  while (!ops->empty()) {
+    // we do not own the OpDesc, so release the ownership.
+    ops->ReleaseLast();
+  }
+}
+
+void BlockDescBind::ClearPBVars() {
+  auto vars = this->desc_->mutable_vars();
+  while (!vars->empty()) {
+    // we do not own the VarDesc, so release the ownership.
+    vars->ReleaseLast();
+  }
+}
+
 }  // namespace framework
 }  // namespace paddle
diff --git a/paddle/framework/block_desc.h b/paddle/framework/block_desc.h
index 9fb88f9632..5e1f10c1ae 100644
--- a/paddle/framework/block_desc.h
+++ b/paddle/framework/block_desc.h
@@ -36,6 +36,11 @@ class BlockDescBind {
   BlockDescBind(ProgramDescBind *prog, BlockDesc *desc)
       : prog_(prog), desc_(desc), need_update_(false) {}
 
+  ~BlockDescBind() {
+    this->ClearPBVars();
+    this->ClearPBOps();
+  }
+
   int32_t ID() const { return desc_->idx(); }
 
   int32_t Parent() const { return desc_->parent_idx(); }
@@ -52,17 +57,25 @@ class BlockDescBind {
 
   OpDescBind *AppendOp();
 
+  void AppendAllocatedOp(std::unique_ptr<OpDescBind> &&op_desc);
+
   OpDescBind *PrependOp();
 
   std::vector<OpDescBind *> AllOps() const;
 
+  size_t OpSize() const { return ops_.size(); }
+
+  OpDescBind *Op(int idx) { return ops_.at(idx).get(); }
+
   void Flush();
 
   BlockDesc *Proto();
 
-  // FIXME(yuyang18): backward will access private data of BlockDesc.
-  // Mark it public temporary. We can fix it later.
- public:
+ private:
+  void ClearPBOps();
+  void ClearPBVars();
+
+ private:
   ProgramDescBind *prog_;  // not_own
   BlockDesc *desc_;        // not_own
   bool need_update_;
diff --git a/paddle/framework/executor.cc b/paddle/framework/executor.cc
index 8e82e28bac..00caa6e1d5 100644
--- a/paddle/framework/executor.cc
+++ b/paddle/framework/executor.cc
@@ -64,99 +64,24 @@ void Executor::Run(const ProgramDesc& pdesc, Scope* scope, int block_id) {
   auto& block = pdesc.blocks(block_id);
   auto& device = device_contexts_[0];
 
-  // Instantiate all the vars in the global scope
-  for (auto& var : block.vars()) {
-    scope->Var(var.name());
-  }
-
   Scope& local_scope = scope->NewScope();
 
-  std::vector<bool> should_run = Prune(pdesc, block_id);
-  PADDLE_ENFORCE_EQ(should_run.size(), static_cast<size_t>(block.ops_size()));
-  for (size_t i = 0; i < should_run.size(); ++i) {
-    if (should_run[i]) {
-      for (auto& var : block.ops(i).outputs()) {
-        for (auto& argu : var.arguments()) {
-          if (local_scope.FindVar(argu) == nullptr) {
-            local_scope.Var(argu);
-          }
-        }
-      }
-      auto op = paddle::framework::OpRegistry::CreateOp(block.ops(i));
-      op->Run(local_scope, *device);
+  for (auto& var : block.vars()) {
+    if (var.persistable()) {
+      scope->Var(var.name());
+    } else {
+      local_scope.Var(var.name());
     }
   }
 
-  // TODO(tonyyang-svail):
-  //  - Destroy local_scope
-}
-
-std::vector<bool> Prune(const ProgramDesc& pdesc, int block_id) {
-  // TODO(tonyyang-svail):
-  //    - will change to use multiple blocks for RNN op and Cond Op
-
-  auto& block = pdesc.blocks(block_id);
-  auto& ops = block.ops();
-
-  bool expect_feed = true;
-  for (auto& op_desc : ops) {
-    PADDLE_ENFORCE(op_desc.type() != kFeedOpType || expect_feed,
-                   "All FeedOps are at the beginning of the ProgramDesc");
-    expect_feed = (op_desc.type() == kFeedOpType);
-  }
-
-  bool expect_fetch = true;
-  for (auto op_iter = ops.rbegin(); op_iter != ops.rend(); ++op_iter) {
-    auto& op_desc = *op_iter;
-    PADDLE_ENFORCE(op_desc.type() != kFetchOpType || expect_fetch,
-                   "All FetchOps must at the end of the ProgramDesc");
-    expect_fetch = (op_desc.type() == kFetchOpType);
-  }
-
-  std::set<std::string> dependent_vars;
-  std::vector<bool> should_run;
-  for (auto op_iter = ops.rbegin(); op_iter != ops.rend(); ++op_iter) {
-    auto& op_desc = *op_iter;
-
-    bool found_dependent_vars = false;
-    for (auto& var : op_desc.outputs()) {
-      for (auto& argu : var.arguments()) {
-        if (dependent_vars.count(argu) != 0) {
-          found_dependent_vars = true;
-        }
-      }
-    }
-
-    if (op_desc.type() == kFetchOpType || found_dependent_vars) {
-      // erase its output to the dependency graph
-      for (auto& var : op_desc.outputs()) {
-        for (auto& argu : var.arguments()) {
-          dependent_vars.erase(argu);
-        }
-      }
-
-      // insert its input to the dependency graph
-      for (auto& var : op_desc.inputs()) {
-        for (auto& argu : var.arguments()) {
-          dependent_vars.insert(argu);
-        }
-      }
-
-      should_run.push_back(true);
-    } else {
-      should_run.push_back(false);
-    }
+  for (auto& op_desc : block.ops()) {
+    auto op = paddle::framework::OpRegistry::CreateOp(
+        op_desc, const_cast<ProgramDesc*>(&pdesc));
+    op->Run(local_scope, *device);
   }
 
   // TODO(tonyyang-svail):
-  //    - check this after integration of Init
-  // PADDLE_ENFORCE(dependent_vars.empty());
-
-  // since we are traversing the ProgramDesc in reverse order
-  // we reverse the should_run vector
-  std::reverse(should_run.begin(), should_run.end());
-
-  return should_run;
+  //  - Destroy local_scope
 }
 
 }  // namespace framework
diff --git a/paddle/framework/executor.h b/paddle/framework/executor.h
index 4e3bc2c0a5..793ee954e2 100644
--- a/paddle/framework/executor.h
+++ b/paddle/framework/executor.h
@@ -40,16 +40,5 @@ class Executor {
   std::vector<platform::DeviceContext*> device_contexts_;
 };
 
-/* @Brief
- * Pruning the graph
- *
- * @param
- *  ProgramDesc
- *
- * @return
- *  vector<bool> Same size as ops. Indicates whether an op should be run.
- */
-std::vector<bool> Prune(const ProgramDesc& pdesc, int block_id);
-
 }  // namespace framework
 }  // namespace paddle
diff --git a/paddle/framework/op_registry.cc b/paddle/framework/op_registry.cc
index 504afbd5db..c2f2438edf 100644
--- a/paddle/framework/op_registry.cc
+++ b/paddle/framework/op_registry.cc
@@ -43,12 +43,13 @@ static VariableNameMap ConvertOpDescVarsToVarNameMap(
   return ret_val;
 }
 
-std::unique_ptr<OperatorBase> OpRegistry::CreateOp(const OpDesc& op_desc) {
+std::unique_ptr<OperatorBase> OpRegistry::CreateOp(const OpDesc& op_desc,
+                                                   ProgramDesc* program) {
   VariableNameMap inputs = ConvertOpDescVarsToVarNameMap(op_desc.inputs());
   VariableNameMap outputs = ConvertOpDescVarsToVarNameMap(op_desc.outputs());
   AttributeMap attrs;
   for (auto& attr : op_desc.attrs()) {
-    attrs[attr.name()] = GetAttrValue(attr);
+    attrs[attr.name()] = GetAttrValue(attr, program);
   }
 
   return CreateOp(op_desc.type(), inputs, outputs, attrs);
diff --git a/paddle/framework/op_registry.h b/paddle/framework/op_registry.h
index 0bda87dfa1..d25b4abccb 100644
--- a/paddle/framework/op_registry.h
+++ b/paddle/framework/op_registry.h
@@ -45,18 +45,15 @@ class Registrar {
 
 template <typename... ARGS>
 struct OperatorRegistrar : public Registrar {
-  explicit OperatorRegistrar(const char* op_type) : op_type(op_type) {
+  explicit OperatorRegistrar(const char* op_type) {
     PADDLE_ENFORCE(!OpInfoMap::Instance().Has(op_type),
                    "'%s' is registered more than once.", op_type);
     static_assert(sizeof...(ARGS) != 0,
                   "OperatorRegistrar should be invoked at least by OpClass");
+    OpInfo info;
     details::OperatorRegistrarRecursive<0, false, ARGS...>(op_type, &info);
     OpInfoMap::Instance().Insert(op_type, info);
   }
-
-  const char* op_type;
-
-  OpInfo info;
 };
 
 class OpRegistry {
@@ -77,7 +74,8 @@ class OpRegistry {
                                                 const VariableNameMap& outputs,
                                                 AttributeMap attrs);
 
-  static std::unique_ptr<OperatorBase> CreateOp(const OpDesc& op_desc);
+  static std::unique_ptr<OperatorBase> CreateOp(const OpDesc& op_desc,
+                                                ProgramDesc* program);
 
   static std::unique_ptr<OperatorBase> CreateOp(const OpDescBind& op_desc);
 };
diff --git a/paddle/framework/op_registry_test.cc b/paddle/framework/op_registry_test.cc
index b860fe6cac..6289125d7c 100644
--- a/paddle/framework/op_registry_test.cc
+++ b/paddle/framework/op_registry_test.cc
@@ -74,7 +74,7 @@ TEST(OpRegistry, CreateOp) {
   attr->set_type(paddle::framework::AttrType::FLOAT);
   attr->set_f(scale);
 
-  auto op = paddle::framework::OpRegistry::CreateOp(op_desc);
+  auto op = paddle::framework::OpRegistry::CreateOp(op_desc, nullptr);
   paddle::framework::Scope scope;
   paddle::platform::CPUDeviceContext dev_ctx;
   op->Run(scope, dev_ctx);
@@ -95,7 +95,7 @@ TEST(OpRegistry, IllegalAttr) {
 
   bool caught = false;
   try {
-    paddle::framework::OpRegistry::CreateOp(op_desc);
+    paddle::framework::OpRegistry::CreateOp(op_desc, nullptr);
   } catch (paddle::platform::EnforceNotMet err) {
     caught = true;
     std::string msg = "larger_than check fail";
@@ -115,7 +115,7 @@ TEST(OpRegistry, DefaultValue) {
 
   ASSERT_TRUE(op_desc.IsInitialized());
 
-  auto op = paddle::framework::OpRegistry::CreateOp(op_desc);
+  auto op = paddle::framework::OpRegistry::CreateOp(op_desc, nullptr);
   paddle::framework::Scope scope;
   paddle::platform::CPUDeviceContext dev_ctx;
   op->Run(scope, dev_ctx);
@@ -131,7 +131,7 @@ TEST(OpRegistry, CustomChecker) {
   // attr 'test_attr' is not set
   bool caught = false;
   try {
-    paddle::framework::OpRegistry::CreateOp(op_desc);
+    paddle::framework::OpRegistry::CreateOp(op_desc, nullptr);
   } catch (paddle::platform::EnforceNotMet err) {
     caught = true;
     std::string msg = "Attribute 'test_attr' is required!";
@@ -149,7 +149,7 @@ TEST(OpRegistry, CustomChecker) {
   attr->set_i(3);
   caught = false;
   try {
-    paddle::framework::OpRegistry::CreateOp(op_desc);
+    paddle::framework::OpRegistry::CreateOp(op_desc, nullptr);
   } catch (paddle::platform::EnforceNotMet err) {
     caught = true;
     std::string msg = "'test_attr' must be even!";
@@ -166,7 +166,7 @@ TEST(OpRegistry, CustomChecker) {
   attr->set_name("test_attr");
   attr->set_type(paddle::framework::AttrType::INT);
   attr->set_i(4);
-  auto op = paddle::framework::OpRegistry::CreateOp(op_desc);
+  auto op = paddle::framework::OpRegistry::CreateOp(op_desc, nullptr);
   paddle::platform::CPUDeviceContext dev_ctx;
   paddle::framework::Scope scope;
   op->Run(scope, dev_ctx);
diff --git a/paddle/framework/operator_test.cc b/paddle/framework/operator_test.cc
index d7890ac8d0..c358f1a2b6 100644
--- a/paddle/framework/operator_test.cc
+++ b/paddle/framework/operator_test.cc
@@ -83,7 +83,7 @@ TEST(OperatorBase, all) {
   paddle::platform::CPUDeviceContext device_context;
   paddle::framework::Scope scope;
 
-  auto op = paddle::framework::OpRegistry::CreateOp(op_desc);
+  auto op = paddle::framework::OpRegistry::CreateOp(op_desc, nullptr);
   scope.Var("OUT1");
   ASSERT_EQ(paddle::framework::op_run_num, 0);
   op->Run(scope, device_context);
@@ -208,7 +208,7 @@ TEST(OpKernel, all) {
   paddle::platform::CPUDeviceContext cpu_device_context;
   paddle::framework::Scope scope;
 
-  auto op = paddle::framework::OpRegistry::CreateOp(op_desc);
+  auto op = paddle::framework::OpRegistry::CreateOp(op_desc, nullptr);
   ASSERT_EQ(paddle::framework::cpu_kernel_run_num, 0);
   op->Run(scope, cpu_device_context);
   ASSERT_EQ(paddle::framework::cpu_kernel_run_num, 1);
@@ -244,7 +244,7 @@ TEST(OpKernel, multi_inputs) {
   scope.Var("y0")->GetMutable<Tensor>();
   scope.Var("y1")->GetMutable<Tensor>();
 
-  auto op = paddle::framework::OpRegistry::CreateOp(op_desc);
+  auto op = paddle::framework::OpRegistry::CreateOp(op_desc, nullptr);
   op->Run(scope, cpu_device_context);
 }
 
diff --git a/paddle/framework/program_desc.cc b/paddle/framework/program_desc.cc
index fcb7292884..df846f115a 100644
--- a/paddle/framework/program_desc.cc
+++ b/paddle/framework/program_desc.cc
@@ -18,27 +18,10 @@ limitations under the License. */
 namespace paddle {
 namespace framework {
 
-using ProgDescMap =
-    std::unordered_map<ProgramDesc *, std::unique_ptr<ProgramDescBind>>;
-static ProgDescMap *g_bind_map = nullptr;
-
-ProgramDescBind &ProgramDescBind::Instance(ProgramDesc *prog) {
-  if (g_bind_map == nullptr) {
-    g_bind_map = new ProgDescMap();
-  }
-  auto &map = *g_bind_map;
-  auto &ptr = map[prog];
-
-  if (ptr == nullptr) {
-    ptr.reset(new ProgramDescBind(prog));
-  }
-  return *ptr;
-}
-
 BlockDescBind *ProgramDescBind::AppendBlock(const BlockDescBind &parent) {
-  auto *b = prog_->add_blocks();
+  auto *b = prog_.add_blocks();
   b->set_parent_idx(parent.ID());
-  b->set_idx(prog_->blocks_size() - 1);
+  b->set_idx(prog_.blocks_size() - 1);
   blocks_.emplace_back(new BlockDescBind(this, b));
   return blocks_.back().get();
 }
@@ -47,14 +30,14 @@ ProgramDesc *ProgramDescBind::Proto() {
   for (auto &block : blocks_) {
     block->Flush();
   }
-  return prog_;
+  return &prog_;
 }
 
-ProgramDescBind::ProgramDescBind(ProgramDesc *prog) {
-  prog_ = prog;
-  for (auto &block : *prog->mutable_blocks()) {
-    blocks_.emplace_back(new BlockDescBind(this, &block));
-  }
+ProgramDescBind::ProgramDescBind() {
+  auto *block = prog_.mutable_blocks()->Add();
+  block->set_idx(0);
+  block->set_parent_idx(-1);
+  blocks_.emplace_back(new BlockDescBind(this, block));
 }
 }  // namespace framework
 }  // namespace paddle
diff --git a/paddle/framework/program_desc.h b/paddle/framework/program_desc.h
index f29b1c54e7..514b62654d 100644
--- a/paddle/framework/program_desc.h
+++ b/paddle/framework/program_desc.h
@@ -26,7 +26,7 @@ class BlockDescBind;
 
 class ProgramDescBind {
  public:
-  static ProgramDescBind &Instance(ProgramDesc *prog);
+  ProgramDescBind();
 
   BlockDescBind *AppendBlock(const BlockDescBind &parent);
 
@@ -37,10 +37,7 @@ class ProgramDescBind {
   ProgramDesc *Proto();
 
  private:
-  explicit ProgramDescBind(ProgramDesc *prog);
-
-  // Not owned
-  ProgramDesc *prog_;
+  ProgramDesc prog_;
 
   std::vector<std::unique_ptr<BlockDescBind>> blocks_;
 
diff --git a/paddle/framework/var_type_inference_test.cc b/paddle/framework/var_type_inference_test.cc
index 87399208e9..918de1fd05 100644
--- a/paddle/framework/var_type_inference_test.cc
+++ b/paddle/framework/var_type_inference_test.cc
@@ -62,7 +62,7 @@ namespace paddle {
 namespace framework {
 
 TEST(InferVarType, sum_op) {
-  auto &prog = ProgramDescBind::Instance(&GetProgramDesc());
+  ProgramDescBind prog;
   auto *op = prog.Block(0)->AppendOp();
   op->SetType("sum");
   op->SetInput("X", {"test_a", "test_b", "test_c"});
@@ -83,7 +83,7 @@ TEST(InferVarType, sum_op) {
 }
 
 TEST(InferVarType, sum_op_without_infer_var_type) {
-  auto &prog = ProgramDescBind::Instance(&GetProgramDesc());
+  ProgramDescBind prog;
   auto *op = prog.Block(0)->AppendOp();
   op->SetType("sum_without_infer_var_type");
   op->SetInput("X", {"test2_a", "test2_b", "test2_c"});
diff --git a/paddle/operators/batch_norm_op.md b/paddle/operators/batch_norm_op.md
new file mode 100644
index 0000000000..80948adf2b
--- /dev/null
+++ b/paddle/operators/batch_norm_op.md
@@ -0,0 +1,134 @@
+# Batch Normalization
+
+## What is batch normalization
+
+Batch normalization is a frequently-used method in deep network training. It adjusts the mean and variance of a layer's output, and make the data distribution easier for next layer's training. 
+
+The principle of batch normalization can be summarized into a simple function:
+
+```
+y = (x - E[x]) / STD[x]) * scale + bias
+```
+
+`x` is a batch of output data of a certain layer. `E[x]` and `STD[x]` is the mean and standard deviation of `x`, respectively。 `scale` and `bias` are two trainable parameters. The training of batch normalization layer equals to the learning of best values of `scale` and `bias`.
+
+In our design, we use a single operator(`batch_norm_op`) to implement the whole batch normalization in C++, and wrap it as a layer in Python.
+
+## Differences with normal operators
+
+`batch_norm_op` is a single operator. However, there are a few differences between `BatchNormOp` and normal operators, which we shall take into consideration in our design.
+
+1. `batch_norm_op` shall behave differently in training and inferencing. For example, during inferencing, there is no batch data and it's impossible to compute `E[x]` and `STD[x]`, so we have to use an `estimated_mean` and an `estimated_variance` instead of them. These require our framework to be able to inform operators current running type (training/inferencing), then operators can switch their behaviors.
+
+2. `batch_norm_op` shall have the ability to maintain `estimated_mean` and `estimated_variance` across mini-batch. In each mini-batch, `estimated_mean` is iterated by the following equations:
+
+```
+if batch_id == 0
+  estimated_mean = E[x]
+else
+  estimated_mean = estimated_mean * momentum + (1.0 - momentum_) * E[x]
+```
+
+The iterating of `estimated_variance` is similar. `momentum` is an attribute, which controls estimated_mean updating speed.
+
+## Implementation
+
+Batch normalization is designed as a single operator is C++, and then wrapped as a layer in Python.
+
+### C++
+
+As most C++ operators do, `batch_norm_op` is defined by inputs, outputs, attributes and compute kernels.
+
+#### Inputs
+
+- `x`: The inputs data, which is generated by the previous layer.
+- `estimated_mean`: The estimated mean of all previous data batches. It is updated in each forward propagation and will be used in inferencing to take the role of `E[x]`.
+- `estimated_var`: The estimated standard deviation of all previous data batches. It is updated in each forward propagation and will be used in inferencing to take the role of `STD[x]`.
+- `scale`: trainable parameter 'scale'
+- `bias`: trainable parameter 'bias'
+
+#### Outputs
+
+- `y`: The output data.
+- `batch_mean`: The mean value of batch data.
+- `batch_var`: The standard deviation value of batch data.
+- `saved_mean`: Updated `estimated_mean` with current batch data. It's supposed to share the memory with input `estimated_mean`.
+- `saved_var`: Updated `estimated_var` with current batch data. It's supposed to share the memory with input `estimated_var`.
+
+#### Attributes
+
+- `is_infer`: *bool*. If true, run `batch_norm_op` in inferencing mode.
+- `use_global_est`: *bool*. If true, use `saved_mean` and `saved_var` instead of `E[x]` and `STD[x]` in trainning.
+- `epsilon`: *float*. The epsilon value to avoid division by zero.
+- `momentum`: *float*. Factor used in `estimated_mean` and `estimated_var` updating. The usage is shown above.
+
+#### Kernels
+
+The following graph showes the training computational process of `batch_norm_op`:
+
+<img src="./images/batch_norm_op_kernel.png" width="800"/>
+
+cudnn provides APIs to finish the whole series of computation, we can use them in our GPU kernel.
+
+### Python
+
+`batch_norm_op` is warpped as a layer in Python:
+
+```python 
+def batch_norm_layer(net, 
+                     input,
+                     output, 
+                     scale, 
+                     bias, 
+                     use_global_est = False, 
+                     epsilon = 1e-6,
+                     momentum = 0.99):
+	mean_cache = scope.new_var(name = 'estimated_mean', trainable = False)
+	var_cache = scop.new_var(name = 'estimated_var', trainable = False)
+	batch_mean = scope.new_var(name = 'batch_mean')
+	batch_var = scope.new_var(name = 'batch_var')
+	batch_norm_op = Operator('batch_norm_op',
+	                         x = input,
+	                         estimated_mean = mean_cache,
+	                         estimated_mean = var_cache,
+	                         scale = scale,
+	                         bias = bias,
+	                         y = output,
+	                         batch_mean = batch_mean,
+	                         batch_var = batch_var,
+	                         saved_mean = mean_cache,
+	                         saved_var = var_cache,
+	                         is_infer = False,
+	                         use_global_est = use_global_est,
+	                         epsilon = epsilon,
+	                         momentum = momentum)
+	net.append_op(batch_norm_op)
+	return output
+```
+
+Because Python API has not been finally decided, the code above can be regarded as pseudo code. There are a few key points we shall note:
+
+1. `estimated_mean` and `estimated_var` are assigned the same variables with `saved_mean` and `saved_var` respectively. So they share same the memories. The output mean and variance values(`saved_mean` and `saved_var`) of a certain batch will be the inputs(`estimated_mean` and `estimated_var`) of the next batch.
+
+2. `is_infer` decided whether `batch_norm_op` will run in training mode or inferencing mode. However, a network may contains both training and inferencing parts. And user may switch `batch_norm_op`'s running mode in Python `for` loop like this:
+
+```python
+for pass_id in range(PASS_NUM):
+    # ...
+    net.train()  # run training model
+    if pass_id % 100 == 0:
+        net.infer(test_image)    # run inferencing model
+    # ...
+``` 
+
+`is_infer` is an attribute. Once an operator is created, its attributes can not be changed. It suggests us that we shall maintain two `batch_norm_op` in the model, one's `is_infer` is `True`(we call it `infer_batch_norm_op`) and the other one's is `False`(we call it `train_batch_norm_op`). They share all parameters and variables, but be placed in two different branches. That is to say, if a network contains a `batch_norm_op`, it will fork into two branches, one go through `train_batch_norm_op` and the other one go through `infer_batch_norm_op`:
+
+<div align=center>
+<img src="./images/batch_norm_fork.png" width="500"/>
+</div>
+
+Just like what is shown in the above graph, the net forks before `batch_norm_op` and will never merge again. All the operators after `batch_norm_op` will duplicate. 
+
+When the net runs in training mode, the end of the left branch will be set as the running target, so the dependency tracking process will ignore right branch automatically. When the net runs in inferencing mode, the process is reversed.
+
+How to set a target is related to Python API design, so I will leave it here waiting for more discussions.
diff --git a/paddle/operators/dynamic_recurrent_op_test.cc b/paddle/operators/dynamic_recurrent_op_test.cc
index 83a5ba36d9..36f405568d 100644
--- a/paddle/operators/dynamic_recurrent_op_test.cc
+++ b/paddle/operators/dynamic_recurrent_op_test.cc
@@ -51,7 +51,7 @@ class DynamicRecurrentOpTestHelper : public ::testing::Test {
     CreateGlobalVariables();
 
     auto op_desc = CreateOpDesc();
-    op = paddle::framework::OpRegistry::CreateOp(op_desc);
+    op = paddle::framework::OpRegistry::CreateOp(op_desc, nullptr);
     dop = dynamic_cast<DynamicRecurrentOp*>(op.get());
     InitCacheManually();
     InitStepNet();
diff --git a/paddle/operators/images/batch_norm_fork.dot b/paddle/operators/images/batch_norm_fork.dot
new file mode 100644
index 0000000000..4bc47713cb
--- /dev/null
+++ b/paddle/operators/images/batch_norm_fork.dot
@@ -0,0 +1,25 @@
+digraph ImageBatchNormForkGragh {
+  subgraph cluster_before {
+    Prev [label="...", shape=plaintext];
+    Rnn [label="rnn_op", shape=box];
+    BatchNorm [label="batch_norm_op", shape=box];
+    Fc [label="fc_op", shape=box];
+    After [label="...", shape=plaintext];
+    Prev -> Rnn -> BatchNorm -> Fc -> After;
+    label="original";
+  }
+
+  subgraph cluster_after {
+    Prev2 [label="...", shape=plaintext];
+    Rnn2 [label="rnn_op", shape=box];
+    BatchNorm2_1 [label="train_batch_norm_op", shape=box];
+    BatchNorm2_2 [label="infer_batch_norm_op", shape=box];
+    Fc2_1 [label="fc_op", shape=box];
+    Fc2_2 [label="fc_op", shape=box];
+    After2_1 [label="...", shape=plaintext];
+    After2_2 [label="...", shape=plaintext];
+    Prev2 -> Rnn2 -> BatchNorm2_1 -> Fc2_1 -> After2_1;
+    Rnn2 -> BatchNorm2_2 ->Fc2_2 ->After2_2
+    label="forked";
+  }
+}
diff --git a/paddle/operators/images/batch_norm_fork.png b/paddle/operators/images/batch_norm_fork.png
new file mode 100644
index 0000000000..aded62bce5
Binary files /dev/null and b/paddle/operators/images/batch_norm_fork.png differ
diff --git a/paddle/operators/images/batch_norm_op_kernel.png b/paddle/operators/images/batch_norm_op_kernel.png
new file mode 100644
index 0000000000..a99ce81ff3
Binary files /dev/null and b/paddle/operators/images/batch_norm_op_kernel.png differ
diff --git a/paddle/operators/math/math_function.cc b/paddle/operators/math/math_function.cc
index 77a1e22b41..aad1357598 100644
--- a/paddle/operators/math/math_function.cc
+++ b/paddle/operators/math/math_function.cc
@@ -130,6 +130,87 @@ void matmul<platform::CPUPlace, double>(
       matrix_b.data<double>(), beta, matrix_out->data<double>());
 }
 
+#ifdef PADDLE_USE_MKLML
+// Use cblas_{s,d}gemm_batched if available: Run with 1 group of size batchSize.
+template <>
+void batched_gemm<platform::CPUPlace, float>(
+    const platform::DeviceContext& context, const CBLAS_TRANSPOSE transA,
+    const CBLAS_TRANSPOSE transB, const int M, const int N, const int K,
+    const float alpha, const float* A, const float* B, const float beta,
+    float* C, const int batchCount, const int strideA, const int strideB) {
+  int lda = (transA == CblasNoTrans) ? K : M;
+  int ldb = (transB == CblasNoTrans) ? N : K;
+  int ldc = N;
+  auto a_array = std::vector<const float*>(batchCount);
+  auto b_array = std::vector<const float*>(batchCount);
+  auto c_array = std::vector<float*>(batchCount);
+  for (int k = 0; k < batchCount; ++k) {
+    a_array[k] = &A[k * strideA];
+    b_array[k] = &B[k * strideB];
+    c_array[k] = &C[k * M * N];
+  }
+  cblas_sgemm_batch(CblasRowMajor, &transA, &transB, &M, &N, &K, &alpha,
+                    a_array.data(), &lda, b_array.data(), &ldb, &beta,
+                    c_array.data(), &ldc, 1 /* group_count */, &batchCount);
+}
+
+template <>
+void batched_gemm<platform::CPUPlace, double>(
+    const platform::DeviceContext& context, const CBLAS_TRANSPOSE transA,
+    const CBLAS_TRANSPOSE transB, const int M, const int N, const int K,
+    const double alpha, const double* A, const double* B, const double beta,
+    double* C, const int batchCount, const int strideA, const int strideB) {
+  int lda = (transA == CblasNoTrans) ? K : M;
+  int ldb = (transB == CblasNoTrans) ? N : K;
+  int ldc = N;
+  auto a_array = std::vector<const double*>(batchCount);
+  auto b_array = std::vector<const double*>(batchCount);
+  auto c_array = std::vector<double*>(batchCount);
+  for (int k = 0; k < batchCount; ++k) {
+    a_array[k] = &A[k * strideA];
+    b_array[k] = &B[k * strideB];
+    c_array[k] = &C[k * M * N];
+  }
+  cblas_dgemm_batch(CblasRowMajor, &transA, &transB, &M, &N, &K, &alpha,
+                    a_array.data(), &lda, b_array.data(), &ldb, &beta,
+                    c_array.data(), &ldc, 1 /* group_count */, &batchCount);
+}
+#else
+// The below is a naive but correct serial implementation that just loops
+// over the batch dimension. This is a fallback for when the batched gemm
+// functions of Intel MKL are not available. In the future, this computation
+// should be parallelized.
+template <>
+void batched_gemm<platform::CPUPlace, float>(
+    const platform::DeviceContext& context, const CBLAS_TRANSPOSE transA,
+    const CBLAS_TRANSPOSE transB, const int M, const int N, const int K,
+    const float alpha, const float* A, const float* B, const float beta,
+    float* C, const int batchCount, const int strideA, const int strideB) {
+  for (int k = 0; k < batchCount; ++k) {
+    const float* Ak = &A[k * strideA];
+    const float* Bk = &B[k * strideB];
+    float* Ck = &C[k * M * N];
+    gemm<platform::CPUPlace, float>(context, transA, transB, M, N, K, alpha, Ak,
+                                    Bk, beta, Ck);
+  }
+}
+
+template <>
+void batched_gemm<platform::CPUPlace, double>(
+    const platform::DeviceContext& context, const CBLAS_TRANSPOSE transA,
+    const CBLAS_TRANSPOSE transB, const int M, const int N, const int K,
+    const double alpha, const double* A, const double* B, const double beta,
+    double* C, const int batchCount, const int strideA, const int strideB) {
+  for (int k = 0; k < batchCount; ++k) {
+    const double* Ak = &A[k * strideA];
+    const double* Bk = &B[k * strideB];
+    double* Ck = &C[k * M * N];
+    gemm<platform::CPUPlace, double>(context, transA, transB, M, N, K, alpha,
+                                     Ak, Bk, beta, Ck);
+  }
+}
+#endif
+
 template struct SetConstant<platform::CPUPlace, float>;
 
 }  // namespace math
diff --git a/paddle/operators/math/math_function.cu b/paddle/operators/math/math_function.cu
index 7fbc03acf2..5583683c6e 100644
--- a/paddle/operators/math/math_function.cu
+++ b/paddle/operators/math/math_function.cu
@@ -155,6 +155,54 @@ void matmul<platform::GPUPlace, double>(
       matrix_b.data<double>(), beta, matrix_out->data<double>());
 }
 
+template <>
+void batched_gemm<platform::GPUPlace, float>(
+    const platform::DeviceContext& context, const CBLAS_TRANSPOSE transA,
+    const CBLAS_TRANSPOSE transB, const int M, const int N, const int K,
+    const float alpha, const float* A, const float* B, const float beta,
+    float* C, const int batchCount, const int strideA, const int strideB) {
+  // Note that cublas follows fortran order, so the order is different from
+  // the cblas convention.
+  int lda = (transA == CblasNoTrans) ? K : M;
+  int ldb = (transB == CblasNoTrans) ? N : K;
+  int ldc = N;
+  cublasOperation_t cuTransA =
+      (transA == CblasNoTrans) ? CUBLAS_OP_N : CUBLAS_OP_T;
+  cublasOperation_t cuTransB =
+      (transB == CblasNoTrans) ? CUBLAS_OP_N : CUBLAS_OP_T;
+  const int strideC = M * N;
+
+  PADDLE_ENFORCE(platform::dynload::cublasSgemmStridedBatched(
+      reinterpret_cast<const platform::CUDADeviceContext&>(context)
+          .cublas_handle(),
+      cuTransB, cuTransA, N, M, K, &alpha, B, ldb, strideB, A, lda, strideA,
+      &beta, C, ldc, strideC, batchCount));
+}
+
+template <>
+void batched_gemm<platform::GPUPlace, double>(
+    const platform::DeviceContext& context, const CBLAS_TRANSPOSE transA,
+    const CBLAS_TRANSPOSE transB, const int M, const int N, const int K,
+    const double alpha, const double* A, const double* B, const double beta,
+    double* C, const int batchCount, const int strideA, const int strideB) {
+  // Note that cublas follows fortran order, so the order is different from
+  // the cblas convention.
+  int lda = (transA == CblasNoTrans) ? K : M;
+  int ldb = (transB == CblasNoTrans) ? N : K;
+  int ldc = N;
+  cublasOperation_t cuTransA =
+      (transA == CblasNoTrans) ? CUBLAS_OP_N : CUBLAS_OP_T;
+  cublasOperation_t cuTransB =
+      (transB == CblasNoTrans) ? CUBLAS_OP_N : CUBLAS_OP_T;
+  const int strideC = M * N;
+
+  PADDLE_ENFORCE(platform::dynload::cublasDgemmStridedBatched(
+      reinterpret_cast<const platform::CUDADeviceContext&>(context)
+          .cublas_handle(),
+      cuTransB, cuTransA, N, M, K, &alpha, B, ldb, strideB, A, lda, strideA,
+      &beta, C, ldc, strideC, batchCount));
+}
+
 template struct SetConstant<platform::GPUPlace, float>;
 
 }  // namespace math
diff --git a/paddle/operators/math/math_function.h b/paddle/operators/math/math_function.h
index 6f92d83aab..9777ebfd15 100644
--- a/paddle/operators/math/math_function.h
+++ b/paddle/operators/math/math_function.h
@@ -63,7 +63,7 @@ namespace math {
 
 // Support continuous memory now
 // If transA = N, and transB = N
-// Then matrixA: M * K, matrixB: K * N matrixC : M * N
+// Then matrixA: M * K, matrixB: K * N, matrixC : M * N
 // For more detailed info, please refer to
 // http://www.netlib.org/lapack/explore-html/d4/de2/sgemm_8f.html
 template <typename Place, typename T>
@@ -85,6 +85,14 @@ void matmul(const platform::DeviceContext& context,
             const framework::Tensor& matrix_b, bool trans_b, T alpha,
             framework::Tensor* matrix_out, T beta);
 
+// Batched gemm
+template <typename Place, typename T>
+void batched_gemm(const platform::DeviceContext& context,
+                  const CBLAS_TRANSPOSE transA, const CBLAS_TRANSPOSE transB,
+                  const int M, const int N, const int K, const T alpha,
+                  const T* A, const T* B, const T beta, T* C,
+                  const int batchCount, const int strideA, const int strideB);
+
 template <typename Place, typename T>
 struct SetConstant {
   void operator()(const platform::DeviceContext& context,
diff --git a/paddle/operators/math/matmul.h b/paddle/operators/math/matmul.h
new file mode 100644
index 0000000000..6ba9a0ba9a
--- /dev/null
+++ b/paddle/operators/math/matmul.h
@@ -0,0 +1,124 @@
+/* Copyright (c) 2017 PaddlePaddle Authors. All Rights Reserve.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+#include "paddle/operators/math/math_function.h"
+
+namespace paddle {
+namespace operators {
+namespace math {
+
+// Implements the logic of numpy matmul:
+// https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.matmul.html
+//
+// but allowing also for a, b to be transposed
+//
+// Both a & b can be 1- to 3-dimensional. Higher rank tensors are not supported
+// yet.
+template <typename Place, typename T>
+class MatMulFunctor {
+ public:
+  void operator()(const platform::DeviceContext& context,
+                  const framework::Tensor& a, bool trans_a,
+                  const framework::Tensor& b, bool trans_b, T alpha,
+                  framework::Tensor* out, T beta) {
+    auto dim_a = a.dims();
+    auto dim_b = b.dims();
+
+    PADDLE_ENFORCE(a.place() == b.place() && b.place() == out->place(),
+                   "Tensors must all be in the same place.");
+    PADDLE_ENFORCE_GE(dim_a.size(), 1,
+                      "Input tensor a must be at least 1-dimensional.");
+    PADDLE_ENFORCE_GE(dim_b.size(), 1,
+                      "Input tensor b must be at least 1-dimensional.");
+    PADDLE_ENFORCE_LE(dim_a.size(), 3,
+                      "Input tensor a must be at most 3-dimensional.");
+    PADDLE_ENFORCE_LE(dim_b.size(), 3,
+                      "Input tensor b must be at most 3-dimensional.");
+
+    int M = 0, N = 0, kA = 0, kB = 0, batchCountA = 0, batchCountB = 0,
+        strideA = 0, strideB = 0;
+
+    switch (dim_a.size()) {
+      case 1:
+        // similar to np.matmul:
+        // prepend dimension 1 (no transpose) or append dimension 1 (transpose)
+        M = trans_a ? dim_a[0] : 1;
+        kA = trans_a ? 1 : dim_a[0];
+        break;
+      case 2:
+        M = trans_a ? dim_a[1] : dim_a[0];
+        kA = trans_a ? dim_a[0] : dim_a[1];
+        break;
+      case 3:
+        batchCountA = dim_a[0];
+        M = trans_a ? dim_a[2] : dim_a[1];
+        kA = trans_a ? dim_a[1] : dim_a[2];
+        strideA = M * kA;
+        break;
+      default:
+        assert(false);
+    }
+
+    switch (dim_b.size()) {
+      case 1:
+        // similar to np.matmul:
+        // append dimension 1 (no transpose) or prepend dimension 1 (transpose)
+        kB = trans_b ? 1 : dim_b[0];
+        N = trans_b ? dim_b[0] : 1;
+        break;
+      case 2:
+        kB = trans_b ? dim_b[1] : dim_b[0];
+        N = trans_b ? dim_b[0] : dim_b[1];
+        break;
+      case 3:
+        batchCountB = dim_b[0];
+        kB = trans_b ? dim_b[2] : dim_b[1];
+        N = trans_b ? dim_b[1] : dim_b[2];
+        strideB = kB * N;
+        break;
+      default:
+        assert(false);
+    }
+
+    PADDLE_ENFORCE_EQ(
+        kA, kB,
+        "First matrix's width must be equal with second matrix's height.");
+    if (batchCountA && batchCountB) {
+      PADDLE_ENFORCE_EQ(
+          batchCountA, batchCountB,
+          "When input tensors a and b are both batched, they must have the "
+          "same batch dimension.");
+    }
+    int batchCount = std::max(batchCountA, batchCountB);
+
+    CBLAS_TRANSPOSE transA = (trans_a == false) ? CblasNoTrans : CblasTrans;
+    CBLAS_TRANSPOSE transB = (trans_b == false) ? CblasNoTrans : CblasTrans;
+
+    if (!batchCount) {
+      // regular matrix multiplication
+      gemm<Place, T>(context, transA, transB, M, N, kA, alpha, a.data<T>(),
+                     b.data<T>(), beta, out->data<T>());
+    } else {
+      // batched matrix multiplication
+      batched_gemm<Place, T>(context, transA, transB, M, N, kA, alpha,
+                             a.data<T>(), b.data<T>(), beta, out->data<T>(),
+                             batchCount, strideA, strideB);
+    }
+  }
+};
+
+}  // namespace math
+}  // namespace operators
+}  // namespace paddle
diff --git a/paddle/operators/matmul_op.cc b/paddle/operators/matmul_op.cc
new file mode 100644
index 0000000000..5ecbee3b41
--- /dev/null
+++ b/paddle/operators/matmul_op.cc
@@ -0,0 +1,208 @@
+/* Copyright (c) 2017 PaddlePaddle Authors. All Rights Reserve.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "paddle/operators/matmul_op.h"
+
+namespace paddle {
+namespace operators {
+
+using framework::Tensor;
+
+class MatMulOp : public framework::OperatorWithKernel {
+ public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+
+ protected:
+  void InferShape(framework::InferShapeContext* context) const override {
+    PADDLE_ENFORCE(context->HasInput("X"),
+                   "Input(X) of MatMulOp should not be null.");
+    PADDLE_ENFORCE(context->HasInput("Y"),
+                   "Input(Y) of MatMulOp should not be null.");
+    PADDLE_ENFORCE(context->HasOutput("Out"),
+                   "Output(Out) of MatMulOp should not be null.");
+
+    auto dim_x = context->GetInputDim("X");
+    auto dim_y = context->GetInputDim("Y");
+    bool transpose_x = context->Attrs().Get<bool>("transpose_X");
+    bool transpose_y = context->Attrs().Get<bool>("transpose_Y");
+
+    PADDLE_ENFORCE_GE(dim_x.size(), 1,
+                      "Input tensor X must be at least 1-dimensional.");
+    PADDLE_ENFORCE_GE(dim_y.size(), 1,
+                      "Input tensor Y must be at least 1-dimensional.");
+    PADDLE_ENFORCE_LE(dim_x.size(), 3,
+                      "Input tensor X must be at most 3-dimensional.");
+    PADDLE_ENFORCE_LE(dim_y.size(), 3,
+                      "Input tensor Y must be at most 3-dimensional.");
+
+    int M = 0, N = 0, KX = 0, KY = 0, batchCountX = 0, batchCountY = 0;
+    bool remove_initial_dim = false, remove_final_dim = false;
+
+    switch (dim_x.size()) {
+      case 1:
+        if (transpose_x) {
+          M = dim_x[0];
+          KX = 1;
+        } else {
+          M = 1;
+          KX = dim_x[0];
+          remove_initial_dim = true;
+        }
+        break;
+      case 2:
+        M = transpose_x ? dim_x[1] : dim_x[0];
+        KX = transpose_x ? dim_x[0] : dim_x[1];
+        break;
+      case 3:
+        batchCountX = dim_x[0];
+        M = transpose_x ? dim_x[2] : dim_x[1];
+        KX = transpose_x ? dim_x[1] : dim_x[2];
+        break;
+      default:
+        assert(false);
+    }
+
+    switch (dim_y.size()) {
+      case 1:
+        if (transpose_y) {
+          N = dim_y[0];
+          KY = 1;
+        } else {
+          N = 1;
+          KY = dim_y[0];
+          remove_final_dim = true;
+        }
+        break;
+      case 2:
+        KY = transpose_y ? dim_y[1] : dim_y[0];
+        N = transpose_y ? dim_y[0] : dim_y[1];
+        break;
+      case 3:
+        batchCountY = dim_y[0];
+        KY = transpose_y ? dim_y[2] : dim_y[1];
+        N = transpose_y ? dim_y[1] : dim_y[2];
+        break;
+      default:
+        assert(false);
+    }
+
+    PADDLE_ENFORCE_EQ(
+        KX, KY,
+        "First matrix's width must be equal with second matrix's height.");
+    if (batchCountX && batchCountY) {
+      PADDLE_ENFORCE_EQ(
+          batchCountX, batchCountY,
+          "When Input(X) and Input(Y) are both three dimensional, they "
+          "must have the same batch dimension.");
+    }
+    int batchCount = std::max(batchCountX, batchCountY);
+
+    std::vector<int64_t> dim_out;
+    if (batchCount) {
+      dim_out.push_back(batchCount);
+    }
+    if (!remove_initial_dim) {
+      dim_out.push_back(M);
+    }
+    if (!remove_final_dim) {
+      dim_out.push_back(N);
+    }
+    if (dim_out.size() == 0) {
+      // We don't support 0-dimensional Tensors (scalars), so instead
+      // treat the output as a Tensor of shape (1, ) in this case.
+      dim_out.push_back(1);
+    }
+    context->SetOutputDim("Out", framework::make_ddim(dim_out));
+    context->ShareLoD("X", /*->*/ "Out");
+  }
+};
+
+class MatMulOpMaker : public framework::OpProtoAndCheckerMaker {
+ public:
+  MatMulOpMaker(framework::OpProto* proto, framework::OpAttrChecker* op_checker)
+      : OpProtoAndCheckerMaker(proto, op_checker) {
+    AddInput("X", "The first input of MatMul op");
+    AddInput("Y", "The second input of MatMul op");
+    AddOutput("Out", "The output of MatMul op");
+    AddAttr<bool>("transpose_X",
+                  R"DOC(If true, use the transpose of `X`.
+        )DOC")
+        .SetDefault(false);
+    AddAttr<bool>("transpose_Y",
+                  R"DOC(If true, use the transpose of `Y`.
+        )DOC")
+        .SetDefault(false);
+    AddComment(R"DOC(
+The MatMul operator is used to perform (batched) matrix multiplication
+over the last two dimensions of the input tensors `X` and `Y`.
+
+If a transpose flag is specified, the last two dimensions of the
+tensor are transposed. If the tensor is rank-1 of shape [D], then
+for `X` it is treated as [1, D] in nontransposed form and as [D, 1]
+in transposed form, whereas for `Y` it is the opposite: It is treated
+as [D, 1] in nontransposed form and as [1, D] in transposed form.
+
+Examples without transpose:
+- X: [K], Y: [K] => Out: [1]
+- X: [K], Y: [K, N] => Out: [N]
+- X: [B, M, K], Y: [K] => Out: [B, M]
+- X: [M, K], Y: [B, K, N] => Out: [B, M, N]
+- X: [B, M, K], Y: [B, K, N] => Out: [B, M, N]
+
+The behavior is designed to be similar to the `numpy.matmul` function.
+The differences are:
+- Currently only rank 1 to rank 3 input tensors are supported.
+- We add `transpose_X` and `transpose_Y` flags.
+
+Both the input `X` and `Y` can carry the LoD (Level of Details) information,
+or not. But the output only shares the LoD with input `X`.
+)DOC");
+  }
+};
+
+class MatMulOpGrad : public framework::OperatorWithKernel {
+ public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+
+ protected:
+  void InferShape(framework::InferShapeContext* context) const override {
+    PADDLE_ENFORCE(context->HasInput("X"), "Input(X) should not be null");
+    PADDLE_ENFORCE(context->HasInput("Y"), "Input(Y) should not be null");
+    PADDLE_ENFORCE(context->HasInput(framework::GradVarName("Out")),
+                   "Input(Out@GRAD) should not be null");
+    auto x_dims = context->GetInputDim("X");
+    auto y_dims = context->GetInputDim("Y");
+
+    auto x_grad_name = framework::GradVarName("X");
+    auto y_grad_name = framework::GradVarName("Y");
+
+    if (context->HasOutput(x_grad_name)) {
+      context->SetOutputDim(x_grad_name, x_dims);
+    }
+    if (context->HasOutput(y_grad_name)) {
+      context->SetOutputDim(y_grad_name, y_dims);
+    }
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+REGISTER_OP(matmul, ops::MatMulOp, ops::MatMulOpMaker, matmul_grad,
+            ops::MatMulOpGrad);
+REGISTER_OP_CPU_KERNEL(matmul,
+                       ops::MatMulKernel<paddle::platform::CPUPlace, float>);
+REGISTER_OP_CPU_KERNEL(
+    matmul_grad, ops::MatMulGradKernel<paddle::platform::CPUPlace, float>);
diff --git a/paddle/operators/matmul_op.cu b/paddle/operators/matmul_op.cu
new file mode 100644
index 0000000000..b7e66382f0
--- /dev/null
+++ b/paddle/operators/matmul_op.cu
@@ -0,0 +1,21 @@
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License. */
+
+#include "paddle/operators/matmul_op.h"
+
+namespace ops = paddle::operators;
+REGISTER_OP_GPU_KERNEL(matmul,
+                       ops::MatMulKernel<paddle::platform::GPUPlace, float>);
+REGISTER_OP_GPU_KERNEL(
+    matmul_grad, ops::MatMulGradKernel<paddle::platform::GPUPlace, float>);
diff --git a/paddle/operators/matmul_op.h b/paddle/operators/matmul_op.h
new file mode 100644
index 0000000000..8ae54e1eec
--- /dev/null
+++ b/paddle/operators/matmul_op.h
@@ -0,0 +1,228 @@
+/* Copyright (c) 2017 PaddlePaddle Authors. All Rights Reserve.
+
+   Licensed under the Apache License, Version 2.0 (the "License");
+   You may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License. */
+
+#pragma once
+
+#include "paddle/framework/op_registry.h"
+#include "paddle/operators/math/matmul.h"
+#include "paddle/operators/transpose_op.h"
+
+namespace paddle {
+namespace operators {
+namespace matmul_detail {
+
+using Tensor = framework::Tensor;
+using DDim = framework::DDim;
+using framework::make_ddim;
+using framework::vectorize;
+
+template <typename Place, typename T>
+class MatMulKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& context) const override {
+    const Tensor& x = *context.Input<Tensor>("X");
+    const Tensor& y = *context.Input<Tensor>("Y");
+    Tensor* out = context.Output<Tensor>("Out");
+    out->mutable_data<T>(context.GetPlace());
+    bool transpose_x = context.Attr<bool>("transpose_X");
+    bool transpose_y = context.Attr<bool>("transpose_Y");
+
+    math::MatMulFunctor<Place, T>()(context.device_context(), x, transpose_x, y,
+                                    transpose_y, T(1), out, T(0));
+  }
+};
+
+template <typename T>
+inline Tensor Reshape(const Tensor& input, const DDim& dims) {
+  Tensor output;
+  output.ShareDataWith<T>(input);
+  output.Resize(dims);
+  return output;
+}
+
+// Reshape a rank-3 tensor from P x M x N to (P * M) x N.
+// Identity op if the tensor is not of rank 3.
+template <typename T>
+Tensor CombineBatchAndM(const Tensor& input) {
+  Tensor output;
+  output.ShareDataWith<T>(input);
+  auto in_dims = input.dims();
+  if (in_dims.size() == 3) {
+    std::vector<int64_t> out_dims = {in_dims[0] * in_dims[1], in_dims[2]};
+    output.Resize(make_ddim(out_dims));
+  }
+  return output;
+}
+
+// Reshape a rank-3 tensor from P x M x N to M x (P * N).
+// (Warning: This requires transposing data and writes into new memory.)
+// Identity op if the tensor is not of rank 3.
+template <typename Place, typename T>
+Tensor CombineBatchAndN(const framework::ExecutionContext& context,
+                        const Tensor& input) {
+  Tensor output;
+  auto in_dims = input.dims();
+  if (in_dims.size() == 3) {
+    output.Resize(in_dims);
+    output.mutable_data<T>(context.GetPlace());
+    EigenTranspose<Place, T, 3>(context, input, output, {1, 0, 2});
+    std::vector<int64_t> out_dims = {in_dims[1], in_dims[0] * in_dims[2]};
+    output.Resize(make_ddim(out_dims));
+  } else {
+    output.ShareDataWith<T>(input);
+  }
+  return output;
+}
+
+// Using dimensional constraints on matrix multiplication, it is
+// straight-forward to check the following table for when X and Y
+// are both matrices.
+//
+// transpose_X | False    | True     | False    | True
+// transpose_Y | False    | False    | True     | True
+// -----------+----------+----------+----------+-----------
+//        dX = | dOut Y^T | Y dOut^T | dOut Y   | Y^T dOut^T
+//        dY = | X^T dOut | X dOut   | dOut^T X | dOut^T X^T
+//
+// When X is a vector of size K, we treat it instead as a matrix of shape
+// (1, K). Similarly, when Y is a vector of size K, we treat it instead as
+// a matrix of shape (K, 1).
+//
+// When X and Y are both 3-dimensional tensors, then the first dimension
+// the batch dimension can be ignored and the exact same formulas apply
+// as for two matrices.
+//
+// Finally, when, e.g., X is a 3-dimensional tensor but Y is a matrix, we end
+// up with formulas like
+//
+//   dY_{ij} = \sum_{p, m} X_{pmi} dOut_{pmj}
+//
+// To handle this sort of scenario, we reshape X : P x M x K, dOut: P x M x N
+// to X: (P * M) x K, dOut: (P * M) x N.
+template <typename Place, typename T>
+class MatMulGradKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& context) const override {
+    const Tensor& x = *context.Input<Tensor>("X");
+    const Tensor& y = *context.Input<Tensor>("Y");
+    const Tensor& dout = *context.Input<Tensor>(framework::GradVarName("Out"));
+    Tensor* dx = context.Output<Tensor>(framework::GradVarName("X"));
+    Tensor* dy = context.Output<Tensor>(framework::GradVarName("Y"));
+    bool transpose_x = context.Attr<bool>("transpose_X");
+    bool transpose_y = context.Attr<bool>("transpose_Y");
+
+    std::vector<int64_t> x_dims = vectorize(x.dims());
+    std::vector<int64_t> y_dims = vectorize(y.dims());
+
+    // If X is a vector, reshape it to a matrix.
+    if (x_dims.size() == 1) {
+      x_dims.insert(x_dims.begin(), 1);
+    }
+
+    // If Y is a vector, reshape it to a matrix.
+    if (y_dims.size() == 1) {
+      y_dims.push_back(1);
+    }
+
+    // Fix the dOut dimensions.
+    int M = 0, N = 0, batchCountX = 0, batchCountY = 0;
+
+    switch (x_dims.size()) {
+      case 2:
+        M = transpose_x ? x_dims[1] : x_dims[0];
+        break;
+      case 3:
+        batchCountX = x_dims[0];
+        M = transpose_x ? x_dims[2] : x_dims[1];
+        break;
+      default:
+        assert(false);
+    }
+
+    switch (y_dims.size()) {
+      case 2:
+        N = transpose_y ? y_dims[0] : y_dims[1];
+        break;
+      case 3:
+        batchCountY = y_dims[0];
+        N = transpose_y ? y_dims[1] : y_dims[2];
+        break;
+      default:
+        assert(false);
+    }
+    if (batchCountX && batchCountY) {
+      PADDLE_ENFORCE_EQ(
+          batchCountX, batchCountY,
+          "When Input(X) and Input(Y) are both three dimensional, they "
+          "must have the same batch dimension.");
+    }
+    int batchCount = std::max(batchCountX, batchCountY);
+    std::vector<int64_t> dout_dims = {M, N};
+    if (batchCount) {
+      dout_dims.insert(dout_dims.begin(), batchCount);
+    }
+    Tensor X = Reshape<T>(x, make_ddim(x_dims));
+    Tensor Y = Reshape<T>(y, make_ddim(y_dims));
+    Tensor dOut = Reshape<T>(dout, make_ddim(dout_dims));
+
+    if (dx) {
+      dx->mutable_data<T>(context.GetPlace());
+      const Tensor& dOut_for_dX =
+          (x_dims.size() == 2 && y_dims.size() == 3)
+              ? CombineBatchAndN<Place, T>(context, dOut)
+              : dOut;
+      if (x_dims.size() == 2 && y_dims.size() == 3) {
+        Y = transpose_y ? CombineBatchAndM<T>(Y)
+                        : CombineBatchAndN<Place, T>(context, Y);
+      }
+      if (transpose_x) {
+        math::MatMulFunctor<Place, T>()(context.device_context(), Y,
+                                        transpose_y, dOut_for_dX, transpose_x,
+                                        T(1), dx, T(0));
+      } else {
+        math::MatMulFunctor<Place, T>()(context.device_context(), dOut_for_dX,
+                                        transpose_x, Y, !transpose_y, T(1), dx,
+                                        T(0));
+      }
+    }
+
+    if (dy) {
+      dy->mutable_data<T>(context.GetPlace());
+      const Tensor& dOut_for_dY = (y_dims.size() == 2 && x_dims.size() == 3)
+                                      ? CombineBatchAndM<T>(dOut)
+                                      : dOut;
+      if (y_dims.size() == 2 && x_dims.size() == 3) {
+        X = transpose_x ? CombineBatchAndN<Place, T>(context, X)
+                        : CombineBatchAndM<T>(X);
+        dOut = CombineBatchAndM<T>(dOut);
+      }
+      if (transpose_y) {
+        math::MatMulFunctor<Place, T>()(context.device_context(), dOut_for_dY,
+                                        transpose_y, X, transpose_x, T(1), dy,
+                                        T(0));
+      } else {
+        math::MatMulFunctor<Place, T>()(context.device_context(), X,
+                                        !transpose_x, dOut_for_dY, transpose_y,
+                                        T(1), dy, T(0));
+      }
+    }
+  }
+};
+}  // namespace matmul_detail
+
+using matmul_detail::MatMulKernel;
+using matmul_detail::MatMulGradKernel;
+
+}  // namespace operators
+}  // namespace paddle
diff --git a/paddle/operators/momentum_op.cc b/paddle/operators/momentum_op.cc
new file mode 100644
index 0000000000..9be4d15a43
--- /dev/null
+++ b/paddle/operators/momentum_op.cc
@@ -0,0 +1,94 @@
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "paddle/operators/momentum_op.h"
+
+namespace paddle {
+namespace operators {
+
+class MomentumOp : public framework::OperatorWithKernel {
+ public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+
+ protected:
+  void InferShape(framework::InferShapeContext *ctx) const override {
+    PADDLE_ENFORCE(ctx->HasInput("Param"),
+                   "Input(param) of Momentum should not be null.");
+    PADDLE_ENFORCE(ctx->HasInput("Grad"),
+                   "Input(grad) of Momentum should not be null.");
+    PADDLE_ENFORCE(ctx->HasInput("Velocity"),
+                   "Input(velocity) of Momentum should not be null.");
+    PADDLE_ENFORCE(ctx->HasInput("LearningRate"),
+                   "Input(LearningRate) of Momentum should not be null.");
+
+    PADDLE_ENFORCE(ctx->HasOutput("ParamOut"),
+                   "Output(ParamOut) of Momentum should not be null.");
+    PADDLE_ENFORCE(ctx->HasOutput("VelocityOut"),
+                   "Output(VelocityOut) of Momentum should not be null.");
+
+    auto param_dim = ctx->GetInputDim("Param");
+    PADDLE_ENFORCE_EQ(
+        param_dim, ctx->GetInputDim("Grad"),
+        "Param and Grad input of MomentumOp should have the same dimension.");
+    PADDLE_ENFORCE_EQ(
+        param_dim, ctx->GetInputDim("Velocity"),
+        "Param and Velocity of MomentumOp should have the same dimension.");
+    PADDLE_ENFORCE_EQ(framework::product(ctx->GetInputDim("LearningRate")), 1,
+                      "Learning_rate should be a scalar");
+
+    ctx->SetOutputDim("ParamOut", param_dim);
+    ctx->SetOutputDim("VelocityOut", param_dim);
+  }
+};
+
+class MomentumOpMaker : public framework::OpProtoAndCheckerMaker {
+ public:
+  MomentumOpMaker(framework::OpProto *proto,
+                  framework::OpAttrChecker *op_checker)
+      : OpProtoAndCheckerMaker(proto, op_checker) {
+    AddInput("Param",
+             "(Tensor, default Tensor<float>) "
+             "Input parameter that has to be updated");
+    AddInput("Grad",
+             "(Tensor, default Tensor<float>) "
+             "Input gradient of the parameter");
+    AddInput("Velocity",
+             "(Tensor, default Tensor<float>) "
+             "Input velocity (corresponding to the parameter) "
+             "that has to be updated");
+    AddInput("LearningRate",
+             "(Tensor, default Tensor<float>) "
+             "Input learning rate");
+
+    AddOutput("ParamOut", "(Tensor) Output updated parameter");
+    AddOutput("VelocityOut", "(Tensor) Output updated velocity");
+
+    AddAttr<float>("mu", "(float) Momentum coefficient");
+    AddComment(R"DOC(
+
+Momentum Algorithm (momentum).
+
+velocity = mu * velocity + gradient
+param = param - learning_rate * velocity
+
+)DOC");
+  }
+};
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+REGISTER_OP_WITHOUT_GRADIENT(momentum, ops::MomentumOp, ops::MomentumOpMaker);
+REGISTER_OP_CPU_KERNEL(
+    momentum, ops::MomentumOpKernel<paddle::platform::CPUPlace, float>);
diff --git a/paddle/operators/momentum_op.cu b/paddle/operators/momentum_op.cu
new file mode 100644
index 0000000000..efc24e795e
--- /dev/null
+++ b/paddle/operators/momentum_op.cu
@@ -0,0 +1,20 @@
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License. */
+
+#define EIGEN_USE_GPU
+#include "paddle/operators/momentum_op.h"
+
+namespace ops = paddle::operators;
+REGISTER_OP_GPU_KERNEL(
+    momentum, ops::MomentumOpKernel<paddle::platform::GPUPlace, float>);
diff --git a/paddle/operators/momentum_op.h b/paddle/operators/momentum_op.h
new file mode 100644
index 0000000000..f7a724f048
--- /dev/null
+++ b/paddle/operators/momentum_op.h
@@ -0,0 +1,55 @@
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+#include "paddle/framework/eigen.h"
+#include "paddle/framework/op_registry.h"
+
+namespace paddle {
+namespace operators {
+
+template <typename Place, typename T>
+class MomentumOpKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    auto param_out = ctx.Output<framework::Tensor>("ParamOut");
+    auto velocity_out = ctx.Output<framework::Tensor>("VelocityOut");
+    auto param = ctx.Input<framework::Tensor>("Param");
+    auto velocity = ctx.Input<framework::Tensor>("Velocity");
+    auto grad = ctx.Input<framework::Tensor>("Grad");
+    auto learning_rate = ctx.Input<framework::Tensor>("LearningRate");
+
+    param_out->mutable_data<T>(ctx.GetPlace());
+    velocity_out->mutable_data<T>(ctx.GetPlace());
+
+    float mu = ctx.Attr<float>("mu");
+
+    auto p_out = framework::EigenVector<T>::Flatten(*param_out);
+    auto v_out = framework::EigenVector<T>::Flatten(*velocity_out);
+
+    auto p = framework::EigenVector<T>::Flatten(*param);
+    auto v = framework::EigenVector<T>::Flatten(*velocity);
+    auto g = framework::EigenVector<T>::Flatten(*grad);
+    auto lr = framework::EigenVector<T>::Flatten(*learning_rate);
+
+    auto place = ctx.GetEigenDevice<Place>();
+
+    Eigen::DSizes<int, 1> grad_dsize(grad->numel());
+    v_out.device(place) = v * mu + g;
+    p_out.device(place) = p - lr.broadcast(grad_dsize) * v_out;
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
diff --git a/paddle/operators/proximal_gd_op.cc b/paddle/operators/proximal_gd_op.cc
new file mode 100644
index 0000000000..e4b014b9f5
--- /dev/null
+++ b/paddle/operators/proximal_gd_op.cc
@@ -0,0 +1,93 @@
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "paddle/operators/proximal_gd_op.h"
+
+namespace paddle {
+namespace operators {
+
+class ProximalGDOp : public framework::OperatorWithKernel {
+ public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+
+ protected:
+  void InferShape(framework::InferShapeContext *ctx) const override {
+    PADDLE_ENFORCE(ctx->HasInput("Param"),
+                   "Input(Param) of ProximalGDOp should not be null.");
+    PADDLE_ENFORCE(ctx->HasInput("Grad"),
+                   "Input(Grad) of ProximalGDOp should not be null.");
+    PADDLE_ENFORCE(ctx->HasInput("LearningRate"),
+                   "Input(LearningRate) of ProximalGDOp should not be null.");
+
+    PADDLE_ENFORCE(ctx->HasOutput("ParamOut"),
+                   "Output(ParamOut) of ProximalGDOp should not be null.");
+
+    auto param_dim = ctx->GetInputDim("Param");
+    PADDLE_ENFORCE_EQ(param_dim, ctx->GetInputDim("Grad"),
+                      "Two input of ProximalGD Op's dimension must be same.");
+
+    auto lr_dim = ctx->GetInputDim("LearningRate");
+    PADDLE_ENFORCE_EQ(framework::product(lr_dim), 1,
+                      "Learning Rate should be a scalar.");
+
+    ctx->SetOutputDim("ParamOut", param_dim);
+  }
+};
+
+class ProximalGDOpMaker : public framework::OpProtoAndCheckerMaker {
+ public:
+  ProximalGDOpMaker(framework::OpProto *proto,
+                    framework::OpAttrChecker *op_checker)
+      : OpProtoAndCheckerMaker(proto, op_checker) {
+    AddInput("Param",
+             "(Tensor, default Tensor<float>) "
+             "Input parameter value that has to be updated.");
+    AddInput("Grad",
+             "(Tensor, default Tensor<float>) "
+             "Input gradient of the parameter.");
+    AddInput("LearningRate",
+             "(Tensor, default Tensor<float>) "
+             "The learning rate should be a tensor of size 1.");
+
+    AddOutput("ParamOut", "(Tensor) Output updated parameter value.");
+
+    AddAttr<float>("l1",
+                   "(float, default 0.0) "
+                   "L1 regularization strength.")
+        .SetDefault(0.0f);
+    AddAttr<float>("l2",
+                   "(float, default 0.0)"
+                   "L2 regularization strength.")
+        .SetDefault(0.0f);
+    AddComment(R"DOC(
+
+Optimizer that implements the proximal gradient descent algorithm.
+
+prox_param = param - learning_rate * grad
+param = sign(prox_param) / (1 + learning_rate * l2) *
+        max { |prox_param| - learning_rate * l1 , 0 }
+
+The paper that proposed Proximal Gradient Descent:
+(http://papers.nips.cc/paper/3793-efficient-learning-using-forward-backward-splitting.pdf)
+)DOC");
+  }
+};
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+REGISTER_OP_WITHOUT_GRADIENT(proximal_gd, ops::ProximalGDOp,
+                             ops::ProximalGDOpMaker);
+REGISTER_OP_CPU_KERNEL(
+    proximal_gd, ops::ProximalGDOpKernel<paddle::platform::CPUPlace, float>);
diff --git a/paddle/operators/proximal_gd_op.cu b/paddle/operators/proximal_gd_op.cu
new file mode 100644
index 0000000000..26f4ebaa0f
--- /dev/null
+++ b/paddle/operators/proximal_gd_op.cu
@@ -0,0 +1,19 @@
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+You may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed
+under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR
+CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License. */
+
+#define EIGEN_USE_GPU
+#include "paddle/operators/proximal_gd_op.h"
+
+namespace ops = paddle::operators;
+REGISTER_OP_GPU_KERNEL(
+    proximal_gd, ops::ProximalGDOpKernel<paddle::platform::GPUPlace, float>);
diff --git a/paddle/operators/proximal_gd_op.h b/paddle/operators/proximal_gd_op.h
new file mode 100644
index 0000000000..bebda02041
--- /dev/null
+++ b/paddle/operators/proximal_gd_op.h
@@ -0,0 +1,64 @@
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+#include "paddle/framework/eigen.h"
+#include "paddle/framework/op_registry.h"
+
+namespace paddle {
+namespace operators {
+
+using Tensor = framework::Tensor;
+template <typename T, int MajorType = Eigen::RowMajor,
+          typename IndexType = Eigen::DenseIndex>
+using EigenVector = framework::EigenVector<T, MajorType, IndexType>;
+
+template <typename Place, typename T>
+class ProximalGDOpKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    auto* param_out = ctx.Output<Tensor>("ParamOut");
+
+    param_out->mutable_data<T>(ctx.GetPlace());
+
+    auto grad = ctx.Input<Tensor>("Grad");
+
+    auto l1 = static_cast<T>(ctx.Attr<float>("l1"));
+    auto l2 = static_cast<T>(ctx.Attr<float>("l2"));
+
+    auto p = EigenVector<T>::Flatten(*ctx.Input<Tensor>("Param"));
+    auto g = EigenVector<T>::Flatten(*grad);
+    auto lr = EigenVector<T>::Flatten(*ctx.Input<Tensor>("LearningRate"));
+
+    auto p_out = EigenVector<T>::Flatten(*param_out);
+    auto place = ctx.GetEigenDevice<Place>();
+
+    Eigen::DSizes<int, 1> grad_dsize(grad->numel());
+
+    auto prox_param = p - lr.broadcast(grad_dsize) * g;
+    if (l1 > 0) {
+      p_out.device(place) =
+          prox_param.sign() *
+          (((prox_param.abs() - (lr * l1).broadcast(grad_dsize))
+                .cwiseMax(T(0.0))) /
+           (1.0 + (lr * l2).broadcast(grad_dsize)));
+    } else {
+      p_out.device(place) =
+          prox_param / (1.0 + (lr * l2).broadcast(grad_dsize));
+    }
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
diff --git a/paddle/platform/dynload/cublas.h b/paddle/platform/dynload/cublas.h
index 9d8343c0b5..6b64539b0a 100644
--- a/paddle/platform/dynload/cublas.h
+++ b/paddle/platform/dynload/cublas.h
@@ -77,6 +77,10 @@ extern void *cublas_dso_handle;
   __macro(cublasDgemmBatched);            \
   __macro(cublasCgemmBatched);            \
   __macro(cublasZgemmBatched);            \
+  __macro(cublasSgemmStridedBatched);     \
+  __macro(cublasDgemmStridedBatched);     \
+  __macro(cublasCgemmStridedBatched);     \
+  __macro(cublasZgemmStridedBatched);     \
   __macro(cublasSgetrfBatched);           \
   __macro(cublasSgetriBatched);           \
   __macro(cublasDgetrfBatched);           \
diff --git a/paddle/pybind/protobuf.cc b/paddle/pybind/protobuf.cc
index 82aae72ba9..d9647717d2 100644
--- a/paddle/pybind/protobuf.cc
+++ b/paddle/pybind/protobuf.cc
@@ -100,21 +100,7 @@ using namespace paddle::framework;  // NOLINT
 // Bind Methods
 void BindProgramDesc(py::module &m) {
   py::class_<ProgramDescBind>(m, "ProgramDesc", "")
-      .def_static("instance",
-                  []() -> ProgramDescBind * {
-                    return &ProgramDescBind::Instance(&GetProgramDesc());
-                  },
-                  py::return_value_policy::reference)
-      .def_static("__create_program_desc__",
-                  []() -> ProgramDescBind * {
-                    // Only used for unit-test
-                    auto *prog_desc = new ProgramDesc;
-                    auto *block = prog_desc->mutable_blocks()->Add();
-                    block->set_idx(0);
-                    block->set_parent_idx(-1);
-                    return &ProgramDescBind::Instance(prog_desc);
-                  },
-                  py::return_value_policy::reference)
+      .def(py::init<>())
       .def("append_block", &ProgramDescBind::AppendBlock,
            py::return_value_policy::reference)
       .def("append_backward",
@@ -176,8 +162,8 @@ void BindBlockDesc(py::module &m) {
            py::return_value_policy::reference)
       .def("all_vars", &BlockDescBind::AllVars,
            py::return_value_policy::reference)
-      .def("all_ops", &BlockDescBind::AllOps,
-           py::return_value_policy::reference)
+      .def("op_size", &BlockDescBind::OpSize)
+      .def("op", &BlockDescBind::Op, py::return_value_policy::reference)
       .def("serialize_to_string", [](BlockDescBind &block_desc) -> py::bytes {
         const BlockDesc *desc = block_desc.Proto();
         PADDLE_ENFORCE(desc->IsInitialized(),
diff --git a/paddle/pybind/pybind.cc b/paddle/pybind/pybind.cc
index fcae92ad99..9eb1bf4a16 100644
--- a/paddle/pybind/pybind.cc
+++ b/paddle/pybind/pybind.cc
@@ -17,6 +17,7 @@ limitations under the License. */
 #include "paddle/framework/backward.h"
 #include "paddle/framework/executor.h"
 #include "paddle/framework/feed_fetch_method.h"
+#include "paddle/framework/framework.pb.h"
 #include "paddle/framework/lod_tensor.h"
 #include "paddle/framework/selected_rows.h"
 #include "paddle/framework/tensor_array.h"
@@ -259,7 +260,7 @@ All parameter, weight, gradient are variables in Paddle.
                     PADDLE_ENFORCE(desc.IsInitialized(),
                                    "User OpDesc is not initialized, reason %s",
                                    desc.InitializationErrorString());
-                    return OpRegistry::CreateOp(desc);
+                    return OpRegistry::CreateOp(desc, nullptr);
                   })
       .def("backward",
            [](const OperatorBase &forwardOp,
@@ -363,7 +364,7 @@ All parameter, weight, gradient are variables in Paddle.
             PADDLE_ENFORCE(desc.IsInitialized(),
                            "User OpDesc is not initialized, reason %s",
                            desc.InitializationErrorString());
-            auto rnn_op = OpRegistry::CreateOp(desc);
+            auto rnn_op = OpRegistry::CreateOp(desc, nullptr);
             return static_cast<operators::RecurrentOp *>(rnn_op.release());
           })
       .def("set_stepnet", [](operators::RecurrentOp &self,
@@ -381,7 +382,7 @@ All parameter, weight, gradient are variables in Paddle.
                     PADDLE_ENFORCE(desc.IsInitialized(),
                                    "User OpDesc is not initialized, reason %s",
                                    desc.InitializationErrorString());
-                    auto rnn_op = OpRegistry::CreateOp(desc);
+                    auto rnn_op = OpRegistry::CreateOp(desc, nullptr);
                     return static_cast<operators::DynamicRecurrentOp *>(
                         rnn_op.release());
                   })
@@ -408,7 +409,7 @@ All parameter, weight, gradient are variables in Paddle.
                     PADDLE_ENFORCE(desc.IsInitialized(),
                                    "User OpDesc is not initialized, reason %s",
                                    desc.InitializationErrorString());
-                    auto cond_op = OpRegistry::CreateOp(desc);
+                    auto cond_op = OpRegistry::CreateOp(desc, nullptr);
                     return static_cast<operators::CondOp *>(cond_op.release());
                   })
       .def("set_truenet",
diff --git a/python/paddle/v2/framework/framework.py b/python/paddle/v2/framework/framework.py
index 3fb6efe42a..5a8ded46ea 100644
--- a/python/paddle/v2/framework/framework.py
+++ b/python/paddle/v2/framework/framework.py
@@ -232,7 +232,7 @@ class Operator(object):
         if attrs is not None:
             for attr in proto.attrs:
                 attr_name = attr.name
-                if not attr_name in attrs:
+                if (not attr_name in attrs) or (attrs[attr_name] is None):
                     continue
                 if not isinstance(attrs[attr_name], Block):
                     self.desc.set_attr(attr_name, attrs[attr_name])
@@ -344,7 +344,10 @@ class Block(object):
                 self.create_var(name=var.name(), desc=var, type=var.type())
 
         # sync operators from cpp
-        ops_in_cpp = self.desc.all_ops()
+        ops_in_cpp = []
+        for op_idx in range(0, self.desc.op_size()):
+            ops_in_cpp.append(self.desc.op(op_idx))
+
         first_op_in_python = self.ops[0].desc
         last_op_in_python = self.ops[len(self.ops) - 1].desc
         start_index = None
@@ -384,10 +387,8 @@ class Program(object):
             cls._instance = cls()
         return cls._instance
 
-    def __init__(self, desc=None):
-        if desc is None:
-            desc = core.ProgramDesc.instance()
-        self.desc = desc
+    def __init__(self):
+        self.desc = core.ProgramDesc()
         self.blocks = [Block(self, 0)]
         self.current_block_idx = 0
 
diff --git a/python/paddle/v2/framework/layer_helper.py b/python/paddle/v2/framework/layer_helper.py
index 26d3e04310..6615bdcd3b 100644
--- a/python/paddle/v2/framework/layer_helper.py
+++ b/python/paddle/v2/framework/layer_helper.py
@@ -66,15 +66,15 @@ class LayerHelper(object):
         actual = self.kwargs.get('param_attr', None)
         return actual if actual is not None else default
 
-    def bias_attr(self, size, dtype):
-        bias_attr = self.kwargs.get('bias_attr', False)
-        if bias_attr is None or bias_attr:
+    def bias_attr(self, shape, dtype):
+        bias_attr = self.kwargs.get('bias_attr', None)
+        if bias_attr is True:
             bias_attr = {
                 'name': None,
                 'init_attr': {
                     'type': 'fill_constant',
                     'value': 0.0,
-                    'shape': [size],
+                    'shape': shape,
                     'dataType': dtype
                 }
             }
@@ -127,15 +127,13 @@ class LayerHelper(object):
         return self.program.global_block().create_var(*args, **kwargs)
 
     def append_bias_op(self, input_var):
-        bias_attr = self.bias_attr(
-            self.kwargs['size'], dtype=input_var.data_type)
+        size = list(input_var.shape[1:])
+        bias_attr = self.bias_attr(size, dtype=input_var.data_type)
         if not bias_attr:
             return input_var
+
         b = self.create_parameter(
-            attr=bias_attr,
-            shape=[self.kwargs['size']],
-            dtype=input_var.data_type,
-            suffix='b')
+            attr=bias_attr, shape=size, dtype=input_var.data_type, suffix='b')
         tmp = self.create_tmp_variable(dtype=input_var.data_type)
         self.append_op(
             type='elementwise_add',
diff --git a/python/paddle/v2/framework/layers.py b/python/paddle/v2/framework/layers.py
index 44b587b116..c7397716c4 100644
--- a/python/paddle/v2/framework/layers.py
+++ b/python/paddle/v2/framework/layers.py
@@ -3,17 +3,17 @@ import paddle.v2.framework.core as core
 from paddle.v2.framework.framework import OpProtoHolder, Variable
 import re
 
-__all__ = ['fc_layer', 'data_layer', 'cross_entropy']
+__all__ = ['fc', 'data', 'cross_entropy', 'conv2d']
 
 
-def fc_layer(input,
-             size,
-             param_attr=None,
-             bias_attr=True,
-             name=None,
-             act=None,
-             num_flatten_dims=1,
-             program=None):
+def fc(input,
+       size,
+       param_attr=None,
+       bias_attr=True,
+       name=None,
+       act=None,
+       num_flatten_dims=1,
+       program=None):
     # create helper
     helper = LayerHelper('fc', **locals())
 
@@ -24,6 +24,7 @@ def fc_layer(input,
     for input_var, param_attr in helper.iter_inputs_and_params():
         input_shape = input_var.shape
         param_shape = list(input_shape[num_flatten_dims:]) + [size]
+
         w = helper.create_parameter(
             attr=param_attr, shape=param_shape, dtype=dtype)
         tmp = helper.create_tmp_variable(dtype)
@@ -50,11 +51,11 @@ def fc_layer(input,
     return helper.append_activation(pre_activation)
 
 
-def data_layer(name,
-               shape,
-               data_type='float32',
-               type=core.VarDesc.VarType.LOD_TENSOR,
-               program=None):
+def data(name,
+         shape,
+         data_type='float32',
+         type=core.VarDesc.VarType.LOD_TENSOR,
+         program=None):
     helper = LayerHelper('data', **locals())
     shape = [-1] + shape  # append batch size as -1
     return helper.create_global_variable(
@@ -111,6 +112,7 @@ def _create_op_func_(op_type):
 
 
 _create_op_func_('mean')
+_create_op_func_('pool2d')
 
 
 def cross_entropy(input, label, **kwargs):
@@ -141,3 +143,47 @@ def square_error_cost(input, label, **kwargs):
         outputs={'Y': [square_out]},
         attrs={'factor': 2.0})
     return square_out
+
+
+def conv2d(input,
+           num_filters,
+           name=None,
+           filter_size=[1, 1],
+           act=None,
+           groups=None,
+           stride=[1, 1],
+           padding=None,
+           bias_attr=None,
+           param_attr=None,
+           program=None):
+    helper = LayerHelper('conv2d', **locals())
+    dtype = helper.input_dtype()
+
+    num_channels = input.shape[1]
+    if groups is None:
+        num_filter_channels = num_channels
+    else:
+        if num_channels % groups is not 0:
+            raise ValueError("num_channels must be divisible by groups.")
+        num_filter_channels = num_channels / groups
+
+    input_shape = input.shape
+    filter_shape = [num_filters, num_filter_channels] + filter_size
+    filter = helper.create_parameter(
+        attr=helper.param_attr, shape=filter_shape, dtype=dtype)
+    pre_bias = helper.create_tmp_variable(dtype)
+
+    helper.append_op(
+        type='conv2d',
+        inputs={
+            'Input': input,
+            'Filter': filter,
+        },
+        outputs={"Output": pre_bias},
+        attrs={'strides': stride,
+               'paddings': padding,
+               'groups': groups})
+
+    pre_act = helper.append_bias_op(pre_bias)
+
+    return helper.append_activation(pre_act)
diff --git a/python/paddle/v2/framework/tests/test_infer_shape.py b/python/paddle/v2/framework/tests/test_infer_shape.py
index 19bb45acef..5cfb9e6687 100644
--- a/python/paddle/v2/framework/tests/test_infer_shape.py
+++ b/python/paddle/v2/framework/tests/test_infer_shape.py
@@ -5,7 +5,7 @@ import paddle.v2.framework.core as core
 
 class TestInferShape(unittest.TestCase):
     def test_sum_op(self):
-        prog = core.ProgramDesc.__create_program_desc__()
+        prog = core.ProgramDesc()
         self.assertIsNotNone(prog)
         block = prog.block(0)
         self.assertIsNotNone(block)
@@ -33,7 +33,7 @@ class TestInferShape(unittest.TestCase):
         self.assertEqual(out.shape(), shape)
 
     def test_mul_op(self):
-        prog = core.ProgramDesc.__create_program_desc__()
+        prog = core.ProgramDesc()
         self.assertIsNotNone(prog)
         block = prog.block(0)
         self.assertIsNotNone(block)
diff --git a/python/paddle/v2/framework/tests/test_layers.py b/python/paddle/v2/framework/tests/test_layers.py
index 1ef2591cca..dbbb653538 100644
--- a/python/paddle/v2/framework/tests/test_layers.py
+++ b/python/paddle/v2/framework/tests/test_layers.py
@@ -1,4 +1,4 @@
-from paddle.v2.framework.layers import fc_layer, data_layer, cross_entropy, mean, square_error_cost
+import paddle.v2.framework.layers as layers
 from paddle.v2.framework.framework import Program, g_program
 import paddle.v2.framework.core as core
 import unittest
@@ -6,36 +6,57 @@ import unittest
 
 class TestBook(unittest.TestCase):
     def test_fit_a_line(self):
-        pd = core.ProgramDesc.__create_program_desc__()
-        program = Program(desc=pd)
-        x = data_layer(
+        program = Program()
+        x = layers.data(
             name='x', shape=[13], data_type='float32', program=program)
-        y_predict = fc_layer(input=x, size=1, act=None, program=program)
+        y_predict = layers.fc(input=x, size=1, act=None, program=program)
 
-        y = data_layer(
+        y = layers.data(
             name='y', shape=[1], data_type='float32', program=program)
-        cost = square_error_cost(input=y_predict, label=y, program=program)
+        cost = layers.square_error_cost(
+            input=y_predict, label=y, program=program)
 
-        avg_cost = mean(x=cost, program=program)
+        avg_cost = layers.mean(x=cost, program=program)
         self.assertIsNotNone(avg_cost)
+        program.append_backward(avg_cost, set())
         print str(program)
 
     def test_recognize_digits_mlp(self):
-        pd = core.ProgramDesc.__create_program_desc__()
-        program = Program(desc=pd)
+        program = Program()
 
         # Change g_program, so the rest layers use `g_program`
-        images = data_layer(
+        images = layers.data(
             name='pixel', shape=[784], data_type='float32', program=program)
-        label = data_layer(
+        label = layers.data(
             name='label', shape=[1], data_type='int32', program=program)
-        hidden1 = fc_layer(input=images, size=128, act='relu', program=program)
-        hidden2 = fc_layer(input=hidden1, size=64, act='relu', program=program)
-        predict = fc_layer(
-            input=hidden2, size=10, act='softmax', program=program)
-        cost = cross_entropy(input=predict, label=label, program=program)
-        avg_cost = mean(x=cost, program=program)
+        hidden1 = layers.fc(input=images, size=128, act='relu', program=program)
+        hidden2 = layers.fc(input=hidden1, size=64, act='relu', program=program)
+        predict = layers.fc(input=hidden2,
+                            size=10,
+                            act='softmax',
+                            program=program)
+        cost = layers.cross_entropy(input=predict, label=label, program=program)
+        avg_cost = layers.mean(x=cost, program=program)
         self.assertIsNotNone(avg_cost)
+        # print str(program)
+
+    def test_simple_conv2d(self):
+        pd = core.ProgramDesc.__create_program_desc__()
+        program = Program(desc=pd)
+        images = data_layer(
+            name='pixel', shape=[3, 48, 48], data_type='int32', program=program)
+        conv2d_layer(
+            input=images, num_filters=3, filter_size=[4, 4], program=program)
+
+        # print str(program)
+
+    def test_simple_conv2d(self):
+        program = Program()
+        images = layers.data(
+            name='pixel', shape=[3, 48, 48], data_type='int32', program=program)
+        layers.conv2d(
+            input=images, num_filters=3, filter_size=[4, 4], program=program)
+
         print str(program)
 
 
diff --git a/python/paddle/v2/framework/tests/test_matmul_op.py b/python/paddle/v2/framework/tests/test_matmul_op.py
new file mode 100644
index 0000000000..d51572c8ab
--- /dev/null
+++ b/python/paddle/v2/framework/tests/test_matmul_op.py
@@ -0,0 +1,119 @@
+import unittest
+import numpy as np
+from op_test import OpTest
+
+
+def generate_compatible_shapes(dim_X, dim_Y, transpose_X, transpose_Y):
+    BATCH_SIZE = 2
+    M = 3
+    N = 4
+    K = 5
+    if (dim_X == 1 and transpose_X) or (dim_Y == 1 and transpose_Y):
+        K = 1
+    if dim_X == 1:
+        if transpose_X:
+            shape_X = [M]
+        else:
+            shape_X = [K]
+    if dim_Y == 1:
+        if transpose_Y:
+            shape_Y = [N]
+        else:
+            shape_Y = [K]
+    if dim_X >= 2:
+        if transpose_X:
+            shape_X = [K, M]
+        else:
+            shape_X = [M, K]
+    if dim_X == 3:
+        shape_X = [BATCH_SIZE] + shape_X
+    if dim_Y >= 2:
+        if transpose_Y:
+            shape_Y = [N, K]
+        else:
+            shape_Y = [K, N]
+    if dim_Y == 3:
+        shape_Y = [BATCH_SIZE] + shape_Y
+    return shape_X, shape_Y
+
+
+def reference_matmul(X, Y, transpose_X=False, transpose_Y=False):
+    """Reference forward implementation using np.matmul."""
+    # np.matmul does not support the transpose flags, so we manually
+    # transpose X and Y appropriately.
+    if transpose_X:
+        if X.ndim == 1:
+            X = X.reshape((X.size, 1))
+        elif X.ndim == 2:
+            X = X.T
+        elif X.ndim == 3:
+            X = np.transpose(X, (0, 2, 1))
+        else:
+            raise ValueError('X must have between 1 and 3 dimensions')
+    if transpose_Y:
+        if Y.ndim == 1:
+            Y = Y.reshape((1, Y.size))
+        elif Y.ndim == 2:
+            Y = Y.T
+        elif Y.ndim == 3:
+            Y = np.transpose(Y, (0, 2, 1))
+        else:
+            raise ValueError('Y must have between 1 and 3 dimensions')
+    Out = np.matmul(X, Y)
+    if not Out.shape:
+        # We do not support 0-dimensional Tensors (scalars). So where
+        # np.matmul outputs a scalar, we must convert to a Tensor of
+        # shape (1, ) instead.
+        # Everywhere else, we are compatible with np.matmul.
+        Out = np.array([Out], dtype="float32")
+    return Out
+
+
+class Generator(object):
+    def setUp(self):
+        self.op_type = "matmul"
+        X = np.random.random(self.shape_X).astype("float32")
+        Y = np.random.random(self.shape_Y).astype("float32")
+        Out = reference_matmul(X, Y, self.transpose_X, self.transpose_Y)
+        self.inputs = {'X': X, 'Y': Y}
+        self.attrs = {
+            'transpose_X': self.transpose_X,
+            'transpose_Y': self.transpose_Y
+        }
+        self.outputs = {'Out': Out}
+
+    def test_check_output(self):
+        self.check_output(atol=1e-2)
+
+    def test_check_grad_normal(self):
+        self.check_grad(['X', 'Y'], 'Out', max_relative_error=0.5)
+
+    def test_check_grad_ignore_x(self):
+        self.check_grad(
+            ['Y'], 'Out', max_relative_error=0.5, no_grad_set=set("X"))
+
+    def test_check_grad_ignore_y(self):
+        self.check_grad(
+            ['X'], 'Out', max_relative_error=0.5, no_grad_set=set('Y'))
+
+
+# Generate test cases for all possibilities
+for dim_X in [1, 2, 3]:
+    for dim_Y in [1, 2, 3]:
+        for transpose_X in [False, True]:
+            for transpose_Y in [False, True]:
+                test_name = (
+                    'TestMatMulOp_dimX_{}_dim_Y_{}_transX_{}_transY_{}'.format(
+                        dim_X, dim_Y, transpose_X, transpose_Y))
+                shape_X, shape_Y = generate_compatible_shapes(
+                    dim_X, dim_Y, transpose_X, transpose_Y)
+                test_class = type(test_name, (Generator, OpTest), {
+                    'shape_X': shape_X,
+                    'shape_Y': shape_Y,
+                    'transpose_X': transpose_X,
+                    'transpose_Y': transpose_Y,
+                })
+                globals()[test_name] = test_class
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/python/paddle/v2/framework/tests/test_momentum_op.py b/python/paddle/v2/framework/tests/test_momentum_op.py
new file mode 100644
index 0000000000..d3353ff6e4
--- /dev/null
+++ b/python/paddle/v2/framework/tests/test_momentum_op.py
@@ -0,0 +1,35 @@
+import unittest
+import numpy as np
+from op_test import OpTest
+
+
+class TestMomentumOp(OpTest):
+    def setUp(self):
+        self.op_type = "momentum"
+
+        param = np.random.random((123, 321)).astype("float32")
+        grad = np.random.random((123, 321)).astype("float32")
+        velocity = np.zeros((123, 321)).astype("float32")
+        learning_rate = np.array([0.001]).astype("float32")
+        mu = 0.0001
+
+        self.inputs = {
+            'Param': param,
+            'Grad': grad,
+            'Velocity': velocity,
+            'LearningRate': learning_rate
+        }
+
+        self.attrs = {'mu': mu}
+
+        velocity_out = mu * velocity + grad
+        param_out = param - learning_rate * velocity_out
+
+        self.outputs = {'ParamOut': param_out, 'VelocityOut': velocity_out}
+
+    def test_check_output(self):
+        self.check_output()
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/python/paddle/v2/framework/tests/test_protobuf_descs.py b/python/paddle/v2/framework/tests/test_protobuf_descs.py
index c775b1a398..2fd3d5d165 100644
--- a/python/paddle/v2/framework/tests/test_protobuf_descs.py
+++ b/python/paddle/v2/framework/tests/test_protobuf_descs.py
@@ -4,7 +4,7 @@ import paddle.v2.framework.core as core
 
 class TestOpDesc(unittest.TestCase):
     def test_op_desc(self):
-        prog = core.ProgramDesc.__create_program_desc__()
+        prog = core.ProgramDesc()
         self.assertIsNotNone(prog)
         block = prog.block(0)
         self.assertIsNotNone(block)
@@ -64,16 +64,16 @@ class TestOpDesc(unittest.TestCase):
 
 class TestProgramDesc(unittest.TestCase):
     def test_instance(self):
-        program_desc = core.ProgramDesc.__create_program_desc__()
+        program_desc = core.ProgramDesc()
         self.assertIsNotNone(program_desc)
         del program_desc
-        program_desc = core.ProgramDesc.instance()
+        program_desc = core.ProgramDesc()
         self.assertIsNotNone(program_desc)
         self.assertIsNotNone(program_desc.block(0))
         del program_desc
 
     def test_append_block(self):
-        prog_desc = core.ProgramDesc.__create_program_desc__()
+        prog_desc = core.ProgramDesc()
         self.assertIsNotNone(prog_desc)
         block_root = prog_desc.block(0)
         self.assertIsNotNone(block_root)
@@ -91,7 +91,7 @@ class TestProgramDesc(unittest.TestCase):
 
 class TestVarDesc(unittest.TestCase):
     def test_shape(self):
-        program_desc = core.ProgramDesc.__create_program_desc__()
+        program_desc = core.ProgramDesc()
         block = program_desc.block(0)
         var = block.var('my_var')
         var.set_type(core.VarDesc.VarType.SELECTED_ROWS)
@@ -102,7 +102,7 @@ class TestVarDesc(unittest.TestCase):
         self.assertEqual(core.VarDesc.VarType.SELECTED_ROWS, var.type())
 
     def test_data_type(self):
-        program_desc = core.ProgramDesc.__create_program_desc__()
+        program_desc = core.ProgramDesc()
         block = program_desc.block(0)
         var = block.var('my_var')
         var.set_type(core.VarDesc.VarType.LOD_TENSOR)
@@ -113,7 +113,7 @@ class TestVarDesc(unittest.TestCase):
 
 class TestBlockDesc(unittest.TestCase):
     def test_add_var(self):
-        prog = core.ProgramDesc.__create_program_desc__()
+        prog = core.ProgramDesc()
         self.assertIsNotNone(prog)
         block = prog.block(0)
         self.assertIsNotNone(block)
@@ -121,19 +121,21 @@ class TestBlockDesc(unittest.TestCase):
         var2 = block.var("var2")
         var3 = block.var("var3")
         all_vars = block.all_vars()
-        self.assertEqual(set(all_vars), set([var1, var2, var3]))
+        self.assertEqual(set(all_vars), {var1, var2, var3})
         var2_re = block.find_var("var2")
         self.assertEqual(var2_re, var2)
 
     def test_add_op(self):
-        prog = core.ProgramDesc.__create_program_desc__()
+        prog = core.ProgramDesc()
         self.assertIsNotNone(prog)
         block = prog.block(0)
         self.assertIsNotNone(block)
         op1 = block.append_op()
         op2 = block.append_op()
         op0 = block.prepend_op()
-        all_ops = block.all_ops()
+        all_ops = []
+        for idx in xrange(0, block.op_size()):
+            all_ops.append(block.op(idx))
         self.assertEqual(all_ops, [op0, op1, op2])
 
 
diff --git a/python/paddle/v2/framework/tests/test_proximal_gd_op.py b/python/paddle/v2/framework/tests/test_proximal_gd_op.py
new file mode 100644
index 0000000000..9ca79ce6b3
--- /dev/null
+++ b/python/paddle/v2/framework/tests/test_proximal_gd_op.py
@@ -0,0 +1,33 @@
+import unittest
+import numpy as np
+from op_test import OpTest
+
+
+class TestProximalGDOp(OpTest):
+    def setUp(self):
+        self.op_type = "proximal_gd"
+        w = np.random.random((102, 105)).astype("float32")
+        g = np.random.random((102, 105)).astype("float32")
+        lr = np.array([0.1]).astype("float32")
+        l1 = 0.1
+        l2 = 0.2
+
+        self.inputs = {'Param': w, 'Grad': g, 'LearningRate': lr}
+        self.attrs = {'l1': l1, 'l2': l2}
+        prox_param = w - lr * g
+        param_out = 0.0
+        if l1 > 0.0:
+            x = np.abs(prox_param) - lr * l1
+            x[x < 0] = 0
+            param_out = np.sign(prox_param) * (x / (1.0 + lr * l2))
+        else:
+            param_out = prox_param / (1.0 + lr * l2)
+
+        self.outputs = {'ParamOut': param_out}
+
+    def test_check_output(self):
+        self.check_output()
+
+
+if __name__ == "__main__":
+    unittest.main()