diff --git a/RELEASE.cn.md b/RELEASE.cn.md
index 5deaf230a8..494c59730d 100644
--- a/RELEASE.cn.md
+++ b/RELEASE.cn.md
@@ -1,3 +1,62 @@
+# v0.11.0版本
+
+## PaddlePaddle Fluid
+
+- PaddlePaddle发布版本v0.11.0包含一个新的特性*PaddlePaddle Fluid*. Fluid 是设计用来让用户像Pytorch和Tensorflow Eager Execution一样执行程序。在这些系统中,不再有*模型*这个概念,应用也不再包含一个用于描述Operator图或者一系列层的符号描述,而是像通用程序那样描述训练或者预测的过程。而Fluid与PyTorch或Eager Execution的区别在于Fluid不依赖Python提供的控制流,例如 if-else-then或者for,而是提供了基于C++实现的控制流并暴露了对应的用with语法实现的Python接口。例如:
+
+ https://github.com/PaddlePaddle/Paddle/blob/3df78ed2a98d37f7ae6725894cc7514effd5664b/python/paddle/v2/fluid/tests/test_while_op.py#L36-L44
+
+- 在v0.11.0版本中,我们提供了一个C++类`Executor`用于运行一个Fluid程序。Executor类似一个解释器。在未来的版本中,我们将提升和优化Executor成为一个调试器,就像GDB。并可能提供一些编译器,这个编译器会读取一个上文所描述的应用然后编译成一个等价的
+源代码,这个源代码可以被nvcc编译成可以使用CUDA的二进制,或者被icc编译成可以充分利用Intel CPU的二进制。
+
+
+## 新特点
+
+* 发布 `PaddlePaddle Fluid`。
+* 增加了用于模型预测的C-API。
+* 用Fluid API实现了一个简单的GAN的例子。
+* 增加了关于性能调优的文档。
+* 为`paddle.v2.dataset`下载数据集提供了重试机制.
+* C++中使用protobuf-lite替换protobuf减少了二进制的大小。
+* 发布了新特性 [Elastic Deep Learning (EDL)](https://github.com/PaddlePaddle/cloud/tree/develop/doc/autoscale/experiment).
+* 基于Bazel API利用cmake实现了一个的新的构建系统函数库。
+* 当使用编译选项`WITH_MKL=ON`时自动下载和编译Intel® [MKLML](https://github.com/01org/mkl-dnn/releases/download/v0.11/mklml_lnx_2018.0.1.20171007.tgz) 函数库.
+* [Intel® MKL-DNN on PaddlePaddle](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/mkldnn):
+ - 完成了 11个 MKL-DNN 层: Convolution, Fully connectivity, Pooling, ReLU, Tanh, ELU, Softmax, BatchNorm, AddTo, Concat, LRN。
+ - 完成了 3个 MKL-DNN 网络: VGG-19, ResNet-50, GoogleNet
+ - 基于Intel Skylake 6148 CPU的[性能测试](https://github.com/PaddlePaddle/Paddle/blob/develop/benchmark/IntelOptimizedPaddle.md) : 相对于MKLML有2~3倍的训练加速。
+* 增加 [softsign activation](http://www.paddlepaddle.org/docs/develop/documentation/zh/api/v2/config/activation.html#softsign)
+* 增加 [dot product layer](http://www.paddlepaddle.org/docs/develop/documentation/zh/api/v2/config/layer.html#dot-prod)
+* 增加 [L2 distance layer](http://www.paddlepaddle.org/docs/develop/documentation/zh/api/v2/config/layer.html#l2-distance)
+* 增加 [sub-nested sequence layer](http://www.paddlepaddle.org/docs/develop/documentation/zh/api/v2/config/layer.html#sub-nested-seq)
+* 增加 [kmax sequence score layer](http://www.paddlepaddle.org/docs/develop/documentation/zh/api/v2/config/layer.html#kmax-sequence-score)
+* 增加 [sequence slice layer](http://www.paddlepaddle.org/docs/develop/documentation/zh/api/v2/config/layer.html#seq-slice)
+* 增加 [row convolution layer](http://www.paddlepaddle.org/docs/develop/documentation/zh/api/v2/config/layer.html#row-conv)
+* 增加移动端友好的网页
+
+## 改进
+
+* 使用一个Python`whl`包即可安装.
+* [V2 API可以实现用户定制化评估](https://github.com/PaddlePaddle/models/tree/develop/ltr#训练过程中输出自定义评估指标)。
+* 将 `PADDLE_ONLY_CPU` 改为 `PADDLE_WITH_GPU`, 因为我们会支持多种设备。
+* 删除了有一些bug的BarrierStat。
+* 清理和删除了paddle::Parameter中未使用的函数。
+* 删除了ProtoDataProvider。
+* Huber loss同时支持回归和分类。
+* 为sequence pooling 层增加`stride`参数。
+* v2 API自动使用cudnn batch normalization。
+* 可以使用一个固定的参数名共享BN层的参数。
+* 2D convolution operation支持variable-dimension input特性。
+* 重构cmake中关于CUDA的部分并实现自动检测GPU架构的功能。
+* 优化网页导航。
+
+## 错误修复
+
+* 修复ROI pooling的Bug. cc9a761
+* 修复当label是dense vector是AUC变成0的问题. #5274
+* 修复WarpCTC 层的Bug.
+
+
# v0.10.0版本
我们非常高兴发布了PaddlePaddle V0.10.0版,并开发了新的[Python API](http://research.baidu.com/paddlepaddles-new-api-simplifies-deep-learning-programs/)。
diff --git a/RELEASE.md b/RELEASE.md
index 146f7afa7d..5a62c95513 100644
--- a/RELEASE.md
+++ b/RELEASE.md
@@ -1,3 +1,75 @@
+# Release v0.11.0
+
+## PaddlePaddle Fluid
+
+- Release 0.11.0 includes a new feature *PaddlePaddle Fluid*. Fluid is
+ designed to allow users to program like PyTorch and TensorFlow Eager Execution.
+ In these systems, there is no longer the concept *model* and applications
+ do not include a symbolic description of a graph of operators nor a sequence
+ of layers. Instead, applications look exactly like a usual program that
+ describes a process of training or inference. The difference between
+ Fluid and PyTorch or Eager Execution is that Fluid doesn't rely on Python's
+ control-flow, `if-then-else` nor `for`. Instead, Fluid provides its
+ C++ implementations and their Python binding using the `with` statement. For an example
+
+ https://github.com/PaddlePaddle/Paddle/blob/3df78ed2a98d37f7ae6725894cc7514effd5664b/python/paddle/v2/fluid/tests/test_while_op.py#L36-L44
+
+- In 0.11.0, we provides a C++ class `Executor` to run a Fluid program.
+Executor works like an interpreter. In future version, we will improve
+`Executor` into a debugger like GDB, and we might provide some compilers,
+which, for example, takes an application like the above one, and outputs
+an equivalent C++ source program, which can be compiled using
+[`nvcc`](http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html)
+to generate binaries that use CUDA, or using
+[`icc`](https://software.intel.com/en-us/c-compilers) to generate binaries
+that make full use of Intel CPUs.
+
+## New Features
+
+* Release `PaddlePaddle Fluid`.
+* Add C-API for model inference
+* Use fluid API to create a simple GAN demo.
+* Add develop guide about performance tunning.
+* Add retry when download `paddle.v2.dataset`.
+* Linking protobuf-lite not protobuf in C++. Reduce the binary size.
+* Feature [Elastic Deep Learning (EDL)](https://github.com/PaddlePaddle/cloud/tree/develop/doc/autoscale/experiment) released.
+* A new style cmake functions for Paddle. It is based on Bazel API.
+* Automatically download and compile with Intel® [MKLML](https://github.com/01org/mkl-dnn/releases/download/v0.11/mklml_lnx_2018.0.1.20171007.tgz) library as CBLAS when build `WITH_MKL=ON`.
+* [Intel® MKL-DNN on PaddlePaddle](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/mkldnn):
+ - Complete 11 MKL-DNN layers: Convolution, Fully connectivity, Pooling, ReLU, Tanh, ELU, Softmax, BatchNorm, AddTo, Concat, LRN.
+ - Complete 3 MKL-DNN networks: VGG-19, ResNet-50, GoogleNet
+ - [Benchmark](https://github.com/PaddlePaddle/Paddle/blob/develop/benchmark/IntelOptimizedPaddle.md) on Intel Skylake 6148 CPU: 2~3x training speedup compared with MKLML.
+* Add the [`softsign` activation](http://www.paddlepaddle.org/docs/develop/documentation/zh/api/v2/config/activation.html#softsign).
+* Add the [dot product layer](http://www.paddlepaddle.org/docs/develop/documentation/zh/api/v2/config/layer.html#dot-prod).
+* Add the [L2 distance layer](http://www.paddlepaddle.org/docs/develop/documentation/zh/api/v2/config/layer.html#l2-distance).
+* Add the [sub-nested sequence layer](http://www.paddlepaddle.org/docs/develop/documentation/zh/api/v2/config/layer.html#sub-nested-seq).
+* Add the [kmax sequence score layer](http://www.paddlepaddle.org/docs/develop/documentation/zh/api/v2/config/layer.html#kmax-sequence-score).
+* Add the [sequence slice layer](http://www.paddlepaddle.org/docs/develop/documentation/zh/api/v2/config/layer.html#seq-slice).
+* Add the [row convolution layer](http://www.paddlepaddle.org/docs/develop/documentation/zh/api/v2/config/layer.html#row-conv)
+* Add mobile friendly webpages.
+
+## Improvements
+
+* Build and install using a single `whl` package.
+* [Custom evaluating in V2 API](https://github.com/PaddlePaddle/models/tree/develop/ltr#训练过程中输出自定义评估指标).
+* Change `PADDLE_ONLY_CPU` to `PADDLE_WITH_GPU`, since we will support many kinds of devices.
+* Remove buggy BarrierStat.
+* Clean and remove unused functions in paddle::Parameter.
+* Remove ProtoDataProvider.
+* Huber loss supports both regression and classification.
+* Add the `stride` parameter for sequence pooling layers.
+* Enable v2 API use cudnn batch normalization automatically.
+* The BN layer's parameter can be shared by a fixed the parameter name.
+* Support variable-dimension input feature for 2D convolution operation.
+* Refine cmake about CUDA to automatically detect GPU architecture.
+* Improved website navigation.
+
+## Bug Fixes
+
+* Fix bug in ROI pooling. cc9a761
+* Fix AUC is zero when label is dense vector. #5274
+* Fix bug in WarpCTC layer.
+
# Release v0.10.0
We are glad to release version 0.10.0. In this version, we are happy to release the new
diff --git a/benchmark/IntelOptimizedPaddle.md b/benchmark/IntelOptimizedPaddle.md
index 26930a7637..8ee7fd28c5 100644
--- a/benchmark/IntelOptimizedPaddle.md
+++ b/benchmark/IntelOptimizedPaddle.md
@@ -19,6 +19,8 @@ On each machine, we will test and compare the performance of training on single
## Benchmark Model
### Server
+
+#### Training
Test on batch size 64, 128, 256 on Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
Input image size - 3 * 224 * 224, Time: images/second
@@ -53,5 +55,33 @@ Input image size - 3 * 224 * 224, Time: images/second
+#### Inference
+Test on batch size 1, 2, 4, 8, 16 on Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
+- VGG-19
+
+| BatchSize | 1 | 2 | 4 | 8 | 16 |
+|-----------|-------|-------|-------|-------|-------|
+| OpenBLAS | 1.07 | 1.08 | 1.06 | 0.88 | 0.65 |
+| MKLML | 5.58 | 9.80 | 15.15 | 21.21 | 28.67 |
+| MKL-DNN | 75.07 | 88.64 | 82.58 | 92.29 | 96.75 |
+
+- ResNet-50
+
+| BatchSize | 1 | 2 | 4 | 8 | 16 |
+|-----------|-------|--------|--------|--------|--------|
+| OpenBLAS | 3.35 | 3.19 | 3.09 | 2.55 | 1.96 |
+| MKLML | 6.33 | 12.02 | 22.88 | 40.53 | 63.09 |
+| MKL-DNN | 107.83| 148.84 | 177.78 | 189.35 | 217.69 |
+
+
+- GoogLeNet
+
+| BatchSize | 1 | 2 | 4 | 8 | 16 |
+|-----------|--------|--------|--------|--------|--------|
+| OpenBLAS | 12.04 | 11.31 | 10.00 | 9.07 | 4.34 |
+| MKLML | 22.74 | 41.56 | 81.22 | 133.47 | 210.53 |
+| MKL-DNN | 175.10 | 272.92 | 450.70 | 512.00 | 600.94 |
+
+
### Laptop
TBD
diff --git a/cmake/cblas.cmake b/cmake/cblas.cmake
index b21fc43904..13294c0548 100644
--- a/cmake/cblas.cmake
+++ b/cmake/cblas.cmake
@@ -17,7 +17,7 @@ if(WITH_MKLML AND MKLML_INC_DIR AND MKLML_LIB)
set(CBLAS_INC_DIR ${MKLML_INC_DIR})
set(CBLAS_LIBRARIES ${MKLML_LIB})
- add_definitions(-DPADDLE_USE_MKLML)
+ add_definitions(-DPADDLE_WITH_MKLML)
add_definitions(-DLAPACK_FOUND)
message(STATUS "Found cblas and lapack in MKLML "
diff --git a/cmake/external/mkldnn.cmake b/cmake/external/mkldnn.cmake
index fc52d339d7..5d24caebdc 100644
--- a/cmake/external/mkldnn.cmake
+++ b/cmake/external/mkldnn.cmake
@@ -67,5 +67,5 @@ ADD_LIBRARY(mkldnn SHARED IMPORTED GLOBAL)
SET_PROPERTY(TARGET mkldnn PROPERTY IMPORTED_LOCATION ${MKLDNN_LIB})
ADD_DEPENDENCIES(mkldnn ${MKLDNN_PROJECT})
MESSAGE(STATUS "MKLDNN library: ${MKLDNN_LIB}")
-add_definitions(-DPADDLE_USE_MKLDNN)
+add_definitions(-DPADDLE_WITH_MKLDNN)
LIST(APPEND external_project_dependencies mkldnn)
diff --git a/doc/howto/dev/contribute_to_paddle_cn.md b/doc/howto/dev/contribute_to_paddle_cn.md
index 6993901452..3e0bf7b397 100644
--- a/doc/howto/dev/contribute_to_paddle_cn.md
+++ b/doc/howto/dev/contribute_to_paddle_cn.md
@@ -76,18 +76,18 @@ no changes added to commit (use "git add" and/or "git commit -a")
## 构建和测试
-编译 PaddlePaddle 的源码以及生成文档需要多种开发工具。为了方便大家,我们的标准开发流程是把这些工具都装进一个Docker image,称为*开发镜像*,通常名字是 `paddle:dev`。然后所有用 `cmake && make` 的地方(比如IDE配置里)都用 `docker run paddle:dev`来代替。
+编译 PaddlePaddle 的源码以及生成文档需要多种开发工具。为了方便大家,我们的标准开发流程是把这些工具都装进一个Docker image,称为*开发镜像*,通常名字是 `paddle:latest-dev` 或者 `paddle:[version tag]-dev` 如 `paddle:0.11.0-dev`。然后所有用 `cmake && make` 的地方(比如IDE配置里)都用 `docker run paddle:latest-dev`来代替。
如要build这个开发镜像,在源码目录树的根目录中运行:
```bash
-➜ docker build -t paddle:dev .
+➜ docker build -t paddle:latest-dev .
```
随后可以用这个开发镜像开始build PaddlePaddle的源码。比如如果要build一个不依赖GPU,但是支持AVX指令集,并且包括unit tests的PaddlePaddle,可以:
```bash
-➜ docker run -v $(pwd):/paddle -e "WITH_GPU=OFF" -e "WITH_AVX=ON" -e "WITH_TEST=ON" paddle:dev
+➜ docker run -v $(pwd):/paddle -e "WITH_GPU=OFF" -e "WITH_AVX=ON" -e "WITH_TESTING=ON" paddle:latest-dev
```
这个过程除了编译PaddlePaddle为 `./build/libpaddle.so`,并且输出一个 `./build/paddle.deb`文件之外,还会输出一个 `build/Dockerfile`。我们只需要运行下面命令把编译好的PaddlePaddle打包成一个*生产镜像*(`paddle:prod`):
@@ -99,7 +99,7 @@ no changes added to commit (use "git add" and/or "git commit -a")
如果要运行所有的单元测试,可以用如下命令:
```bash
-➜ docker run -it -v $(pwd):/paddle paddle:dev bash -c "cd /paddle/build && ctest"
+➜ docker run -it -v $(pwd):/paddle paddle:latest-dev bash -c "cd /paddle/build && ctest"
```
关于构建和测试的更多信息,请参见[这篇文档](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/getstarted/build_and_install/docker_install_cn.rst)。
diff --git a/doc/howto/dev/new_op_cn.md b/doc/howto/dev/new_op_cn.md
index 6cfc9536f2..757a5840bc 100644
--- a/doc/howto/dev/new_op_cn.md
+++ b/doc/howto/dev/new_op_cn.md
@@ -1,17 +1,18 @@
# 如何写新的Operator
- [概念简介](#概念简介)
- - [实现C++类](#实现C++类)
- - [定义ProtoMaker类](#定义ProtoMaker类)
- - [定义Operator类](#定义Operator类)
- - [定义OpKernel类](#定义OpKernel类)
- - [注册Operator](#注册Operator)
+ - [实现C++类](#实现c类)
+ - [定义ProtoMaker类](#定义protomaker类)
+ - [定义Operator类](#定义operator类)
+ - [定义OpKernel类](#定义opkernel类)
+ - [注册Operator](#注册operator)
- [编译](#编译)
- - [绑定Python](#绑定Python)
+ - [绑定Python](#绑定python)
- [实现单元测试](#实现单元测试)
- - [前向Operator单测](#前向Operator单测)
- - [反向Operator单测](#反向Operator单测)
+ - [前向Operator单测](#前向operator单测)
+ - [反向Operator单测](#反向operator单测)
- [编译和执行](#编译和执行)
+ - [注意事项](#注意事项)
## 概念简介
@@ -30,8 +31,8 @@
-------------- | :----------------------
OpProtoMake定义 | `.cc`文件,Backward Op不需要定义OpProtoMake
Op定义 | `.cc`文件
-Kernel实现 | CPU、GPU共享Kernel实现在`.h`文件中,否则,CPU 实现在`.cc`文件中,GPU 实现在`.cu`文件中。
-注册Op | Op注册实现在`.cc`文件;Kernel注册CPU实现在`.cc`文件中,GPU实现在`.cu`文件中
+Kernel实现 | CPU、CUDA共享Kernel实现在`.h`文件中,否则,CPU 实现在`.cc`文件中,CUDA 实现在`.cu`文件中。
+注册Op | Op注册实现在`.cc`文件;Kernel注册CPU实现在`.cc`文件中,CUDA实现在`.cu`文件中
实现新的op都添加至目录[paddle/operators](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/operators)下,文件命名以`*_op.h`(如有) 、 `*_op.cc` 、`*_op.cu`(如有)结尾。**系统会根据文件名自动构建op和其对应的Python扩展。**
@@ -43,7 +44,7 @@ Kernel实现 | CPU、GPU共享Kernel实现在`.h`文件中,否则,CPU
## 实现C++类
-### 1. 定义ProtoMaker类
+### 定义ProtoMaker类
矩阵乘法的公式:$Out = X * Y$, 可见该计算由两个输入,一个输出组成。
@@ -100,7 +101,7 @@ The equation is: Out = scale*X
- `AddAttr("scale", "...").SetDefault(1.0);` : 增加`scale`系数,作为参数属性,并且设置默认值为1.0。
-### 2. 定义Operator类
+### 定义Operator类
下面的点实现了MulOp的定义:
@@ -149,11 +150,11 @@ MulOp(const std::string &type, const framework::VariableNameMap &inputs,
通常`OpProtoMaker`和`Op`类的定义写在`.cc`文件中,和下面将要介绍的注册函数一起放在`.cc`中
-### 3. 定义OpKernel类
+### 定义OpKernel类
`MulKernel`继承自`framework::OpKernel`,带有下面两个模板参数:
-- `typename Place`: 表示设备类型,不同设备(CPU、GPU)共享同一个Kernel时,需加该模板参数,不共享则不加,一个不共享的例子是[`OnehotCrossEntropyOpKernel`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/cross_entropy_op.h#L43)。
+- `typename DeviceContext`: 表示设备类型,不同设备(CPU、CUDA)共享同一个Kernel时,需加该模板参数,不共享则不加,一个不共享的例子是[`OnehotCrossEntropyOpKernel`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/cross_entropy_op.h#L43)。
- `typename T` : 表示数据类型,如`float`, `double`等。
@@ -165,7 +166,7 @@ MulOp(const std::string &type, const framework::VariableNameMap &inputs,
下面是 `MulKernel` `Compute`的实现:
```cpp
- template
+ template
class MulKernel : public framework::OpKernel {
public:
void Compute(const framework::ExecutionContext& context) const override {
@@ -173,33 +174,32 @@ MulOp(const std::string &type, const framework::VariableNameMap &inputs,
auto* Y = context.Input("Y");
auto* Z = context.Output("Out");
Z->mutable_data(context.GetPlace());
- auto* device_context =
- const_cast(context.device_context_);
- math::matmul(*X, false, *Y, false, 1, Z, 0, device_context);
+ auto& device_context = context.template device_context();
+ math::matmul(*X, false, *Y, false, 1, Z, 0, device_context);
}
};
```
-需要注意:**不同设备(CPU、GPU)共享一个Op定义,是否则共享同一个`OpKernel`,取决于`Compute`调用的函数是否支持不同设备。**
+需要注意:**不同设备(CPU、CUDA)共享一个Op定义,是否则共享同一个`OpKernel`,取决于`Compute`调用的函数是否支持不同设备。**
-`MulOp`的CPU、GPU实现共享同一个`Kernel`。`OpKernel`不共享的例子可以参考:[`OnehotCrossEntropyOpKernel`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/cross_entropy_op.h#L43)。
+`MulOp`的CPU、CUDA实现共享同一个`Kernel`。`OpKernel`不共享的例子可以参考:[`OnehotCrossEntropyOpKernel`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/cross_entropy_op.h#L43)。
-为了使`OpKernel`的计算过程书写更加简单,并且CPU、GPU的代码可以复用,我们通常借助 Eigen unsupported Tensor模块来实现`Compute`接口。关于在PaddlePaddle中如何使用Eigen库,请参考[使用文档](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/dev/use_eigen_cn.md)。
+为了使`OpKernel`的计算过程书写更加简单,并且CPU、CUDA的代码可以复用,我们通常借助 Eigen unsupported Tensor模块来实现`Compute`接口。关于在PaddlePaddle中如何使用Eigen库,请参考[使用文档](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/dev/use_eigen_cn.md)。
到此,前向Op实现完成。接下来,需要在`.cc`文件中注册该op和kernel。
反向Op类的定义,反向OpKernel的定义与前向Op类似,这里不再赘述。**但需注意反向Op没有`ProtoMaker`**。
-### 4. 注册Operator
+### 注册Operator
- 在`.cc`文件中注册前向、反向Op类,注册CPU Kernel。
```cpp
namespace ops = paddle::operators;
REGISTER_OP(mul, ops::MulOp, ops::MulOpMaker, mul_grad, ops::MulOpGrad);
- REGISTER_OP_CPU_KERNEL(mul, ops::MulKernel);
+ REGISTER_OP_CPU_KERNEL(mul, ops::MulKernel);
REGISTER_OP_CPU_KERNEL(mul_grad,
- ops::MulGradKernel);
+ ops::MulGradKernel);
```
在上面的代码中:
@@ -209,20 +209,20 @@ MulOp(const std::string &type, const framework::VariableNameMap &inputs,
- `REGISTER_OP_CPU_KERNEL` :注册`ops::MulKernel`类,并特化模板参数为`paddle::platform::CPUPlace`和`float`类型,同理,注册`ops::MulGradKernel`类。
-- 在 `.cu`文件中注册GPU Kernel。
- - 请注意,如果GPU Kernel的实现基于Eigen unsupported模块,那么在 `.cu`的开始请加上宏定义 `#define EIGEN_USE_GPU`,代码示例如下:
+- 在 `.cu`文件中注册CUDA Kernel。
+ - 请注意,如果CUDA Kernel的实现基于Eigen unsupported模块,那么在 `.cu`的开始请加上宏定义 `#define EIGEN_USE_GPU`,代码示例如下:
```cpp
// if use Eigen unsupported module before include head files
- // #define EIGEN_USE_GPU
+ #define EIGEN_USE_GPU
namespace ops = paddle::operators;
- REGISTER_OP_GPU_KERNEL(mul, ops::MulKernel);
- REGISTER_OP_GPU_KERNEL(mul_grad,
- ops::MulGradKernel);
+ REGISTER_OP_CUDA_KERNEL(mul, ops::MulKernel);
+ REGISTER_OP_CUDA_KERNEL(mul_grad,
+ ops::MulGradKernel);
```
-### 5. 编译
+### 编译
运行下面命令可以进行编译:
@@ -236,71 +236,57 @@ make mul_op
## 实现单元测试
-单测包括对比前向Op不同设备(CPU、GPU)的实现、对比反向OP不同设备(CPU、GPU)的实现、反向Op的梯度测试。下面介绍介绍[`MulOp`的单元测试](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/v2/framework/tests/test_mul_op.py)。
+单测包括对比前向Op不同设备(CPU、CUDA)的实现、对比反向OP不同设备(CPU、CUDA)的实现、反向Op的梯度测试。下面介绍介绍[`MulOp`的单元测试](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/v2/framework/tests/test_mul_op.py)。
-### 前向Operator单元测试
+### 前向Operator单测
-前向Op单元测试继承自`unittest.TestCase`,并定义元类`__metaclass__ = OpTestMeta`。各项更加具体的单元测试在`OpTestMeta`里完成。测试前向Operator,需要:
+Op单元测试继承自`OpTest`。各项更加具体的单元测试在`TestMulOp`里完成。测试Operator,需要:
1. 在`setUp`函数定义输入、输出,以及相关的属性参数。
2. 生成随机的输入数据。
3. 在Python脚本中实现与前向operator相同的计算逻辑,得到输出值,与operator前向计算的输出进行对比。
+4. 反向计算已经自动集成进测试框架,直接调用相应接口即可。
```python
import unittest
import numpy as np
- from gradient_checker import GradientChecker, create_op
- from op_test_util import OpTestMeta
+ from op_test import OpTest
- class TestMulOp(unittest.TestCase):
- __metaclass__ = OpTestMeta
+ class TestMulOp(OpTest):
def setUp(self):
- self.type = "mul"
+ self.op_type = "mul"
self.inputs = {
'X': np.random.random((32, 84)).astype("float32"),
'Y': np.random.random((84, 100)).astype("float32")
}
self.outputs = {'Out': np.dot(self.inputs['X'], self.inputs['Y'])}
- ```
-
-上面的代码首先导入依赖的包,下面是对`setUp`函数中操作的重要变量的详细解释:
-
-- `self.type = "mul" ` : 定义类型,与operator注册时注册的类型一致。
-- `self.inputs` : 定义输入,类型为`numpy.array`,并初始化。
-- `self.outputs` : 定义输出,并在Python脚本中完成与operator同样的计算逻辑,返回Python端的计算结果。
+ def test_check_output(self):
+ self.check_output()
-### 反向Operator单元测试
+ def test_check_grad_normal(self):
+ self.check_grad(['X', 'Y'], 'Out', max_relative_error=0.5)
-反向Op单元测试继承自`GradientChecker`,而`GradientChecker`继承自`unittest.TestCase`,因此,**反向单元测试函数需要以`test_`开头**。
+ def test_check_grad_ingore_x(self):
+ self.check_grad(
+ ['Y'], 'Out', max_relative_error=0.5, no_grad_set=set("X"))
-```python
-class TestMulGradOp(GradientChecker):
- def setUp(self):
- self.op = create_op("mul")
- self.inputs = {
- 'X': np.random.random((32, 84)).astype("float32"),
- 'Y': np.random.random((84, 100)).astype("float32")
- }
-
- def test_check_grad_normal(self):
- # mul op will enlarge the relative error
- self.check_grad(['X', 'Y'], 'Out', max_relative_error=0.5)
+ def test_check_grad_ingore_y(self):
+ self.check_grad(
+ ['X'], 'Out', max_relative_error=0.5, no_grad_set=set('Y'))
+ ```
- def test_check_grad_ingore_x(self):
- self.check_grad(
- ['Y'], 'Out', max_relative_error=0.5, no_grad_set=set("X"))
+上面的代码首先导入依赖的包,下面是对`setUp`函数中操作的重要变量的详细解释:
- def test_check_grad_ingore_y(self):
- self.check_grad(
- ['X'], 'Out', max_relative_error=0.5, no_grad_set=set('Y'))
-```
+- `self.op_type = "mul" ` : 定义类型,与operator注册时注册的类型一致。
+- `self.inputs` : 定义输入,类型为`numpy.array`,并初始化。
+- `self.outputs` : 定义输出,并在Python脚本中完成与operator同样的计算逻辑,返回Python端的计算结果。
-下面解释代码中一些关键的地方:
+### 反向operator单测
-- 调用`create_op("mul")`创建反向Op对应的前向Op。
+而反向测试中:
- `test_check_grad_normal`中调用`check_grad`使用数值法检测梯度正确性和稳定性。
- 第一个参数`["X", "Y"]` : 指定对输入变量`X`、`Y`做梯度检测。
- 第二个参数`"Out"` : 指定前向网络最终的输出目标变量`Out`。
@@ -308,7 +294,7 @@ class TestMulGradOp(GradientChecker):
- `test_check_grad_ingore_x`和`test_check_grad_ingore_y`分支用来测试只需要计算一个输入梯度的情况。
-### 编译和执行单元测试
+### 编译和执行
`python/paddle/v2/framework/tests` 目录下新增的 `test_*.py` 单元测试会被自动加入工程进行编译。
@@ -328,5 +314,5 @@ ctest -R test_mul_op
- 为每个Op创建单独的`*_op.h`(如有)、`*_op.cc`和`*_op.cu`(如有)。不允许一个文件中包含多个Op,这将会导致编译出错。
- 注册Op时的类型名,需要和该Op的名字一样。即不允许在`A_op.cc`里面,注册`REGISTER_OP(B, ...)`等,这将会导致单元测试出错。
-- 如果Op没有实现GPU Kernel,请不要创建空的`*_op.cu`,这将会导致单元测试出错。
+- 如果Op没有实现CUDA Kernel,请不要创建空的`*_op.cu`,这将会导致单元测试出错。
- 如果多个Op依赖一些共用的函数,可以创建非`*_op.*`格式的文件来存放,如`gather.h`文件。
diff --git a/doc/howto/dev/new_op_en.md b/doc/howto/dev/new_op_en.md
index 1e88e1f5b4..fe86936bc1 100644
--- a/doc/howto/dev/new_op_en.md
+++ b/doc/howto/dev/new_op_en.md
@@ -1,8 +1,8 @@
# How to write a new operator
- [Background](#background)
- - [Implementing C++ Types](#implementing-c++-types)
- - [Defining ProtoMaker](#defining-protoMaker)
+ - [Implementing C++ Types](#implementing-c-types)
+ - [Defining ProtoMaker](#defining-protomaker)
- [Defining Operator](#defining-operator)
- [Registering Operator](#registering-operator)
- [Compilation](#compilation)
@@ -28,8 +28,8 @@ An operator can be differentiated by whether in has kernel methods. An operator
-------------- | :----------------------
OpProtoMake definition | `.cc`files, Backward Op does not need an OpProtoMake interface.
Op definition | `.cc` files
-Kernel implementation | The kernel methods shared between CPU and GPU are defined in `.h` files. CPU-specific kernels live in `.cc` files, while GPU-specific kernels are implemented in `.cu`files.
-Registering the Op | Ops are registered in `.cc` files; For Kernel registration, `.cc` files contain the CPU implementation, while `.cu` files contain the GPU implementation.
+Kernel implementation | The kernel methods shared between CPU and CUDA are defined in `.h` files. CPU-specific kernels live in `.cc` files, while CUDA-specific kernels are implemented in `.cu`files.
+Registering the Op | Ops are registered in `.cc` files; For Kernel registration, `.cc` files contain the CPU implementation, while `.cu` files contain the CUDA implementation.
New Operator implementations are added to the list [paddle/operators](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/operators), with file names in the format `*_op.h` (if applicable), `*_op.cc`, `*_op.cu` (if applicable).** The system will use the naming scheme to automatically build operators and their corresponding Python extensions. **
@@ -41,7 +41,7 @@ Let's take matrix multiplication operator, [MulOp](https://github.com/PaddlePadd
## Implementing C++ Types
-### 1. Defining Class ProtoMaker
+### Defining ProtoMaker
Matrix Multiplication can be written as $Out = X * Y$, meaning that the operation consists of two inputs and pne output.
@@ -98,7 +98,7 @@ There are two changes in this example:
- `AddAttr("scale", "...").SetDefault(1.0);` adds `scale`constant as an attribute, and sets the default value to 1.0.
-### 2. Defining Operator
+### Defining Operator
The following code defines the interface for MulOp:
@@ -147,11 +147,11 @@ MulOp(const std::string &type, const framework::VariableNameMap &inputs,
Usually `OpProtoMaker` and `Op`'s type definitions are written in `.cc` files, which also include the registration methods introduced later.
-### 3. Defining OpKernel
+### Defining OpKernel
`MulKernel` inherits `framework::OpKernel`, which includes the following templates:
-- `typename Place` denotes device type. When different devices, namely the CPU and the GPU, share the same kernel, this template needs to be added. If they don't share kernels, this must not be added. An example of a non-sharing kernel is [`OnehotCrossEntropyOpKernel`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/cross_entropy_op.h#L43).
+- `typename DeviceContext` denotes device context type. When different devices, namely the CPUDeviceContext and the CUDADeviceContext, share the same kernel, this template needs to be added. If they don't share kernels, this must not be added. An example of a non-sharing kernel is [`OnehotCrossEntropyOpKernel`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/cross_entropy_op.h#L43).
- `typename T` denotes data type, such as `float` or `double`.
@@ -163,7 +163,7 @@ Usually `OpProtoMaker` and `Op`'s type definitions are written in `.cc` files, w
`MulKernel`'s implementation of `Compute` is as follows:
```cpp
- template
+ template
class MulKernel : public framework::OpKernel {
public:
void Compute(const framework::ExecutionContext& context) const override {
@@ -171,16 +171,15 @@ Usually `OpProtoMaker` and `Op`'s type definitions are written in `.cc` files, w
auto* Y = context.Input("Y");
auto* Z = context.Output("Out");
Z->mutable_data(context.GetPlace());
- auto* device_context =
- const_cast(context.device_context_);
- math::matmul(*X, false, *Y, false, 1, Z, 0, device_context);
+ auto& device_context = context.template device_context();
+ math::matmul(*X, false, *Y, false, 1, Z, 0, device_context);
}
};
```
-Note that **different devices (CPU, GPU)share an Op definition; whether or not they share the same `OpKernel` depends on whether `Compute` calls functions that support both devices.**
+Note that **different devices (CPU, CUDA)share an Op definition; whether or not they share the same `OpKernel` depends on whether `Compute` calls functions that support both devices.**
-`MulOp`'s CPU and GPU share the same `Kernel`. A non-sharing `OpKernel` example can be seen in [`OnehotCrossEntropyOpKernel`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/cross_entropy_op.h#L43).
+`MulOp`'s CPU and CUDA share the same `Kernel`. A non-sharing `OpKernel` example can be seen in [`OnehotCrossEntropyOpKernel`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/cross_entropy_op.h#L43).
To ease the writing of `OpKernel` compute, and for reusing code cross-device, [`Eigen-unsupported Tensor`](https://bitbucket.org/eigen/eigen/src/default/unsupported/Eigen/CXX11/src/Tensor/README.md?fileviewer=file-view-default) module is used to implement `Compute` interface. To learn about how the Eigen library is used in PaddlePaddle, please see [usage document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/dev/use_eigen_cn.md).
@@ -189,16 +188,16 @@ This concludes the forward implementation of an operator. Next its operation and
The definition of its corresponding backward operator, if applicable, is similar to that of an forward operator. **Note that a backward operator does not include a `ProtoMaker`**.
-### 4. Registering Operator
+### Registering Operator
- In `.cc` files, register forward and backward operator classes and the CPU kernel.
```cpp
namespace ops = paddle::operators;
REGISTER_OP(mul, ops::MulOp, ops::MulOpMaker, mul_grad, ops::MulOpGrad);
- REGISTER_OP_CPU_KERNEL(mul, ops::MulKernel);
+ REGISTER_OP_CPU_KERNEL(mul, ops::MulKernel);
REGISTER_OP_CPU_KERNEL(mul_grad,
- ops::MulGradKernel);
+ ops::MulGradKernel);
```
In that code block,
@@ -208,20 +207,20 @@ The definition of its corresponding backward operator, if applicable, is similar
- `REGISTER_OP_CPU_KERNEL` registers `ops::MulKernel` class and specialized template types `paddle::platform::CPUPlace` and `float`, which also registers `ops::MulGradKernel`.
-- Registering GPU Kernel in `.cu` files
- - Note that if GPU Kernel is implemented using the `Eigen unsupported` module, then on top of `.cu`, a macro definition `#define EIGEN_USE_GPU` is needed, such as
+- Registering CUDA Kernel in `.cu` files
+ - Note that if CUDA Kernel is implemented using the `Eigen unsupported` module, then on top of `.cu`, a macro definition `#define EIGEN_USE_GPU` is needed, such as
```cpp
// if use Eigen unsupported module before include head files
#define EIGEN_USE_GPU
namespace ops = paddle::operators;
- REGISTER_OP_GPU_KERNEL(mul, ops::MulKernel);
- REGISTER_OP_GPU_KERNEL(mul_grad,
- ops::MulGradKernel);
+ REGISTER_OP_CUDA_KERNEL(mul, ops::MulKernel);
+ REGISTER_OP_CUDA_KERNEL(mul_grad,
+ ops::MulGradKernel);
```
-### 5. Compilation
+### Compilation
Run the following commands to compile.
@@ -253,62 +252,51 @@ A forward operator unit test inherits `unittest.TestCase` and defines metaclass
2. Generating random input data.
-3. Implementing the same computation logic in a Python script:
+3. Implementing the same computation logic in a Python script.
+
+4. Call check gradient function to check the backward operator.
```python
import unittest
import numpy as np
- from gradient_checker import GradientChecker, create_op
- from op_test_util import OpTestMeta
+ from op_test import OpTest
- class TestMulOp(unittest.TestCase):
- __metaclass__ = OpTestMeta
+ class TestMulOp(OpTest):
def setUp(self):
- self.type = "mul"
+ self.op_type = "mul"
self.inputs = {
'X': np.random.random((32, 84)).astype("float32"),
'Y': np.random.random((84, 100)).astype("float32")
}
self.outputs = {'Out': np.dot(self.inputs['X'], self.inputs['Y'])}
+
+ def test_check_output(self):
+ self.check_output()
+
+ def test_check_grad_normal(self):
+ self.check_grad(['X', 'Y'], 'Out', max_relative_error=0.5)
+
+ def test_check_grad_ingore_x(self):
+ self.check_grad(
+ ['Y'], 'Out', max_relative_error=0.5, no_grad_set=set("X"))
+
+ def test_check_grad_ingore_y(self):
+ self.check_grad(
+ ['X'], 'Out', max_relative_error=0.5, no_grad_set=set('Y'))
```
Get its output, and compare it with the forward operator's own output.
The code above first loads required packages. In addition, we have
-- `self.type = "mul" ` defines the type that is identical to what the operator's registered type.
+- `self.op_type = "mul" ` defines the type that is identical to what the operator's registered type.
- `self.inputs` defines input, with type `numpy.array` and initializes it.
- `self.outputs` defines output and completes the same operator computation in the Python script, and returns its result from the Python script.
### Testing Backward Operators
-A backward operator unit test inherits `GradientChecker`, which inherits `unittest.TestCase`. As a result, **a backward operator unit test needs to be have the prefix `test_`**.
-
-```python
-class TestMulGradOp(GradientChecker):
- def setUp(self):
- self.op = create_op("mul")
- self.inputs = {
- 'X': np.random.random((32, 84)).astype("float32"),
- 'Y': np.random.random((84, 100)).astype("float32")
- }
-
- def test_check_grad_normal(self):
- # mul op will enlarge the relative error
- self.check_grad(['X', 'Y'], 'Out', max_relative_error=0.5)
-
- def test_check_grad_ingore_x(self):
- self.check_grad(
- ['Y'], 'Out', max_relative_error=0.5, no_grad_set=set("X"))
-
- def test_check_grad_ingore_y(self):
- self.check_grad(
- ['X'], 'Out', max_relative_error=0.5, no_grad_set=set('Y'))
-```
-
-Some key points in the code above include:
+Some key points in checking gradient above include:
-- `create_op("mul")` creates the backward operator's corresponding forward operator.
- `test_normal` calls `check_grad` to validate scaling tests' correctness and stability through numeric methods.
- The first variable `["X", "Y"]` appoints `X` and `Y` to be scale tested.
- The second variable `"Out"` points to the network's final output target `Out`.
@@ -338,5 +326,5 @@ ctest -R test_mul_op
- Every `*_op.h` (if applicable), `*_op.cc`, and `*_op.cu` (if applicable) must be created for a unique Op. Compiling will fail if multiple operators are included per file.
- The type with which an operator is registered needs to be identical to the Op's name. Registering `REGISTER_OP(B, ...)` in `A_op.cc` will cause unit testing failures.
-- If the operator does not implement a GPU kernel, please refrain from creating an empty `*_op.cu` file, or else unit tests will fail.
+- If the operator does not implement a CUDA kernel, please refrain from creating an empty `*_op.cu` file, or else unit tests will fail.
- If multiple operators rely on some shared methods, a file NOT named `*_op.*` can be created to store them, such as `gather.h`.
diff --git a/doc/howto/read_source.md b/doc/howto/read_source.md
new file mode 100644
index 0000000000..383acb0c82
--- /dev/null
+++ b/doc/howto/read_source.md
@@ -0,0 +1,67 @@
+# PaddlePaddle Fluid Source Code Overview
+
+Examples: https://github.com/PaddlePaddle/Paddle/tree/develop/python/paddle/v2/fluid/tests/book
+
+Core: https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/framework
+
+Operator: https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/operators
+
+Optimizer: https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/optimizer
+
+Memory: https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/memory
+
+# Compile Time
+
+The following **defines** the NN. The definition goes into this [protocol buffer](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/framework.proto).
+
+```python
+x = fluid.layers.data(name='x', shape=[13], dtype='float32')
+y = fluid.layers.data(name='y', shape=[1], dtype='float32')
+
+y_predict = fluid.layers.fc(input=x, size=1, act=None)
+cost = fluid.layers.square_error_cost(input=y_predict, label=y)
+avg_cost = fluid.layers.mean(x=cost)
+
+sgd_optimizer = fluid.optimizer.SGD(learning_rate=0.001)
+sgd_optimizer.minimize(avg_cost)
+```
+
+- Variables: `x`, `y`, `y_predict`, `cost` and `avg_cost`. [Python](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/v2/fluid/framework.py#L93)
+- Layers: `fluid.layers.data`, `fluid.layers.fc` and `fluid.layers.mean` are layers. [Python](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/v2/fluid/layers.py)
+ - Every Layer has one or more operators and variables/parameters
+ - All the operators are defined at [`paddle/operators/`](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/operators). Other worth-looking files:
+ - Base class: [`paddle/framework/operator.h`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/operator.h)
+ - Operator Registration: [`paddle/framework/op_registry.h`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/op_registry.h)
+ - Operator Lookup: [`paddle/framework/op_info.h`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/op_info.h)
+- Optimizer: `fluid.optimizer.SGD`. It does the following
+ - Add backward operators. [[Python](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/v2/fluid/backward.py), [C++](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/backward.cc)]
+ - Add optimizer operators. [[Python](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/v2/fluid/optimizer.py), [C++](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/optimizer)]
+
+# Run Time
+
+The following **evaluates** the NN. Instantiates all the variables, operators.
+
+```python
+place = fluid.CPUPlace()
+feeder = fluid.DataFeeder(place=place, feed_list=[x, y])
+exe = fluid.Executor(place)
+
+# Allocate memory. Initialize Parameter.
+exe.run(fluid.default_startup_program())
+
+# Allocate memory. Do computation.
+exe.run(fluid.default_main_program(),
+ feed=feeder.feed(data),
+ fetch_list=[avg_cost])
+```
+
+- Place: `place`. one of CPU, GPU or FPGA. [C++](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/platform/place.h)
+ - The device handle are at [paddle/platform/device_context.h](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/platform/device_context.h)
+- Executor: `fluid.Executor(place)`. [[Python](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/v2/fluid/executor.py), [C++](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/executor.cc)]
+ - Feeds the data: `feed=feeder.feed(data)`
+ - Evaluates all the operators
+ - Fetches the result: `fetch_list=[avg_cost]`
+- Other worth looking files:
+ - Scope: [paddle/framework/scope.h](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/scope.h). Where all the variables live
+ - Variable: [paddle/framework/variable.h](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/variable.h). Where all the data (most likely tensors) live
+ - Tensor: [paddle/framework/tensor.h](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/tensor.h). Where we allocate memory through [`paddle/memory/`](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/memory)
diff --git a/paddle/api/CMakeLists.txt b/paddle/api/CMakeLists.txt
index d6b8464100..cf84568ecd 100644
--- a/paddle/api/CMakeLists.txt
+++ b/paddle/api/CMakeLists.txt
@@ -25,8 +25,18 @@ FILE(GLOB PY_PADDLE_PYTHON_FILES ${PADDLE_SOURCE_DIR}/paddle/py_paddle/*.py)
SET_SOURCE_FILES_PROPERTIES(Paddle.i PROPERTIES CPLUSPLUS ON)
+SET(SWIG_NEED_FLAGS
+ -ftls-model=global-dynamic
+ -Wno-parentheses-equality
+ -Wno-self-assign
+ -Wno-maybe-uninitialized
+ -Wno-missing-field-initializers)
+ FOREACH(flag ${SWIG_NEED_FLAGS})
+ safe_set_cxxflag(SWIG_CXX_FLAGS ${flag})
+ENDFOREACH()
+
SET(CMAKE_SWIG_OUTDIR ${CMAKE_CURRENT_BINARY_DIR})
-SET(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wno-parentheses-equality -Wno-missing-field-initializers -Wno-self-assign -ftls-model=global-dynamic")
+SET(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${SWIG_CXX_FLAGS}")
SET(SWIG_MODULE_swig_paddle_EXTRA_DEPS
paddle_parameter
diff --git a/paddle/framework/backward.cc b/paddle/framework/backward.cc
index 7294ba1a9c..a17036c652 100644
--- a/paddle/framework/backward.cc
+++ b/paddle/framework/backward.cc
@@ -190,8 +190,9 @@ static std::unique_ptr BackwardRecursive(
// collect all the offset for each alias,
// insert a sum operator to add all aliases to output
insert_position.push_back(
- {dup_op.back(), OpRegistry::CreateOp("sum", {{"X", dup_outputs}},
- {{"Out", {name}}}, {})});
+ {dup_op.back(),
+ OpRegistry::CreateOp("sum", {{"X", dup_outputs}}, {{"Out", {name}}},
+ AttributeMap{})});
}
// make sure the inserted `sum` ops follow the BFS order.
@@ -216,7 +217,8 @@ static std::unique_ptr BackwardRecursive(
// If part of input gradient of that operator is not calculated, fill
// zero variables to that input gradient.
net->AppendOp(OpRegistry::CreateOp("fill_zeros_like", {{"X", {prefix}}},
- {{"Y", {grad_input}}}, {}));
+ {{"Y", {grad_input}}},
+ AttributeMap{}));
}
return false;
});
@@ -392,8 +394,9 @@ std::vector> MakeOpGrad(
0, in_name.size() - sizeof(kGradVarSuffix) / sizeof(char) + 1);
std::string new_name = prefix + kZeroVarSuffix;
desc->Rename(in_name, new_name);
- std::unique_ptr fill_zeros_op(new OpDescBind(
- "fill_zeros_like", {{"X", {prefix}}}, {{"Y", {new_name}}}, {}));
+ std::unique_ptr fill_zeros_op(
+ new OpDescBind("fill_zeros_like", {{"X", {prefix}}},
+ {{"Y", {new_name}}}, AttributeMap{}));
pending_fill_zeros_ops.push_back(std::move(fill_zeros_op));
}
}
@@ -483,8 +486,9 @@ std::vector> MakeBlockBackward(
sum_op_inputs.emplace_back(new_name);
next_g_name = sum_op_inputs.back();
}
- std::unique_ptr sum_op(new OpDescBind(
- "sum", {{"X", sum_op_inputs}}, {{"Out", {out_name}}}, {}));
+ std::unique_ptr sum_op(
+ new OpDescBind("sum", {{"X", sum_op_inputs}}, {{"Out", {out_name}}},
+ AttributeMap{}));
pending_sum_ops.push_back({dup_op.back(), std::move(sum_op)});
}
}
diff --git a/paddle/framework/backward_test.cc b/paddle/framework/backward_test.cc
index 2b858f5ea0..9fe49881d5 100644
--- a/paddle/framework/backward_test.cc
+++ b/paddle/framework/backward_test.cc
@@ -106,15 +106,15 @@ class FcOp : public operators::NetOp {
FcOp(const std::string &type, const VariableNameMap &inputs,
const VariableNameMap &outputs, const AttributeMap &attrs)
: NetOp(type, inputs, outputs, attrs) {
- AppendOp(OpRegistry::CreateOp("mul",
- {{"X", {Input("X")}}, {"Y", {Input("W")}}},
- {{"Out", {Output("mul_result")}}}, {}));
+ AppendOp(OpRegistry::CreateOp(
+ "mul", {{"X", {Input("X")}}, {"Y", {Input("W")}}},
+ {{"Out", {Output("mul_result")}}}, AttributeMap{}));
auto input_b = Inputs("b");
std::string before_act = "mul_result";
if (input_b.size() != 0) {
AppendOp(OpRegistry::CreateOp(
"rowwise_add", {{"X", {Output("mul_result")}}, {"b", {input_b[0]}}},
- {{"Out", {Output("add_result")}}}, {}));
+ {{"Out", {Output("add_result")}}}, AttributeMap{}));
before_act = "add_result";
} else {
auto out_varname = Output("add_result");
@@ -124,7 +124,7 @@ class FcOp : public operators::NetOp {
}
AppendOp(OpRegistry::CreateOp("sigmoid", {{"X", {Output(before_act)}}},
- {{"Out", {Output("Out")}}}, {}));
+ {{"Out", {Output("Out")}}}, AttributeMap{}));
CompleteAddOp(false);
}
};
@@ -278,8 +278,9 @@ REGISTER_OPERATOR(scale, f::NoneOp);
REGISTER_OP_CPU_KERNEL(scale, f::NoneKernel);
TEST(Backward, simple_op_not_need_grad) {
- auto fwd = f::OpRegistry::CreateOp(
- "rowwise_add", {{"X", {"x"}}, {"b", {"b"}}}, {{"Out", {"out"}}}, {});
+ auto fwd =
+ f::OpRegistry::CreateOp("rowwise_add", {{"X", {"x"}}, {"b", {"b"}}},
+ {{"Out", {"out"}}}, f::AttributeMap{});
ASSERT_NE(fwd, nullptr);
auto gop = f::Backward(*fwd, {"x"});
ASSERT_EQ(gop->Output(f::GradVarName("X")), f::kEmptyVarName);
@@ -296,9 +297,10 @@ TEST(Backward, net_fc_backward_normal) {
{{"mul_result", {"mul_res"}},
{"add_result", {"add_re"}},
{"Out", {"out"}}},
- {});
+ f::AttributeMap{});
ASSERT_NE(fwd, nullptr);
- std::shared_ptr gop = f::Backward(*fwd, {});
+ std::shared_ptr gop =
+ f::Backward(*fwd, std::unordered_set{});
ASSERT_TRUE(gop->IsNetOp());
auto net = static_cast(gop.get());
@@ -322,9 +324,10 @@ TEST(Backward, net_fc_backward_not_have_b) {
{{"mul_result", {"mul_res"}},
{"add_result", {"add_res"}},
{"Out", {"tmp"}}},
- {});
+ f::AttributeMap{});
ASSERT_NE(fwd, nullptr);
- std::shared_ptr gop = f::Backward(*fwd, {});
+ std::shared_ptr gop =
+ f::Backward(*fwd, std::unordered_set{});
ASSERT_TRUE(gop->IsNetOp());
auto net = static_cast(gop.get());
@@ -346,13 +349,13 @@ TEST(Backward, net_input_of_network_not_need_grad) {
{{"mul_result", {"mul_tmp_0"}},
{"add_result", {"add_tmp_0"}},
{"Out", {"hidden0"}}},
- {}));
+ f::AttributeMap{}));
net.AppendOp(f::OpRegistry::CreateOp(
"fc", {{"X", {"hidden0"}}, {"W", {"W2"}}, {"b", {"b2"}}},
{{"mul_result", {"mul_tmp_1"}},
{"add_result", {"add_tmp_1"}},
{"Out", {"hidden1"}}},
- {}));
+ f::AttributeMap{}));
net.CompleteAddOp();
auto bwd = Backward(net, {"x"}); // x@GRAD is not need.
ASSERT_TRUE(bwd->IsNetOp());
@@ -381,12 +384,13 @@ TEST(Backward, net_input_of_network_not_need_grad) {
TEST(Backward, net_shared_weight) {
ops::NetOp net;
net.AppendOp(f::OpRegistry::CreateOp("mul", {{"X", {"x"}}, {"Y", {"w"}}},
- {{"Out", {"out"}}}, {}));
+ {{"Out", {"out"}}}, f::AttributeMap{}));
net.AppendOp(f::OpRegistry::CreateOp("mul", {{"X", {"out"}}, {"Y", {"w"}}},
- {{"Out", {"FinalOut"}}}, {}));
+ {{"Out", {"FinalOut"}}},
+ f::AttributeMap{}));
net.CompleteAddOp();
- auto bwd = f::Backward(net, {});
+ auto bwd = f::Backward(net, std::unordered_set{});
ASSERT_TRUE(bwd->IsNetOp());
auto bwd_net = static_cast(bwd.get());
ASSERT_EQ(3UL, bwd_net->ops_.size());
@@ -394,8 +398,9 @@ TEST(Backward, net_shared_weight) {
}
TEST(Backward, op_all_input_are_not_need) {
- auto fwd = f::OpRegistry::CreateOp(
- "rowwise_add", {{"X", {"x"}}, {"b", {"b"}}}, {{"Out", {"out"}}}, {});
+ auto fwd =
+ f::OpRegistry::CreateOp("rowwise_add", {{"X", {"x"}}, {"b", {"b"}}},
+ {{"Out", {"out"}}}, f::AttributeMap{});
auto backward = f::Backward(*fwd, {"x", "b"});
ASSERT_TRUE(backward->IsNetOp());
auto net = static_cast(backward.get());
@@ -403,8 +408,9 @@ TEST(Backward, op_all_input_are_not_need) {
}
TEST(Backward, op_all_output_are_not_need) {
- auto fwd = f::OpRegistry::CreateOp(
- "rowwise_add", {{"X", {"x"}}, {"b", {"b"}}}, {{"Out", {"out"}}}, {});
+ auto fwd =
+ f::OpRegistry::CreateOp("rowwise_add", {{"X", {"x"}}, {"b", {"b"}}},
+ {{"Out", {"out"}}}, f::AttributeMap{});
auto backward = f::Backward(*fwd, {"out"});
ASSERT_TRUE(backward->IsNetOp());
auto net = static_cast(backward.get());
@@ -412,8 +418,9 @@ TEST(Backward, op_all_output_are_not_need) {
}
TEST(Backward, op_part_of_output_are_not_need) {
- auto fwd = f::OpRegistry::CreateOp("many_output_op", {{"x", {"X"}}},
- {{"y", {"Y"}}, {"z", {"Z"}}}, {});
+ auto fwd =
+ f::OpRegistry::CreateOp("many_output_op", {{"x", {"X"}}},
+ {{"y", {"Y"}}, {"z", {"Z"}}}, f::AttributeMap{});
auto backward = f::Backward(*fwd, {"Z"});
ASSERT_TRUE(backward->IsNetOp());
auto net = static_cast(backward.get());
@@ -437,7 +444,7 @@ TEST(Backward, op_part_of_output_are_not_need) {
TEST(Backward, op_part_of_input_are_not_need) {
auto fwd = f::OpRegistry::CreateOp("mul", {{"X", {"a"}}, {"Y", {"b"}}},
- {{"Out", {"out"}}}, {});
+ {{"Out", {"out"}}}, f::AttributeMap{});
auto backward = f::Backward(*fwd, {"a"});
auto &grad_mul = *backward;
ASSERT_EQ(grad_mul.Type(), "mul_grad");
@@ -458,19 +465,19 @@ TEST(Backward, linear_net_intermediate_variable_has_no_grad) {
{{"mul_result", {"mul_out1"}},
{"add_result", {"add_out1"}},
{"Out", {"out1"}}},
- {}));
+ f::AttributeMap{}));
net.AppendOp(f::OpRegistry::CreateOp(
"fc", {{"X", {"out1"}}, {"W", {"w2"}}, {"b", {"b2"}}},
{{"mul_result", {"mul_out2"}},
{"add_result", {"tmp_out2"}},
{"Out", {"out2"}}},
- {}));
+ f::AttributeMap{}));
net.AppendOp(f::OpRegistry::CreateOp(
"fc", {{"X", {"out2"}}, {"W", {"w3"}}, {"b", {"b3"}}},
{{"mul_result", {"mul_out3"}},
{"add_result", {"tmp_out3"}},
{"Out", {"out3"}}},
- {}));
+ f::AttributeMap{}));
net.CompleteAddOp();
auto backward = f::Backward(net, {"mul_out2", "tmp_out2", "out2"});
@@ -509,7 +516,8 @@ TEST(Backward, simple_single_op) {
auto target = f::VarDescBind("out");
target.SetShape({1});
- auto var_to_grad = AppendBackward(program, target, {});
+ auto var_to_grad =
+ AppendBackward(program, target, std::unordered_set{});
ASSERT_EQ(block->AllOps().size(), 3UL);
f::OpDescBind *fill_op = block->AllOps()[1];
@@ -546,7 +554,7 @@ TEST(Backward, default_attribute) {
auto target = f::VarDescBind("out");
target.SetShape({1});
- AppendBackward(program, target, {});
+ AppendBackward(program, target, std::unordered_set{});
ASSERT_EQ(block->AllOps().size(), 3UL);
EXPECT_EQ(boost::get(op->GetAttr("x_num_col_dims")), 1);
@@ -585,7 +593,8 @@ TEST(Backward, simple_mult_op) {
auto target = f::VarDescBind("out3");
target.SetShape({1});
size_t forward_len = block->AllOps().size();
- auto var_to_grad = AppendBackward(program, target, {});
+ auto var_to_grad =
+ AppendBackward(program, target, std::unordered_set{});
ASSERT_EQ(block->AllOps().size(), 6UL + 1);
f::OpDescBind *fill_op = block->AllOps()[forward_len];
@@ -817,7 +826,8 @@ TEST(Backward, shared_var) {
auto target = f::VarDescBind("out3");
target.SetShape({1});
size_t forward_len = block->AllOps().size();
- auto var_to_grad = AppendBackward(program, target, {});
+ auto var_to_grad =
+ AppendBackward(program, target, std::unordered_set{});
ASSERT_EQ(block->AllOps().size(), 8UL);
f::OpDescBind *fill_op = block->AllOps()[forward_len];
diff --git a/paddle/framework/op_desc.cc b/paddle/framework/op_desc.cc
index cde3f1ac2e..7ba1e3e4e3 100644
--- a/paddle/framework/op_desc.cc
+++ b/paddle/framework/op_desc.cc
@@ -316,8 +316,8 @@ static void InitInferShapeFuncs() {
for (auto &kern_pair : OperatorWithKernel::AllOpKernels()) {
auto op_type = kern_pair.first;
auto &op_info = info_map.at(op_type);
- auto op =
- static_cast(op_info.Creator()("", {}, {}, {}));
+ auto op = static_cast(op_info.Creator()(
+ "", VariableNameMap{}, VariableNameMap{}, AttributeMap{}));
if (op_info.infer_shape_) { // infer_shape has been registered.
continue;
}
diff --git a/paddle/framework/op_registry.h b/paddle/framework/op_registry.h
index daade439e5..b29238432b 100644
--- a/paddle/framework/op_registry.h
+++ b/paddle/framework/op_registry.h
@@ -181,8 +181,8 @@ class OpKernelRegistrar : public Registrar {
return 0; \
}
-#define REGISTER_OP_GPU_KERNEL(op_type, ...) \
- REGISTER_OP_KERNEL(op_type, GPU, ::paddle::platform::GPUPlace, __VA_ARGS__)
+#define REGISTER_OP_CUDA_KERNEL(op_type, ...) \
+ REGISTER_OP_KERNEL(op_type, CUDA, ::paddle::platform::GPUPlace, __VA_ARGS__)
#define REGISTER_OP_CPU_KERNEL(op_type, ...) \
REGISTER_OP_KERNEL(op_type, CPU, ::paddle::platform::CPUPlace, __VA_ARGS__)
@@ -217,7 +217,7 @@ class OpKernelRegistrar : public Registrar {
#else
#define USE_OP_KERNEL(op_type) \
USE_OP_DEVICE_KERNEL(op_type, CPU); \
- USE_OP_DEVICE_KERNEL(op_type, GPU)
+ USE_OP_DEVICE_KERNEL(op_type, CUDA)
#endif
#define USE_NO_KERNEL_OP(op_type) USE_OP_ITSELF(op_type);
@@ -226,9 +226,9 @@ class OpKernelRegistrar : public Registrar {
USE_OP_ITSELF(op_type); \
USE_OP_DEVICE_KERNEL(op_type, CPU);
-#define USE_GPU_ONLY_OP(op_type) \
- USE_OP_ITSELF(op_type); \
- USE_OP_DEVICE_KERNEL(op_type, GPU)
+#define USE_CUDA_ONLY_OP(op_type) \
+ USE_OP_ITSELF(op_type); \
+ USE_OP_DEVICE_KERNEL(op_type, CUDA)
#define USE_OP(op_type) \
USE_OP_ITSELF(op_type); \
diff --git a/paddle/framework/operator.cc b/paddle/framework/operator.cc
index f1444eeee9..e83d754783 100644
--- a/paddle/framework/operator.cc
+++ b/paddle/framework/operator.cc
@@ -22,20 +22,6 @@ limitations under the License. */
namespace paddle {
namespace framework {
-template <>
-Eigen::DefaultDevice& ExecutionContext::GetEigenDevice<
- platform::CPUPlace, Eigen::DefaultDevice>() const {
- return *device_context_.GetEigenDevice();
-}
-
-#ifdef PADDLE_WITH_CUDA
-template <>
-Eigen::GpuDevice&
-ExecutionContext::GetEigenDevice() const {
- return *device_context_.GetEigenDevice();
-}
-#endif
-
std::string OperatorBase::Input(const std::string& name) const {
auto& ins = Inputs(name);
PADDLE_ENFORCE_LE(ins.size(), 1UL,
@@ -429,7 +415,7 @@ void OperatorWithKernel::Run(const Scope& scope,
}
OpKernelType OperatorWithKernel::GetKernelType(
const ExecutionContext& ctx) const {
- return OpKernelType(IndicateDataType(ctx), ctx.device_context());
+ return OpKernelType(IndicateDataType(ctx), ctx.GetPlace());
}
DataType OperatorWithKernel::IndicateDataType(
const ExecutionContext& ctx) const {
diff --git a/paddle/framework/operator.h b/paddle/framework/operator.h
index 60861d9293..e60dbfc313 100644
--- a/paddle/framework/operator.h
+++ b/paddle/framework/operator.h
@@ -276,17 +276,25 @@ class ExecutionContext {
out_tensor->set_lod(in_tensor.lod());
}
- template ::EigenDeviceType>
- DeviceType& GetEigenDevice() const;
-
platform::Place GetPlace() const { return device_context_.GetPlace(); }
+ template
+ const DeviceContextType& device_context() const {
+ return *reinterpret_cast(&device_context_);
+ }
+
const platform::DeviceContext& device_context() const {
return device_context_;
}
+#ifdef PADDLE_WITH_CUDA
+ const inline platform::CUDADeviceContext& cuda_device_context() const {
+ PADDLE_ENFORCE(platform::is_gpu_place(device_context_.GetPlace()));
+ return *reinterpret_cast(
+ &device_context_);
+ }
+#endif
+
//! Get actual name vector for this input.
const std::vector& Inputs(const std::string& name) const {
return op_.Inputs(name);
@@ -297,14 +305,6 @@ class ExecutionContext {
return op_.Outputs(name);
}
-#ifdef PADDLE_WITH_CUDA
- const inline platform::CUDADeviceContext& cuda_device_context() const {
- PADDLE_ENFORCE(platform::is_gpu_place(device_context_.GetPlace()));
- return *reinterpret_cast(
- &device_context_);
- }
-#endif
-
private:
const OperatorBase& op_;
const Scope& scope_;
diff --git a/paddle/framework/operator_test.cc b/paddle/framework/operator_test.cc
index 1e19f82b34..b678178454 100644
--- a/paddle/framework/operator_test.cc
+++ b/paddle/framework/operator_test.cc
@@ -115,7 +115,7 @@ class OpWithKernelTest : public OperatorWithKernel {
protected:
void InferShape(framework::InferShapeContext* ctx) const override {}
OpKernelType GetKernelType(const ExecutionContext& ctx) const override {
- return OpKernelType(DataType::FP32, ctx.device_context());
+ return OpKernelType(DataType::FP32, ctx.GetPlace());
}
};
@@ -261,7 +261,9 @@ class OperatorClone : public paddle::framework::OperatorBase {
};
TEST(Operator, Clone) {
- OperatorClone a("ABC", {}, {}, {});
+ OperatorClone a("ABC", paddle::framework::VariableNameMap{},
+ paddle::framework::VariableNameMap{},
+ paddle::framework::AttributeMap{});
auto b = a.Clone();
ASSERT_EQ(a.Type(), b->Type());
}
diff --git a/paddle/framework/prune_test.cc b/paddle/framework/prune_test.cc
index 5988874809..f21df37a29 100644
--- a/paddle/framework/prune_test.cc
+++ b/paddle/framework/prune_test.cc
@@ -54,7 +54,8 @@ TEST(Prune, one_operator) {
f::ProgramDescBind program;
f::BlockDescBind *block = program.MutableBlock(0);
- AddOp("one_one", {{"input", {"a"}}}, {{"output", {"b"}}}, {}, block);
+ AddOp("one_one", {{"input", {"a"}}}, {{"output", {"b"}}}, f::AttributeMap{},
+ block);
f::ProgramDesc *pdesc = program.Proto();
f::ProgramDesc pruned;
@@ -71,10 +72,14 @@ TEST(Prune, forward) {
f::ProgramDescBind program;
f::BlockDescBind *block = program.MutableBlock(0);
- AddOp("one_one", {{"input", {"a"}}}, {{"output", {"b"}}}, {}, block);
- AddOp("one_one", {{"input", {"b"}}}, {{"output", {"c"}}}, {}, block);
- AddOp("one_one", {{"input", {"c"}}}, {{"output", {"d"}}}, {}, block);
- AddOp("one_one", {{"input", {"d"}}}, {{"output", {"e"}}}, {}, block);
+ AddOp("one_one", {{"input", {"a"}}}, {{"output", {"b"}}}, f::AttributeMap{},
+ block);
+ AddOp("one_one", {{"input", {"b"}}}, {{"output", {"c"}}}, f::AttributeMap{},
+ block);
+ AddOp("one_one", {{"input", {"c"}}}, {{"output", {"d"}}}, f::AttributeMap{},
+ block);
+ AddOp("one_one", {{"input", {"d"}}}, {{"output", {"e"}}}, f::AttributeMap{},
+ block);
f::ProgramDesc *pdesc = program.Proto();
@@ -90,11 +95,14 @@ TEST(Prune, multi_input_op) {
f::ProgramDescBind program;
f::BlockDescBind *block = program.MutableBlock(0);
- AddOp("one_one", {{"input", {"a0"}}}, {{"output", {"b0"}}}, {}, block);
- AddOp("one_one", {{"input", {"a1"}}}, {{"output", {"b1"}}}, {}, block);
- AddOp("one_one", {{"input", {"a2"}}}, {{"output", {"b2"}}}, {}, block);
- AddOp("three_one", {{"input", {"b0", "b1", "b2"}}}, {{"output", {"c"}}}, {},
+ AddOp("one_one", {{"input", {"a0"}}}, {{"output", {"b0"}}}, f::AttributeMap{},
+ block);
+ AddOp("one_one", {{"input", {"a1"}}}, {{"output", {"b1"}}}, f::AttributeMap{},
block);
+ AddOp("one_one", {{"input", {"a2"}}}, {{"output", {"b2"}}}, f::AttributeMap{},
+ block);
+ AddOp("three_one", {{"input", {"b0", "b1", "b2"}}}, {{"output", {"c"}}},
+ f::AttributeMap{}, block);
f::ProgramDesc *pdesc = program.Proto();
pdesc->mutable_blocks(0)->mutable_ops(3)->set_is_target(true);
@@ -108,9 +116,12 @@ TEST(Prune, multi_output_op) {
f::ProgramDescBind program;
f::BlockDescBind *block = program.MutableBlock(0);
- AddOp("one_two", {{"input", {"a"}}}, {{"output", {"b", "c"}}}, {}, block);
- AddOp("one_one", {{"input", {"b"}}}, {{"output", {"b1"}}}, {}, block);
- AddOp("one_one", {{"input", {"c"}}}, {{"output", {"c1"}}}, {}, block);
+ AddOp("one_two", {{"input", {"a"}}}, {{"output", {"b", "c"}}},
+ f::AttributeMap{}, block);
+ AddOp("one_one", {{"input", {"b"}}}, {{"output", {"b1"}}}, f::AttributeMap{},
+ block);
+ AddOp("one_one", {{"input", {"c"}}}, {{"output", {"c1"}}}, f::AttributeMap{},
+ block);
f::ProgramDesc *pdesc = program.Proto();
pdesc->mutable_blocks(0)->mutable_ops(2)->set_is_target(true);
@@ -124,9 +135,12 @@ TEST(Prune, multi_target) {
f::ProgramDescBind program;
f::BlockDescBind *block = program.MutableBlock(0);
- AddOp("one_two", {{"input", {"a"}}}, {{"output", {"b", "c"}}}, {}, block);
- AddOp("one_one", {{"input", {"b"}}}, {{"output", {"b1"}}}, {}, block);
- AddOp("one_one", {{"input", {"c"}}}, {{"output", {"c1"}}}, {}, block);
+ AddOp("one_two", {{"input", {"a"}}}, {{"output", {"b", "c"}}},
+ f::AttributeMap{}, block);
+ AddOp("one_one", {{"input", {"b"}}}, {{"output", {"b1"}}}, f::AttributeMap{},
+ block);
+ AddOp("one_one", {{"input", {"c"}}}, {{"output", {"c1"}}}, f::AttributeMap{},
+ block);
f::ProgramDesc *pdesc = program.Proto();
pdesc->mutable_blocks(0)->mutable_ops(1)->set_is_target(true);
diff --git a/paddle/gserver/activations/ActivationFunction.cpp b/paddle/gserver/activations/ActivationFunction.cpp
index f5a41b66bf..57c890e488 100644
--- a/paddle/gserver/activations/ActivationFunction.cpp
+++ b/paddle/gserver/activations/ActivationFunction.cpp
@@ -24,7 +24,7 @@ limitations under the License. */
#include "paddle/utils/ClassRegistrar.h"
#include "paddle/utils/Logging.h"
-#ifdef PADDLE_USE_MKLDNN
+#ifdef PADDLE_WITH_MKLDNN
#include "MKLDNNActivation.h"
#endif
@@ -490,7 +490,7 @@ Error __must_check backward(Argument& act) {
END_DEFINE_ACTIVATION(log)
ActivationFunction* ActivationFunction::create(const std::string& type) {
-#ifdef PADDLE_USE_MKLDNN
+#ifdef PADDLE_WITH_MKLDNN
if (!type.empty() && type.compare(0, 7, "mkldnn_") == 0) {
return MKLDNNActivation::create(type);
}
diff --git a/paddle/gserver/gradientmachines/NeuralNetwork.cpp b/paddle/gserver/gradientmachines/NeuralNetwork.cpp
index be112b4123..68bf37d59d 100644
--- a/paddle/gserver/gradientmachines/NeuralNetwork.cpp
+++ b/paddle/gserver/gradientmachines/NeuralNetwork.cpp
@@ -20,7 +20,7 @@ limitations under the License. */
#include "paddle/utils/Logging.h"
#include "paddle/utils/Stat.h"
-#ifdef PADDLE_USE_MKLDNN
+#ifdef PADDLE_WITH_MKLDNN
#include "paddle/gserver/layers/MKLDNNLayer.h"
#endif
@@ -307,7 +307,7 @@ void NeuralNetwork::backward(const UpdateCallback& callback) {
}
void NeuralNetwork::finish() {
-#ifdef PADDLE_USE_MKLDNN
+#ifdef PADDLE_WITH_MKLDNN
FOR_EACH_R(layer, layers_) {
MKLDNNLayerPtr dnnLayer = std::dynamic_pointer_cast(*layer);
if (dnnLayer) {
diff --git a/paddle/math/Allocator.h b/paddle/math/Allocator.h
index 94ef561f06..17563bf5e1 100644
--- a/paddle/math/Allocator.h
+++ b/paddle/math/Allocator.h
@@ -48,7 +48,7 @@ public:
*/
virtual void* alloc(size_t size) {
void* ptr;
-#ifdef PADDLE_USE_MKLDNN
+#ifdef PADDLE_WITH_MKLDNN
// refer to https://github.com/01org/mkl-dnn/blob/master/include/mkldnn.hpp
// memory alignment
CHECK_EQ(posix_memalign(&ptr, 4096ul, size), 0);
diff --git a/paddle/math/MathFunctions.cpp b/paddle/math/MathFunctions.cpp
index ba86eacbb5..28ab54b450 100644
--- a/paddle/math/MathFunctions.cpp
+++ b/paddle/math/MathFunctions.cpp
@@ -206,7 +206,7 @@ double dotProduct(const int n, const double* x, const double* y) {
}
#endif
-#if defined(PADDLE_USE_MKLML)
+#if defined(PADDLE_WITH_MKLML)
template <>
void vExp(const int n, const float* a, float* r) {
diff --git a/paddle/math/MathFunctions.h b/paddle/math/MathFunctions.h
index f6e77029bd..29fe36e3a4 100644
--- a/paddle/math/MathFunctions.h
+++ b/paddle/math/MathFunctions.h
@@ -15,7 +15,7 @@ limitations under the License. */
#ifndef MATHFUNCTIONS_H_
#define MATHFUNCTIONS_H_
-#ifdef PADDLE_USE_MKLML
+#ifdef PADDLE_WITH_MKLML
#include
#include
#include
diff --git a/paddle/math/Matrix.cpp b/paddle/math/Matrix.cpp
index ebbbdfab1d..1ec4336cab 100644
--- a/paddle/math/Matrix.cpp
+++ b/paddle/math/Matrix.cpp
@@ -28,6 +28,7 @@ limitations under the License. */
#include "hl_top_k.h"
#include "paddle/utils/Logging.h"
+#include "NEONFunctions.h"
#include "paddle/function/GemmFunctor.h"
#include "paddle/utils/ThreadLocal.h"
@@ -4165,16 +4166,36 @@ void CpuMatrix::print(std::ostream& os) const {
void CpuMatrix::paramReluForward(Matrix& data, Matrix& W) {
real* input = data.getData();
real* w = W.getData();
+ real* output = data_;
size_t numElements = data.getWidth();
size_t numSamples = data.getHeight();
size_t paraSize = W.getHeight() * W.getWidth();
CHECK(!(numElements % paraSize)); // this check from ParameterReluLayer::init
+
size_t partial_sum = numElements / paraSize;
+ if (paraSize == numElements) {
+ for (size_t n = 0; n < numSamples * numElements; ++n) {
+ output[n] = input[n] > 0 ? input[n] : input[n] * w[n % numElements];
+ }
+ return;
+ }
+
+#if defined(__ARM_NEON__) || defined(__ARM_NEON)
+ for (size_t n = 0; n < numSamples; ++n) {
+ for (size_t i = 0; i < paraSize; i++) {
+ neon::prelu(
+ input + i * partial_sum, w[i], output + i * partial_sum, partial_sum);
+ }
+ input = input + numElements;
+ output = output + numElements;
+ }
+#else
for (size_t n = 0, k = 0; n < numSamples; ++n) {
for (size_t i = 0; i < numElements; ++i, ++k) {
- data_[k] = input[k] > 0 ? input[k] : input[k] * w[i / partial_sum];
+ output[k] = input[k] > 0 ? input[k] : input[k] * w[i / partial_sum];
}
}
+#endif
}
void CpuMatrix::paramReluBackwardW(Matrix& oGrad, Matrix& data) {
diff --git a/paddle/math/NEONFunctions.cpp b/paddle/math/NEONFunctions.cpp
index 3bf47901f1..0f83149422 100644
--- a/paddle/math/NEONFunctions.cpp
+++ b/paddle/math/NEONFunctions.cpp
@@ -49,6 +49,46 @@ void relu(const float* a, float* b, int len) {
}
}
+// b[i] = a[i] > 0.0f ? a[i] : a[i] * w
+void prelu(const float* a, float w, float* b, int len) {
+ int offset = len % 16;
+ float32x4_t ma0, ma1, ma2, ma3;
+
+ float32x4_t zero = vdupq_n_f32(0.f);
+ float32x4_t vw = vdupq_n_f32(w);
+
+ for (int k = 0; k < len / 16; k++, a += 16, b += 16) {
+ ma0 = vld1q_f32(a);
+ ma1 = vld1q_f32(a + 4);
+ ma2 = vld1q_f32(a + 8);
+ ma3 = vld1q_f32(a + 12);
+
+ uint32x4_t flag0 = vcgtq_f32(ma0, zero);
+ uint32x4_t flag1 = vcgtq_f32(ma1, zero);
+ uint32x4_t flag2 = vcgtq_f32(ma2, zero);
+ uint32x4_t flag3 = vcgtq_f32(ma3, zero);
+
+ float32x4_t mul0 = vmulq_f32(ma0, vw);
+ float32x4_t mul1 = vmulq_f32(ma1, vw);
+ float32x4_t mul2 = vmulq_f32(ma2, vw);
+ float32x4_t mul3 = vmulq_f32(ma3, vw);
+
+ ma0 = vbslq_f32(flag0, ma0, mul0);
+ ma1 = vbslq_f32(flag1, ma1, mul1);
+ ma2 = vbslq_f32(flag2, ma2, mul2);
+ ma3 = vbslq_f32(flag3, ma3, mul3);
+
+ vst1q_f32(b, ma0);
+ vst1q_f32(b + 4, ma1);
+ vst1q_f32(b + 8, ma2);
+ vst1q_f32(b + 12, ma3);
+ }
+
+ for (int i = 0; i < offset; i++) {
+ b[i] = a[i] > 0.0f ? a[i] : a[i] * w;
+ }
+}
+
} // namespace neon
} // namespace paddle
diff --git a/paddle/math/NEONFunctions.h b/paddle/math/NEONFunctions.h
index 69085e3335..d67b2f47a8 100644
--- a/paddle/math/NEONFunctions.h
+++ b/paddle/math/NEONFunctions.h
@@ -18,6 +18,7 @@ namespace paddle {
namespace neon {
void relu(const float* a, float* b, int len);
+void prelu(const float* a, float w, float* b, int len);
} // namespace neon
} // namespace paddle
diff --git a/paddle/math/float16.h b/paddle/math/float16.h
index f805cad08b..76ad3a0123 100644
--- a/paddle/math/float16.h
+++ b/paddle/math/float16.h
@@ -101,7 +101,7 @@ public:
half tmp = __float2half(val);
x = *reinterpret_cast(&tmp);
-#elif defined(PADDLE_NEON)
+#elif defined(PADDLE_WITH_NATIVE_FP16)
float32x4_t tmp = vld1q_dup_f32(&val);
float16_t res = vget_lane_f16(vcvt_f16_f32(tmp), 0);
x = *reinterpret_cast(&res);
@@ -252,7 +252,7 @@ public:
half tmp = *reinterpret_cast(this);
return __half2float(tmp);
-#elif defined(PADDLE_NEON)
+#elif defined(PADDLE_WITH_NATIVE_FP16)
float16x4_t res = vld1_dup_f16(reinterpret_cast(this));
return vgetq_lane_f32(vcvt_f32_f16(res), 0);
diff --git a/paddle/math/tests/CMakeLists.txt b/paddle/math/tests/CMakeLists.txt
index 215bac1271..dcd2a34583 100644
--- a/paddle/math/tests/CMakeLists.txt
+++ b/paddle/math/tests/CMakeLists.txt
@@ -34,4 +34,4 @@ add_simple_unittest(test_FPException)
add_simple_unittest(test_GpuProfiler)
add_simple_unittest(test_BaseMatrix)
add_simple_unittest(test_Matrix)
-cc_test(test_float16 SRCS test_float16.cpp)
+add_simple_unittest(test_float16)
diff --git a/paddle/memory/detail/system_allocator.cc b/paddle/memory/detail/system_allocator.cc
index b543b767e8..6a815a1b57 100644
--- a/paddle/memory/detail/system_allocator.cc
+++ b/paddle/memory/detail/system_allocator.cc
@@ -43,7 +43,7 @@ void* CPUAllocator::Alloc(size_t& index, size_t size) {
void* p;
-#ifdef PADDLE_USE_MKLDNN
+#ifdef PADDLE_WITH_MKLDNN
// refer to https://github.com/01org/mkl-dnn/blob/master/include/mkldnn.hpp
// memory alignment
PADDLE_ENFORCE_EQ(posix_memalign(&p, 4096ul, size), 0);
diff --git a/paddle/operators/CMakeLists.txt b/paddle/operators/CMakeLists.txt
index 38b89b9eb1..5aaaf99332 100644
--- a/paddle/operators/CMakeLists.txt
+++ b/paddle/operators/CMakeLists.txt
@@ -138,7 +138,7 @@ function(op_library TARGET)
if ("${TARGET}" STREQUAL "nccl_op")
set(pybind_flag 1)
# It's enough to just adding one operator to pybind
- file(APPEND ${pybind_file} "USE_GPU_ONLY_OP(ncclAllReduce);\n")
+ file(APPEND ${pybind_file} "USE_CUDA_ONLY_OP(ncclAllReduce);\n")
endif()
# reduce_op contains several operators
diff --git a/paddle/operators/accuracy_op.cc b/paddle/operators/accuracy_op.cc
index 2785a8c6fb..76da21c472 100644
--- a/paddle/operators/accuracy_op.cc
+++ b/paddle/operators/accuracy_op.cc
@@ -57,7 +57,7 @@ class AccuracyOp : public framework::OperatorWithKernel {
const framework::ExecutionContext &ctx) const override {
return framework::OpKernelType(
framework::ToDataType(ctx.Input("Out")->type()),
- ctx.device_context());
+ ctx.GetPlace());
}
};
diff --git a/paddle/operators/accuracy_op.cu b/paddle/operators/accuracy_op.cu
index d2dcab4e54..539a935302 100644
--- a/paddle/operators/accuracy_op.cu
+++ b/paddle/operators/accuracy_op.cu
@@ -104,5 +104,6 @@ class AccuracyOpCUDAKernel : public framework::OpKernel {
// FIXME(typhoonzero): types of T is for inference data.
// label data is always int64
-REGISTER_OP_GPU_KERNEL(accuracy, paddle::operators::AccuracyOpCUDAKernel,
- paddle::operators::AccuracyOpCUDAKernel);
+REGISTER_OP_CUDA_KERNEL(accuracy,
+ paddle::operators::AccuracyOpCUDAKernel,
+ paddle::operators::AccuracyOpCUDAKernel);
diff --git a/paddle/operators/accuracy_op.h b/paddle/operators/accuracy_op.h
index d060e6eddd..04104a695f 100644
--- a/paddle/operators/accuracy_op.h
+++ b/paddle/operators/accuracy_op.h
@@ -21,7 +21,7 @@ namespace operators {
using Tensor = framework::Tensor;
-template
+template
class AccuracyKernel : public framework::OpKernel {
public:
void Compute(const framework::ExecutionContext& ctx) const override {
diff --git a/paddle/operators/activation_op.cc b/paddle/operators/activation_op.cc
index 7f3118f176..63490f0ec9 100644
--- a/paddle/operators/activation_op.cc
+++ b/paddle/operators/activation_op.cc
@@ -611,16 +611,17 @@ REGISTER_OP(hard_sigmoid, ops::ActivationOp, ops::HardSigmoidOpMaker,
REGISTER_OP(swish, ops::ActivationOp, ops::SwishOpMaker, swish_grad,
ops::ActivationOpGrad);
-#define REGISTER_ACTIVATION_CPU_KERNEL(act_type, functor, grad_functor) \
- REGISTER_OP_CPU_KERNEL( \
- act_type, \
- ops::ActivationKernel>, \
- ops::ActivationKernel>); \
- REGISTER_OP_CPU_KERNEL( \
- act_type##_grad, ops::ActivationGradKernel>, \
- ops::ActivationGradKernel>, \
+ ops::ActivationKernel>); \
+ REGISTER_OP_CPU_KERNEL( \
+ act_type##_grad, \
+ ops::ActivationGradKernel>, \
+ ops::ActivationGradKernel>);
FOR_EACH_KERNEL_FUNCTOR(REGISTER_ACTIVATION_CPU_KERNEL);
diff --git a/paddle/operators/activation_op.cu b/paddle/operators/activation_op.cu
index 97737857ab..856d3fc35d 100644
--- a/paddle/operators/activation_op.cu
+++ b/paddle/operators/activation_op.cu
@@ -17,16 +17,17 @@
namespace ops = paddle::operators;
-#define REGISTER_ACTIVATION_GPU_KERNEL(act_type, functor, grad_functor) \
- REGISTER_OP_GPU_KERNEL( \
- act_type, \
- ops::ActivationKernel>, \
- ops::ActivationKernel>); \
- REGISTER_OP_GPU_KERNEL( \
- act_type##_grad, ops::ActivationGradKernel>, \
- ops::ActivationGradKernel>, \
+ ops::ActivationKernel>); \
+ REGISTER_OP_CUDA_KERNEL( \
+ act_type##_grad, \
+ ops::ActivationGradKernel>, \
+ ops::ActivationGradKernel>);
-FOR_EACH_KERNEL_FUNCTOR(REGISTER_ACTIVATION_GPU_KERNEL);
+FOR_EACH_KERNEL_FUNCTOR(REGISTER_ACTIVATION_CUDA_KERNEL);
diff --git a/paddle/operators/activation_op.h b/paddle/operators/activation_op.h
index ac0e0a3b01..75eefca8b8 100644
--- a/paddle/operators/activation_op.h
+++ b/paddle/operators/activation_op.h
@@ -19,7 +19,7 @@
namespace paddle {
namespace operators {
-template
+template
class ActivationKernel
: public framework::OpKernel {
public:
@@ -32,18 +32,19 @@ class ActivationKernel
auto x = framework::EigenVector::Flatten(*X);
auto y = framework::EigenVector::Flatten(*Y);
- auto place = context.GetEigenDevice();
+ auto* place =
+ context.template device_context().eigen_device();
Functor functor;
auto attrs = functor.GetAttrs();
for (auto& attr : attrs) {
*attr.second = context.Attr(attr.first);
}
- functor(place, x, y);
+ functor(*place, x, y);
}
};
-template
+template
class ActivationGradKernel
: public framework::OpKernel {
public:
@@ -59,13 +60,14 @@ class ActivationGradKernel
auto x = framework::EigenVector::Flatten(*X);
auto y = framework::EigenVector::Flatten(*Y);
auto dx = framework::EigenVector::Flatten(*dX);
- auto place = context.GetEigenDevice();
+ auto* place =
+ context.template device_context().eigen_device();
Functor functor;
auto attrs = functor.GetAttrs();
for (auto& attr : attrs) {
*attr.second = context.Attr(attr.first);
}
- functor(place, x, y, dy, dx);
+ functor(*place, x, y, dy, dx);
}
};
diff --git a/paddle/operators/adadelta_op.cc b/paddle/operators/adadelta_op.cc
index 16a7794d5b..507811e7b5 100644
--- a/paddle/operators/adadelta_op.cc
+++ b/paddle/operators/adadelta_op.cc
@@ -92,12 +92,12 @@ for gradient descent.
Adadelta updates are as follows:
-$$avgSquaredGradOut = \rho * avgSquaredGrad + (1 - \rho) * grad * grad \break
-paramUpdate = - $\sqrt{((avgSquaredUpdate + \epsilon) /
- (avgSquaredGrad_out + \epsilon))}$ * grad \break
-avgSquaredUpdateOut = \rho * avgSquaredUpdate + (1 - \rho) *
- {(paramUpdate)}^2 \break
-paramOut = param + paramUpdate$$
+$$
+avg\_squared\_grad\_out = \rho * avg\_squared\_grad + (1 - \rho) * grad * grad \\
+param\_update = - \sqrt{\frac{avg\_squared\_update + \epsilon}{avg\_squared\_grad\_out + \epsilon}} * grad \\
+avg\_squared\_update\_out = \rho * avg\_squared\_update + (1 - \rho) * {param\_update}^2 \\
+param\_out = param + param\_update
+$$
)DOC");
}
@@ -109,5 +109,5 @@ paramOut = param + paramUpdate$$
namespace ops = paddle::operators;
REGISTER_OP_WITHOUT_GRADIENT(adadelta, ops::AdadeltaOp, ops::AdadeltaOpMaker);
REGISTER_OP_CPU_KERNEL(
- adadelta, ops::AdadeltaOpKernel,
- ops::AdadeltaOpKernel);
+ adadelta, ops::AdadeltaOpKernel,
+ ops::AdadeltaOpKernel);
diff --git a/paddle/operators/adadelta_op.cu b/paddle/operators/adadelta_op.cu
index 9fb6185207..eee2d0a2f5 100644
--- a/paddle/operators/adadelta_op.cu
+++ b/paddle/operators/adadelta_op.cu
@@ -16,6 +16,6 @@
#include "paddle/operators/adadelta_op.h"
namespace ops = paddle::operators;
-REGISTER_OP_GPU_KERNEL(
- adadelta, ops::AdadeltaOpKernel,
- ops::AdadeltaOpKernel);
+REGISTER_OP_CUDA_KERNEL(
+ adadelta, ops::AdadeltaOpKernel,
+ ops::AdadeltaOpKernel);
diff --git a/paddle/operators/adadelta_op.h b/paddle/operators/adadelta_op.h
index a8c5f0c8aa..819d0845db 100644
--- a/paddle/operators/adadelta_op.h
+++ b/paddle/operators/adadelta_op.h
@@ -19,7 +19,7 @@ limitations under the License. */
namespace paddle {
namespace operators {
-template
+template
class AdadeltaOpKernel : public framework::OpKernel {
public:
void Compute(const framework::ExecutionContext& ctx) const override {
@@ -51,7 +51,7 @@ class AdadeltaOpKernel : public framework::OpKernel {
framework::EigenVector::Flatten(*avg_squared_grad_out_tensor);
auto avg_squared_update_out =
framework::EigenVector::Flatten(*avg_squared_update_out_tensor);
- auto place = ctx.GetEigenDevice();
+ auto& place = *ctx.template device_context().eigen_device();
avg_squared_grad_out.device(place) =
rho * avg_squared_grad + (1 - rho) * grad.square();
diff --git a/paddle/operators/adagrad_op.cc b/paddle/operators/adagrad_op.cc
index d6686e3ef3..5d00716316 100644
--- a/paddle/operators/adagrad_op.cc
+++ b/paddle/operators/adagrad_op.cc
@@ -80,8 +80,8 @@ Adaptive Gradient Algorithm (Adagrad).
The update is done as follows:
-$$momentOut = moment + grad * grad \break
-paramOut = param - learningRate * grad / ($\sqrt{momentOut}$ + \epsilon) \break
+$$moment\_out = moment + grad * grad \\
+param\_out = param - \frac{learning\_rate * grad}{\sqrt{moment\_out} + \epsilon}
$$
The original paper(http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf)
@@ -100,8 +100,8 @@ size_t FindPos(const std::vector& rows, int64_t value) {
} // namespace
template
-struct SparseAdagradFunctor {
- void operator()(const platform::DeviceContext& context,
+struct SparseAdagradFunctor {
+ void operator()(const platform::CPUDeviceContext& context,
const framework::SelectedRows& grad,
const framework::Tensor& learning_rate, T epsilon,
framework::Tensor* moment, framework::Tensor* param) {
@@ -120,7 +120,7 @@ struct SparseAdagradFunctor {
{static_cast(merge_rows.size()), grad_width}),
context.GetPlace());
- math::SetConstant constant_functor;
+ math::SetConstant constant_functor;
constant_functor(context, grad_merge->mutable_value(), 0.0);
auto* grad_merge_data = grad_merge->mutable_value()->data();
@@ -144,9 +144,9 @@ struct SparseAdagradFunctor {
auto gs =
framework::EigenVector::Flatten(*(grad_square->mutable_value()));
auto gm = framework::EigenVector::Flatten(grad_merge->value());
- gs.device(*context.GetEigenDevice()) = gm * gm;
+ gs.device(*context.eigen_device()) = gm * gm;
- math::SelectedRowsAddToTensor functor;
+ math::SelectedRowsAddToTensor functor;
functor(context, *grad_square, moment);
// 3. update parameter
@@ -164,13 +164,13 @@ struct SparseAdagradFunctor {
}
};
-template struct SparseAdagradFunctor;
-template struct SparseAdagradFunctor;
+template struct SparseAdagradFunctor;
+template struct SparseAdagradFunctor;
} // namespace operators
} // namespace paddle
namespace ops = paddle::operators;
REGISTER_OP_WITHOUT_GRADIENT(adagrad, ops::AdagradOp, ops::AdagradOpMaker);
REGISTER_OP_CPU_KERNEL(
- adagrad, ops::AdagradOpKernel,
- ops::AdagradOpKernel);
+ adagrad, ops::AdagradOpKernel,
+ ops::AdagradOpKernel);
diff --git a/paddle/operators/adagrad_op.cu b/paddle/operators/adagrad_op.cu
index 1c870214b2..585b2d9289 100644
--- a/paddle/operators/adagrad_op.cu
+++ b/paddle/operators/adagrad_op.cu
@@ -72,8 +72,8 @@ __global__ void SparseAdagradFunctorKernel(const T* grad, const int64_t* rows,
} // namespace
template
-struct SparseAdagradFunctor {
- void operator()(const platform::DeviceContext& context,
+struct SparseAdagradFunctor {
+ void operator()(const platform::CUDADeviceContext& context,
const framework::SelectedRows& grad,
const framework::Tensor& learning_rate, T epsilon,
framework::Tensor* moment, framework::Tensor* param) {
@@ -92,7 +92,7 @@ struct SparseAdagradFunctor {
{static_cast(merge_rows.size()), grad_width}),
context.GetPlace());
- math::SetConstant constant_functor;
+ math::SetConstant constant_functor;
constant_functor(context, grad_merge->mutable_value(), 0.0);
auto* grad_merge_data = grad_merge->mutable_value()->data();
@@ -119,9 +119,9 @@ struct SparseAdagradFunctor {
auto gs =
framework::EigenVector::Flatten(*(grad_square->mutable_value()));
auto gm = framework::EigenVector::Flatten(grad_merge->value());
- gs.device(*context.GetEigenDevice()) = gm * gm;
+ gs.device(*context.eigen_device()) = gm * gm;
- math::SelectedRowsAddToTensor functor;
+ math::SelectedRowsAddToTensor functor;
functor(context, *grad_square, moment);
// 3. update parameter
@@ -139,13 +139,13 @@ struct SparseAdagradFunctor {
}
};
-template struct SparseAdagradFunctor;
-template struct SparseAdagradFunctor;
+template struct SparseAdagradFunctor;
+template struct SparseAdagradFunctor;
} // namespace operators
} // namespace paddle
namespace ops = paddle::operators;
-REGISTER_OP_GPU_KERNEL(
- adagrad, ops::AdagradOpKernel,
- ops::AdagradOpKernel);
+REGISTER_OP_CUDA_KERNEL(
+ adagrad, ops::AdagradOpKernel,
+ ops::AdagradOpKernel);
diff --git a/paddle/operators/adagrad_op.h b/paddle/operators/adagrad_op.h
index 4d4a6434c7..0d77dbcbac 100644
--- a/paddle/operators/adagrad_op.h
+++ b/paddle/operators/adagrad_op.h
@@ -19,15 +19,15 @@ limitations under the License. */
namespace paddle {
namespace operators {
-template
+template
struct SparseAdagradFunctor {
- void operator()(const platform::DeviceContext& context,
+ void operator()(const DeviceContext& context,
const framework::SelectedRows& grad,
const framework::Tensor& learning_rate, T epsilon,
framework::Tensor* moment, framework::Tensor* param);
};
-template
+template
class AdagradOpKernel : public framework::OpKernel {
public:
void Compute(const framework::ExecutionContext& ctx) const override {
@@ -52,11 +52,11 @@ class AdagradOpKernel : public framework::OpKernel {
auto param_out = framework::EigenVector::Flatten(*param_out_tensor);
auto moment_out = framework::EigenVector::Flatten(*moment_out_tensor);
- auto place = ctx.GetEigenDevice();
+ auto* place = ctx.template device_context().eigen_device();
- moment_out.device(place) = moment + grad * grad;
+ moment_out.device(*place) = moment + grad * grad;
Eigen::DSizes m_dsize(moment_out_tensor->numel());
- param_out.device(place) =
+ param_out.device(*place) =
param - lr.broadcast(m_dsize) * grad / (moment_out.sqrt() + epsilon);
} else if (grad_var->IsType()) {
auto* param_tensor = ctx.Input("Param");
@@ -65,8 +65,9 @@ class AdagradOpKernel : public framework::OpKernel {
auto* moment_tensor = ctx.Input("Moment");
PADDLE_ENFORCE_EQ(moment_tensor, moment_out_tensor);
- SparseAdagradFunctor functor;
- functor(ctx.device_context(), *ctx.Input("Grad"),
+ SparseAdagradFunctor functor;
+ functor(ctx.template device_context(),
+ *ctx.Input("Grad"),
*ctx.Input("LearningRate"), epsilon,
moment_out_tensor, param_out_tensor);
} else {
diff --git a/paddle/operators/adam_op.cc b/paddle/operators/adam_op.cc
index 03faa2a7c5..cf6ef6dd53 100644
--- a/paddle/operators/adam_op.cc
+++ b/paddle/operators/adam_op.cc
@@ -112,11 +112,13 @@ adaptive estimates of lower-order moments.
Adam updates:
-$$moment_1_{out} = \beta_1 * moment_1 + (1 - \beta_1) * grad \break
-moment_2_{out} = \beta_2 * moment_2 + (1 - \beta_2) * grad * grad \break
-learningRate = learningRate *
- $\sqrt{(1 - \beta_2_{pow})}$ / (1 - \beta_1_{pow}) \break
-paramOut = param - learningRate * moment_1/ ($\sqrt{(moment_2)} + \epsilon)$$
+$$
+moment\_1\_out = \beta_1 * moment\_1 + (1 - \beta_1) * grad \\
+moment\_2_\out = \beta_2 * moment\_2 + (1 - \beta_2) * grad * grad \\
+learning\_rate = learning\_rate *
+ \frac{\sqrt{1 - \beta_{2\_pow}}}{1 - \beta_{1\_pow}} \\
+param\_out = param - learning\_rate * \frac{moment\_1}{\sqrt{moment\_2} + \epsilon}
+$$
)DOC");
}
@@ -126,6 +128,6 @@ paramOut = param - learningRate * moment_1/ ($\sqrt{(moment_2)} + \epsilon)$$
namespace ops = paddle::operators;
REGISTER_OP_WITHOUT_GRADIENT(adam, ops::AdamOp, ops::AdamOpMaker);
-REGISTER_OP_CPU_KERNEL(adam,
- ops::AdamOpKernel,
- ops::AdamOpKernel);
+REGISTER_OP_CPU_KERNEL(
+ adam, ops::AdamOpKernel,
+ ops::AdamOpKernel);
diff --git a/paddle/operators/adam_op.cu b/paddle/operators/adam_op.cu
index 6e34f7818c..c135b37378 100644
--- a/paddle/operators/adam_op.cu
+++ b/paddle/operators/adam_op.cu
@@ -16,6 +16,6 @@
#include "paddle/operators/adam_op.h"
namespace ops = paddle::operators;
-REGISTER_OP_GPU_KERNEL(adam,
- ops::AdamOpKernel,
- ops::AdamOpKernel);
+REGISTER_OP_CUDA_KERNEL(
+ adam, ops::AdamOpKernel,
+ ops::AdamOpKernel);
diff --git a/paddle/operators/adam_op.h b/paddle/operators/adam_op.h
index 7f7fa1da1c..45157842a6 100644
--- a/paddle/operators/adam_op.h
+++ b/paddle/operators/adam_op.h
@@ -19,7 +19,7 @@ limitations under the License. */
namespace paddle {
namespace operators {
-template
+template
class AdamOpKernel : public framework::OpKernel {
public:
void Compute(const framework::ExecutionContext& ctx) const override {
@@ -52,17 +52,17 @@ class AdamOpKernel : public framework::OpKernel {
auto param_out = framework::EigenVector::Flatten(*param_out_tensor);
auto moment1_out = framework::EigenVector::Flatten(*moment1_out_tensor);
auto moment2_out = framework::EigenVector::Flatten(*moment2_out_tensor);
- auto place = ctx.GetEigenDevice();
+ auto* place = ctx.template device_context().eigen_device();
- moment1_out.device(place) = beta1 * moment1 + (1 - beta1) * grad;
- moment2_out.device(place) = beta2 * moment2 + (1 - beta2) * grad.square();
+ moment1_out.device(*place) = beta1 * moment1 + (1 - beta1) * grad;
+ moment2_out.device(*place) = beta2 * moment2 + (1 - beta2) * grad.square();
// All of these are tensors of 1 element
auto lr_t = lr * (1 - beta2_pow).sqrt() / (1 - beta1_pow);
// Eigen does not support automatic broadcast
// Get dimensions of moment vector to broadcast lr_t
Eigen::DSizes m_dsize(moment1_out_tensor->numel());
- param_out.device(place) =
+ param_out.device(*place) =
param -
lr_t.broadcast(m_dsize) *
(moment1_out / (moment2_out.sqrt() + epsilon));
diff --git a/paddle/operators/adamax_op.cc b/paddle/operators/adamax_op.cc
index 867ddd9790..49ce497bb7 100644
--- a/paddle/operators/adamax_op.cc
+++ b/paddle/operators/adamax_op.cc
@@ -108,10 +108,10 @@ Adam algorithm based on the infinity norm.
Adamax updates:
$$
- momentOut = \beta_{1} * moment + (1 - \beta_{1}) * grad \\
- infNormOut = max(\beta_{2} * infNorm + \epsilon, |grad|) \\
- learningRate = \frac{learningRate}{1 - \beta_{1}^{Beta1Pow}} \\
- paramOut = param - learningRate * \frac{momentOut}{infNormOut}
+moment\_out = \beta_1 * moment + (1 - \beta_1) * grad \\
+inf\_norm\_out = max(\beta_2 * inf\_norm + \epsilon, |grad|) \\
+learning\_rate = \frac{learning\_rate}{1 - \beta_{1\_pow}} \\
+param\_out = param - learning\_rate * \frac{moment\_out}{inf\_norm\_out}
$$
The original paper does not have an epsilon attribute.
@@ -127,6 +127,6 @@ division by 0 error.
namespace ops = paddle::operators;
REGISTER_OP_WITHOUT_GRADIENT(adamax, ops::AdamaxOp, ops::AdamaxOpMaker);
-REGISTER_OP_CPU_KERNEL(adamax,
- ops::AdamaxOpKernel,
- ops::AdamaxOpKernel);
+REGISTER_OP_CPU_KERNEL(
+ adamax, ops::AdamaxOpKernel,
+ ops::AdamaxOpKernel);
diff --git a/paddle/operators/adamax_op.cu b/paddle/operators/adamax_op.cu
index 057ef39025..2d143905c4 100644
--- a/paddle/operators/adamax_op.cu
+++ b/paddle/operators/adamax_op.cu
@@ -16,6 +16,6 @@
#include "paddle/operators/adamax_op.h"
namespace ops = paddle::operators;
-REGISTER_OP_GPU_KERNEL(adamax,
- ops::AdamaxOpKernel,
- ops::AdamaxOpKernel);
+REGISTER_OP_CUDA_KERNEL(
+ adamax, ops::AdamaxOpKernel,
+ ops::AdamaxOpKernel);
diff --git a/paddle/operators/adamax_op.h b/paddle/operators/adamax_op.h
index bf36ed7860..172c179c5f 100644
--- a/paddle/operators/adamax_op.h
+++ b/paddle/operators/adamax_op.h
@@ -19,7 +19,7 @@ limitations under the License. */
namespace paddle {
namespace operators {
-template
+template
class AdamaxOpKernel : public framework::OpKernel {
public:
void Compute(const framework::ExecutionContext& ctx) const override {
@@ -51,14 +51,14 @@ class AdamaxOpKernel : public framework::OpKernel {
auto moment_out = framework::EigenVector::Flatten(*moment_out_tensor);
auto inf_norm_out =
framework::EigenVector::Flatten(*inf_norm_out_tensor);
- auto place = ctx.GetEigenDevice();
+ auto* place = ctx.template device_context().eigen_device();
- moment_out.device(place) = beta1 * moment + (1 - beta1) * grad;
- inf_norm_out.device(place) =
+ moment_out.device(*place) = beta1 * moment + (1 - beta1) * grad;
+ inf_norm_out.device(*place) =
grad.abs().cwiseMax((beta2 * inf_norm) + epsilon);
auto lr_t = lr / (1 - beta1_pow);
Eigen::DSizes m_dsize(moment_out_tensor->numel());
- param_out.device(place) =
+ param_out.device(*place) =
param - lr_t.broadcast(m_dsize) * (moment_out / inf_norm_out);
}
};
diff --git a/paddle/operators/auc_op.h b/paddle/operators/auc_op.h
index e5ac57b038..b80509e2a9 100644
--- a/paddle/operators/auc_op.h
+++ b/paddle/operators/auc_op.h
@@ -25,7 +25,7 @@ template
using EigenVector = framework::EigenVector;
-template
+template
class AucKernel : public framework::OpKernel {
public:
void Compute(const framework::ExecutionContext& ctx) const override {
diff --git a/paddle/operators/batch_norm_op.cc b/paddle/operators/batch_norm_op.cc
index ac97bd83ab..94a972b7ab 100644
--- a/paddle/operators/batch_norm_op.cc
+++ b/paddle/operators/batch_norm_op.cc
@@ -135,7 +135,8 @@ The required data format for this layer is one of the following:
};
template
-class BatchNormKernel : public framework::OpKernel {
+class BatchNormKernel
+ : public framework::OpKernel {
public:
void Compute(const framework::ExecutionContext &ctx) const override {
const float epsilon = ctx.Attr("epsilon");
@@ -318,12 +319,12 @@ class BatchNormGradOp : public framework::OperatorWithKernel {
PADDLE_THROW("can't find Y@GRAD");
}
return framework::OpKernelType(framework::ToDataType(t->type()),
- ctx.device_context());
+ ctx.GetPlace());
}
};
template