Merge branch 'develop' into ProtoDataProvider

8 years ago · 62ee11415d
parent ebb22e5cb7 83c228164b
commit 62ee11415d
22 changed files with 660 additions and 690 deletions
--- a/doc/faq/parameter/index_cn.rst
+++ b/doc/faq/parameter/index_cn.rst
@ -75,7 +75,7 @@ PaddlePaddle目前支持8种learning_rate_schedule，这8种learning_rate_schedu
      optimizer = paddle.optimizer.Adam(
          learning_rate=1e-3,
-          learning_rate_schedule="manual",
+          learning_rate_schedule="pass_manual",
          learning_rate_args="1:1.0,2:0.9,3:0.8",)
  在该示例中，当已训练pass数小于等于1时，学习率为 :code:`1e-3 * 1.0`；当已训练pass数大于1小于等于2时，学习率为 :code:`1e-3 * 0.9`；当已训练pass数大于2时，学习率为 :code:`1e-3 * 0.8`。
--- a/doc/howto/cross_compiling/cross_compiling_for_android.md
+++ b/doc/howto/cross_compiling/cross_compiling_for_android.md
@ -0,0 +1,153 @@
 # Build PaddlePaddle for Android
 There are two approaches to build PaddlePaddle for Android: using Docker and on Linux without Docker. 
 ## Cross-Compiling Using Docker
 Docker-based cross-compiling is the recommended approach because Docker runs on all major operating systems, including Linux, Mac OS X, and Windows.
 ### Build the Docker Image
 The following steps pack all the tools that we need to build PaddlePaddle into a Docker image.
 ```bash
 $ git clone https://github.com/PaddlePaddle/Paddle.git
 $ cd Paddle
 $ docker build -t paddle:dev-android . -f Dockerfile.android
 ```
 ### Build the Inference Library
 We can run the Docker image we just created to build the inference library of PaddlePaddle for Android using the command below:
 ```bash
 $ docker run -it --rm -v $PWD:/paddle -e "ANDROID_ABI=armeabi-v7a" -e "ANDROID_API=21" paddle:dev-android
 ```
 The Docker image accepts two arguments `ANDROID_ABI` and `ANDROID_API`:
 | Argument        | Optional Values         | Default |
 |-----------------|-------------------------|---------|
 |`ANDROID_ABI`    |`armeabi-v7a, arm64-v8a` | `armeabi-v7a` |
 |`ANDROID_API`    |`>= 21` | `21` |
 The ARM-64 architecture (`arm64-v8a`) requires at least level 21 of Android API.
 The default entry-point of the Docker image, [`paddle/scripts/docker/build_android.sh`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/scripts/docker/build_android.sh) generates the [Android cross-compiling standalone toolchain](https://developer.android.com/ndk/guides/standalone_toolchain.html) based on the argument: `ANDROID_ABI` or `ANDROID_API`.  For information about other configuration arguments, please continue reading.
 The above command generates and outputs the inference library in `$PWD/install_android` and puts third-party libraries in `$PWD/install_android/third_party`.
 ## Cross-Compiling on Linux
 The Linux-base approach to cross-compile is to run steps in `Dockerfile.android` manually on a Linux x64 computer.
 ### Setup the Environment
 To build for Android's, we need [Android NDK](
 https://developer.android.com/ndk/downloads/index.html):
 ```bash
 wget -q https://dl.google.com/android/repository/android-ndk-r14b-linux-x86_64.zip
 unzip -q android-ndk-r14b-linux-x86_64.zip
 ```
 Android NDK includes everything we need to build the [*standalone toolchain*](https://developer.android.com/ndk/guides/standalone_toolchain.html), which in then used to build PaddlePaddle for Android.  (We plan to remove the intermediate stage of building the standalone toolchain in the near future.)
 - To build the standalone toolchain for `armeabi-v7a` and Android API level 21:
  ```bash
  your/path/to/android-ndk-r14b-linux-x86_64/build/tools/make-standalone-toolchain.sh \
          --arch=arm --platform=android-21 --install-dir=your/path/to/arm_standalone_toolchain
  ```
  The generated standalone toolchain will be in `your/path/to/arm_standalone_toolchain`.
 - To build the standalone toolchain for `arm64-v8a` and Android API level 21:
  ```bash
  your/path/to/android-ndk-r14b-linux-x86_64/build/tools/make-standalone-toolchain.sh \
          --arch=arm64 --platform=android-21 --install-dir=your/path/to/arm64_standalone_toolchain
  ```
  The generated standalone toolchain will be in `your/path/to/arm64_standalone_toolchain`.
 **Please be aware that the minimum level of Android API required by PaddlePaddle is 21.**
 ### Cross-Compiling Arguments
 CMake supports [choosing the toolchain](https://cmake.org/cmake/help/v3.0/manual/cmake-toolchains.7.html#cross-compiling).  PaddlePaddle provides [`android.cmake`](https://github.com/PaddlePaddle/Paddle/blob/develop/cmake/cross_compiling/android.cmake), which configures the Android cross-compiling toolchain for CMake.  `android.cmake` is not required for CMake >= 3.7, which support Android cross-compiling. PaddlePaddle detects the CMake version, for those newer than 3.7, it uses [the official version](https://cmake.org/cmake/help/v3.7/manual/cmake-toolchains.7.html#cross-compiling).
 Some other CMake arguments you need to know:
 - `CMAKE_SYSTEM_NAME` must be `Android`.  This tells PaddlePaddle's CMake system to cross-compile third-party dependencies.  This also changes some other CMake arguments like `WITH_GPU=OFF`, `WITH_AVX=OFF`, `WITH_PYTHON=OFF`, and `WITH_RDMA=OFF`.
 - `WITH_C_API` must be `ON`, to build the C-based inference library for Android.
 - `WITH_SWIG_PY` must be `OFF` because the Android platform doesn't support SWIG-based API.
 Some Android-specific arguments:
 - `ANDROID_STANDALONE_TOOLCHAIN`: the absolute path of the Android standalone toolchain, or the path relative to the CMake build directory.  PaddlePaddle's CMake extensions would derive the cross-compiler, sysroot and Android API level from this argument.
 - `ANDROID_TOOLCHAIN`: could be `gcc` or `clang`.  The default value is `clang`.
  - For CMake >= 3.7, it should anyway be `clang`.  For older versions, it could be `gcc`.
  - Android's official `clang` requires `glibc` >= 2.15.
 - `ANDROID_ABI`: could be `armeabi-v7a` or `arm64-v8a`.  The default value is `armeabi-v7a`.
 - `ANDROID_NATIVE_API_LEVEL`: could be derived from the value of `ANDROID_STANDALONE_TOOLCHAIN`.
 - `ANROID_ARM_MODE`:
  - could be `ON` or `OFF`, and defaults to `ON`, when `ANDROID_ABI=armeabi-v7a`;
  - no need to specify when `ANDROID_ABI=arm64-v8a`.
 - `ANDROID_ARM_NEON`: indicates if to use NEON instructions.
  - could be `ON` or `OFF`, and defaults to `ON`, when `ANDROID_ABI=armeabi-v7a`;
  - no need to specify when `ANDROID_ABI=arm64-v8a`.
 Other useful arguments:
 - `USE_EIGEN_FOR_BLAS`: indicates if using Eigen.  Could be `ON` or `OFF`, defaults to `OFF`.
 - `HOST_C/CXX_COMPILER`: specifies the host compiler, which is used to build the host-specific protoc and target-specific OpenBLAS.  It defaults to the value of the environment variable `CC`, or `cc`.
 Some frequent configurations for your reference:
 ```bash
 cmake -DCMAKE_SYSTEM_NAME=Android \
      -DANDROID_STANDALONE_TOOLCHAIN=your/path/to/arm_standalone_toolchain \
      -DANDROID_ABI=armeabi-v7a \
      -DANDROID_ARM_NEON=ON \
      -DANDROID_ARM_MODE=ON \
      -DUSE_EIGEN_FOR_BLAS=ON \
      -DCMAKE_INSTALL_PREFIX=your/path/to/install \
      -DWITH_C_API=ON \
      -DWITH_SWIG_PY=OFF \
      ..
 ```
 ```
 cmake -DCMAKE_SYSTEM_NAME=Android \
      -DANDROID_STANDALONE_TOOLCHAIN=your/path/to/arm64_standalone_toolchain \
      -DANDROID_ABI=arm64-v8a \
      -DUSE_EIGEN_FOR_BLAS=OFF \
      -DCMAKE_INSTALL_PREFIX=your/path/to/install \
      -DWITH_C_API=ON \
      -DWITH_SWIG_PY=OFF \
      ..
 ```
 There are some other arguments you might want to configure.
 - `CMAKE_BUILD_TYPE=MinSizeRel` minimizes the size of library.
 - `CMAKE_BUILD_TYPE-Release` optimizes the runtime performance.
 Our own tip for performance optimization to use clang and Eigen or OpenBLAS:
 - `CMAKE_BUILD_TYPE=Release`
 - `ANDROID_TOOLCHAIN=clang`
 - `USE_EIGEN_BLAS=ON` for `armeabi-v7a`, or `USE_EIGEN_FOR_BLAS=OFF` for `arm64-v8a`.
 ### Build and Install
 After running `cmake`, we can run `make; make install` to build and install.
 Before building, you might want to remove the `third_party` and `build` directories including pre-built libraries for other architectures.
 After building，in the directory `CMAKE_INSTALL_PREFIX`, you will find three sub-directories:
 - `include`: the header file of the inference library,
 - `lib`: the inference library built for various Android ABIs,
 - `third_party`: dependent third-party libraries built for Android.
--- a/doc/howto/cross_compiling/cross_compiling_for_android_cn.md
+++ b/doc/howto/cross_compiling/cross_compiling_for_android_cn.md
@ -1,7 +1,7 @@
 # 构建Android平台上的PaddlePaddle库
 用户可通过如下两种方式，交叉编译Android平台上适用的PaddlePaddle库：
- 基于Docker容器的编译方式
+- 基于Docker容器的编译方式
 - 基于Linux交叉编译环境的编译方式
 ## 基于Docker容器的编译方式
@ -26,14 +26,14 @@ Android的Docker开发镜像向用户提供两个可配置的参数：
 |`ANDROID_API`    |`>= 21` | `21` |
 - 编译`armeabi-v7a`，`Android API 21`的PaddlePaddle库
-```bash
+  ```bash
-$ docker run -it --rm -v $PWD:/paddle -e "ANDROID_ABI=armeabi-v7a" -e "ANDROID_API=21" username/paddle-android:dev
+  $ docker run -it --rm -v $PWD:/paddle -e "ANDROID_ABI=armeabi-v7a" -e "ANDROID_API=21" username/paddle-android:dev
-```
+  ```
- 编译`arm64-v8a`，`Android API 21`的PaddlePaddle库
+- 编译`arm64-v8a`，`Android API 21`的PaddlePaddle库
-```bash
+  ```bash
-$ docker run -it --rm -v $PWD:/paddle -e "ANDROID_ABI=arm64-v8a" -e "ANDROID_API=21" username/paddle-android:dev
+  $ docker run -it --rm -v $PWD:/paddle -e "ANDROID_ABI=arm64-v8a" -e "ANDROID_API=21" username/paddle-android:dev
-```
+  ```
 执行上述`docker run`命令时，容器默认执行[paddle/scripts/docker/build_android.sh](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/scripts/docker/build_android.sh)脚本。该脚本中记录了交叉编译Android版PaddlePaddle库常用的CMake配置，并且会根据`ANDROID_ABI`和`ANDROID_API`自动构建独立工具链、进行编译和安装。由于arm64架构要求Android API不小于21。因此当`ANDROID_ABI=arm64-v8a`，`ANDROID_API<21`时，Docker容器中将默认使用`Android API 21`的编译工具链。用户可以参考下文**配置交叉编译参数**章节，根据个人的需求修改定制Docker容器所执行的脚本。编译安装结束之后，PaddlePaddle的C-API库将被安装到`$PWD/install_android`目录，所依赖的第三方库同时也被安装到`$PWD/install_android/third_party`目录。
@ -82,16 +82,16 @@ CMake系统对交叉编译提供了支持[cmake-toolchains](https://cmake.org/cm
 Android平台可选配置参数：
 - `ANDROID_STANDALONE_TOOLCHAIN`，独立工具链所在的绝对路径，或者相对于构建目录的相对路径。PaddlePaddle的CMake系统将根据该值自动推导和设置需要使用的交叉编译器、sysroot、以及Android API级别；否则，用户需要在cmake时手动设置这些值。无默认值。
- `ANDROID_TOOLCHAIN`，目标工具链。可设置`gcc/clang`，默认值为`clang`。
+- `ANDROID_TOOLCHAIN`，目标工具链。可设置`gcc/clang`，默认值为`clang`。
-	- CMake 3.7以上，将会始终使用`clang`工具链；CMake 3.7以下，可设置`ANDROID_TOOLCHAIN=gcc`以使用`gcc`工具链。
+	- CMake 3.7以上，将会始终使用`clang`工具链；CMake 3.7以下，可设置`ANDROID_TOOLCHAIN=gcc`以使用`gcc`工具链。
 	- Android官方提供的`clang`编译器要求系统支持`GLIBC 2.15`以上。
 - `ANDROID_ABI`，目标架构ABI。目前支持`armeabi-v7a`和`arm64-v8a`，默认值为`armeabi-v7a`。
 - `ANDROID_NATIVE_API_LEVEL`，工具链的Android API级别。若没有显式设置，PaddlePaddle将根据`ANDROID_STANDALONE_TOOLCHAIN`的值自动推导得到。
- `ANROID_ARM_MODE`，是否使用ARM模式。
+- `ANROID_ARM_MODE`，是否使用ARM模式。
-	- `ANDROID_ABI=armeabi-v7a`时，可设置`ON/OFF`，默认值为`ON`；
+	- `ANDROID_ABI=armeabi-v7a`时，可设置`ON/OFF`，默认值为`ON`；
 	- `ANDROID_ABI=arm64-v8a`时，不需要设置。
- `ANDROID_ARM_NEON`，是否使用NEON指令。
+- `ANDROID_ARM_NEON`，是否使用NEON指令。
-	- `ANDROID_ABI=armeabi-v7a`时，可设置`ON/OFF`，默认值为`ON`；
+	- `ANDROID_ABI=armeabi-v7a`时，可设置`ON/OFF`，默认值为`ON`；
 	- `ANDROID_ABI=arm64-v8a`时，不需要设置。
 其他配置参数：
@ -119,7 +119,7 @@ cmake -DCMAKE_SYSTEM_NAME=Android \
      -DANDROID_STANDALONE_TOOLCHAIN=your/path/to/arm64_standalone_toolchain \
      -DANDROID_ABI=arm64-v8a \
      -DUSE_EIGEN_FOR_BLAS=OFF \
-      -DCMAKE_INSTALL_PREFIX=your/path/to/install \  
+      -DCMAKE_INSTALL_PREFIX=your/path/to/install \
      -DWITH_C_API=ON \
      -DWITH_SWIG_PY=OFF \
      ..
@ -128,8 +128,8 @@ cmake -DCMAKE_SYSTEM_NAME=Android \
 用户还可根据自己的需求设置其他编译参数。比如希望最小化生成的库的大小，可以设置`CMAKE_BUILD_TYPE`为`MinSizeRel`；若希望最快的执行速度，则可设置`CMAKE_BUILD_TYPE`为`Release`。亦可以通过手动设置`CMAKE_C/CXX_FLAGS_MINSIZEREL/RELEASE`来影响PaddlePaddle的编译过程。
 **性能TIPS**，为了达到最快的计算速度，在CMake参数配置上，有以下建议：
- 设置`CMAKE_BUILD_TYPE`为`Release`
+- 设置`CMAKE_BUILD_TYPE`为`Release`
- 使用`clang`编译工具链
+- 使用`clang`编译工具链
 - `armeabi-v7a`时，设置`USE_EIGEN_BLAS=ON`，使用Eigen进行矩阵计算；`arm64-v8a`时，设置`USE_EIGEN_FOR_BLAS=OFF`，使用OpenBLAS进行矩阵计算
 ### 编译和安装
--- a/paddle/framework/CMakeLists.txt
+++ b/paddle/framework/CMakeLists.txt
@ -20,7 +20,8 @@ cc_test(scope_test SRCS scope_test.cc DEPS scope)
 cc_library(attribute SRCS attribute.cc DEPS framework_proto)
-cc_test(program_desc_test SRCS program_desc_test.cc DEPS proto_desc)
+cc_test(program_desc_test SRCS program_desc_test.cc DEPS proto_desc
 device_context)
 cc_library(op_proto_maker SRCS op_proto_maker.cc DEPS framework_proto attribute)
 cc_test(op_proto_maker_test SRCS op_proto_maker_test.cc DEPS op_proto_maker)
 cc_library(op_info SRCS op_info.cc DEPS attribute framework_proto)
--- a/paddle/framework/executor.cc
+++ b/paddle/framework/executor.cc
@ -83,7 +83,7 @@ void Executor::Run(const ProgramDescBind& pdesc, Scope* scope, int block_id,
  // TODO(tonyyang-svail):
  //    - only runs on the first device (i.e. no interdevice communication)
  //    - will change to use multiple blocks for RNN op and Cond Op
-  PADDLE_ENFORCE_LT(block_id, pdesc.Size());
+  PADDLE_ENFORCE_LT(static_cast<size_t>(block_id), pdesc.Size());
  auto& block = pdesc.Block(block_id);
  auto& device = device_contexts_[0];
--- a/paddle/gserver/evaluators/Evaluator.cpp
+++ b/paddle/gserver/evaluators/Evaluator.cpp
@ -407,7 +407,7 @@ real AucEvaluator::evalImp(std::vector<Argument>& arguments) {
  // Copy label from value to a vector.
  if (nullptr == label && nullptr != labelval) {
    // label width is 1
-    CHECK_EQ(1, labelval->getWidth());
+    CHECK_EQ(1U, labelval->getWidth());
    VectorPtr vec =
        Vector::create(labelval->getData(), insNum, output->useGpu());
    label = vec->castToInt();
--- a/paddle/operators/activation_op.cc
+++ b/paddle/operators/activation_op.cc
--- a/paddle/operators/activation_op.h
+++ b/paddle/operators/activation_op.h
@ -232,7 +232,7 @@ struct HardShrinkGradFunctor : public BaseActivationFunctor<T> {
  }
 };
-// softshrink(x) = x - lambda, if x > lambda; x + lambda, if x < lambda; 0
+// softshrink(x) = x - lambda, if x > lambda; x + lambda, if x < -lambda; 0
 // otherwise
 template <typename T>
 struct SoftShrinkFunctor : public BaseActivationFunctor<T> {
--- a/paddle/operators/math/detail/CMakeLists.txt
+++ b/paddle/operators/math/detail/CMakeLists.txt
@ -1,5 +1,3 @@
 if(WITH_AVX)
-    cc_library(activation_functions SRCS hl_cpu_functions.cc hl_avx_functions.cc)
+    cc_library(activation_functions SRCS avx_functions.cc)
 else()
    cc_library(activation_functions SRCS hl_cpu_functions.cc)
 endif()
--- a/paddle/operators/math/detail/activation_functions.h
+++ b/paddle/operators/math/detail/activation_functions.h
@ -0,0 +1,170 @@
 /* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
 You may obtain a copy of the License at
    http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */
 #pragma once
 #include <math.h>
 #include "paddle/platform/hostdevice.h"
 #ifdef __AVX__
 #include <immintrin.h>
 #endif
 namespace paddle {
 namespace operators {
 namespace math {
 namespace detail {
 #define SIGMOID_THRESHOLD_MIN -40.0
 #define SIGMOID_THRESHOLD_MAX 13.0
 #define EXP_MAX_INPUT 40.0
 namespace forward {
 template <typename T>
 DEVICE T Identity(const T a) {
  return a;
 }
 template <typename T>
 DEVICE T Relu(const T a) {
  return a > static_cast<T>(0.0) ? a : static_cast<T>(0.0);
 }
 template <typename T>
 DEVICE T Sigmoid(const T a) {
  const T min = SIGMOID_THRESHOLD_MIN;
  const T max = SIGMOID_THRESHOLD_MAX;
  T tmp = (a < min) ? min : ((a > max) ? max : a);
  return static_cast<T>(1.0) / (static_cast<T>(1.0) + exp(-tmp));
 }
 template <typename T>
 DEVICE T Tanh(const T a) {
  T tmp = -2.0 * a;
  tmp = (tmp > EXP_MAX_INPUT) ? EXP_MAX_INPUT : tmp;
  return (2.0 / (1.0 + exp(tmp))) - 1.0;
 }
 }  // namespace forward
 namespace backward {
 template <typename T>
 DEVICE T Identity(const T a, const T b) {
  return a;
 }
 template <typename T>
 DEVICE T Relu(const T a, const T b) {
  return a * (b > 0.0 ? 1.0 : 0.0);
 }
 template <typename T>
 DEVICE T Sigmoid(const T a, const T b) {
  return a * b * (1.0 - b);
 }
 template <typename T>
 DEVICE T Tanh(const T a, const T b) {
  return a * (1.0 - b * b);
 }
 }  // namespace backward
 template <typename T>
 struct Active {
  typedef T (*Act)(T);
  typedef T (*ActGrad)(T, T);
 };
 static DEVICE Active<float>::Act kActFloat[] = {
    &forward::Sigmoid<float>, &forward::Relu<float>, &forward::Tanh<float>,
    &forward::Identity<float>};
 static DEVICE Active<float>::ActGrad kActGradFloat[] = {
    &backward::Sigmoid<float>, &backward::Relu<float>, &backward::Tanh<float>,
    &backward::Identity<float>};
 static DEVICE Active<double>::Act kActDouble[] = {
    &forward::Sigmoid<double>, &forward::Relu<double>, &forward::Tanh<double>,
    &forward::Identity<double>};
 static DEVICE Active<double>::ActGrad kActGradDouble[] = {
    &backward::Sigmoid<double>, &backward::Relu<double>,
    &backward::Tanh<double>, &backward::Identity<double>};
 namespace forward {
 inline DEVICE float activation(float a, int index) {
  return kActFloat[index](a);
 }
 inline DEVICE double activation(double a, int index) {
  return kActDouble[index](a);
 }
 }  // namespace forward
 namespace backward {
 inline DEVICE float activation(float a, float b, int index) {
  return kActGradFloat[index](a, b);
 }
 inline DEVICE double activation(double a, double b, int index) {
  return kActGradDouble[index](a, b);
 }
 }  // namespace backward
 #ifdef __AVX__
 namespace forward {
 namespace avx {
 __m256 Relu(const __m256 a);
 __m256 Sigmoid(const __m256 a);
 __m256 Tanh(const __m256 a);
 __m256 Identity(const __m256 a);
 }  // namespace avx
 }  // namespace forward
 namespace backward {
 namespace avx {
 __m256 Relu(const __m256 a, const __m256 b);
 __m256 Sigmoid(const __m256 a, const __m256 b);
 __m256 Tanh(const __m256 a, const __m256 b);
 __m256 Identity(const __m256 a, const __m256 b);
 }  // namespace avx
 }  // namespace backward
 static Active<__m256>::Act kActAvx[] = {
    &forward::avx::Sigmoid, &forward::avx::Relu, &forward::avx::Tanh,
    &forward::avx::Identity};
 static Active<__m256>::ActGrad kActGradAvx[] = {
    &backward::avx::Sigmoid, &backward::avx::Relu, &backward::avx::Tanh,
    &backward::avx::Identity};
 namespace forward {
 inline __m256 activation(__m256 a, int index) { return kActAvx[index](a); }
 }  // namespace forward
 namespace backward {
 inline __m256 activation(__m256 a, __m256 b, int index) {
  return kActGradAvx[index](a, b);
 }
 }  // namespace backward
 #endif
 }  // namespace detail
 }  // namespace math
 }  // namespace operators
 }  // namespace paddle
--- a/paddle/operators/math/detail/hl_avx_functions.cc
+++ b/paddle/operators/math/detail/hl_avx_functions.cc
@ -13,58 +13,74 @@ See the License for the specific language governing permissions and
 limitations under the License. */
 #include <immintrin.h>
-#include "hl_functions.h"
+#include "paddle/operators/math/detail/activation_functions.h"
 // TODO(qingqing) refine this dependence
 #include "paddle/cuda/src/avx_mathfun.h"
-namespace hppl {
+namespace paddle {
 namespace operators {
 namespace math {
 namespace detail {
-__m256 exp(__m256 a) { return exp256_ps(a); }
+__m256 Exp(__m256 a) { return exp256_ps(a); }
-__m256 relu(const __m256 a) {
+namespace forward {
 namespace avx {
 __m256 Relu(const __m256 a) {
  __m256 tmp = _mm256_set1_ps(0.0f);
  return _mm256_max_ps(a, tmp);
 }
-__m256 sigmoid(const __m256 a) {
+__m256 Sigmoid(const __m256 a) {
  __m256 max = _mm256_set1_ps(SIGMOID_THRESHOLD_MAX);
  __m256 min = _mm256_set1_ps(SIGMOID_THRESHOLD_MIN);
  __m256 tmp = _mm256_max_ps(a, min);
  tmp = _mm256_min_ps(tmp, max);
  tmp = _mm256_sub_ps(_mm256_set1_ps(0.0f), tmp);
-  tmp = exp(tmp);
+  tmp = Exp(tmp);
  tmp = _mm256_add_ps(_mm256_set1_ps(1.0f), tmp);
  tmp = _mm256_div_ps(_mm256_set1_ps(1.0f), tmp);
  return tmp;
 }
-__m256 tanh(const __m256 a) {
+__m256 Tanh(const __m256 a) {
  __m256 max = _mm256_set1_ps(EXP_MAX_INPUT);
  __m256 tmp = _mm256_mul_ps(_mm256_set1_ps(-2.0f), a);
  tmp = _mm256_min_ps(tmp, max);
-  tmp = exp(tmp);
+  tmp = Exp(tmp);
  return _mm256_sub_ps(_mm256_div_ps(_mm256_set1_ps(2.0f),
                                     _mm256_add_ps(_mm256_set1_ps(1.0f), tmp)),
                       _mm256_set1_ps(1.0f));
 }
-__m256 linear(const __m256 a) { return a; }
+__m256 Identity(const __m256 a) { return a; }
-__m256 relu(const __m256 a, const __m256 b) {
+}  // namespace avx
 }  // namespace forward
 namespace backward {
 namespace avx {
 __m256 Relu(const __m256 a, const __m256 b) {
  return _mm256_mul_ps(
      a, _mm256_and_ps(_mm256_cmp_ps(b, _mm256_set1_ps(0.0f), _CMP_GT_OS),
                       _mm256_set1_ps(1.0f)));
 }
-__m256 sigmoid(const __m256 a, const __m256 b) {
+__m256 Sigmoid(const __m256 a, const __m256 b) {
  return _mm256_mul_ps(_mm256_mul_ps(a, b),
                       _mm256_sub_ps(_mm256_set1_ps(1.0f), b));
 }
-__m256 tanh(const __m256 a, const __m256 b) {
+__m256 Tanh(const __m256 a, const __m256 b) {
  return _mm256_mul_ps(
      a, _mm256_sub_ps(_mm256_set1_ps(1.0f), _mm256_mul_ps(b, b)));
 }
-__m256 linear(const __m256 a, const __m256 b) { return a; }
+__m256 Identity(const __m256 a, const __m256 b) { return a; }
-}  // namespace hppl
+}  // namespace avx
 }  // namespace backward
 }  // namespace detail
 }  // namespace math
 }  // namespace operators
 }  // namespace paddle
--- a/paddle/operators/math/detail/hl_activation_functions.h
+++ b/paddle/operators/math/detail/hl_activation_functions.h
@ -1,188 +0,0 @@
 /* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
 You may obtain a copy of the License at
    http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */
 #ifndef HL_ACTIVATION_FUNCTIONS_H_
 #define HL_ACTIVATION_FUNCTIONS_H_
 #include "hl_functions.h"
 #include "paddle/operators/math/lstm_compute.h"
 /**
 * Active functions: sigmoid, relu, tanh and linear.
 */
 #define FLOAT_ACTIVE_FUNCTION                                   \
  {                                                             \
    hppl::typef::sigmoid, hppl::typef::relu, hppl::typef::tanh, \
        hppl::typef::linear                                     \
  }
 #define DOUBLE_ACTIVE_FUNCTION                                  \
  {                                                             \
    hppl::typed::sigmoid, hppl::typed::relu, hppl::typed::tanh, \
        hppl::typed::linear                                     \
  }
 #define AVX_ACTIVE_FUNCTION \
  { hppl::sigmoid, hppl::relu, hppl::tanh, hppl::linear }
 namespace hppl {
 using activation_mode_t = paddle::operators::math::activation_mode_t;
 /**
 * Hppl supports sigmoid, relu, tanh, linear active functions
 * for neural networks' forward and backward activation.
 */
 template <class T>
 class Active {
 public:
  typedef T (*forward)(T);
  typedef T (*backward)(T, T);
 };
 template <typename T>
 struct ForwardActType;
 template <>
 struct ForwardActType<float> {
  using type = Active<float>::forward;
 };
 template <>
 struct ForwardActType<double> {
  using type = Active<double>::forward;
 };
 template <typename T>
 struct BackwardActType;
 template <>
 struct BackwardActType<float> {
  using type = Active<float>::backward;
 };
 template <>
 struct BackwardActType<double> {
  using type = Active<double>::backward;
 };
 #ifdef __NVCC__
 namespace gpu {
 static __device__ Active<float>::forward forward[] = FLOAT_ACTIVE_FUNCTION;
 static __device__ Active<float>::backward backward[] = FLOAT_ACTIVE_FUNCTION;
 static __device__ Active<double>::forward forward_d[] = DOUBLE_ACTIVE_FUNCTION;
 static __device__ Active<double>::backward backward_d[] =
    DOUBLE_ACTIVE_FUNCTION;
 template <typename T>
 struct ForwardAct {
  __device__ typename ForwardActType<T>::type operator()(
      activation_mode_t type);
 };
 template <>
 struct ForwardAct<float> {
  __device__ ForwardActType<float>::type operator()(activation_mode_t type) {
    return forward[type];
  }
 };
 template <>
 struct ForwardAct<double> {
  __device__ ForwardActType<double>::type operator()(activation_mode_t type) {
    return forward_d[type];
  }
 };
 template <typename T>
 struct BackwardAct {
  __device__ typename BackwardActType<T>::type operator()(
      activation_mode_t type);
 };
 template <>
 struct BackwardAct<float> {
  __device__ BackwardActType<float>::type operator()(activation_mode_t type) {
    return backward[type];
  }
 };
 template <>
 struct BackwardAct<double> {
  __device__ BackwardActType<double>::type operator()(activation_mode_t type) {
    return backward_d[type];
  }
 };
 }  // namespace gpu
 #else
 namespace cpu {
 static Active<float>::forward forward[] = FLOAT_ACTIVE_FUNCTION;
 static Active<float>::backward backward[] = FLOAT_ACTIVE_FUNCTION;
 static Active<double>::forward forward_d[] = DOUBLE_ACTIVE_FUNCTION;
 static Active<double>::backward backward_d[] = DOUBLE_ACTIVE_FUNCTION;
 template <typename T>
 struct ForwardAct {
  typename ForwardActType<T>::type operator()(activation_mode_t type);
 };
 template <>
 struct ForwardAct<float> {
  ForwardActType<float>::type operator()(activation_mode_t type) {
    return forward[type];
  }
 };
 template <>
 struct ForwardAct<double> {
  ForwardActType<double>::type operator()(activation_mode_t type) {
    return forward_d[type];
  }
 };
 template <typename T>
 struct BackwardAct {
  typename BackwardActType<T>::type operator()(activation_mode_t type);
 };
 template <>
 struct BackwardAct<float> {
  BackwardActType<float>::type operator()(activation_mode_t type) {
    return backward[type];
  }
 };
 template <>
 struct BackwardAct<double> {
  BackwardActType<double>::type operator()(activation_mode_t type) {
    return backward_d[type];
  }
 };
 }  // namespace cpu
 #ifdef __AVX__
 namespace avx {
 static Active<__m256>::forward forward[] = AVX_ACTIVE_FUNCTION;
 static Active<__m256>::backward backward[] = AVX_ACTIVE_FUNCTION;
 }  // namespace avx
 #endif
 #endif
 }  // namespace hppl
 #endif  // HL_ACTIVATION_FUNCTIONS_H_
--- a/paddle/operators/math/detail/hl_avx_functions.h
+++ b/paddle/operators/math/detail/hl_avx_functions.h
@ -1,32 +0,0 @@
 /* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
 You may obtain a copy of the License at
    http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */
 #ifndef HL_AVX_FUNCTIONS_H_
 #define HL_AVX_FUNCTIONS_H_
 #include <immintrin.h>
 namespace hppl {
 __m256 relu(const __m256 a);
 __m256 sigmoid(const __m256 a);
 __m256 tanh(const __m256 a);
 __m256 linear(const __m256 a);
 __m256 relu(const __m256 a, const __m256 b);
 __m256 sigmoid(const __m256 a, const __m256 b);
 __m256 tanh(const __m256 a, const __m256 b);
 __m256 linear(const __m256 a, const __m256 b);
 }  // namespace hppl
 #endif  // HL_AVX_FUNCTIONS_H_
--- a/paddle/operators/math/detail/hl_cpu_functions.cc
+++ b/paddle/operators/math/detail/hl_cpu_functions.cc
@ -1,89 +0,0 @@
 /* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
 You may obtain a copy of the License at
    http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */
 #include <math.h>
 #include "hl_functions.h"
 namespace hppl {
 namespace typef {
 float relu(const float a) {
  return a > static_cast<float>(0.0) ? a : static_cast<float>(0.0);
 }
 float sigmoid(const float a) {
  const float min = SIGMOID_THRESHOLD_MIN;
  const float max = SIGMOID_THRESHOLD_MAX;
  float tmp = (a < min) ? min : ((a > max) ? max : a);
  return static_cast<float>(1.0) / (static_cast<float>(1.0) + exp(-tmp));
 }
 float tanh(const float a) {
  float tmp = -2.0 * a;
  tmp = (tmp > EXP_MAX_INPUT) ? EXP_MAX_INPUT : tmp;
  return (2.0 / (1.0 + exp(tmp))) - 1.0;
 }
 float linear(const float a) { return a; }
 float relu(const float a, const float b) { return a * (b > 0.0 ? 1.0 : 0.0); }
 float sigmoid(const float a, const float b) {
  return a * b * (static_cast<float>(1) - b);
 }
 float tanh(const float a, const float b) {
  return a * (static_cast<float>(1) - b * b);
 }
 float linear(const float a, const float b) { return a; }
 }  // namespace typef
 namespace typed {
 double relu(const double a) {
  return a > static_cast<double>(0.0) ? a : static_cast<double>(0.0);
 }
 double sigmoid(const double a) {
  const double min = SIGMOID_THRESHOLD_MIN;
  const double max = SIGMOID_THRESHOLD_MAX;
  double tmp = (a < min) ? min : ((a > max) ? max : a);
  return static_cast<double>(1.0) / (static_cast<double>(1.0) + exp(-tmp));
 }
 double tanh(const double a) {
  double tmp = -2.0 * a;
  tmp = (tmp > EXP_MAX_INPUT) ? EXP_MAX_INPUT : tmp;
  return (2.0 / (1.0 + exp(tmp))) - 1.0;
 }
 double linear(const double a) { return a; }
 double relu(const double a, const double b) {
  return a * (b > 0.0 ? 1.0 : 0.0);
 }
 double sigmoid(const double a, const double b) {
  return a * b * (static_cast<double>(1) - b);
 }
 double tanh(const double a, const double b) {
  return a * (static_cast<double>(1) - b * b);
 }
 double linear(const double a, const double b) { return a; }
 }  // namespace typed
 }  // namespace hppl
--- a/paddle/operators/math/detail/hl_functions.h
+++ b/paddle/operators/math/detail/hl_functions.h
@ -1,71 +0,0 @@
 /* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
 You may obtain a copy of the License at
    http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */
 #ifndef HL_FUNCTIONS_H_
 #define HL_FUNCTIONS_H_
 /**
 * sigmoid threshold maximum
 */
 #define SIGMOID_THRESHOLD_MIN -40.0
 /**
 * sigmoid threshold minimum
 */
 #define SIGMOID_THRESHOLD_MAX 13.0
 /**
 * The maximum input value for exp, used to avoid overflow problem.
 * currently only used for tanh function.
 */
 #define EXP_MAX_INPUT 40.0
 #ifndef __NVCC__
 namespace hppl {
 namespace typef {
 float relu(const float a);
 float sigmoid(const float a);
 float tanh(const float a);
 float linear(const float a);
 float relu(const float a, const float b);
 float sigmoid(const float a, const float b);
 float tanh(const float a, const float b);
 float linear(const float a, const float b);
 }  // namespace typef
 namespace typed {
 double relu(const double a);
 double sigmoid(const double a);
 double tanh(const double a);
 double linear(const double a);
 double relu(const double a, const double b);
 double sigmoid(const double a, const double b);
 double tanh(const double a, const double b);
 double linear(const double a, const double b);
 }  // namespace typed
 }  // namespace hppl
 #ifdef __AVX__
 #include "hl_avx_functions.h"
 #endif
 #else
 #include "hl_gpu_functions.h"
 #endif
 #endif  // HL_FUNCTIONS_H_
--- a/paddle/operators/math/detail/hl_gpu_functions.h
+++ b/paddle/operators/math/detail/hl_gpu_functions.h
@ -1,93 +0,0 @@
 /* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
 You may obtain a copy of the License at
    http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */
 #ifndef HL_GPU_FUNCTIONS_CUH_
 #define HL_GPU_FUNCTIONS_CUH_
 #include "hl_base.h"
 namespace hppl {
 namespace typef {
 __device__ static float relu(const float a) { return a > 0.0f ? a : 0.0f; }
 __device__ static float sigmoid(const float a) {
  const float min = SIGMOID_THRESHOLD_MIN;
  const float max = SIGMOID_THRESHOLD_MAX;
  float tmp = (a < min) ? min : ((a > max) ? max : a);
  return __fdividef(1.0f, 1.0f + __expf(-tmp));
 }
 __device__ static float tanh(const float a) {
  float tmp = -2.0 * a;
  tmp = (tmp > EXP_MAX_INPUT) ? EXP_MAX_INPUT : tmp;
  return __fdividef(2.0f, (1.0f + __expf(-2.0f * tmp))) - 1.0f;
 }
 __device__ static float linear(const float a) { return a; }
 __device__ static float relu(const float a, const float b) {
  return a * (b > 0.0f ? 1.0f : 0.0f);
 }
 __device__ static float sigmoid(const float a, const float b) {
  return a * b * (1.0f - b);
 }
 __device__ static float tanh(const float a, const float b) {
  return a * (1.0f - b * b);
 }
 __device__ static float linear(const float a, const float b) { return a; }
 }  // namespace typef
 namespace typed {
 __device__ static double relu(const double a) { return a > 0.0 ? a : 0.0; }
 __device__ static double sigmoid(const double a) {
  const double min = SIGMOID_THRESHOLD_MIN;
  const double max = SIGMOID_THRESHOLD_MAX;
  double tmp = (a < min) ? min : ((a > max) ? max : a);
  return 1.0 / (1.0 + exp(-tmp));
 }
 __device__ static double tanh(const double a) {
  double tmp = -2.0 * a;
  tmp = (tmp > EXP_MAX_INPUT) ? EXP_MAX_INPUT : tmp;
  return (2.0 / (1.0 + exp(-2.0 * a))) - 1.0;
 }
 __device__ static double linear(const double a) { return a; }
 __device__ static double relu(const double a, const double b) {
  return a * (b > 0.0 ? 1.0 : 0.0);
 }
 __device__ static double sigmoid(const double a, const double b) {
  return a * b * (1 - b);
 }
 __device__ static double tanh(const double a, const double b) {
  return a * (1.0 - b * b);
 }
 __device__ static double linear(const double a, const double b) { return a; }
 }  // namespace typef
 }  // namespace hppl
 #endif  // HL_GPU_FUNCTIONS_CUH_
--- a/paddle/operators/math/detail/lstm_cpu_kernel.h
+++ b/paddle/operators/math/detail/lstm_cpu_kernel.h
@ -14,7 +14,7 @@ limitations under the License. */
 #pragma once
 #include <type_traits>
-#include "paddle/operators/math/detail/hl_activation_functions.h"
+#include "paddle/operators/math/detail/activation_functions.h"
 #include "paddle/operators/math/lstm_compute.h"
 namespace paddle {
@ -26,7 +26,10 @@ namespace detail {
 template <class T, class Op>
 void naive_lstm_forward_one_sequence(Op op, LstmMetaValue<T> value,
-                                     int frameSize) {
+                                     int frameSize,
                                     activation_mode_t active_node,
                                     activation_mode_t active_gate,
                                     activation_mode_t active_state) {
  T rValueIn;
  T rValueIg;
  T rValueFg;
@ -58,7 +61,7 @@ void naive_lstm_forward_one_sequence(Op op, LstmMetaValue<T> value,
    }
    op(rValueIn, rValueIg, rValueFg, rValueOg, rPrevState, rState, rStateAtv,
-       rOut, rCheckI, rCheckF, rCheckO);
+       rOut, rCheckI, rCheckF, rCheckO, active_node, active_gate, active_state);
    valueIn[i] = rValueIn;
    valueIg[i] = rValueIg;
@ -72,7 +75,10 @@ void naive_lstm_forward_one_sequence(Op op, LstmMetaValue<T> value,
 template <class T, class Op>
 void naive_lstm_backward_one_sequence(Op op, LstmMetaValue<T> value,
-                                      LstmMetaGrad<T> grad, int frameSize) {
+                                      LstmMetaGrad<T> grad, int frameSize,
                                      activation_mode_t active_node,
                                      activation_mode_t active_gate,
                                      activation_mode_t active_state) {
  T rValueIn;
  T rValueIg;
  T rValueFg;
@ -122,7 +128,7 @@ void naive_lstm_backward_one_sequence(Op op, LstmMetaValue<T> value,
    op(rValueIn, rValueIg, rValueFg, rValueOg, rGradIn, rGradIg, rGradFg,
       rGradOg, rPrevState, rPrevStateGrad, rState, rStateGrad, rStateAtv,
       rOutputGrad, rCheckI, rCheckF, rCheckO, rCheckIGrad, rCheckFGrad,
-       rCheckOGrad);
+       rCheckOGrad, active_node, active_gate, active_state);
    gradIn[i] = rGradIn;
    gradIg[i] = rGradIg;
@ -176,8 +182,7 @@ void avx_lstm_forward_one_sequence(Op op, LstmMetaValue<T> value, int frameSize,
    }
    op(rValueIn, rValueIg, rValueFg, rValueOg, rPrevState, rState, rStateAtv,
-       rOut, rCheckI, rCheckF, rCheckO, hppl::avx::forward[active_node],
+       rOut, rCheckI, rCheckF, rCheckO, active_node, active_gate, active_state);
       hppl::avx::forward[active_gate], hppl::avx::forward[active_state]);
    valueIn[i] = rValueIn;
    valueIg[i] = rValueIg;
@ -246,8 +251,7 @@ void avx_lstm_backward_one_sequence(Op op, LstmMetaValue<T> value,
    op(rValueIn, rValueIg, rValueFg, rValueOg, rGradIn, rGradIg, rGradFg,
       rGradOg, rPrevState, rPrevStateGrad, rState, rStateGrad, rStateAtv,
       rOutputGrad, rCheckI, rCheckF, rCheckO, rCheckIGrad, rCheckFGrad,
-       rCheckOGrad, hppl::avx::backward[active_node],
+       rCheckOGrad, active_node, active_gate, active_state);
       hppl::avx::backward[active_gate], hppl::avx::backward[active_state]);
    gradIn[i] = rGradIn;
    gradIg[i] = rGradIg;
@ -274,7 +278,8 @@ void cpu_lstm_forward(Op op, LstmMetaValue<T> value, int frameSize,
    avx_lstm_forward_one_sequence<T>(op, value, frameSize, active_node,
                                     active_gate, active_state);
  } else {
-    naive_lstm_forward_one_sequence<T>(op, value, frameSize);
+    naive_lstm_forward_one_sequence<T>(op, value, frameSize, active_node,
                                       active_gate, active_state);
  }
 }
@ -287,7 +292,8 @@ void cpu_lstm_backward(Op op, LstmMetaValue<T> value, LstmMetaGrad<T> grad,
    avx_lstm_backward_one_sequence<T>(op, value, grad, frameSize, active_node,
                                      active_gate, active_state);
  } else {
-    naive_lstm_backward_one_sequence<T>(op, value, grad, frameSize);
+    naive_lstm_backward_one_sequence<T>(op, value, grad, frameSize, active_node,
                                        active_gate, active_state);
  }
 }
--- a/paddle/operators/math/detail/lstm_gpu_kernel.h
+++ b/paddle/operators/math/detail/lstm_gpu_kernel.h
@ -13,13 +13,12 @@ See the License for the specific language governing permissions and
 limitations under the License. */
 #pragma once
-#include <type_traits>
+#include "paddle/operators/math/detail/activation_functions.h"
 #include "paddle/operators/math/detail/hl_activation_functions.h"
 #include "paddle/operators/math/lstm_compute.h"
 #include "paddle/platform/cuda_helper.h"
 #include "paddle/platform/device_context.h"
-#include <glog/logging.h>
+#include <type_traits>
 namespace paddle {
 namespace operators {
@ -32,7 +31,9 @@ namespace detail {
 */
 template <class T, class Op, bool isBatch>
 __global__ void KeLstmForward(Op op, LstmMetaValue<T> value, int frameSize,
-                              int batchSize) {
+                              int batchSize, activation_mode_t active_node,
                              activation_mode_t active_gate,
                              activation_mode_t active_state) {
  const int frameIdx = blockIdx.x * blockDim.x + threadIdx.x;
  if (frameIdx >= frameSize) return;
@ -69,7 +70,7 @@ __global__ void KeLstmForward(Op op, LstmMetaValue<T> value, int frameSize,
  }
  op(rValueIn, rValueIg, rValueFg, rValueOg, rPrevState, rState, rStateAtv,
-     rOut, rCheckI, rCheckF, rCheckO);
+     rOut, rCheckI, rCheckF, rCheckO, active_node, active_gate, active_state);
  value.gateValue[frameIdx] = rValueIn;
  value.gateValue[frameIdx + frameSize] = rValueIg;
@ -88,7 +89,9 @@ __global__ void KeLstmForward(Op op, LstmMetaValue<T> value, int frameSize,
 template <class T, class Op, bool isBatch>
 __global__ void KeLstmBackward(Op op, LstmMetaValue<T> value,
                               LstmMetaGrad<T> grad, int frameSize,
-                               int batchSize) {
+                               int batchSize, activation_mode_t active_node,
                               activation_mode_t active_gate,
                               activation_mode_t active_state) {
  const int frameIdx = blockIdx.x * blockDim.x + threadIdx.x;
  if (frameIdx >= frameSize) return;
@ -141,7 +144,8 @@ __global__ void KeLstmBackward(Op op, LstmMetaValue<T> value,
  op(rValueIn, rValueIg, rValueFg, rValueOg, rGradIn, rGradIg, rGradFg, rGradOg,
     rPrevState, rPrevStateGrad, rState, rStateGrad, rStateAtv, rOutputGrad,
-     rCheckI, rCheckF, rCheckO, rCheckIGrad, rCheckFGrad, rCheckOGrad);
+     rCheckI, rCheckF, rCheckO, rCheckIGrad, rCheckFGrad, rCheckOGrad,
     active_node, active_gate, active_state);
  grad.gateGrad[frameIdx] = rGradIn;
  grad.gateGrad[frameIdx + frameSize] = rGradIg;
@ -197,11 +201,13 @@ void gpu_lstm_forward(const platform::DeviceContext& context, Op op,
  if (batchSize == 1) {
    KeLstmForward<T, Op,
                  /* isBatch= */ false><<<grid, threads, 0, stream>>>(
-        op, value, frameSize, batchSize);
+        op, value, frameSize, batchSize, active_node, active_gate,
        active_state);
  } else {
    KeLstmForward<T, Op,
                  /* isBatch= */ true><<<grid, threads, 0, stream>>>(
-        op, value, frameSize, batchSize);
+        op, value, frameSize, batchSize, active_node, active_gate,
        active_state);
  }
 }
@ -220,9 +226,9 @@ void gpu_lstm_backward(const platform::DeviceContext& context, Op op,
    threads = dim3(framePerBlock, 1);
    grid = dim3(frameBlocks, 1);
  } else {
-    /* framePerBlock = 32 batchPerBlock = 32 */
+    /* framePerBlock = 32 batchPerBlock = 16 */
-    threads = dim3(32, 32);
+    threads = dim3(32, 16);
-    grid = dim3((frameSize + 32 - 1) / 32, (batchSize + 32 - 1) / 32);
+    grid = dim3((frameSize + 32 - 1) / 32, (batchSize + 16 - 1) / 16);
  }
  auto stream =
@ -230,12 +236,19 @@ void gpu_lstm_backward(const platform::DeviceContext& context, Op op,
  if (batchSize == 1) {
    KeLstmBackward<T, Op,
                   /* isBatch= */ false><<<grid, threads, 0, stream>>>(
-        op, value, grad, frameSize, batchSize);
+        op, value, grad, frameSize, batchSize, active_node, active_gate,
        active_state);
  } else {
    KeLstmBackward<T, Op,
                   /* isBatch= */ true><<<grid, threads, 0, stream>>>(
-        op, value, grad, frameSize, batchSize);
+        op, value, grad, frameSize, batchSize, active_node, active_gate,
        active_state);
  }
  cudaStreamSynchronize(stream);
  // TODO(qingqing): Add cuda error check for each kernel.
  cudaError_t err = cudaGetLastError();
  PADDLE_ENFORCE(err, cudaGetErrorString(err));
 }
 }  // namespace detail
--- a/paddle/operators/math/detail/lstm_kernel.h
+++ b/paddle/operators/math/detail/lstm_kernel.h
@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */
-#include "paddle/operators/math/detail/hl_activation_functions.h"
+#include "paddle/operators/math/detail/activation_functions.h"
 #include "paddle/platform/hostdevice.h"
 #include <type_traits>
@ -24,45 +24,22 @@ namespace detail {
 namespace forward {
 template <typename T>
 DEVICE inline T sigmoid(const T a) {
  const T min = SIGMOID_THRESHOLD_MIN;
  const T max = SIGMOID_THRESHOLD_MAX;
  T tmp = (a < min) ? min : ((a > max) ? max : a);
  return static_cast<T>(1.0) / (static_cast<T>(1.0) + exp(-tmp));
 }
 template <typename T>
 DEVICE inline T tanh(const T a) {
  T tmp = -2.0 * a;
  tmp = (tmp > EXP_MAX_INPUT) ? EXP_MAX_INPUT : tmp;
  return (2.0 / (1.0 + exp(tmp))) - 1.0;
 }
 template <class T>
 class lstm {
 public:
  HOSTDEVICE void operator()(T &valueIn, T &valueIg, T &valueFg, T &valueOg,
                             T &prevState, T &state, T &stateAtv, T &output,
-                             T &checkI, T &checkF, T &checkO) {
+                             T &checkI, T &checkF, T &checkO,
-#if 0
+                             activation_mode_t active_node,
-    // TODO(qingqing) support to activation speficed by users
+                             activation_mode_t active_gate,
-    valueIn = actInput(valueIn);
+                             activation_mode_t active_state) {
-    valueIg = actGate(valueIg + prevState * checkI);
+    valueIn = activation(valueIn, active_node);
-    valueFg = actGate(valueFg + prevState * checkF);
+    valueIg = activation(valueIg + prevState * checkI, active_gate);
-    state = valueIn * valueIg + prevState * valueFg;
+    valueFg = activation(valueFg + prevState * checkF, active_gate);
    valueOg = actGate(valueOg + state * checkO);
    stateAtv = actState(state);
    output = valueOg * stateAtv;
 #else
    valueIn = tanh<T>(valueIn);
    valueIg = sigmoid<T>(valueIg + prevState * checkI);
    valueFg = sigmoid<T>(valueFg + prevState * checkF);
    state = valueIn * valueIg + prevState * valueFg;
-    valueOg = sigmoid<T>(valueOg + state * checkO);
+    valueOg = activation(valueOg + state * checkO, active_gate);
-    stateAtv = tanh<T>(state);
+    stateAtv = activation(state, active_state);
    output = valueOg * stateAtv;
 #endif
  }
 #ifndef __NVCC__
 #ifndef __AVX__  // If not compiled with AVX instructs. Disable AVX by default
@ -75,16 +52,19 @@ class lstm {
                             __m256 &valueOg, __m256 &prevState, __m256 &state,
                             __m256 &stateAtv, __m256 &output, __m256 &checkI,
                             __m256 &checkF, __m256 &checkO,
-                             hppl::Active<__m256>::forward actInput,
+                             activation_mode_t active_node,
-                             hppl::Active<__m256>::forward actGate,
+                             activation_mode_t active_gate,
-                             hppl::Active<__m256>::forward actState) {
+                             activation_mode_t active_state) {
-    valueIn = actInput(valueIn);
+    valueIn = activation(valueIn, active_node);
-    valueIg = actGate(_mm256_add_ps(valueIg, _mm256_mul_ps(prevState, checkI)));
+    valueIg = activation(
-    valueFg = actGate(_mm256_add_ps(valueFg, _mm256_mul_ps(prevState, checkF)));
+        _mm256_add_ps(valueIg, _mm256_mul_ps(prevState, checkI)), active_gate);
    valueFg = activation(
        _mm256_add_ps(valueFg, _mm256_mul_ps(prevState, checkF)), active_gate);
    state = _mm256_add_ps(_mm256_mul_ps(valueIn, valueIg),
                          _mm256_mul_ps(prevState, valueFg));
-    valueOg = actGate(_mm256_add_ps(valueOg, _mm256_mul_ps(state, checkO)));
+    valueOg = activation(_mm256_add_ps(valueOg, _mm256_mul_ps(state, checkO)),
-    stateAtv = actState(state);
+                         active_gate);
    stateAtv = activation(state, active_state);
    output = _mm256_mul_ps(valueOg, stateAtv);
  }
 #endif
@ -95,16 +75,6 @@ class lstm {
 namespace backward {
 template <typename T>
 DEVICE inline T sigmoid(const T a, const T b) {
  return a * b * (1.0 - b);
 }
 template <typename T>
 DEVICE inline T tanh(const T a, const T b) {
  return a * (1.0 - b * b);
 }
 template <class T>
 class lstm {
 public:
@ -113,29 +83,20 @@ class lstm {
                             T &prevState, T &prevStateGrad, T &state,
                             T &stateGrad, T &stateAtv, T &outputGrad,
                             T &checkI, T &checkF, T &checkO, T &checkIGrad,
-                             T &checkFGrad, T &checkOGrad) {
+                             T &checkFGrad, T &checkOGrad,
-#if 0
+                             activation_mode_t active_node,
-    // TODO(qingqing) support to activation speficed by users
+                             activation_mode_t active_gate,
-    gradOg = actGate(outputGrad * stateAtv, valueOg);
+                             activation_mode_t active_state) {
-    stateGrad += actState(outputGrad * valueOg, stateAtv) + gradOg * checkO;
+    gradOg = activation(outputGrad * stateAtv, valueOg, active_gate);
-    gradIn = actInput(stateGrad * valueIg, valueIn);
+    stateGrad += activation(outputGrad * valueOg, stateAtv, active_state) +
-    gradIg = actGate(stateGrad * valueIn, valueIg);
+                 gradOg * checkO;
-    gradFg = actGate(stateGrad * prevState, valueFg);
+    gradIn = activation(stateGrad * valueIg, valueIn, active_node);
    gradIg = activation(stateGrad * valueIn, valueIg, active_gate);
    gradFg = activation(stateGrad * prevState, valueFg, active_gate);
    prevStateGrad = gradIg * checkI + gradFg * checkF + stateGrad * valueFg;
    checkIGrad = gradIg * prevState;
    checkFGrad = gradFg * prevState;
    checkOGrad = gradOg * state;
 #else
    gradOg = sigmoid<T>(outputGrad * stateAtv, valueOg);
    stateGrad += tanh<T>(outputGrad * valueOg, stateAtv) + gradOg * checkO;
    gradIn = tanh<T>(stateGrad * valueIg, valueIn);
    gradIg = sigmoid<T>(stateGrad * valueIn, valueIg);
    gradFg = sigmoid<T>(stateGrad * prevState, valueFg);
    prevStateGrad = gradIg * checkI + gradFg * checkF + stateGrad * valueFg;
    checkIGrad = gradIg * prevState;
    checkFGrad = gradFg * prevState;
    checkOGrad = gradOg * state;
 #endif
  }
 #ifndef __NVCC__
 #ifndef __AVX__  // If not compiled with AVX instructs. Disable AVX by default
@ -143,24 +104,26 @@ class lstm {
 #else
  // Only float support AVX optimization
  static const bool avx = std::is_same<T, float>::value;
-  HOSTDEVICE void operator()(__m256 &valueIn, __m256 &valueIg, __m256 &valueFg,
+  HOSTDEVICE void operator()(
-                             __m256 &valueOg, __m256 &gradIn, __m256 &gradIg,
+      __m256 &valueIn, __m256 &valueIg, __m256 &valueFg, __m256 &valueOg,
-                             __m256 &gradFg, __m256 &gradOg, __m256 &prevState,
+      __m256 &gradIn, __m256 &gradIg, __m256 &gradFg, __m256 &gradOg,
-                             __m256 &prevStateGrad, __m256 &state,
+      __m256 &prevState, __m256 &prevStateGrad, __m256 &state,
-                             __m256 &stateGrad, __m256 &stateAtv,
+      __m256 &stateGrad, __m256 &stateAtv, __m256 &outputGrad, __m256 &checkI,
-                             __m256 &outputGrad, __m256 &checkI, __m256 &checkF,
+      __m256 &checkF, __m256 &checkO, __m256 &checkIGrad, __m256 &checkFGrad,
-                             __m256 &checkO, __m256 &checkIGrad,
+      __m256 &checkOGrad, activation_mode_t active_node,
-                             __m256 &checkFGrad, __m256 &checkOGrad,
+      activation_mode_t active_gate, activation_mode_t active_state) {
-                             hppl::Active<__m256>::backward actInput,
+    gradOg =
-                             hppl::Active<__m256>::backward actGate,
+        activation(_mm256_mul_ps(outputGrad, stateAtv), valueOg, active_gate);
                             hppl::Active<__m256>::backward actState) {
    gradOg = actGate(_mm256_mul_ps(outputGrad, stateAtv), valueOg);
    stateGrad = _mm256_add_ps(
-        actState(_mm256_mul_ps(outputGrad, valueOg), stateAtv), stateGrad);
+        activation(_mm256_mul_ps(outputGrad, valueOg), stateAtv, active_state),
        stateGrad);
    stateGrad = _mm256_add_ps(_mm256_mul_ps(gradOg, checkO), stateGrad);
-    gradIn = actInput(_mm256_mul_ps(stateGrad, valueIg), valueIn);
+    gradIn =
-    gradIg = actGate(_mm256_mul_ps(stateGrad, valueIn), valueIg);
+        activation(_mm256_mul_ps(stateGrad, valueIg), valueIn, active_node);
-    gradFg = actGate(_mm256_mul_ps(stateGrad, prevState), valueFg);
+    gradIg =
        activation(_mm256_mul_ps(stateGrad, valueIn), valueIg, active_gate);
    gradFg =
        activation(_mm256_mul_ps(stateGrad, prevState), valueFg, active_gate);
    prevStateGrad = _mm256_add_ps(_mm256_mul_ps(gradIg, checkI),
                                  _mm256_mul_ps(gradFg, checkF));
    prevStateGrad =
--- a/paddle/operators/seq_expand_op.h
+++ b/paddle/operators/seq_expand_op.h
@ -32,7 +32,8 @@ class SeqExpandKernel : public framework::OpKernel<T> {
    const T* x_data = x->data<T>();
    auto x_dims = x->dims();
    auto* y = context.Input<LoDTensor>("Y");
-    PADDLE_ENFORCE_EQ(x_dims[0], y->lod().back().size() - 1,
+    PADDLE_ENFORCE_EQ(static_cast<size_t>(x_dims[0]),
                      y->lod().back().size() - 1,
                      "The size of last lod level in Input(Y)"
                      "must be equal to dims[0] of Input(X).");
    out->set_lod(y->lod());
--- a/paddle/optimizer/parameter_optimizer_test.cpp
+++ b/paddle/optimizer/parameter_optimizer_test.cpp
@ -85,7 +85,7 @@ public:
    for (size_t i = 0; i < opts_.size(); ++i) {
      int s = 0;
      float* newp = (float*)opts_[i]->get_weight(&s);
-      EXPECT_EQ(s, kSize);
+      EXPECT_EQ(static_cast<size_t>(s), kSize);
      for (size_t j = 0; j < kSize; ++j) {
        EXPECT_EQ(newp[j], (*p)[j]);
      }
--- a/python/paddle/trainer_config_helpers/layers.py
+++ b/python/paddle/trainer_config_helpers/layers.py
@ -1047,6 +1047,13 @@ def fc_layer(input,
        if isinstance(param_attr, collections.Sequence):
            assert len(input) == len(param_attr)
        else:
            if "parameter_name" in param_attr.attr and len(input) > 1:
                logger.fatal(
                    "When the name field of param_attr is manually specified "
                    "and the input is a list, the param_attr should also be a "
                    "list with each item being the param_attr for each input "
                    "item. If only one named param_attr is provided, all the "
                    "input items would share this parameter.")
            param_attr = [copy.deepcopy(param_attr) for _ in range(len(input))]
    assert isinstance(input, collections.Sequence)
@ -4877,6 +4884,13 @@ def selective_fc_layer(input,
        if isinstance(param_attr, collections.Sequence):
            assert len(input) == len(param_attr)
        else:
            if "parameter_name" in param_attr.attr and len(input) > 1:
                logger.fatal(
                    "When the name field of param_attr is manually specified "
                    "and the input is a list, the param_attr should also be a "
                    "list with each item being the param_attr for each input "
                    "item. If only one named param_attr is provided, all the "
                    "input items would share this parameter.")
            param_attr = [copy.deepcopy(param_attr) for _ in range(len(input))]
    assert isinstance(input, collections.Sequence)