Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into develop

7 years ago · 4e228021bb
parent a0acfc6a62 8401039feb
commit 4e228021bb
39 changed files with 1812 additions and 327 deletions
--- a/doc/howto/cross_compiling/cross_compiling_for_ios_cn.md
+++ b/doc/howto/cross_compiling/cross_compiling_for_ios_cn.md
@ -0,0 +1,99 @@
+# 构建iOS平台上的PaddlePaddle库
+交叉编译iOS平台上适用的PaddlePaddle库，需要在MacOS系统上进行。本文的将介绍在MacOS上，从源码交叉编译iOS平台上适用的PaddlePaddle库。
+
+## 准备交叉编译环境
+Apple官方为iOS开发提供了完整的交叉编译工具和集成开发环境，用户从App Store下载安装Xcode即可。也可自行前往官网下载，[Xcode](https://developer.apple.com/cn/xcode/)。安装完成之后，可在命令行执行`xcodebuild -version`，判断是否安装成功。
+
+```bash
+$ xcodebuild -version
+Xcode 9.0
+Build version 9A235
+```
+
+## 配置交叉编译参数
+
+PaddlePaddle为交叉编译提供了工具链配置文档[cmake/cross_compiling/ios.cmake](https://github.com/PaddlePaddle/Paddle/blob/develop/cmake/cross_compiling/ios.cmake)，以提供一些默认的编译器和编译参数配置。
+
+交叉编译iOS版本的PaddlePaddle库时，有一些必须配置的参数：
+
+- `CMAKE_SYSTEM_NAME`，CMake编译的目标平台，必须设置为`iOS`。在设置`CMAKE_SYSTEM_NAME=iOS`后，PaddlePaddle的CMake系统会自动编译所有的第三方依赖库，并且强制设置一些PaddlePaddle参数的值（`WITH_C_API=ON`、`WITH_GPU=OFF`、`WITH_AVX=OFF`、`WITH_PYTHON=OFF`、`WITH_RDMA=OFF`）。
+- `WITH_C_API`，是否编译C-API预测库，必须设置为ON。在iOS平台上只支持使用C-API来预测。
+- `WITH_SWIG_PY`，必须设置为ON。在iOS平台上不支持通过swig调用来训练或者预测。
+
+iOS平台可选配置参数：
+
+- `IOS_PLATFORM`，可设置为`OS/SIMULATOR`，默认值为`OS`。
+  - `OS`，构建目标为`arm`架构的iPhone或者iPad等物理设备。
+  - `SIMULATOR`，构建目标为`x86`架构的模拟器平台。
+- `IOS_ARCH`，目标架构。针对不同的`IOS_PLATFORM`，可设置的目标架构如下表所示：
+
+   | IOS_PLATFORM | IOS_ARCH             |
+   |--------------|----------------------|
+   |   OS         | armv7, armv7s, arm64 (默认) |
+   | SIMULATOR    | i386, x86_64 (默认)         |   
+
+- `IOS_DEPLOYMENT_TARGET`，最小的iOS部署版本，默认值为`7.0`。
+- `IOS_ENABLE_BITCODE`，是否使能[Bitcode](https://developer.apple.com/library/content/documentation/IDEs/Conceptual/AppDistributionGuide/AppThinning/AppThinning.html#//apple_ref/doc/uid/TP40012582-CH35-SW3)，可设置`ON/OFF`，默认值为`ON`。
+- `IOS_USE_VECLIB_FOR_BLAS`，是否使用[vecLib](https://developer.apple.com/documentation/accelerate/veclib)框架进行BLAS矩阵计算，可设置`ON/OFF`，默认值为`OFF`。
+- `IOS_DEVELOPMENT_ROOT`，`Developer`目录，可显式指定为`/path/to/platform/Developer`。若未显式指定，PaddlePaddle将会根据`IOS_PLATFORM`自动选择`Xcode`对应`platform`的`Developer`目录。
+- `IOS_SDK_ROOT`，所使用`SDK`的根目录，可显式指定为`/path/to/platform/Developer/SDKs/SDK`。若未显式指定，PaddlePaddle将会自动选择`IOS_DEVELOPMENT_ROOT`目录下最新的`SDK`版本。
+
+其他配置参数：
+
+- `USE_EIGEN_FOR_BLAS`，是否使用Eigen库进行矩阵计算，在`IOS_USE_VECLIB_FOR_BLAS=OFF`时有效。可设置`ON/OFF`，默认值为`OFF`。
+- `HOST_C/CXX_COMPILER`，宿主机的C/C++编译器。默认值为环境变量`CC/CXX`的值；若环境变量`CC/CXX`未设置，则使用`cc/c++`编译器。
+
+常用的cmake配置如下：
+
+```bash
+cmake -DCMAKE_SYSTEM_NAME=iOS \
+      -DIOS_PLATFORM=OS \
+      -DIOS_ARCH="arm64" \
+      -DIOS_ENABLE_BITCODE=ON \
+      -DIOS_USE_VECLIB_FOR_BLAS=ON \
+      -DCMAKE_INSTALL_PREFIX=your/path/to/install \
+      -DWITH_C_API=ON \
+      -DWITH_TESTING=OFF \
+      -DWITH_SWIG_PY=OFF \
+      ..
+```
+
+```bash
+cmake -DCMAKE_SYSTEM_NAME=iOS \
+      -DIOS_PLATFORM=SIMULATOR \
+      -DIOS_ARCH="x86_64" \
+      -DIOS_USE_VECLIB_FOR_BLAS=ON \
+      -DCMAKE_INSTALL_PREFIX=your/path/to/install \
+      -DWITH_C_API=ON \
+      -DWITH_TESTING=OFF \
+      -DWITH_SWIG_PY=OFF \
+      ..
+```
+
+用户还可根据自己的需求设置其他编译参数。比如希望最小化生成库的大小，可以设置`CMAKE_BUILD_TYPE`为`MinSizeRel`；若希望得到最快的执行速度，则可设置`CMAKE_BUILD_TYPE`为`Release`。亦可以通过手动设置`CMAKE_C/CXX_FLAGS`来影响PaddlePaddle的编译过程。
+
+**性能TIPS**，为了达到最快的计算速度，在CMake参数配置上，有以下建议：
+
+- 设置`CMAKE_BUILD_TYPE`为`Release`
+- 设置`IOS_USE_VECLIB_FOR_BLAS=ON`，调用`vecLib`框架提供的BLAS函数进行矩阵计算。
+
+## 编译和安装
+
+CMake配置完成后，执行以下命令，PaddlePaddle将自动下载和编译所有第三方依赖库、编译和安装PaddlePaddle预测库。
+
+```
+$ make
+$ make install
+```
+
+注意：如果你曾在源码目录下编译过其他平台的PaddlePaddle库，请先使用`rm -rf`命令删除`third_party`目录和`build`目录，以确保所有的第三方依赖库和PaddlePaddle代码都是针对新的CMake配置重新编译的。
+
+执行完安装命令后，`your/path/to/install`目录中会包含以下内容：
+
+- `include`目录，其中包含所有C-API的头文件
+- `lib`目录，其中包含PaddlePaddle的C-API静态库
+- `third_party`目录，其中包含所依赖的所有第三方库
+
+注意，不同架构的PaddlePaddle库建议安装到不同的目录下，然后使用`lipo`工具将多个静态库合并成一个支持多个架构的fat库。
+
+自此，PaddlePaddle库已经安装完成，用户可将合成的fat库用于深度学习相关的iOS App中，调用方法见C-API文档。
--- a/doc/howto/cross_compiling/cross_compiling_for_raspberry_cn.md
+++ b/doc/howto/cross_compiling/cross_compiling_for_raspberry_cn.md
@ -59,4 +59,4 @@ make install

 注意：如果你曾经在源码目录下编译过其他平台的PaddlePaddle库，请先使用`rm -rf`命令删除`third_party`目录和`build`目录，以确保所有的第三方依赖库和PaddlePaddle代码都是针对新的CMake配置重新编译的。

-执行完安装命令后，，`your/path/to/install`目录中会包含`include`和`lib`目录，其中`include`中包含C-API的头文件，`lib`中包含一个Raspberry Pi版本的库。
+执行完安装命令后，`your/path/to/install`目录中会包含`include`和`lib`目录，其中`include`中包含C-API的头文件，`lib`中包含一个Raspberry Pi版本的库。
--- a/doc/howto/cross_compiling/cross_compiling_for_raspberry_en.md
+++ b/doc/howto/cross_compiling/cross_compiling_for_raspberry_en.md
@ -44,7 +44,7 @@ cmake -DCMAKE_SYSTEM_NAME=RPi \
      ..
 ```

-To build the inference library, please set the argument WITH_API to ON: `WITH_C_API=ON`.
+To build the inference library, please set the argument WITH\_C\_API to ON: `WITH_C_API=ON`.

 You can add more arguments. For example, to minimize the size of the generated inference library, you may use `CMAKE_BUILD_TYPE=MinSizeRel`. For performance optimization, you may use `CMAKE_BUILD_TYPE=Release`.

--- a/paddle/framework/lod_tensor_test.cu
+++ b/paddle/framework/lod_tensor_test.cu
@ -36,8 +36,8 @@ TEST(LoDTensor, LoDInGPU) {
  lod_tensor.mutable_data<float>(place);

  lod_tensor.set_lod(src_lod);
-  CHECK_EQ(lod_tensor.lod_element(0, 2).first, 4UL);
-  CHECK_EQ(lod_tensor.lod_element(0, 4).first, 8UL);
+  EXPECT_EQ(lod_tensor.lod_element(0, 2).first, 4UL);
+  EXPECT_EQ(lod_tensor.lod_element(0, 4).first, 8UL);

  auto lod = lod_tensor.lod();

@ -45,6 +45,6 @@ TEST(LoDTensor, LoDInGPU) {
  cudaDeviceSynchronize();

  for (size_t i = 0; i < src_lod[0].size(); ++i) {
-    CHECK_EQ(lod[0].data()[i], src_lod[0].data()[i] * 2);
+    EXPECT_EQ(lod[0].data()[i], src_lod[0].data()[i] * 2);
  }
-}
+}
--- a/paddle/framework/operator.cc
+++ b/paddle/framework/operator.cc
@ -37,32 +37,32 @@ ExecutionContext::GetEigenDevice<platform::GPUPlace, Eigen::GpuDevice>() const {
 std::string OperatorBase::Input(const std::string& name) const {
  auto& ins = Inputs(name);
  PADDLE_ENFORCE_LE(ins.size(), 1UL,
-                    "Op %s input %s should contain only one variable", type_,
-                    name);
+                    "Operator %s's input %s should contain only one variable.",
+                    type_, name);
  return ins.empty() ? kEmptyVarName : ins[0];
 }

 const std::vector<std::string>& OperatorBase::Inputs(
    const std::string& name) const {
  auto it = inputs_.find(name);
-  PADDLE_ENFORCE(it != inputs_.end(), "Op %s do not have input %s", type_,
-                 name);
+  PADDLE_ENFORCE(it != inputs_.end(), "Operator %s does not have the input %s.",
+                 type_, name);
  return it->second;
 }

 std::string OperatorBase::Output(const std::string& name) const {
  auto& outs = Outputs(name);
  PADDLE_ENFORCE_LE(outs.size(), 1UL,
-                    "Op %s output %s should contain only one variable", type_,
-                    name);
+                    "Operator %s's output %s should contain only one variable.",
+                    type_, name);
  return outs.empty() ? kEmptyVarName : outs[0];
 }

 const std::vector<std::string>& OperatorBase::Outputs(
    const std::string& name) const {
  auto it = outputs_.find(name);
-  PADDLE_ENFORCE(it != outputs_.end(), "Op %s does not have output called %s",
-                 type_, name);
+  PADDLE_ENFORCE(it != outputs_.end(),
+                 "Operator %s does not have an output called %s.", type_, name);
  return it->second;
 }

--- a/paddle/framework/operator.h
+++ b/paddle/framework/operator.h
@ -427,7 +427,8 @@ class OperatorWithKernel : public OperatorBase {
            int tmp = static_cast<int>(ToDataType(t->type()));
            VLOG(3) << "Input " << ipt_name << " with data_type " << tmp;
            PADDLE_ENFORCE(tmp == data_type || data_type == -1,
-                           "DataType of Paddle Op %s must be same.", Type());
+                           "DataType of Paddle Op %s must be the same.",
+                           Type());
            data_type = tmp;
          }
        }
--- a/paddle/framework/tensor.h
+++ b/paddle/framework/tensor.h
@ -118,10 +118,12 @@ class Tensor {
                             const platform::DeviceContext& ctx);

  /**
-   * @brief   Return the slice of the tensor.
+   * @brief  Return a sub-tensor of the given tensor.
   *
-   * @param[in] begin_idx   The begin index of the slice.
-   * @param[in] end_idx     The end index of the slice.
+   * @param[in] begin_idx   The index of the start row(inclusive) to slice.
+   *                        The index number begins from 0.
+   * @param[in] end_idx     The index of the end row(exclusive) to slice.
+   *                        The index number begins from 0.
   */
  inline Tensor Slice(const int& begin_idx, const int& end_idx) const;

--- a/paddle/framework/tensor_impl.h
+++ b/paddle/framework/tensor_impl.h
@ -112,9 +112,10 @@ inline void* Tensor::mutable_data(platform::Place place, std::type_index type) {
  if (holder_ != nullptr) {
    holder_->set_type(type);
  }
-  PADDLE_ENFORCE_GT(numel(), 0,
-                    "Tensor's numel must be larger than zero to call "
-                    "Tensor::mutable_data. Call Tensor::set_dim first.");
+  PADDLE_ENFORCE_GT(
+      numel(), 0,
+      "When calling this method, the Tensor's numel must be larger than zero. "
+      "Please check Tensor::Resize has been called first.");
  int64_t size = numel() * SizeOfType(type);
  /* some versions of boost::variant don't have operator!= */
  if (holder_ == nullptr || !(holder_->place() == place) ||
@ -229,10 +230,12 @@ inline void Tensor::CopyFromVector(const std::vector<T>& src,

 inline Tensor Tensor::Slice(const int& begin_idx, const int& end_idx) const {
  check_memory_size();
-  PADDLE_ENFORCE_GE(begin_idx, 0, "Slice begin index is less than zero.");
-  PADDLE_ENFORCE_LE(end_idx, dims_[0], "Slice end index is out of bound.");
-  PADDLE_ENFORCE_LT(begin_idx, end_idx,
-                    "Begin index must be less than end index.");
+  PADDLE_ENFORCE_GE(begin_idx, 0,
+                    "The start row index must be greater than 0.");
+  PADDLE_ENFORCE_LE(end_idx, dims_[0], "The end row index is out of bound.");
+  PADDLE_ENFORCE_LT(
+      begin_idx, end_idx,
+      "The start row index must be lesser than the end row index.");

  if (dims_[0] == 1) {
    return *this;
--- a/paddle/gserver/layers/CRFLayer.cpp
+++ b/paddle/gserver/layers/CRFLayer.cpp
@ -101,8 +101,10 @@ void CRFLayer::backward(const UpdateCallback& callback) {
                              : real(1.0f);
    instanceWeight *= coeff_;

-    MatrixPtr grad = output.grad->subRowMatrix(starts[i], starts[i + 1]);
-    grad->add(*crfs_[i].getXGrad(), real(1.0f), instanceWeight);
+    if (output.grad) {
+      MatrixPtr grad = output.grad->subRowMatrix(starts[i], starts[i + 1]);
+      grad->add(*crfs_[i].getXGrad(), real(1.0f), instanceWeight);
+    }
    if (needWGrad) {
      weight_->getWGrad()->add(
          *crfs_[i].getWGrad(), real(1.0f), instanceWeight);
--- a/paddle/gserver/layers/LinearChainCRF.cpp
+++ b/paddle/gserver/layers/LinearChainCRF.cpp
@ -102,7 +102,6 @@ real LinearChainCRF::forward(real* x, int* s, int length) {
 }

 void LinearChainCRF::backward(real* x, int* s, int length, bool needWGrad) {
-  MatrixPtr matX = Matrix::create(x, length, numClasses_);
  Matrix::resizeOrCreate(matGrad_, length, numClasses_);
  Matrix::resizeOrCreate(beta_, length, numClasses_);
  real* b = b_->getData();
--- a/paddle/gserver/layers/SequenceReshapeLayer.cpp
+++ b/paddle/gserver/layers/SequenceReshapeLayer.cpp
@ -70,11 +70,23 @@ void SequenceReshapeLayer::forward(PassType passType) {
  size_t outDim = getSize();

  size_t numSequences = input.getNumSequences();
-  auto startPositions = input.sequenceStartPositions->getVector(false);
-  const int* starts = startPositions->getData();

-  CHECK_EQ(starts[numSequences], input.getBatchSize());
-  CHECK_EQ(numSequences, startPositions->getSize() - 1);
+  // by default, we assume each instance as a sequence
+  IVectorPtr seqStarts;
+  IVector::resizeOrCreate(seqStarts, input.getBatchSize() + 1, false);
+  int* startsData = seqStarts->getData();
+  for (int i = 0; i < input.getBatchSize() + 1; i++) {
+    startsData[i] = i;
+  }
+  const int* starts = startsData;
+
+  // if there is sequence, then use start positions
+  if (input.sequenceStartPositions) {
+    auto startPositions = input.sequenceStartPositions->getVector(false);
+    starts = startPositions->getData();
+    CHECK_EQ(starts[numSequences], input.getBatchSize());
+    CHECK_EQ(numSequences, startPositions->getSize() - 1);
+  }

  for (size_t seqID = 0; seqID < numSequences; seqID++) {
    size_t inNumIns = starts[seqID + 1] - starts[seqID];
--- a/paddle/memory/detail/system_allocator.cc
+++ b/paddle/memory/detail/system_allocator.cc
@ -41,7 +41,16 @@ void* CPUAllocator::Alloc(size_t& index, size_t size) {

  index = 0;  // unlock memory

-  void* p = malloc(size);
+  void* p;
+
+#ifdef PADDLE_USE_MKLDNN
+  // refer to https://github.com/01org/mkl-dnn/blob/master/include/mkldnn.hpp
+  // memory alignment
+  PADDLE_ENFORCE_EQ(posix_memalign(&p, 4096ul, size), 0);
+#else
+  PADDLE_ENFORCE_EQ(posix_memalign(&p, 32ul, size), 0);
+#endif
+  PADDLE_ENFORCE(p, "Fail to allocate CPU memory: size = %d .", size);

  if (p != nullptr) {
    if (FLAGS_use_pinned_memory) {
--- a/paddle/operators/cross_entropy_op.cc
+++ b/paddle/operators/cross_entropy_op.cc
@ -28,8 +28,9 @@ class CrossEntropyOp : public framework::OperatorWithKernel {

    auto x_dims = ctx->GetInputDim("X");
    auto label_dims = ctx->GetInputDim("Label");
-    PADDLE_ENFORCE_EQ(x_dims.size(), 2, "Input(X)'s rank should be 2.");
-    PADDLE_ENFORCE_EQ(label_dims.size(), 2, "Input(Label)'s rank should be 2.");
+    PADDLE_ENFORCE_EQ(x_dims.size(), 2UL, "Input(X)'s rank should be 2.");
+    PADDLE_ENFORCE_EQ(label_dims.size(), 2UL,
+                      "Input(Label)'s rank should be 2.");
    PADDLE_ENFORCE_EQ(x_dims[0], label_dims[0],
                      "The 1st dimension of Input(X) and Input(Label) should "
                      "be equal.");
@ -38,8 +39,8 @@ class CrossEntropyOp : public framework::OperatorWithKernel {
                        "If Attr(soft_label) == true, the 2nd dimension of "
                        "Input(X) and Input(Label) should be equal.");
    } else {
-      PADDLE_ENFORCE_EQ(label_dims[1], 1,
-                        "If Attr(soft_label) == false, the 2nd dimension of "
+      PADDLE_ENFORCE_EQ(label_dims[1], 1UL,
+                        "If Attr(softLabel) == false, the 2nd dimension of "
                        "Input(Label) should be 1.");
    }

@ -48,7 +49,8 @@ class CrossEntropyOp : public framework::OperatorWithKernel {
  }

 protected:
-  // CrossEntropy's data type just determined by "X"
+  // Explicitly set that data type of the output of the cross_entropy operator
+  // is determined by its input "X".
  framework::DataType IndicateDataType(
      const framework::ExecutionContext& ctx) const override {
    return framework::ToDataType(ctx.Input<Tensor>("X")->type());
--- a/paddle/operators/linear_chain_crf_op.cc
+++ b/paddle/operators/linear_chain_crf_op.cc
--- a/paddle/operators/linear_chain_crf_op.cu
+++ b/paddle/operators/linear_chain_crf_op.cu
@ -0,0 +1,26 @@
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "paddle/operators/linear_chain_crf_op.h"
+
+namespace ops = paddle::operators;
+
+REGISTER_OP_GPU_KERNEL(
+    linear_chain_crf,
+    ops::LinearChainCRFOpKernel<paddle::platform::GPUPlace, float>,
+    ops::LinearChainCRFOpKernel<paddle::platform::GPUPlace, double>);
+REGISTER_OP_GPU_KERNEL(
+    linear_chain_crf_grad,
+    ops::LinearChainCRFGradOpKernel<paddle::platform::GPUPlace, float>,
+    ops::LinearChainCRFGradOpKernel<paddle::platform::GPUPlace, double>);
--- a/paddle/operators/linear_chain_crf_op.h
+++ b/paddle/operators/linear_chain_crf_op.h
--- a/paddle/operators/lstm_op.cc
+++ b/paddle/operators/lstm_op.cc
@ -21,7 +21,6 @@ class LSTMOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;

- protected:
  void InferShape(framework::InferShapeContext* ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput("Input"),
                   "Input(Input) of LSTM should not be null.");
@ -29,9 +28,13 @@ class LSTMOp : public framework::OperatorWithKernel {
                   "Output(Hidden) of LSTM should not be null.");
    PADDLE_ENFORCE(ctx->HasOutput("Cell"),
                   "Output(Cell) of LSTM should not be null.");
+    PADDLE_ENFORCE(ctx->HasOutput("BatchGate"),
+                   "Output(BatchGate) of LSTM should not be null.");
+    PADDLE_ENFORCE(ctx->HasOutput("BatchCellPreAct"),
+                   "Output(BatchGate) of LSTM should not be null.");

-    auto x_dims = ctx->GetInputDim("Input");
-    PADDLE_ENFORCE_EQ(x_dims.size(), 2, "Input(X)'s rank must be 2.");
+    auto in_dims = ctx->GetInputDim("Input");
+    PADDLE_ENFORCE_EQ(in_dims.size(), 2, "Input(X)'s rank must be 2.");

    if (ctx->HasInput("H0")) {
      PADDLE_ENFORCE(ctx->HasInput("C0"),
@ -44,7 +47,7 @@ class LSTMOp : public framework::OperatorWithKernel {
                     "should be the same.");
    }

-    int frame_size = x_dims[1] / 4;
+    int frame_size = in_dims[1] / 4;
    auto w_dims = ctx->GetInputDim("Weight");
    PADDLE_ENFORCE_EQ(w_dims.size(), 2,
                      "The rank of Input(Weight) should be 2.");
@ -71,12 +74,21 @@ class LSTMOp : public framework::OperatorWithKernel {
                        "4 * %d if disable peepholes connection",
                        frame_size);
    }
-    ctx->SetOutputDim("Hidden", {x_dims[0], frame_size});
-    ctx->SetOutputDim("Cell", {x_dims[0], frame_size});
-    ctx->SetOutputDim("BatchGate", x_dims);
+    framework::DDim out_dims({in_dims[0], frame_size});
+    ctx->SetOutputDim("Hidden", out_dims);
+    ctx->SetOutputDim("Cell", out_dims);
+    ctx->SetOutputDim("BatchGate", in_dims);
+    ctx->SetOutputDim("BatchCellPreAct", out_dims);
    ctx->ShareLoD("Input", "Hidden");
    ctx->ShareLoD("Input", "Cell");
  }
+
+ protected:
+  framework::DataType IndicateDataType(
+      const framework::ExecutionContext& ctx) const override {
+    return framework::ToDataType(
+        ctx.Input<framework::LoDTensor>("Input")->type());
+  }
 };

 class LSTMOpMaker : public framework::OpProtoAndCheckerMaker {
@ -86,16 +98,18 @@ class LSTMOpMaker : public framework::OpProtoAndCheckerMaker {
    AddInput("Input",
             "(LoDTensor) the first input is a LodTensor, which support "
             "variable-time length input sequence. The underlying tensor in "
-             "this LoDTensor is a matrix with shape (T X 4D), where, T is the "
+             "this LoDTensor is a matrix with shape (T X 4D), where T is the "
             "total time steps in this mini-batch, D is the hidden size.");
    AddInput("H0",
             "(Tensor, optional) the initial hidden state is an optional "
             "input. This is a tensor with shape (N x D), where N is the "
-             "batch size, D is the hidden size.");
+             "batch size, D is the hidden size.")
+        .AsDispensable();
    AddInput("C0",
             "(Tensor, optional) the initial cell state is an optional "
             "input. This is a tensor with shape (N x D), where N is the "
-             "batch size. `H0` and `C0` can be NULL but only at the same time");
+             "batch size. `H0` and `C0` can be NULL but only at the same time")
+        .AsDispensable();
    AddInput("Weight",
             "(Tensor) the learnable hidden-hidden weights."
             " - The shape is (D x 4D), where D is the hidden size. "
@ -109,22 +123,27 @@ class LSTMOpMaker : public framework::OpProtoAndCheckerMaker {
             " - Bias = {b_c, b_i, b_f, b_o}."
             "2. `usePeepholes = True` "
             " - The shape is (1 x 7D). "
-             " - Bias = {b_c, b_i, b_f, b_o, W_ic, W_fc, W_oc}.");
+             " - Bias = {b_c, b_i, b_f, b_o, W_ic, W_fc, W_oc}.")
+        .AsDispensable();
+    AddOutput("Hidden",
+              "(LoDTensor) the hidden state of LSTM operator. "
+              "The shape is (T x D), and lod is the same with the `Input`.");
+    AddOutput("Cell",
+              "(LoDTensor) the cell state of LSTM operator. "
+              "The shape is (T x D), and lod is the same with the `Input`.");
    AddOutput("BatchGate",
              "(LoDTensor) This LoDTensor contains input gate, forget gate "
              "and output gate after the nonlinear computation. This "
              "LoDTensor has the same shape with the reorganized input, which "
-              "was also be called batch input. The LoD size is 2. The first "
+              "is also be called batch input. The LoD size is 2. The first "
              "LoD is the batch offsets and the second LoD contains the "
              "indexes, which denote the position of reorganized sequence "
              "in the raw input.")
        .AsIntermediate();
-    AddOutput("Hidden",
-              "(LoDTensor) the hidden state lod tensor of LSTM operator. "
-              "The shape and lod is the same with the `Input`.");
-    AddOutput("Cell",
-              "(LoDTensor) the cell state lod tensor of LSTM operator. "
-              "The shape and lod is the same with the `Input`.");
+    AddOutput("BatchCellPreAct",
+              "(LoDTensor) This LoDTensor is got in the forward and used "
+              "in the backward.")
+        .AsIntermediate();
    AddAttr<bool>("usePeepholes",
                  "(bool, defalut: True) "
                  "whether to enable diagonal/peephole connections.")
@ -202,15 +221,37 @@ class LSTMGradOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;

- protected:
  void InferShape(framework::InferShapeContext* ctx) const override {
-    PADDLE_ENFORCE(ctx->HasInput(framework::GradVarName("Hidden")),
-                   "Input(Hidden@GRAD) should not be null");
-    PADDLE_ENFORCE(ctx->HasInput(framework::GradVarName("Cell")),
-                   "Input(Cell@GRAD) should not be null");
-    ctx->SetOutputDim(framework::GradVarName("Weight"),
-                      ctx->GetInputDim("Weight"));
-    ctx->SetOutputDim(framework::GradVarName("Bias"), ctx->GetInputDim("Bias"));
+    PADDLE_ENFORCE(ctx->HasInput("Input"),
+                   "Input(Input) of LSTM should not be null.");
+    PADDLE_ENFORCE(ctx->HasInput("Hidden"),
+                   "Input(Hidden) of LSTM should not be null.");
+    PADDLE_ENFORCE(ctx->HasInput("Cell"),
+                   "Input(Cell) of LSTM should not be null.");
+
+    PADDLE_ENFORCE(ctx->HasInput("BatchGate"),
+                   "Input(BatchGate) of LSTM should not be null.");
+    PADDLE_ENFORCE(ctx->HasInput("BatchCellPreAct"),
+                   "Input(BatchGate) of LSTM should not be null.");
+
+    auto in_g_name = framework::GradVarName("Input");
+    if (ctx->HasOutput(in_g_name))
+      ctx->SetOutputDim(in_g_name, ctx->GetInputDim("Input"));
+
+    auto w_g_name = framework::GradVarName("Weight");
+    if (ctx->HasOutput(w_g_name))
+      ctx->SetOutputDim(w_g_name, ctx->GetInputDim("Weight"));
+
+    auto b_g_name = framework::GradVarName("Bias");
+    if (ctx->HasOutput(b_g_name))
+      ctx->SetOutputDim(b_g_name, ctx->GetInputDim("Bias"));
+  }
+
+ protected:
+  framework::DataType IndicateDataType(
+      const framework::ExecutionContext& ctx) const override {
+    return framework::ToDataType(
+        ctx.Input<framework::LoDTensor>("Input")->type());
  }
 };

--- a/paddle/operators/lstm_op.h
+++ b/paddle/operators/lstm_op.h
--- a/paddle/operators/math/detail/lstm_cpu_kernel.h
+++ b/paddle/operators/math/detail/lstm_cpu_kernel.h
@ -26,10 +26,7 @@ namespace detail {

 template <class T, class Op>
 void naive_lstm_forward_one_sequence(Op op, LstmMetaValue<T> value,
-                                     int frameSize,
-                                     activation_mode_t active_node,
-                                     activation_mode_t active_gate,
-                                     activation_mode_t active_state) {
+                                     int frameSize) {
  T rValueIn;
  T rValueIg;
  T rValueFg;
@ -60,10 +57,8 @@ void naive_lstm_forward_one_sequence(Op op, LstmMetaValue<T> value,
      rPrevState = value.prevStateValue[i];
    }

-    hppl::cpu::ForwardAct<T> act;
    op(rValueIn, rValueIg, rValueFg, rValueOg, rPrevState, rState, rStateAtv,
-       rOut, rCheckI, rCheckF, rCheckO, act(active_node), act(active_gate),
-       act(active_state));
+       rOut, rCheckI, rCheckF, rCheckO);

    valueIn[i] = rValueIn;
    valueIg[i] = rValueIg;
@ -77,10 +72,7 @@ void naive_lstm_forward_one_sequence(Op op, LstmMetaValue<T> value,

 template <class T, class Op>
 void naive_lstm_backward_one_sequence(Op op, LstmMetaValue<T> value,
-                                      LstmMetaGrad<T> grad, int frameSize,
-                                      activation_mode_t active_node,
-                                      activation_mode_t active_gate,
-                                      activation_mode_t active_state) {
+                                      LstmMetaGrad<T> grad, int frameSize) {
  T rValueIn;
  T rValueIg;
  T rValueFg;
@ -127,11 +119,10 @@ void naive_lstm_backward_one_sequence(Op op, LstmMetaValue<T> value,
      rPrevState = value.prevStateValue[i];
    }

-    hppl::cpu::BackwardAct<T> act;
    op(rValueIn, rValueIg, rValueFg, rValueOg, rGradIn, rGradIg, rGradFg,
       rGradOg, rPrevState, rPrevStateGrad, rState, rStateGrad, rStateAtv,
       rOutputGrad, rCheckI, rCheckF, rCheckO, rCheckIGrad, rCheckFGrad,
-       rCheckOGrad, act(active_node), act(active_gate), act(active_state));
+       rCheckOGrad);

    gradIn[i] = rGradIn;
    gradIg[i] = rGradIg;
@ -283,8 +274,7 @@ void cpu_lstm_forward(Op op, LstmMetaValue<T> value, int frameSize,
    avx_lstm_forward_one_sequence<T>(op, value, frameSize, active_node,
                                     active_gate, active_state);
  } else {
-    naive_lstm_forward_one_sequence<T>(op, value, frameSize, active_node,
-                                       active_gate, active_state);
+    naive_lstm_forward_one_sequence<T>(op, value, frameSize);
  }
 }

@ -297,8 +287,7 @@ void cpu_lstm_backward(Op op, LstmMetaValue<T> value, LstmMetaGrad<T> grad,
    avx_lstm_backward_one_sequence<T>(op, value, grad, frameSize, active_node,
                                      active_gate, active_state);
  } else {
-    naive_lstm_backward_one_sequence<T>(op, value, grad, frameSize, active_node,
-                                        active_gate, active_state);
+    naive_lstm_backward_one_sequence<T>(op, value, grad, frameSize);
  }
 }

--- a/paddle/operators/math/detail/lstm_gpu_kernel.h
+++ b/paddle/operators/math/detail/lstm_gpu_kernel.h
@ -32,9 +32,7 @@ namespace detail {
 */
 template <class T, class Op, bool isBatch>
 __global__ void KeLstmForward(Op op, LstmMetaValue<T> value, int frameSize,
-                              int batchSize, activation_mode_t active_node,
-                              activation_mode_t active_gate,
-                              activation_mode_t active_state) {
+                              int batchSize) {
  const int frameIdx = blockIdx.x * blockDim.x + threadIdx.x;
  if (frameIdx >= frameSize) return;

@ -70,10 +68,8 @@ __global__ void KeLstmForward(Op op, LstmMetaValue<T> value, int frameSize,
    rPrevState = value.prevStateValue[frameIdx];
  }

-  hppl::gpu::ForwardAct<T> act;
  op(rValueIn, rValueIg, rValueFg, rValueOg, rPrevState, rState, rStateAtv,
-     rOut, rCheckI, rCheckF, rCheckO, act(active_node), act(active_gate),
-     act(active_state));
+     rOut, rCheckI, rCheckF, rCheckO);

  value.gateValue[frameIdx] = rValueIn;
  value.gateValue[frameIdx + frameSize] = rValueIg;
@ -92,9 +88,7 @@ __global__ void KeLstmForward(Op op, LstmMetaValue<T> value, int frameSize,
 template <class T, class Op, bool isBatch>
 __global__ void KeLstmBackward(Op op, LstmMetaValue<T> value,
                               LstmMetaGrad<T> grad, int frameSize,
-                               int batchSize, activation_mode_t active_node,
-                               activation_mode_t active_gate,
-                               activation_mode_t active_state) {
+                               int batchSize) {
  const int frameIdx = blockIdx.x * blockDim.x + threadIdx.x;
  if (frameIdx >= frameSize) return;

@ -145,11 +139,9 @@ __global__ void KeLstmBackward(Op op, LstmMetaValue<T> value,
    rPrevState = value.prevStateValue[frameIdx];
  }

-  hppl::gpu::BackwardAct<T> act;
  op(rValueIn, rValueIg, rValueFg, rValueOg, rGradIn, rGradIg, rGradFg, rGradOg,
     rPrevState, rPrevStateGrad, rState, rStateGrad, rStateAtv, rOutputGrad,
-     rCheckI, rCheckF, rCheckO, rCheckIGrad, rCheckFGrad, rCheckOGrad,
-     act(active_node), act(active_gate), act(active_state));
+     rCheckI, rCheckF, rCheckO, rCheckIGrad, rCheckFGrad, rCheckOGrad);

  grad.gateGrad[frameIdx] = rGradIn;
  grad.gateGrad[frameIdx + frameSize] = rGradIg;
@ -205,13 +197,11 @@ void gpu_lstm_forward(const platform::DeviceContext& context, Op op,
  if (batchSize == 1) {
    KeLstmForward<T, Op,
                  /* isBatch= */ false><<<grid, threads, 0, stream>>>(
-        op, value, frameSize, batchSize, active_node, active_gate,
-        active_state);
+        op, value, frameSize, batchSize);
  } else {
    KeLstmForward<T, Op,
                  /* isBatch= */ true><<<grid, threads, 0, stream>>>(
-        op, value, frameSize, batchSize, active_node, active_gate,
-        active_state);
+        op, value, frameSize, batchSize);
  }
 }

@ -240,13 +230,11 @@ void gpu_lstm_backward(const platform::DeviceContext& context, Op op,
  if (batchSize == 1) {
    KeLstmBackward<T, Op,
                   /* isBatch= */ false><<<grid, threads, 0, stream>>>(
-        op, value, grad, frameSize, batchSize, active_node, active_gate,
-        active_state);
+        op, value, grad, frameSize, batchSize);
  } else {
    KeLstmBackward<T, Op,
                   /* isBatch= */ true><<<grid, threads, 0, stream>>>(
-        op, value, grad, frameSize, batchSize, active_node, active_gate,
-        active_state);
+        op, value, grad, frameSize, batchSize);
  }
 }

--- a/paddle/operators/math/detail/lstm_kernel.h
+++ b/paddle/operators/math/detail/lstm_kernel.h
@ -24,15 +24,29 @@ namespace detail {

 namespace forward {

+template <typename T>
+DEVICE inline T sigmoid(const T a) {
+  const T min = SIGMOID_THRESHOLD_MIN;
+  const T max = SIGMOID_THRESHOLD_MAX;
+  T tmp = (a < min) ? min : ((a > max) ? max : a);
+  return static_cast<T>(1.0) / (static_cast<T>(1.0) + exp(-tmp));
+}
+
+template <typename T>
+DEVICE inline T tanh(const T a) {
+  T tmp = -2.0 * a;
+  tmp = (tmp > EXP_MAX_INPUT) ? EXP_MAX_INPUT : tmp;
+  return (2.0 / (1.0 + exp(tmp))) - 1.0;
+}
+
 template <class T>
 class lstm {
 public:
  HOSTDEVICE void operator()(T &valueIn, T &valueIg, T &valueFg, T &valueOg,
                             T &prevState, T &state, T &stateAtv, T &output,
-                             T &checkI, T &checkF, T &checkO,
-                             typename hppl::ForwardActType<T>::type actInput,
-                             typename hppl::ForwardActType<T>::type actGate,
-                             typename hppl::ForwardActType<T>::type actState) {
+                             T &checkI, T &checkF, T &checkO) {
+#if 0
+    // TODO(qingqing) support to activation speficed by users
    valueIn = actInput(valueIn);
    valueIg = actGate(valueIg + prevState * checkI);
    valueFg = actGate(valueFg + prevState * checkF);
@ -40,6 +54,15 @@ class lstm {
    valueOg = actGate(valueOg + state * checkO);
    stateAtv = actState(state);
    output = valueOg * stateAtv;
+#else
+    valueIn = tanh<T>(valueIn);
+    valueIg = sigmoid<T>(valueIg + prevState * checkI);
+    valueFg = sigmoid<T>(valueFg + prevState * checkF);
+    state = valueIn * valueIg + prevState * valueFg;
+    valueOg = sigmoid<T>(valueOg + state * checkO);
+    stateAtv = tanh<T>(state);
+    output = valueOg * stateAtv;
+#endif
  }
 #ifndef __NVCC__
 #ifndef __AVX__  // If not compiled with AVX instructs. Disable AVX by default
@ -72,6 +95,16 @@ class lstm {

 namespace backward {

+template <typename T>
+DEVICE inline T sigmoid(const T a, const T b) {
+  return a * b * (1.0 - b);
+}
+
+template <typename T>
+DEVICE inline T tanh(const T a, const T b) {
+  return a * (1.0 - b * b);
+}
+
 template <class T>
 class lstm {
 public:
@ -80,10 +113,9 @@ class lstm {
                             T &prevState, T &prevStateGrad, T &state,
                             T &stateGrad, T &stateAtv, T &outputGrad,
                             T &checkI, T &checkF, T &checkO, T &checkIGrad,
-                             T &checkFGrad, T &checkOGrad,
-                             typename hppl::BackwardActType<T>::type actInput,
-                             typename hppl::BackwardActType<T>::type actGate,
-                             typename hppl::BackwardActType<T>::type actState) {
+                             T &checkFGrad, T &checkOGrad) {
+#if 0
+    // TODO(qingqing) support to activation speficed by users
    gradOg = actGate(outputGrad * stateAtv, valueOg);
    stateGrad += actState(outputGrad * valueOg, stateAtv) + gradOg * checkO;
    gradIn = actInput(stateGrad * valueIg, valueIn);
@ -93,6 +125,17 @@ class lstm {
    checkIGrad = gradIg * prevState;
    checkFGrad = gradFg * prevState;
    checkOGrad = gradOg * state;
+#else
+    gradOg = sigmoid<T>(outputGrad * stateAtv, valueOg);
+    stateGrad += tanh<T>(outputGrad * valueOg, stateAtv) + gradOg * checkO;
+    gradIn = tanh<T>(stateGrad * valueIg, valueIn);
+    gradIg = sigmoid<T>(stateGrad * valueIn, valueIg);
+    gradFg = sigmoid<T>(stateGrad * prevState, valueFg);
+    prevStateGrad = gradIg * checkI + gradFg * checkF + stateGrad * valueFg;
+    checkIGrad = gradIg * prevState;
+    checkFGrad = gradFg * prevState;
+    checkOGrad = gradOg * state;
+#endif
  }
 #ifndef __NVCC__
 #ifndef __AVX__  // If not compiled with AVX instructs. Disable AVX by default
--- a/paddle/operators/math/math_function.cc
+++ b/paddle/operators/math/math_function.cc
@ -211,6 +211,26 @@ void batched_gemm<platform::CPUPlace, double>(
 }
 #endif

+template <>
+void gemv<platform::CPUPlace, float>(const platform::DeviceContext& context,
+                                     const bool trans_a, const int M,
+                                     const int N, const float alpha,
+                                     const float* A, const float* B,
+                                     const float beta, float* C) {
+  CBLAS_TRANSPOSE transA = (trans_a == false) ? CblasNoTrans : CblasTrans;
+  cblas_sgemv(CblasRowMajor, transA, M, N, alpha, A, N, B, 1, beta, C, 1);
+}
+
+template <>
+void gemv<platform::CPUPlace, double>(const platform::DeviceContext& context,
+                                      const bool trans_a, const int M,
+                                      const int N, const double alpha,
+                                      const double* A, const double* B,
+                                      const double beta, double* C) {
+  CBLAS_TRANSPOSE transA = (trans_a == false) ? CblasNoTrans : CblasTrans;
+  cblas_dgemv(CblasRowMajor, transA, M, N, alpha, A, N, B, 1, beta, C, 1);
+}
+
 template struct SetConstant<platform::CPUPlace, float>;

 }  // namespace math
--- a/paddle/operators/math/math_function.cu
+++ b/paddle/operators/math/math_function.cu
@ -203,6 +203,33 @@ void batched_gemm<platform::GPUPlace, double>(
      &beta, C, ldc, strideC, batchCount));
 }

+template <>
+void gemv<platform::GPUPlace, float>(const platform::DeviceContext& context,
+                                     const bool trans_a, const int M,
+                                     const int N, const float alpha,
+                                     const float* A, const float* B,
+                                     const float beta, float* C) {
+  cublasOperation_t cuTransA = (trans_a == false) ? CUBLAS_OP_T : CUBLAS_OP_N;
+
+  PADDLE_ENFORCE(platform::dynload::cublasSgemv(
+      reinterpret_cast<const platform::CUDADeviceContext&>(context)
+          .cublas_handle(),
+      cuTransA, N, M, &alpha, A, N, B, 1, &beta, C, 1));
+}
+
+template <>
+void gemv<platform::GPUPlace, double>(const platform::DeviceContext& context,
+                                      const bool trans_a, const int M,
+                                      const int N, const double alpha,
+                                      const double* A, const double* B,
+                                      const double beta, double* C) {
+  cublasOperation_t cuTransA = (trans_a == false) ? CUBLAS_OP_T : CUBLAS_OP_N;
+  PADDLE_ENFORCE(platform::dynload::cublasDgemv(
+      reinterpret_cast<const platform::CUDADeviceContext&>(context)
+          .cublas_handle(),
+      cuTransA, N, M, &alpha, A, N, B, 1, &beta, C, 1));
+}
+
 template struct SetConstant<platform::GPUPlace, float>;

 }  // namespace math
--- a/paddle/operators/math/math_function.h
+++ b/paddle/operators/math/math_function.h
@ -93,6 +93,11 @@ void batched_gemm(const platform::DeviceContext& context,
                  const T* A, const T* B, const T beta, T* C,
                  const int batchCount, const int strideA, const int strideB);

+template <typename Place, typename T>
+void gemv(const platform::DeviceContext& context, const bool trans_a,
+          const int M, const int N, const T alpha, const T* A, const T* B,
+          const T beta, T* C);
+
 template <typename Place, typename T>
 struct SetConstant {
  void operator()(const platform::DeviceContext& context,
--- a/paddle/operators/math/math_function_test.cc
+++ b/paddle/operators/math/math_function_test.cc
@ -89,3 +89,53 @@ TEST(math_function, zero) {
  EXPECT_EQ(t[2], 1);
  EXPECT_EQ(t[3], 1);
 }
+
+template <typename T>
+void GemvTest(int m, int n, bool trans) {
+  paddle::framework::Tensor mat_a;
+  paddle::framework::Tensor vec_b;
+  paddle::framework::Tensor vec_c;
+  auto* cpu_place = new paddle::platform::CPUPlace();
+  int b_num = trans ? m : n;
+  int c_num = trans ? n : m;
+
+  T* data_a = mat_a.mutable_data<T>({m, n}, *cpu_place);
+  T* data_b = vec_b.mutable_data<T>({b_num}, *cpu_place);
+  T* data_c = vec_c.mutable_data<T>({c_num}, *cpu_place);
+  for (int i = 0; i < mat_a.numel(); ++i) {
+    data_a[i] = static_cast<T>(i);
+  }
+  for (int i = 0; i < vec_b.numel(); ++i) {
+    data_b[i] = static_cast<T>(i);
+  }
+
+  paddle::platform::CPUDeviceContext context(*cpu_place);
+  paddle::operators::math::gemv<paddle::platform::CPUPlace, T>(
+      context, trans, static_cast<int>(m), static_cast<int>(n), 1., data_a,
+      data_b, 0., data_c);
+
+  if (!trans) {
+    for (int i = 0; i < m; ++i) {
+      T sum = 0.0;
+      for (int j = 0; j < n; ++j) {
+        sum += data_a[i * n + j] * data_b[j];
+      }
+      ASSERT_FLOAT_EQ(data_c[i], sum);
+    }
+  } else {
+    for (int i = 0; i < n; ++i) {
+      T sum = 0.0;
+      for (int j = 0; j < m; ++j) {
+        sum += data_a[j * n + i] * data_b[j];
+      }
+      ASSERT_FLOAT_EQ(data_c[i], sum);
+    }
+  }
+}
+
+TEST(math_function, gemv) {
+  GemvTest<float>(3, 13, false);
+  GemvTest<double>(4, 5, false);
+  GemvTest<float>(12, 7, true);
+  GemvTest<double>(7, 9, true);
+}
--- a/Show More
+++ b/Show More