remove conflict

7 years ago · df59889984
parent e1e3859e88 99130c6e94
commit df59889984
167 changed files with 3283 additions and 1160 deletions
--- a/README.md
+++ b/README.md
@ -51,19 +51,19 @@ Please refer to our [release announcement](https://github.com/PaddlePaddle/Paddl
 - **Connected to Products**

    In addition, PaddlePaddle is also designed to be easily deployable. At Baidu,
-    PaddlePaddle has been deployed into products or service with a vast number
+    PaddlePaddle has been deployed into products and services with a vast number
    of users, including ad click-through rate (CTR) prediction, large-scale image
    classification, optical character recognition(OCR), search ranking, computer
    virus detection, recommendation, etc. It is widely utilized in products at
-    Baidu and it has achieved a significant impact. We hope you can also exploit
-    the capability of PaddlePaddle to make a huge impact for your product.
+    Baidu and it has achieved a significant impact. We hope you can also explore
+    the capability of PaddlePaddle to make an impact on your product.

 ## Installation

 It is recommended to check out the
 [Docker installation guide](http://doc.paddlepaddle.org/develop/doc/getstarted/build_and_install/docker_install_en.html)
 before looking into the
-[build from source guide](http://doc.paddlepaddle.org/develop/doc/getstarted/build_and_install/build_from_source_en.html)
+[build from source guide](http://doc.paddlepaddle.org/develop/doc/getstarted/build_and_install/build_from_source_en.html).

 ## Documentation

@ -72,7 +72,7 @@ We provide [English](http://doc.paddlepaddle.org/develop/doc/) and

 - [Deep Learning 101](http://book.paddlepaddle.org/index.html)

-  You might want to start from this online interactive book that can run in Jupyter Notebook.
+  You might want to start from this online interactive book that can run in a Jupyter Notebook.

 - [Distributed Training](http://doc.paddlepaddle.org/develop/doc/howto/usage/cluster/cluster_train_en.html)

--- a/benchmark/paddle/image/run_mkldnn.sh
+++ b/benchmark/paddle/image/run_mkldnn.sh
@ -1,10 +1,9 @@
 set -e

-unset OMP_NUM_THREADS MKL_NUM_THREADS
-export OMP_DYNAMIC="FALSE"
-export KMP_AFFINITY="granularity=fine,compact,0,0"
-
 function train() {
+  unset OMP_NUM_THREADS MKL_NUM_THREADS
+  export OMP_DYNAMIC="FALSE"
+  export KMP_AFFINITY="granularity=fine,compact,0,0"
  topology=$1
  bs=$2
  use_mkldnn=$3
--- a/doc/design/refactorization.md
+++ b/doc/design/refactorization.md
--- a/doc/design/tensor_array.md
+++ b/doc/design/tensor_array.md
@ -0,0 +1,73 @@
+# Design for TensorArray
+TensorArray as a new concept is borrowed from TensorFlow, 
+it is meant to be used with dynamic iteration primitives such as `while_loop` and `map_fn`.
+
+This concept can be used to support our new design of dynamic operations, and help to refactor some existing variant-sentence-related layers, 
+such as `RecurrentGradientMachine`.
+
+In [our design for dynamic RNN](https://github.com/PaddlePaddle/Paddle/pull/4401), 
+`TensorArray` is used to segment inputs and store states in all time steps.
+By providing some methods similar to a C++ array,
+the definition of some state-based dynamic models such as RNN could be more natural and highly flexible.
+
+## Dynamic-Related Methods
+Some basic methods should be proposed as follows:
+
+### stack()
+Pack the values in a `TensorArray` into a tensor with rank one higher than each tensor in `values`.
+### unstack(axis=0)
+Unpacks the given dimension of a rank-`R` tensor into rank-`(R-1)` tensors.
+### concat()
+Return the values in the `TensorArray` as a concatenated Tensor.
+### write(index, value, data_shared=true)
+Write value into index of the TensorArray.
+### read(index)
+Read the value at location `index` in the `TensorArray`.
+### size()
+Return the number of values.
+
+## LoDTensor-related Supports
+The `RecurrentGradientMachine` in Paddle serves as a flexible RNN layer; it takes variant length sequences as input, 
+because each step of RNN could only take a tensor-represented batch of data as input, 
+some preprocess should be taken on the inputs such as sorting the sentences by their length in descending order and cut each word and pack to new batches.
+
+Such cut-like operations can be embedded into `TensorArray` as general methods called `unpack` and `pack`.
+
+With these two methods, a variant-sentence-RNN can be implemented like
+
+```c++
+// input is the varient-length data
+LodTensor sentence_input(xxx);
+TensorArray ta;
+Tensor indice_map;
+Tensor boot_state = xxx; // to initialize rnn's first state
+TensorArray::unpack(input, 1/*level*/, true/*sort_by_length*/, &ta, &indice_map);
+TessorArray step_outputs;
+TensorArray states;
+
+for (int step = 0; step = ta.size(); step++) {
+  auto state = states.read(step);
+  // rnnstep is a function which acts like a step of RNN
+  auto step_input = ta.read(step);
+  auto step_output = rnnstep(step_input, state);
+  step_outputs.write(step_output, true/*data_shared*/);
+}
+
+// rnn_output is the final output of an rnn
+LoDTensor rnn_output = ta.pack(ta, indice_map);
+```
+the code above shows that by embedding the LoDTensor-related preprocess operations into `TensorArray`,
+the implementation of a RNN that supports varient-length sentences is far more concise than `RecurrentGradientMachine` because the latter mixes all the codes together, hard to read and extend.
+
+
+some details are as follows.
+
+### unpack(level, sort_by_length)
+Split LodTensor in some `level` and generate batches, if set `sort_by_length`, will sort by length.
+
+Returns:
+
+- a new `TensorArray`, whose values are LodTensors and represents batches of data.
+- an int32 Tensor, which stores the map from the new batch's indices to original LoDTensor
+### pack(level, indices_map)
+Recover the original LoD-arranged LoDTensor with the values in a `TensorArray` and `level` and `indices_map`.
--- a/doc/howto/dev/new_op_en.md
+++ b/doc/howto/dev/new_op_en.md
@ -182,7 +182,7 @@ Note that **different devices (CPU, GPU)share an Op definition; whether or not t

 `MulOp`'s CPU and GPU share the same `Kernel`. A non-sharing  `OpKernel` example can be seen in [`OnehotCrossEntropyOpKernel`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/cross_entropy_op.h#L43).

-To ease the writing of `OpKernel` compute, and for reusing code cross-device, `Eigen unsupported Tensor` module is used to implement `Compute` interface. To learn about how the Eigen library is used in PaddlePaddle, please see [usage document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/dev/use_eigen_cn.md).
+To ease the writing of `OpKernel` compute, and for reusing code cross-device, [`Eigen-unsupported Tensor`](https://bitbucket.org/eigen/eigen/src/default/unsupported/Eigen/CXX11/src/Tensor/README.md?fileviewer=file-view-default) module is used to implement `Compute` interface. To learn about how the Eigen library is used in PaddlePaddle, please see [usage document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/dev/use_eigen_cn.md).


 This concludes the forward implementation of an operator. Next its operation and kernel need to be registered in a `.cc` file.
--- a/doc/howto/dev/use_eigen_en.md
+++ b/doc/howto/dev/use_eigen_en.md
@ -0,0 +1,146 @@
+## How to use Eigen in Paddle
+
+Essentially, a neural network is a compute graph. T data needed for the computation is stored in `Tensor`s and its computation procedure is described by `Operator`s. An `Operator` calls the `Compute` interface in its corresponding `OpKernel` and operates on the `Tensor`.
+
+
+### Eigen Tensor Module
+
+The Eigen Tensor module supports powerful element-wise computation. In addition, a piece of code written using it can be run on both the CPU and the GPU.
+
+Note that Eigen Tensor is still being actively developed, so its tests are not completely covered and its documentation may be sparse.
+
+For details on Eigen Tensor module, please see [doc 1](https://github.com/RLovelett/eigen/blob/master/unsupported/Eigen/CXX11/src/Tensor/README.md) and [doc 2](https://bitbucket.org/eigen/eigen/src/default/unsupported/Eigen/CXX11/src/Tensor/README.md).
+
+
+### paddle::framework::Tensor
+
+Paddle Tensor's is defined in the framework directory with the following interface:
+
+```cpp
+class Tensor {
+ public:
+  /*! Return a pointer to mutable memory block. */
+  template <typename T>
+  inline T* data();
+
+  /**
+   * @brief   Return a pointer to mutable memory block.
+   * @note    If not exist, then allocation.
+   */
+  template <typename T>
+  inline T* mutable_data(platform::Place place);
+
+  /**
+   * @brief     Return a pointer to mutable memory block.
+   *
+   * @param[in] dims    The dimensions of the memory block.
+   * @param[in] place   The place of the memory block.
+   *
+   * @note      If not exist, then allocation.
+   */
+  template <typename T>
+  inline T* mutable_data(DDim dims, platform::Place place);
+
+  /*! Resize the dimensions of the memory block. */
+  inline Tensor& Resize(const DDim& dims);
+
+  /*! Return the dimensions of the memory block. */
+  inline const DDim& dims() const;
+
+ private:
+  /*! holds the memory block if allocated. */
+  std::shared_ptr<Placeholder> holder_;
+
+  /*! points to dimensions of memory block. */
+  DDim dim_;
+};
+```
+
+`Placeholder` is used to delay memory allocation; that is, we can first define a tensor, using `Resize` to configure its shape, and then call `mutuable_data` to allocate the actual memory.
+
+```cpp
+paddle::framework::Tensor t;
+paddle::platform::CPUPlace place;
+// set size first
+t.Resize({2, 3});
+// allocate memory on CPU later
+t.mutable_data(place);
+```
+
+### paddle::framework::Tensor Usage
+`AddOp` demonstrates Tensor's usage.
+
+- InferShape
+
+When computing a neural network's compute graph, first call every `Operator`'s `InferShape` method, and use `Resize` to configure the size of the output tensor.
+
+```cpp
+void InferShape(const framework::InferShapeContext &ctx) const override {
+  PADDLE_ENFORCE_EQ(ctx.Input<Tensor>("X")->dims(),
+                    ctx.Input<Tensor>("Y")->dims(),
+                    "Two input of Add Op's dimension must be same.");
+  ctx.Output<Tensor>("Out")->Resize(ctx.Input<Tensor>("X")->dims());
+}
+```
+
+
+- Run
+
+```cpp
+void Compute(const framework::ExecutionContext& context) const override {
+  auto* input0 = context.Input<Tensor>("X");
+  auto* input1 = context.Input<Tensor>("Y");
+  auto* output = context.Output<Tensor>("Out");
+
+  output->mutable_data<T>(context.GetPlace());
+
+  auto x = EigenVector<T>::Flatten(*input0);
+  auto y = EigenVector<T>::Flatten(*input1);
+  auto z = EigenVector<T>::Flatten(*output);
+
+  auto place = context.GetEigenDevice<Place>();
+
+  z.device(place) = x + y;
+}
+```
+
+
+### paddle::framework::Tensor到EigenTensor的转换
+
+As shown above, in actual computation, we need to transform the input and output `Tensor`s into formats Eigen supports. We show some functions in [eigen.h](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/eigen.h) to implement the transformation from `paddle::framework::Tensor`to `EigenTensor/EigenMatrix/EigenVector/EigenScalar`.
+
+Using EigenTensor as an example:
+
+```cpp
+Tensor t;
+float* p = t.mutable_data<float>(make_ddim({1, 2, 3}), platform::CPUPlace());
+for (int i = 0; i < 1 * 2 * 3; i++) {
+  p[i] = static_cast<float>(i);
+}
+
+EigenTensor<float, 3>::Type et = EigenTensor<float, 3>::From(t);
+```
+
+`From` is an interfacing method provided by the EigenTensor template, which implements the transformation from a `paddle::framework::Tensor` object to an EigenTensor. Since `rank` is a template parameter, it needs to be explicitly specified at the time of the transformation.
+
+In Eigen, tensors with different ranks are different types, with `Vector` bring a rank-1 instance. Note that `EigenVector<T>::From` uses a transformation from an 1-dimensional Paddle tensor to a 1-dimensional Eigen tensor while `EigenVector<T>::Flatten` reshapes a paddle tensor and flattens it into a 1-dimensional Eigen tensor. Both resulting tensors are still typed EigenVector.
+
+For more transformations, see the [unit tests](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/eigen_test.cc) in the `eigen_test.cc` file.
+
+
+
+### Implementing Computation
+
+While computing, the device interface is needed from the EigenTensors on the left hand side of the assignments. Note that the computation between EigenTensors only changes the data originally inthe Tensor and does not change all the shape information associated with the Tensor.
+
+```cpp
+auto x = EigenVector<T>::Flatten(*input0);
+auto y = EigenVector<T>::Flatten(*input1);
+auto z = EigenVector<T>::Flatten(*output);
+auto place = context.GetEigenDevice<Place>();
+z.device(place) = x + y;
+```
+
+In this code segment, input0/input1/output can be Tensors of arbitrary dimension. We are calling Flatten from EigenVector, transforming a tensor of any dimension into a 1-dimensional EigenVector. After completing computation, input0/input1/output will retain the same shape information, and they can be resized using the `Resize` interface.
+
+Because the Eigen Tensor module is under-documented, please refer to `OpKernel`'s computation code in TensorFlow's [kernel module documentation](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/core/kernels).
--- a/paddle/framework/CMakeLists.txt
+++ b/paddle/framework/CMakeLists.txt
@ -19,13 +19,14 @@ cc_test(scope_test SRCS scope_test.cc DEPS scope)
 proto_library(framework_proto SRCS framework.proto)

 cc_library(attribute SRCS attribute.cc DEPS framework_proto)
+cc_library(proto_desc SRCS var_desc.cc op_desc.cc block_desc.cc program_desc.cc DEPS attribute)
 cc_library(op_proto_maker SRCS op_proto_maker.cc DEPS framework_proto attribute)
 cc_test(op_proto_maker_test SRCS op_proto_maker_test.cc DEPS op_proto_maker)
 cc_library(op_info SRCS op_info.cc DEPS attribute framework_proto)
 cc_library(operator SRCS operator.cc DEPS op_info device_context tensor scope)
 cc_test(operator_test SRCS operator_test.cc DEPS operator op_registry)

-cc_library(grad_op_builder SRCS grad_op_builder.cc DEPS operator)
+cc_library(grad_op_builder SRCS grad_op_builder.cc DEPS operator proto_desc)
 cc_library(op_registry SRCS op_registry.cc DEPS grad_op_builder op_proto_maker op_info)
 cc_test(op_registry_test SRCS op_registry_test.cc DEPS op_registry)
 cc_test(grad_op_builder_test SRCS grad_op_builder_test.cc DEPS grad_op_builder op_registry add_op)
--- a/paddle/framework/block_desc.cc
+++ b/paddle/framework/block_desc.cc
@ -0,0 +1,89 @@
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "paddle/framework/block_desc.h"
+#include "paddle/framework/program_desc.h"
+
+namespace paddle {
+namespace framework {
+
+VarDescBind *BlockDescBind::NewVar(const std::string &name) {
+  need_update_ = true;
+  auto it = vars_.find(name);
+  PADDLE_ENFORCE(it == vars_.end(), "Duplicated variable %s", name);
+  auto var = new VarDescBind(name);
+  vars_[name].reset(var);
+  return var;
+}
+
+VarDescBind *BlockDescBind::Var(const std::string &name) const {
+  auto it = vars_.find(name);
+  PADDLE_ENFORCE(it != vars_.end(),
+                 "Can not find variable %s in current block.", name);
+  return it->second.get();
+}
+
+std::vector<VarDescBind *> BlockDescBind::AllVars() const {
+  std::vector<VarDescBind *> res;
+  for (const auto &p : vars_) {
+    res.push_back(p.second.get());
+  }
+  return res;
+}
+
+OpDescBind *BlockDescBind::AppendOp() {
+  need_update_ = true;
+  ops_.emplace_back(new OpDescBind());
+  return ops_.back().get();
+}
+
+OpDescBind *BlockDescBind::PrependOp() {
+  need_update_ = true;
+  ops_.emplace_front(new OpDescBind());
+  return ops_.front().get();
+}
+
+std::vector<OpDescBind *> BlockDescBind::AllOps() const {
+  std::vector<OpDescBind *> res;
+  for (const auto &op : ops_) {
+    res.push_back(op.get());
+  }
+  return res;
+}
+
+void BlockDescBind::Sync() {
+  if (need_update_) {
+    auto &op_field = *this->desc_->mutable_ops();
+    op_field.Clear();
+    op_field.Reserve(static_cast<int>(ops_.size()));
+    for (auto &op_desc : ops_) {
+      op_field.AddAllocated(op_desc->Proto());
+    }
+    need_update_ = false;
+  }
+}
+
+BlockDescBind *BlockDescBind::ParentBlock() const {
+  if (this->desc_->parent_idx() == -1) {
+    return nullptr;
+  }
+  return prog_->Block(static_cast<size_t>(this->desc_->parent_idx()));
+}
+
+void OpDescBind::SetBlockAttr(const std::string &name, BlockDescBind &block) {
+  BlockDesc *desc = block.RawPtr();
+  this->attrs_[name] = desc;
+}
+}  // namespace framework
+}  // namespace paddle
--- a/paddle/framework/block_desc.h
+++ b/paddle/framework/block_desc.h
@ -0,0 +1,71 @@
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+
+#include <deque>
+#include <unordered_map>
+#include <vector>
+#include "paddle/framework/op_desc.h"
+#include "paddle/framework/var_desc.h"
+
+namespace paddle {
+namespace framework {
+
+class ProgramDescBind;
+
+// Each Protobuf Message, we provide a XXXBind class. In that class, we optimize
+// read/write speed. Only when we want the protobuf message, the local changes
+// will be synchronized (by `Sync` method).
+
+class BlockDescBind {
+ public:
+  BlockDescBind(ProgramDescBind *prog, BlockDesc *desc)
+      : prog_(prog), desc_(desc), need_update_(false) {}
+
+  BlockDescBind(const BlockDescBind &o) = delete;
+  BlockDescBind &operator=(const BlockDescBind &o) = delete;
+
+  int32_t ID() const { return desc_->idx(); }
+
+  int32_t Parent() const { return desc_->parent_idx(); }
+
+  VarDescBind *NewVar(const std::string &name_bytes);
+
+  VarDescBind *Var(const std::string &name_bytes) const;
+
+  std::vector<VarDescBind *> AllVars() const;
+
+  BlockDescBind *ParentBlock() const;
+
+  OpDescBind *AppendOp();
+
+  OpDescBind *PrependOp();
+
+  std::vector<OpDescBind *> AllOps() const;
+
+  void Sync();
+
+  BlockDesc *RawPtr() { return desc_; }
+
+ private:
+  ProgramDescBind *prog_;  // not_own
+  BlockDesc *desc_;        // not_own
+  bool need_update_;
+
+  std::deque<std::unique_ptr<OpDescBind>> ops_;
+  std::unordered_map<std::string, std::unique_ptr<VarDescBind>> vars_;
+};
+}  // namespace framework
+}  // namespace paddle
--- a/paddle/framework/data_type.h
+++ b/paddle/framework/data_type.h
@ -0,0 +1,36 @@
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License. */
+
+#pragma once
+#include <typeindex>
+#include "paddle/framework/framework.pb.h"
+
+namespace paddle {
+namespace framework {
+
+inline DataType ToDataType(std::type_index type) {
+  if (typeid(float).hash_code() == type.hash_code()) {
+    return DataType::FP32;
+  } else if (typeid(double).hash_code() == type.hash_code()) {
+    return DataType::FP64;
+  } else if (typeid(int).hash_code() == type.hash_code()) {
+    return DataType::INT32;
+  } else {
+    PADDLE_THROW("Not supported");
+    return static_cast<DataType>(-1);
+  }
+}
+
+}  // namespace framework
+}  // namespace paddle
--- a/paddle/framework/grad_op_builder.cc
+++ b/paddle/framework/grad_op_builder.cc
@ -54,5 +54,44 @@ OperatorBase* BuildGradOp(const OperatorBase* op) {
  return grad_info.Creator()(info.grad_op_type_, inputs, outputs, op->Attrs());
 }

+static void TransOpDescArg(const OpDescBind* src_op, const OpArgType& src_type,
+                           bool is_grad, OpDescBind* dst_op,
+                           const OpArgType& dst_type) {
+  PADDLE_ENFORCE(dst_op != nullptr,
+                 "Protobuf desc of gradient op must be initialized first.");
+  const auto& proto = OpInfoMap::Instance().Get(src_op->Type()).Proto();
+  const auto& src_arg_list =
+      src_type == OpArgType::IN ? proto.inputs() : proto.outputs();
+  for (const auto& arg : src_arg_list) {
+    if (arg.not_in_gradient() && !is_grad) continue;
+    const std::string src_name = arg.name();
+    std::vector<std::string> vars = src_type == OpArgType::IN
+                                        ? src_op->Input(src_name)
+                                        : src_op->Output(src_name);
+    if (is_grad) {
+      for (std::string& var : vars) {
+        var = GradVarName(var);
+      }
+    }
+    std::string dst_name = is_grad ? GradVarName(src_name) : src_name;
+    dst_type == OpArgType::IN ? dst_op->SetInput(dst_name, vars)
+                              : dst_op->SetOutput(dst_name, vars);
+  }
+}
+
+void CompleteGradOpDesc(const OpDescBind* forw_op, OpDescBind* grad_op) {
+  auto& info = OpInfoMap::Instance().Get(forw_op->Type());
+  PADDLE_ENFORCE(info.HasGradientOp());
+
+  grad_op->SetType(info.grad_op_type_);
+
+  TransOpDescArg(forw_op, OpArgType::IN, false, grad_op, OpArgType::IN);
+  TransOpDescArg(forw_op, OpArgType::OUT, false, grad_op, OpArgType::IN);
+  TransOpDescArg(forw_op, OpArgType::OUT, true, grad_op, OpArgType::IN);
+  TransOpDescArg(forw_op, OpArgType::IN, true, grad_op, OpArgType::OUT);
+
+  grad_op->SetAttrMap(forw_op->GetAttrMap());
+}
+
 }  // namespace framework
 }  // namespace paddle
--- a/paddle/framework/grad_op_builder.h
+++ b/paddle/framework/grad_op_builder.h
@ -14,6 +14,7 @@ limitations under the License. */

 #pragma once

+#include "paddle/framework/op_desc.h"
 #include "paddle/framework/operator.h"

 namespace paddle {
@ -21,5 +22,7 @@ namespace framework {

 OperatorBase* BuildGradOp(const OperatorBase* op);

+void CompleteGradOpDesc(const OpDescBind* forw_op, OpDescBind* grad_op);
+
 }  // namespace framework
 }  // namespace paddle
--- a/paddle/framework/grad_op_builder_test.cc
+++ b/paddle/framework/grad_op_builder_test.cc
@ -120,3 +120,82 @@ TEST(GradOpBuilder, IOIgnoredInGradient) {
            std::vector<std::string>(
                {f::GradVarName("in3_1"), f::GradVarName("in3_2")}));
 }
+
+TEST(GradOpDescBuilder, MutiInOut) {
+  f::OpDescBind *forw_op = new f::OpDescBind();
+  forw_op->SetType("mult_io");
+  forw_op->SetInput("In1", {"in1"});
+  forw_op->SetInput("In2_mult", {"in2_1", "in2_2", "in2_3"});
+  forw_op->SetInput("In3", {"in3"});
+  forw_op->SetOutput("Out1", {"out1"});
+  forw_op->SetOutput("Out2_mult", {"out2_1", "out2_2"});
+
+  f::OpDescBind *grad_op = new f::OpDescBind();
+  f::CompleteGradOpDesc(forw_op, grad_op);
+
+  EXPECT_EQ(grad_op->Type(), "mult_io_grad");
+  ASSERT_EQ(grad_op->InputNames().size(), 3UL + 2UL + 2UL);
+  EXPECT_EQ(grad_op->Input("In1"), std::vector<std::string>({"in1"}));
+  EXPECT_EQ(grad_op->Input("In2_mult"),
+            std::vector<std::string>({"in2_1", "in2_2", "in2_3"}));
+  EXPECT_EQ(grad_op->Input("In3"), std::vector<std::string>({"in3"}));
+  EXPECT_EQ(grad_op->Input("Out1"), std::vector<std::string>({"out1"}));
+  EXPECT_EQ(grad_op->Input("Out2_mult"),
+            std::vector<std::string>({"out2_1", "out2_2"}));
+  EXPECT_EQ(grad_op->Input(f::GradVarName("Out1")),
+            std::vector<std::string>({f::GradVarName("out1")}));
+  EXPECT_EQ(grad_op->Input(f::GradVarName("Out2_mult")),
+            std::vector<std::string>(
+                {f::GradVarName("out2_1"), f::GradVarName("out2_2")}));
+
+  ASSERT_EQ(grad_op->OutputNames().size(), 3UL);
+  EXPECT_EQ(grad_op->Output(f::GradVarName("In1")),
+            std::vector<std::string>({f::GradVarName("in1")}));
+  EXPECT_EQ(grad_op->Output(f::GradVarName("In2_mult")),
+            std::vector<std::string>({f::GradVarName("in2_1"),
+                                      f::GradVarName("in2_2"),
+                                      f::GradVarName("in2_3")}));
+  EXPECT_EQ(grad_op->Output(f::GradVarName("In3")),
+            std::vector<std::string>({f::GradVarName("in3")}));
+  delete forw_op;
+  delete grad_op;
+}
+
+TEST(GradOpDescBuilder, IOIgnoredInGradient) {
+  f::OpDescBind *forw_op = new f::OpDescBind();
+  forw_op->SetType("io_ignored");
+  forw_op->SetInput("In1", {"in1"});
+  forw_op->SetInput("In2_mult", {"in2_1", "in2_2"});
+  forw_op->SetInput("In3_mult", {"in3_1", "in3_2"});
+  forw_op->SetOutput("Out1_mult", {"out1_1", "out1_2"});
+  forw_op->SetOutput("Out2", {"out2"});
+
+  f::OpDescBind *grad_op = new f::OpDescBind();
+  f::CompleteGradOpDesc(forw_op, grad_op);
+
+  EXPECT_EQ(grad_op->Type(), "io_ignored_grad");
+  // 'In2' and 'Out2' are ignored in gradient calculating
+  ASSERT_EQ(grad_op->InputNames().size(), 2UL + 1UL + 2UL);
+  EXPECT_EQ(grad_op->Input("In1"), std::vector<std::string>({"in1"}));
+  EXPECT_EQ(grad_op->Input("In3_mult"),
+            std::vector<std::string>({"in3_1", "in3_2"}));
+  EXPECT_EQ(grad_op->Input("Out1_mult"),
+            std::vector<std::string>({"out1_1", "out1_2"}));
+  EXPECT_EQ(grad_op->Input(f::GradVarName("Out1_mult")),
+            std::vector<std::string>(
+                {f::GradVarName("out1_1"), f::GradVarName("out1_2")}));
+  EXPECT_EQ(grad_op->Input(f::GradVarName("Out2")),
+            std::vector<std::string>({f::GradVarName("out2")}));
+
+  ASSERT_EQ(grad_op->OutputNames().size(), 3UL);
+  EXPECT_EQ(grad_op->Output(f::GradVarName("In1")),
+            std::vector<std::string>({f::GradVarName("in1")}));
+  EXPECT_EQ(grad_op->Output(f::GradVarName("In2_mult")),
+            std::vector<std::string>(
+                {f::GradVarName("in2_1"), f::GradVarName("in2_2")}));
+  EXPECT_EQ(grad_op->Output(f::GradVarName("In3_mult")),
+            std::vector<std::string>(
+                {f::GradVarName("in3_1"), f::GradVarName("in3_2")}));
+  delete forw_op;
+  delete grad_op;
+}
--- a/paddle/framework/lod_tensor.cc
+++ b/paddle/framework/lod_tensor.cc
@ -72,6 +72,22 @@ bool operator==(const LoD& a, const LoD& b) {
  return true;
 }

+size_t LoDTensor::NumElements(size_t level, size_t idx) const {
+  PADDLE_ENFORCE_LT(level, NumLevels());
+  PADDLE_ENFORCE_LT(idx, NumElements(level));
+  // the last level of LoD, just return number of records in Tensor
+  if (level == NumLevels() - 1) {
+    return lod_[level][idx + 1] - lod_[level][idx];
+  }
+  // high level of LoD, and there is another lower level, return number of
+  // lower-level elements
+  auto tmp = SliceInLevel(lod_, level, idx, idx + 1);
+  PADDLE_ENFORCE_GE(tmp.size(), 2);
+  // there is a 0 as a placeholder stored in LoD, so the number of elements
+  // equals lod.size() - 1
+  return tmp[1].size() - 1;
+}
+
 void LoDTensor::ShrinkLevels(size_t level_begin, size_t level_end) {
  auto new_lod = framework::SliceLevels(lod_, level_begin, level_end);
  lod_ = new_lod;
--- a/paddle/framework/lod_tensor.h
+++ b/paddle/framework/lod_tensor.h
@ -38,6 +38,18 @@ using Vector = thrust::host_vector<
    T, thrust::system::cuda::experimental::pinned_allocator<T>>;
 #endif

+/*
+ * 3-level LoD stores
+ *
+ * 0 10 20
+ * 0 5 10 15 20
+ * 0 2 5 7 10 12 15 20
+ *
+ * - in a level, each element indicates offset in the underlying Tensor
+ * - the first element should be 0 and that indicates that this sequence start
+ * from 0
+ * - each sequence's begin and end(no-inclusive) is level[id, id+1]
+ */
 using LoD = std::vector<Vector<size_t>>;

 LoD SliceLevels(const LoD& in, size_t level_begin, size_t level_end);
@ -65,11 +77,8 @@ class LoDTensor : public Tensor {
   * Get a element from LoD.
   */
  size_t lod_element(size_t level, size_t elem) const {
-    PADDLE_ENFORCE(level < NumLevels(), "level [%d] out of range [%d]", level,
-                   NumLevels());
-    PADDLE_ENFORCE(elem < NumElements(level),
-                   "element begin [%d] out of range [%d]", elem,
-                   NumElements(level));
+    PADDLE_ENFORCE_LT(level, NumLevels());
+    PADDLE_ENFORCE_LT(elem, NumElements(level));
    return (lod_)[level][elem];
  }

@ -82,12 +91,23 @@ class LoDTensor : public Tensor {
   * Number of elements in a level.
   */
  size_t NumElements(size_t level = 0) const {
-    PADDLE_ENFORCE(level < NumLevels(), "level [%d] out of range [%d]", level,
-                   NumLevels());
+    PADDLE_ENFORCE_LT(level, NumLevels());
    // the last offset is the end of last element
    return (lod_)[level].size() - 1;
  }

+  /*
+   * Number of lower-level elements.
+   * For example, a 2-level lod-tensor
+   *
+   * 0-th level   |   |
+   * 1-th level   ||  |||
+   *
+   * NumElements(0, 0) get 2
+   * NumElements(0, 1) get 3
+   */
+  size_t NumElements(size_t level, size_t idx) const;
+
  /*
   * Shrink levels[level_begin:level_end]
   */
--- a/paddle/framework/lod_tensor_test.cc
+++ b/paddle/framework/lod_tensor_test.cc
@ -56,6 +56,12 @@ TEST_F(LoDTensorTester, NumElements) {
  ASSERT_EQ(lod_tensor_.NumElements(2), 8UL);
 }

+TEST_F(LoDTensorTester, NumElements2) {
+  ASSERT_EQ(lod_tensor_.NumElements(0, 0), 2UL);
+  ASSERT_EQ(lod_tensor_.NumElements(0, 1), 2UL);
+  ASSERT_EQ(lod_tensor_.NumElements(1, 1), 2UL);
+}
+
 TEST_F(LoDTensorTester, ShrinkLevels) {
  // slice 1 level
  for (size_t level = 0; level < 3UL; ++level) {
@ -65,7 +71,7 @@ TEST_F(LoDTensorTester, ShrinkLevels) {
    ASSERT_EQ(new_lod_tensor.NumElements(0), lod_tensor_.NumElements(level));
    ASSERT_EQ(new_lod_tensor.data<float>(), lod_tensor_.data<float>());
  }
-  // slice 2 level
+  // shrink 2 level
  for (size_t level = 0; level < 2UL; ++level) {
    LoDTensor new_lod_tensor = lod_tensor_;
    new_lod_tensor.ShrinkLevels(level, level + 2);
--- a/paddle/framework/lod_tensor_test.cu
+++ b/paddle/framework/lod_tensor_test.cu
@ -36,8 +36,8 @@ TEST(LoDTensor, LoDInGPU) {
  lod_tensor.mutable_data<float>(place);

  lod_tensor.set_lod(src_lod);
-  CHECK_EQ(lod_tensor.lod_element(0, 2), 4);
-  CHECK_EQ(lod_tensor.lod_element(0, 4), 8);
+  CHECK_EQ(lod_tensor.lod_element(0, 2), 4UL);
+  CHECK_EQ(lod_tensor.lod_element(0, 4), 8UL);

  auto lod = lod_tensor.lod();

--- a/paddle/framework/op_desc.cc
+++ b/paddle/framework/op_desc.cc
@ -0,0 +1,144 @@
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "paddle/framework/op_desc.h"
+#include "paddle/framework/block_desc.h"
+
+namespace paddle {
+namespace framework {
+
+OpDesc *OpDescBind::Proto() {
+  Sync();
+  return &op_desc_;
+}
+
+const std::vector<std::string> &OpDescBind::Input(
+    const std::string &name) const {
+  auto it = inputs_.find(name);
+  PADDLE_ENFORCE(it != inputs_.end(), "Input %s cannot be found in Op %s", name,
+                 Type());
+  return it->second;
+}
+
+std::vector<std::string> OpDescBind::InputNames() const {
+  std::vector<std::string> retv;
+  retv.reserve(this->inputs_.size());
+  for (auto &ipt : this->inputs_) {
+    retv.push_back(ipt.first);
+  }
+  return retv;
+}
+
+void OpDescBind::SetInput(const std::string &param_name,
+                          const std::vector<std::string> &args) {
+  need_update_ = true;
+  inputs_[param_name] = args;
+}
+
+const std::vector<std::string> &OpDescBind::Output(
+    const std::string &name) const {
+  auto it = outputs_.find(name);
+  PADDLE_ENFORCE(it != outputs_.end(), "Output %s cannot be found in Op %s",
+                 name, Type());
+  return it->second;
+}
+
+std::vector<std::string> OpDescBind::OutputNames() const {
+  std::vector<std::string> retv;
+  retv.reserve(this->outputs_.size());
+  for (auto &ipt : this->outputs_) {
+    retv.push_back(ipt.first);
+  }
+  return retv;
+}
+
+void OpDescBind::SetOutput(const std::string &param_name,
+                           const std::vector<std::string> &args) {
+  need_update_ = true;
+  this->outputs_[param_name] = args;
+}
+
+AttrType OpDescBind::GetAttrType(const std::string &name) const {
+  auto it = attrs_.find(name);
+  PADDLE_ENFORCE(it != attrs_.end(), "Attribute %s is not found", name);
+  return static_cast<AttrType>(it->second.which() - 1);
+}
+
+std::vector<std::string> OpDescBind::AttrNames() const {
+  std::vector<std::string> retv;
+  retv.reserve(attrs_.size());
+  for (auto &attr : attrs_) {
+    retv.push_back(attr.first);
+  }
+  return retv;
+}
+
+void OpDescBind::SetAttr(const std::string &name, const Attribute &v) {
+  this->attrs_[name] = v;
+  need_update_ = true;
+}
+
+void OpDescBind::SetAttrMap(
+    const std::unordered_map<std::string, Attribute> &attr_map) {
+  attrs_ = attr_map;
+  need_update_ = true;
+}
+
+Attribute OpDescBind::GetAttr(const std::string &name) const {
+  auto it = attrs_.find(name);
+  PADDLE_ENFORCE(it != attrs_.end(), "Attribute %s is not found", name);
+  return it->second;
+}
+
+int OpDescBind::GetBlockAttr(const std::string &name) const {
+  auto it = attrs_.find(name);
+  PADDLE_ENFORCE(it != attrs_.end(), "Attribute %s is not found", name);
+  return boost::get<BlockDesc *>(it->second)->idx();
+}
+
+const std::unordered_map<std::string, Attribute> &OpDescBind::GetAttrMap()
+    const {
+  return attrs_;
+}
+
+void OpDescBind::Sync() {
+  if (need_update_) {
+    this->op_desc_.mutable_inputs()->Clear();
+    for (auto &ipt : inputs_) {
+      auto *input = op_desc_.add_inputs();
+      input->set_parameter(ipt.first);
+      VectorToRepeated(ipt.second, input->mutable_arguments());
+    }
+
+    this->op_desc_.mutable_outputs()->Clear();
+    for (auto &opt : outputs_) {
+      auto *output = op_desc_.add_outputs();
+      output->set_parameter(opt.first);
+      VectorToRepeated(opt.second, output->mutable_arguments());
+    }
+
+    this->op_desc_.mutable_attrs()->Clear();
+    for (auto &attr : attrs_) {
+      auto *attr_desc = op_desc_.add_attrs();
+      attr_desc->set_name(attr.first);
+      attr_desc->set_type(
+          static_cast<framework::AttrType>(attr.second.which() - 1));
+      boost::apply_visitor(SetAttrDescVisitor(attr_desc), attr.second);
+    }
+
+    need_update_ = false;
+  }
+}
+}  // namespace framework
+}  // namespace paddle
--- a/paddle/framework/op_desc.h
+++ b/paddle/framework/op_desc.h
@ -0,0 +1,112 @@
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+
+#include <unordered_map>
+#include <vector>
+#include "paddle/framework/attribute.h"
+#include "paddle/framework/var_desc.h"
+
+namespace paddle {
+namespace framework {
+
+class BlockDescBind;
+
+class OpDescBind {
+ public:
+  OpDesc *Proto();
+
+  std::string Type() const { return op_desc_.type(); }
+
+  void SetType(const std::string &type) { op_desc_.set_type(type); }
+
+  const std::vector<std::string> &Input(const std::string &name) const;
+
+  std::vector<std::string> InputNames() const;
+
+  void SetInput(const std::string &param_name,
+                const std::vector<std::string> &args);
+
+  const std::vector<std::string> &Output(const std::string &name) const;
+
+  std::vector<std::string> OutputNames() const;
+
+  void SetOutput(const std::string &param_name,
+                 const std::vector<std::string> &args);
+
+  std::string DebugString() { return this->Proto()->DebugString(); }
+
+  bool HasAttr(const std::string &name) const {
+    return attrs_.find(name) != attrs_.end();
+  }
+
+  AttrType GetAttrType(const std::string &name) const;
+
+  std::vector<std::string> AttrNames() const;
+
+  void SetAttr(const std::string &name, const Attribute &v);
+
+  void SetBlockAttr(const std::string &name, BlockDescBind &block);
+
+  // Only be used in C++
+  void SetAttrMap(const std::unordered_map<std::string, Attribute> &attr_map);
+
+  Attribute GetAttr(const std::string &name) const;
+
+  int GetBlockAttr(const std::string &name) const;
+
+  // Only be used in C++
+  const std::unordered_map<std::string, Attribute> &GetAttrMap() const;
+
+ private:
+  struct SetAttrDescVisitor : public boost::static_visitor<void> {
+    explicit SetAttrDescVisitor(OpDesc::Attr *attr) : attr_(attr) {}
+    mutable OpDesc::Attr *attr_;
+    void operator()(int v) const { attr_->set_i(v); }
+    void operator()(float v) const { attr_->set_f(v); }
+    void operator()(const std::string &v) const { attr_->set_s(v); }
+    void operator()(bool b) const { attr_->set_b(b); }
+
+    void operator()(const std::vector<int> &v) const {
+      VectorToRepeated(v, attr_->mutable_ints());
+    }
+    void operator()(const std::vector<float> &v) const {
+      VectorToRepeated(v, attr_->mutable_floats());
+    }
+    void operator()(const std::vector<std::string> &v) const {
+      VectorToRepeated(v, attr_->mutable_strings());
+    }
+    void operator()(const std::vector<bool> &v) const {
+      VectorToRepeated(v, attr_->mutable_bools());
+    }
+    void operator()(BlockDesc *desc) const {
+      attr_->set_block_idx(desc->idx());
+    }
+    void operator()(boost::blank) const { PADDLE_THROW("Unexpected branch"); }
+  };
+
+  void Sync();
+
+  OpDesc op_desc_;
+  std::unordered_map<std::string, std::vector<std::string>> inputs_;
+  std::unordered_map<std::string, std::vector<std::string>> outputs_;
+  std::unordered_map<std::string, Attribute> attrs_;
+
+  // need_update_ indicate there some local changes not be synchronized. If
+  // local changes should be synchronized, need_update_ should be set to true.
+  bool need_update_{false};
+};
+}  // namespace framework
+}  // namespace paddle
--- a/paddle/framework/op_registry.h
+++ b/paddle/framework/op_registry.h
@ -100,13 +100,39 @@ class OpRegistrar : public Registrar {
  }
 };

-template <typename PlaceType, typename KernelType>
+template <typename PlaceType, bool at_end, size_t I, typename... KernelType>
+struct OpKernelRegistrarFunctor;
+
+template <typename PlaceType, size_t I, typename... KernelTypes>
+struct OpKernelRegistrarFunctor<PlaceType, false, I, KernelTypes...> {
+  using KERNEL_TYPE =
+      typename std::tuple_element<I, std::tuple<KernelTypes...>>::type;
+
+  void operator()(const char* op_type) const {
+    using T = typename KERNEL_TYPE::ELEMENT_TYPE;
+    OperatorWithKernel::OpKernelKey key(ToDataType(std::type_index(typeid(T))),
+                                        PlaceType());
+    OperatorWithKernel::AllOpKernels()[op_type][key].reset(new KERNEL_TYPE);
+
+    constexpr auto size = std::tuple_size<std::tuple<KernelTypes...>>::value;
+    OpKernelRegistrarFunctor<PlaceType, I + 1 == size, I + 1, KernelTypes...>
+        func;
+    func(op_type);
+  }
+};
+
+template <typename PlaceType, size_t I, typename... KernelType>
+struct OpKernelRegistrarFunctor<PlaceType, true, I, KernelType...> {
+  void operator()(const char* op_type) const {}
+};
+
+// User can register many kernel in one place. The data type could be different.
+template <typename PlaceType, typename... KernelType>
 class OpKernelRegistrar : public Registrar {
 public:
  explicit OpKernelRegistrar(const char* op_type) {
-    OperatorWithKernel::OpKernelKey key;
-    key.place_ = PlaceType();
-    OperatorWithKernel::AllOpKernels()[op_type][key].reset(new KernelType);
+    OpKernelRegistrarFunctor<PlaceType, false, 0, KernelType...> func;
+    func(op_type);
  }
 };

--- a/paddle/framework/op_registry_test.cc
+++ b/paddle/framework/op_registry_test.cc
@ -10,7 +10,6 @@ class CosineOp : public OperatorBase {
  using OperatorBase::OperatorBase;
  void Run(const Scope& scope,
           const platform::DeviceContext& dev_ctx) const override {}
-  void InferShape(const Scope& scope) const override {}
 };

 class CosineOpProtoAndCheckerMaker : public OpProtoAndCheckerMaker {
@ -29,7 +28,6 @@ class CosineOpProtoAndCheckerMaker : public OpProtoAndCheckerMaker {
 class MyTestOp : public OperatorBase {
 public:
  using OperatorBase::OperatorBase;
-  void InferShape(const Scope& scope) const override {}
  void Run(const Scope& scope,
           const platform::DeviceContext& dev_ctx) const override {}
 };
--- a/paddle/framework/operator.cc
+++ b/paddle/framework/operator.cc
@ -22,14 +22,14 @@ namespace framework {
 template <>
 Eigen::DefaultDevice& ExecutionContext::GetEigenDevice<
    platform::CPUPlace, Eigen::DefaultDevice>() const {
-  return *device_context_.get_eigen_device<Eigen::DefaultDevice>();
+  return *device_context_.GetEigenDevice<platform::CPUPlace>();
 }

 #ifndef PADDLE_ONLY_CPU
 template <>
 Eigen::GpuDevice&
 ExecutionContext::GetEigenDevice<platform::GPUPlace, Eigen::GpuDevice>() const {
-  return *device_context_.get_eigen_device<Eigen::GpuDevice>();
+  return *device_context_.GetEigenDevice<platform::GPUPlace>();
 }
 #endif

--- a/paddle/framework/operator.h
+++ b/paddle/framework/operator.h
@ -15,12 +15,14 @@ limitations under the License. */
 #pragma once

 #include <algorithm>
+#include <atomic>
 #include <string>
 #include <unordered_map>
 #include <vector>

 #include "op_info.h"
 #include "paddle/framework/attribute.h"
+#include "paddle/framework/data_type.h"
 #include "paddle/framework/framework.pb.h"
 #include "paddle/framework/lod_tensor.h"
 #include "paddle/framework/scope.h"
@ -82,10 +84,6 @@ class OperatorBase {

  virtual std::string DebugString() const;

-  /// InferShape infer the size of Variables used by this Operator with
-  /// information inside scope
-  virtual void InferShape(const Scope& scope) const = 0;
-
  /// Net will call this function to Run an op.
  virtual void Run(const Scope& scope,
                   const platform::DeviceContext& dev_ctx) const = 0;
@ -163,7 +161,6 @@ class OperatorBase {
 class NOP : public OperatorBase {
 public:
  using OperatorBase::OperatorBase;
-  void InferShape(const Scope& scope) const override {}
  void Run(const Scope& scope,
           const platform::DeviceContext& dev_ctx) const override {}
  std::unique_ptr<OperatorBase> Clone() const override {
@ -299,21 +296,6 @@ template <>
 std::vector<Tensor*> InferShapeContext::MultiOutput<Tensor>(
    const std::string& name) const;

-template <typename T>
-struct EigenDeviceConverter;
-
-template <>
-struct EigenDeviceConverter<platform::CPUPlace> {
-  using EigenDeviceType = Eigen::DefaultDevice;
-};
-
-#ifndef PADDLE_ONLY_CPU
-template <>
-struct EigenDeviceConverter<platform::GPUPlace> {
-  using EigenDeviceType = Eigen::GpuDevice;
-};
-#endif
-
 class ExecutionContext : public InferShapeContext {
 public:
  ExecutionContext(const OperatorBase& op, const Scope& scope,
@ -321,8 +303,8 @@ class ExecutionContext : public InferShapeContext {
      : InferShapeContext(op, scope), device_context_(device_context) {}

  template <typename PlaceType,
-            typename DeviceType =
-                typename EigenDeviceConverter<PlaceType>::EigenDeviceType>
+            typename DeviceType = typename platform::EigenDeviceConverter<
+                PlaceType>::EigenDeviceType>
  DeviceType& GetEigenDevice() const;

  platform::Place GetPlace() const { return device_context_.GetPlace(); }
@ -407,7 +389,7 @@ class RuntimeInferShapeContext : public InferShapeContextBase {
  const Scope& scope_;
 };

-class OpKernel {
+class OpKernelBase {
 public:
  /**
   * ExecutionContext is the only parameter of Kernel Run function.
@ -418,48 +400,61 @@ class OpKernel {

  virtual void Compute(const ExecutionContext& context) const = 0;

-  virtual ~OpKernel() {}
+  virtual ~OpKernelBase() = default;
+};
+
+template <typename T>
+class OpKernel : public OpKernelBase {
+ public:
+  using ELEMENT_TYPE = T;
 };

 class OperatorWithKernel : public OperatorBase {
 public:
  struct OpKernelKey {
    platform::Place place_;
+    DataType data_type_;

-    OpKernelKey() = default;
-    explicit OpKernelKey(const platform::DeviceContext& dev_ctx) {
-      place_ = dev_ctx.GetPlace();
-    }
+    OpKernelKey(DataType data_type, platform::Place place)
+        : place_(place), data_type_(data_type) {}
+
+    OpKernelKey(DataType data_type, const platform::DeviceContext& dev_ctx)
+        : place_(dev_ctx.GetPlace()), data_type_(data_type) {}

    bool operator==(const OpKernelKey& o) const {
-      return platform::places_are_same_class(place_, o.place_);
+      return platform::places_are_same_class(place_, o.place_) &&
+             data_type_ == o.data_type_;
    }
  };

  struct OpKernelHash {
-    std::hash<bool> hash_;
+    std::hash<int> hash_;
    size_t operator()(const OpKernelKey& key) const {
-      return hash_(platform::is_gpu_place(key.place_));
+      int place = key.place_.which();
+      int data_type = static_cast<int>(key.data_type_);
+      int pre_hash = data_type << NUM_PLACE_TYPE_LIMIT_IN_BIT |
+                     (place & ((1 << NUM_PLACE_TYPE_LIMIT_IN_BIT) - 1));
+      return hash_(pre_hash);
    }
  };

  using OpKernelMap =
-      std::unordered_map<OpKernelKey, std::unique_ptr<OpKernel>, OpKernelHash>;
+      std::unordered_map<OpKernelKey, std::unique_ptr<OpKernelBase>,
+                         OpKernelHash>;

  OperatorWithKernel(const std::string& type, const VariableNameMap& inputs,
                     const VariableNameMap& outputs, const AttributeMap& attrs)
      : OperatorBase(type, inputs, outputs, attrs) {}

-  // runtime infershape
-  void InferShape(const Scope& scope) const override {
-    auto c = RuntimeInferShapeContext(*this, scope);
-    InferShape(&c);
-  }
-
  void Run(const Scope& scope,
           const platform::DeviceContext& dev_ctx) const final {
-    auto& opKernel = AllOpKernels().at(type_).at(OpKernelKey(dev_ctx));
-    opKernel->Compute(ExecutionContext(*this, scope, dev_ctx));
+    RuntimeInferShapeContext infer_shape_ctx(*this, scope);
+    this->InferShape(&infer_shape_ctx);
+
+    ExecutionContext ctx(*this, scope, dev_ctx);
+    auto& opKernel = AllOpKernels().at(type_).at(
+        OpKernelKey(IndicateDataType(ctx), dev_ctx));
+    opKernel->Compute(ctx);
  }

  static std::unordered_map<std::string /* op_type */, OpKernelMap>&
@ -469,13 +464,43 @@ class OperatorWithKernel : public OperatorBase {
  }

  bool SupportGPU() const override {
-    OperatorWithKernel::OpKernelKey key;
-    key.place_ = platform::GPUPlace();
-    return OperatorWithKernel::AllOpKernels().at(type_).count(key) != 0;
+    auto& op_kernels = OperatorWithKernel::AllOpKernels().at(type_);
+    return std::any_of(op_kernels.begin(), op_kernels.end(),
+                       [](OpKernelMap::const_reference kern_pair) {
+                         return platform::is_gpu_place(kern_pair.first.place_);
+                       });
  }

 protected:
  virtual void InferShape(InferShapeContextBase* ctx) const = 0;
+
+  // indicate kernel DataType by input data. Defaultly all input data must be
+  // same.
+  virtual DataType IndicateDataType(const ExecutionContext& ctx) const {
+    auto& scope = ctx.scope();
+    int data_type = -1;
+    for (auto& input : this->inputs_) {
+      for (auto& ipt_name : input.second) {
+        auto* var = scope.FindVar(ipt_name);
+        if (var != nullptr) {
+          const Tensor* t = nullptr;
+          if (var->IsType<Tensor>()) {
+            t = &var->Get<Tensor>();
+          } else if (var->IsType<LoDTensor>()) {
+            t = &var->Get<LoDTensor>();
+          }
+          if (t != nullptr) {
+            int tmp = static_cast<int>(ToDataType(t->type()));
+            PADDLE_ENFORCE(tmp == data_type || data_type == -1,
+                           "DataType of Paddle Op must be same.");
+            data_type = tmp;
+          }
+        }
+      }
+    }
+    PADDLE_ENFORCE(data_type != -1, "DataType should be indicated by input");
+    return static_cast<DataType>(data_type);
+  }
 };

 }  // namespace framework
--- a/paddle/framework/operator_test.cc
+++ b/paddle/framework/operator_test.cc
@ -27,7 +27,6 @@ class OpWithoutKernelTest : public OperatorBase {
  OpWithoutKernelTest(const std::string& type, const VariableNameMap& inputs,
                      const VariableNameMap& outputs, const AttributeMap& attrs)
      : OperatorBase(type, inputs, outputs, attrs), x(1) {}
-  void InferShape(const Scope& scope) const override {}
  void Run(const Scope& scope,
           const platform::DeviceContext& dev_ctx) const override {
    ++op_run_num;
@ -87,7 +86,6 @@ TEST(OperatorBase, all) {
  auto op = paddle::framework::OpRegistry::CreateOp(op_desc);
  scope.NewVar("OUT1");
  ASSERT_EQ(paddle::framework::op_run_num, 0);
-  op->InferShape(scope);
  op->Run(scope, device_context);
  ASSERT_EQ(paddle::framework::op_run_num, 1);
 }
@ -116,10 +114,13 @@ class OpWithKernelTest : public OperatorWithKernel {

 protected:
  void InferShape(framework::InferShapeContextBase* ctx) const override {}
+  DataType IndicateDataType(const ExecutionContext& ctx) const override {
+    return DataType::FP32;
+  }
 };

 template <typename T1, typename T2>
-class CPUKernelTest : public OpKernel {
+class CPUKernelTest : public OpKernel<float> {
 public:
  void Compute(const ExecutionContext& ctx) const {
    std::cout << "this is cpu kernel" << std::endl;
@ -146,7 +147,7 @@ class OpKernelTestMultiInputsProtoAndCheckerMaker
  }
 };

-class CPUKernalMultiInputsTest : public OpKernel {
+class CPUKernalMultiInputsTest : public OpKernel<float> {
 public:
  void Compute(const ExecutionContext& ctx) const {
    auto xs = ctx.op().Inputs("xs");
@ -255,7 +256,6 @@ class OperatorClone : public paddle::framework::OperatorBase {
                const paddle::framework::VariableNameMap& outputs,
                const paddle::framework::AttributeMap& attrs)
      : OperatorBase(type, inputs, outputs, attrs) {}
-  void InferShape(const paddle::framework::Scope& scope) const override {}
  void Run(const paddle::framework::Scope& scope,
           const paddle::platform::DeviceContext& dev_ctx) const override {}
 };
--- a/paddle/framework/program_desc.cc
+++ b/paddle/framework/program_desc.cc
@ -0,0 +1,60 @@
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "paddle/framework/program_desc.h"
+#include "paddle/framework/block_desc.h"
+
+namespace paddle {
+namespace framework {
+
+using ProgDescMap =
+    std::unordered_map<ProgramDesc *, std::unique_ptr<ProgramDescBind>>;
+static ProgDescMap *g_bind_map = nullptr;
+
+ProgramDescBind &ProgramDescBind::Instance(ProgramDesc *prog) {
+  if (g_bind_map == nullptr) {
+    g_bind_map = new ProgDescMap();
+  }
+  auto &map = *g_bind_map;
+  auto &ptr = map[prog];
+
+  if (ptr == nullptr) {
+    ptr.reset(new ProgramDescBind(prog));
+  }
+  return *ptr;
+}
+
+BlockDescBind *ProgramDescBind::AppendBlock(const BlockDescBind &parent) {
+  auto *b = prog_->add_blocks();
+  b->set_parent_idx(parent.ID());
+  b->set_idx(prog_->blocks_size() - 1);
+  blocks_.emplace_back(new BlockDescBind(this, b));
+  return blocks_.back().get();
+}
+
+ProgramDesc *ProgramDescBind::Proto() {
+  for (auto &block : blocks_) {
+    block->Sync();
+  }
+  return prog_;
+}
+
+ProgramDescBind::ProgramDescBind(ProgramDesc *prog) {
+  prog_ = prog;
+  for (auto &block : *prog->mutable_blocks()) {
+    blocks_.emplace_back(new BlockDescBind(this, &block));
+  }
+}
+}  // namespace framework
+}  // namespace paddle
--- a/Show More
+++ b/Show More