Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into dev_add_axis

8 years ago · 823bdd670f
parent f2a66ffabb 6eec2b70d1
commit 823bdd670f
44 changed files with 950 additions and 410 deletions
--- a/doc/howto/dev/new_op_cn.md
+++ b/doc/howto/dev/new_op_cn.md
--- a/doc/howto/dev/use_eigen_cn.md
+++ b/doc/howto/dev/use_eigen_cn.md
@ -0,0 +1,146 @@
 ## 在Paddle中如何使用Eigen
 神经网络本质上是一个计算图，计算需要的数据存放在`Tensor`中，而计算过程是由`Operartor`来描述的。在执行时，`Operator`调用对应`OpKernel`中的`Compute`接口，实现对`Tensor`的操作。
 ### Eigen Tensor模块
 Eigen Tensor模块对element-wise计算提供了强大的支持，并且书写一份代码，可以同时在CPU、GPU执行。但Eigen Tensor是一个正在开发中的模块，因此可能测试不够完备，文档较少。
 关于Eigen Tensor模块的详细介绍请参考[文档1](https://github.com/RLovelett/eigen/blob/master/unsupported/Eigen/CXX11/src/Tensor/README.md) 和[文档2](https://bitbucket.org/eigen/eigen/src/default/unsupported/Eigen/CXX11/src/Tensor/README.md)
 ### paddle::framework::Tensor
 Paddle Tensor定义在framework目录下，其主要接口如下：
 ```cpp
 class Tensor {
 public:
  /*! Return a pointer to mutable memory block. */
  template <typename T>
  inline T* data();
  /**
   * @brief   Return a pointer to mutable memory block.
   * @note    If not exist, then allocation.
   */
  template <typename T>
  inline T* mutable_data(platform::Place place);
  /**
   * @brief     Return a pointer to mutable memory block.
   *
   * @param[in] dims    The dimensions of the memory block.
   * @param[in] place   The place of the memory block.
   *
   * @note      If not exist, then allocation.
   */
  template <typename T>
  inline T* mutable_data(DDim dims, platform::Place place);
  /*! Resize the dimensions of the memory block. */
  inline Tensor& Resize(const DDim& dims);
  /*! Return the dimensions of the memory block. */
  inline const DDim& dims() const;
 private:  
  /*! holds the memory block if allocated. */
  std::shared_ptr<Placeholder> holder_;
  /*! points to dimensions of memory block. */
  DDim dim_;
 };
 ```
 `Placeholder`的作用是延迟分配内存，即我们可以先定义一个Tensor，然后使用Resize接口设置Tensor的大小，最后再调用mutable_data接口分配实际的内存。
 ```cpp
 paddle::framework::Tensor t;
 paddle::platform::CPUPlace place;
 // set size first
 t.Resize({2, 3});
 // allocate memory on CPU later
 t.mutable_data(place);
 ```
 ### paddle::framework::Tensor使用样例
 下面以AddOp为例说明Tensor的使用过程：
 - InferShape
 在运行神经网络计算图时，我们先调用每个`Operator`的`InferShape`接口，根据输入Tensor的大小来设置输出Tensor的大小，`Resize`接口会被调用。
 ```cpp
 void InferShape(const framework::InferShapeContext &ctx) const override {
  PADDLE_ENFORCE_EQ(ctx.Input<Tensor>("X")->dims(),
                    ctx.Input<Tensor>("Y")->dims(),
                    "Two input of Add Op's dimension must be same.");
  ctx.Output<Tensor>("Out")->Resize(ctx.Input<Tensor>("X")->dims());
 }
 ```
 - Run
 `Operator`的`Run`接口最终会调用对应`OpKernel`的`Compute`接口，在这时真正的分配内存，`mutable_data`接口会被调用。
 ```cpp
 void Compute(const framework::ExecutionContext& context) const override {
  auto* input0 = context.Input<Tensor>("X");
  auto* input1 = context.Input<Tensor>("Y");
  auto* output = context.Output<Tensor>("Out");
  output->mutable_data<T>(context.GetPlace());
  auto x = EigenVector<T>::Flatten(*input0);
  auto y = EigenVector<T>::Flatten(*input1);
  auto z = EigenVector<T>::Flatten(*output);
  auto place = context.GetEigenDevice<Place>();
  z.device(place) = x + y;
 }
 ```
 ### paddle::framework::Tensor到EigenTensor的转换
 如上一小节所示，在具体的计算中，我们需要先把输入Tensor和输出Tensor转换为Eigen支持的格式。我们在[eigen.h](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/eigen.h)中提供了一些全局函数用来实现paddle::framework::Tensor到EigenTensor/EigenMatrix/EigenVector/EigenScalar的转换。
 以EigenTensor为例，做一个介绍
 ```cpp
 Tensor t;
 float* p = t.mutable_data<float>(make_ddim({1, 2, 3}), platform::CPUPlace());
 for (int i = 0; i < 1 * 2 * 3; i++) {
  p[i] = static_cast<float>(i);
 }
 EigenTensor<float, 3>::Type et = EigenTensor<float, 3>::From(t);
 ```
 From是EigenTensor模板提供的一个接口，可以实现从paddle::framework::Tensor到对EigenTensor的转换。由于Tensor的rank是模板参数，因此在转换时需要显示的指定。
 在Eigen中，不同rank的Tensor是不同类型，Vector是rank为1的Tensor。需要额外注意的是，EigenVector<T>::From方法是把paddle中的一维Tensor转为Eigen的一维Tensor，在这里用EigenVector来表示；而EigenVector<T>::Flatten方法是把paddle中的一个Tensor进行reshape操作，压扁成为Eigen的一维Tensor，类型仍然为EigenVector。
 更多的转换方法请参考eigen_test.cc中的[单元测试](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/eigen_test.cc)。
 ### 实现计算
 当需要完成计算时，我们需要等式左边的EigenTensor调用device接口。在这里需要注意的是，这里的EigenTensor之间的运算只是改变了原有Tensor中的数据，而不会改变原有Tensor的shape信息。
 ```cpp
 auto x = EigenVector<T>::Flatten(*input0);
 auto y = EigenVector<T>::Flatten(*input1);
 auto z = EigenVector<T>::Flatten(*output);
 auto place = context.GetEigenDevice<Place>();
 z.device(place) = x + y;
 ```
 在这段代码中，input0/input1/output可以是任意维度的Tensor。我们调用了EigenVector的Flatten接口，把任意维度的Tensor转为了一维的EigenVector。而在计算结束之后，input0/input1/output的原有shape信息不变。如果想改变原有Tensor的shape信息，可以调用Resize接口进行改变。
 由于Eigen Tensor模块的文档较少，我们可以参考TensorFlow的[kernels](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/core/kernels)模块下的相关`OpKernel`的计算代码。
--- a/paddle/framework/attribute.cc
+++ b/paddle/framework/attribute.cc
@ -43,6 +43,10 @@ template <>
 AttrType AttrTypeID<std::vector<std::string>>() {
  return STRINGS;
 }
 template <>
 AttrType AttrTypeID<std::vector<std::pair<int, int>>>() {
  return INT_PAIRS;
 }
 Attribute GetAttrValue(const OpDesc::Attr& attr_desc) {
  switch (attr_desc.type()) {
@ -76,6 +80,14 @@ Attribute GetAttrValue(const OpDesc::Attr& attr_desc) {
      }
      return val;
    }
    case paddle::framework::AttrType::INT_PAIRS: {
      std::vector<std::pair<int, int>> val(attr_desc.int_pairs_size());
      for (int i = 0; i < attr_desc.int_pairs_size(); ++i) {
        val[i].first = attr_desc.int_pairs(i).first();
        val[i].second = attr_desc.int_pairs(i).second();
      }
      return val;
    }
  }
  PADDLE_ENFORCE(false, "Unknown OpDesc::AttrDesc::type !");
  return boost::blank();
--- a/paddle/framework/attribute.h
+++ b/paddle/framework/attribute.h
@ -28,7 +28,8 @@ namespace paddle {
 namespace framework {
 typedef boost::variant<boost::blank, int, float, std::string, std::vector<int>,
-                       std::vector<float>, std::vector<std::string>>
+                       std::vector<float>, std::vector<std::string>,
                       std::vector<std::pair<int, int>>>
    Attribute;
 typedef std::unordered_map<std::string, Attribute> AttributeMap;
--- a/paddle/framework/ddim.cc
+++ b/paddle/framework/ddim.cc
@ -21,16 +21,16 @@ namespace framework {
 /// @cond HIDDEN
 template <int i>
-Dim<i> make_dim(const int* d) {
+Dim<i> make_dim(const int64_t* d) {
  return Dim<i>(*d, make_dim<i - 1>(d + 1));
 }
 template <>
-Dim<1> make_dim<1>(const int* d) {
+Dim<1> make_dim<1>(const int64_t* d) {
  return Dim<1>(*d);
 }
-void make_ddim(DDim& ddim, const int* dims, int n) {
+void make_ddim(DDim& ddim, const int64_t* dims, int n) {
  switch (n) {
    case 1:
      ddim = make_dim<1>(dims);
@ -67,13 +67,13 @@ void make_ddim(DDim& ddim, const int* dims, int n) {
 /// @endcond
-DDim make_ddim(std::initializer_list<int> dims) {
+DDim make_ddim(std::initializer_list<int64_t> dims) {
  DDim result(make_dim(0));
  make_ddim(result, dims.begin(), dims.size());
  return result;
 }
-DDim make_ddim(const std::vector<int>& dims) {
+DDim make_ddim(const std::vector<int64_t>& dims) {
  DDim result(make_dim(0));
  make_ddim(result, &dims[0], dims.size());
  return result;
@ -81,12 +81,12 @@ DDim make_ddim(const std::vector<int>& dims) {
 /// @cond HIDDEN
 // XXX For some reason, putting this in an anonymous namespace causes errors
-class DynamicMutableIndexer : public boost::static_visitor<int&> {
+class DynamicMutableIndexer : public boost::static_visitor<int64_t&> {
 public:
  explicit DynamicMutableIndexer(int idx) : idx_(idx) {}
  template <int D>
-  int& operator()(Dim<D>& dim) const {
+  int64_t& operator()(Dim<D>& dim) const {
    return dim[idx_];
  }
@ -94,12 +94,12 @@ class DynamicMutableIndexer : public boost::static_visitor<int&> {
  int idx_;
 };
-class DynamicConstIndexer : public boost::static_visitor<int> {
+class DynamicConstIndexer : public boost::static_visitor<int64_t> {
 public:
  explicit DynamicConstIndexer(int idx) : idx_(idx) {}
  template <int D>
-  int operator()(const Dim<D>& dim) const {
+  int64_t operator()(const Dim<D>& dim) const {
    return dim[idx_];
  }
@ -109,22 +109,22 @@ class DynamicConstIndexer : public boost::static_visitor<int> {
 /// @endcond
-int& DDim::operator[](int idx) {
+int64_t& DDim::operator[](int idx) {
  return boost::apply_visitor(DynamicMutableIndexer(idx), var);
 }
-int DDim::operator[](int idx) const {
+int64_t DDim::operator[](int idx) const {
  return boost::apply_visitor(DynamicConstIndexer(idx), var);
 }
-ssize_t DDim::size() const { return arity(*this); }
+int64_t DDim::size() const { return arity(*this); }
 bool DDim::operator==(DDim d) const {
  if (var.which() != d.getVar().which()) {
    return false;
  } else {
-    std::vector<int> v1 = vectorize(*this);
+    std::vector<int64_t> v1 = vectorize(*this);
-    std::vector<int> v2 = vectorize(d);
+    std::vector<int64_t> v2 = vectorize(d);
    for (unsigned int i = 0; i < v1.size(); i++) {
      if (v1[i] != v2[i]) {
@ -139,10 +139,10 @@ bool DDim::operator==(DDim d) const {
 bool DDim::operator!=(DDim d) const { return !(*this == d); }
 DDim DDim::operator+(DDim d) const {
-  std::vector<int> v1 = vectorize(*this);
+  std::vector<int64_t> v1 = vectorize(*this);
-  std::vector<int> v2 = vectorize(d);
+  std::vector<int64_t> v2 = vectorize(d);
-  std::vector<int> v3;
+  std::vector<int64_t> v3;
  assert(v1.size() == v2.size());
@ -154,10 +154,10 @@ DDim DDim::operator+(DDim d) const {
 }
 DDim DDim::operator*(DDim d) const {
-  std::vector<int> v1 = vectorize(*this);
+  std::vector<int64_t> v1 = vectorize(*this);
-  std::vector<int> v2 = vectorize(d);
+  std::vector<int64_t> v2 = vectorize(d);
-  std::vector<int> v3;
+  std::vector<int64_t> v3;
  assert(v1.size() == v2.size());
@ -168,15 +168,15 @@ DDim DDim::operator*(DDim d) const {
  return make_ddim(v3);
 }
-int get(const DDim& ddim, int idx) { return ddim[idx]; }
+int64_t get(const DDim& ddim, int idx) { return ddim[idx]; }
 void set(DDim& ddim, int idx, int value) { ddim[idx] = value; }
 /// @cond HIDDEN
 struct VectorizeVisitor : public boost::static_visitor<> {
-  std::vector<int>& vector;
+  std::vector<int64_t>& vector;
-  explicit VectorizeVisitor(std::vector<int>& v) : vector(v) {}
+  explicit VectorizeVisitor(std::vector<int64_t>& v) : vector(v) {}
  template <typename T>
  void operator()(const T& t) {
@ -188,19 +188,31 @@ struct VectorizeVisitor : public boost::static_visitor<> {
 };
 /// @endcond
-std::vector<int> vectorize(const DDim& ddim) {
+std::vector<int64_t> vectorize(const DDim& ddim) {
-  std::vector<int> result;
+  std::vector<int64_t> result;
  VectorizeVisitor visitor(result);
  boost::apply_visitor(visitor, ddim);
  return result;
 }
 struct ProductVisitor : public boost::static_visitor<int64_t> {
  template <int D>
  int64_t operator()(const Dim<D>& dim) {
    return product(dim);
  }
 };
 int64_t product(const DDim& ddim) {
  ProductVisitor visitor;
  return boost::apply_visitor(visitor, ddim);
 }
 struct SliceVectorizeVisitor : public boost::static_visitor<> {
-  std::vector<int>& vector;
+  std::vector<int64_t>& vector;
  int begin;
  int end;
-  SliceVectorizeVisitor(std::vector<int>& v, int b, int e)
+  SliceVectorizeVisitor(std::vector<int64_t>& v, int b, int e)
      : vector(v), begin(b), end(e) {
    PADDLE_ENFORCE(begin < end,
                   "Begin index must be less than end index in ddim slice.");
@ -228,25 +240,13 @@ struct SliceVectorizeVisitor : public boost::static_visitor<> {
 };
 DDim slice_ddim(const DDim& dim, int begin, int end) {
-  std::vector<int> vec;
+  std::vector<int64_t> vec;
  vec.reserve(end - begin);
  SliceVectorizeVisitor visitor(vec, begin, end);
  boost::apply_visitor(visitor, dim);
  return make_ddim(vec);
 }
 struct ProductVisitor : public boost::static_visitor<ssize_t> {
  template <int D>
  ssize_t operator()(const Dim<D>& dim) {
    return product(dim);
  }
 };
 ssize_t product(const DDim& ddim) {
  ProductVisitor visitor;
  return boost::apply_visitor(visitor, ddim);
 }
 /// \cond HIDDEN
 struct ArityVisitor : boost::static_visitor<int> {
@ -280,7 +280,7 @@ std::ostream& operator<<(std::ostream& os, const DDim& ddim) {
  return os;
 }
-DDim::DDim(std::initializer_list<int> init_list) {
+DDim::DDim(std::initializer_list<int64_t> init_list) {
  *this = make_ddim(init_list);
 }
--- a/paddle/framework/ddim.h
+++ b/paddle/framework/ddim.h
@ -40,7 +40,7 @@ struct DDim {
  template <int D>
  explicit DDim(const Dim<D>& in) : var(in) {}
-  /*implicit*/ DDim(std::initializer_list<int> init_list);
+  /*implicit*/ DDim(std::initializer_list<int64_t> init_list);
  template <int D>
  DDim& operator=(const Dim<D>& in) {
@ -48,8 +48,8 @@ struct DDim {
    return *this;
  }
-  int& operator[](int idx);
+  int64_t& operator[](int idx);
-  int operator[](int idx) const;
+  int64_t operator[](int idx) const;
  template <typename Visitor>
  typename Visitor::result_type apply_visitor(Visitor& visitor) {
@ -71,15 +71,15 @@ struct DDim {
  DDim operator*(DDim d) const;
-  ssize_t size() const;
+  int64_t size() const;
 };
 /**
- * \brief Make a DDim from std::vector<int>
+ * \brief Make a DDim from std::vector<int64_t>
 *
 * \param dims An vector of ints. Must be sized between [1, 9]
 */
-DDim make_ddim(const std::vector<int>& dims);
+DDim make_ddim(const std::vector<int64_t>& dims);
 /**
 * \brief Make a DDim from an initializer list
@ -87,14 +87,14 @@ DDim make_ddim(const std::vector<int>& dims);
 * \param dims An initializer list of ints. Must be sized between [1, 9]
 *
 */
-DDim make_ddim(std::initializer_list<int> dims);
+DDim make_ddim(std::initializer_list<int64_t> dims);
-int get(const DDim& dim, int idx);
+int64_t get(const DDim& dim, int idx);
 void set(DDim& dim, int idx, int val);
-std::vector<int> vectorize(const DDim& ddim);
+std::vector<int64_t> vectorize(const DDim& ddim);
-ssize_t product(const DDim& ddim);
+int64_t product(const DDim& ddim);
 /**
 * \brief Slice a ddim
--- a/paddle/framework/ddim_test.cc
+++ b/paddle/framework/ddim_test.cc
@ -12,7 +12,7 @@ TEST(DDim, Equality) {
  EXPECT_EQ(ddim[2], 5);
  // construct a DDim from a vector
-  std::vector<int> vec({9, 1, 5});
+  std::vector<int64_t> vec({9, 1, 5});
  paddle::framework::DDim vddim = paddle::framework::make_ddim(vec);
  EXPECT_EQ(ddim[0], 9);
  EXPECT_EQ(ddim[1], 1);
@ -25,7 +25,7 @@ TEST(DDim, Equality) {
  EXPECT_EQ(paddle::framework::get(ddim, 0), 6);
  // vectorize a DDim
-  std::vector<int> res_vec = paddle::framework::vectorize(vddim);
+  std::vector<int64_t> res_vec = paddle::framework::vectorize(vddim);
  EXPECT_EQ(res_vec[0], 9);
  EXPECT_EQ(res_vec[1], 1);
  EXPECT_EQ(res_vec[2], 5);
--- a/paddle/framework/dim.h
+++ b/paddle/framework/dim.h
@ -17,13 +17,13 @@ struct Dim {
  static constexpr int dimensions = i;
  template <typename... Args>
-  HOSTDEVICE Dim(int _head, Args... _tail) : head(_head), tail(_tail...) {
+  HOSTDEVICE Dim(int64_t _head, Args... _tail) : head(_head), tail(_tail...) {
    static_assert(sizeof...(_tail) == i - 1,
                  "Dim initialized with the wrong number of parameters");
  }
  HOSTDEVICE
-  Dim(int _head, const Dim<i - 1>& _tail) : head(_head), tail(_tail) {}
+  Dim(int64_t _head, const Dim<i - 1>& _tail) : head(_head), tail(_tail) {}
  HOSTDEVICE
  Dim() : head(0), tail() {}
@ -31,12 +31,12 @@ struct Dim {
  /** Construct a Dim from a linear index and size.  Uses Fortran order
   * indexing. */
  HOSTDEVICE
-  Dim(int idx, const Dim<i>& size)
+  Dim(int64_t idx, const Dim<i>& size)
      : head(idx % size.head), tail(idx / size.head, size.tail) {}
  /** Construct a Dim with each dimension set to the given index */
  HOSTDEVICE
-  Dim(int idx) : head(idx), tail(idx) {}
+  Dim(int64_t idx) : head(idx), tail(idx) {}
  HOSTDEVICE
  bool operator==(const Dim<i>& o) const {
@ -47,13 +47,13 @@ struct Dim {
  bool operator!=(const Dim<i>& o) const { return !(*this == o); }
  HOSTDEVICE
-  int& operator[](int idx);
+  int64_t& operator[](int idx);
  HOSTDEVICE
-  int operator[](int idx) const;
+  int64_t operator[](int idx) const;
  HOST std::string to_string() const;
-  int head;
+  int64_t head;
  Dim<i - 1> tail;
 };
@ -63,7 +63,7 @@ struct Dim<1> {
  static constexpr int dimensions = 1;
  HOSTDEVICE
-  Dim(int _head) : head(_head) {}
+  Dim(int64_t _head) : head(_head) {}
  HOSTDEVICE
  Dim() : head(0) {}
@ -86,11 +86,11 @@ struct Dim<1> {
  bool operator!=(const Dim<1>& o) const { return !(*this == o); }
  HOSTDEVICE
-  int& operator[](int idx);
+  int64_t& operator[](int idx);
  HOSTDEVICE
-  int operator[](int idx) const;
+  int64_t operator[](int idx) const;
-  int head;
+  int64_t head;
 };
 namespace {
@ -100,12 +100,12 @@ template <int i>
 struct DimGetter {
  // Return a copy if Dim is const
  template <typename D>
-  HOSTDEVICE static int impl(const D& d) {
+  HOSTDEVICE static int64_t impl(const D& d) {
    return DimGetter<i - 1>::impl(d.tail);
  }
  // Return a reference if Dim is mutable
  template <typename D>
-  HOSTDEVICE static int& impl(D& d) {
+  HOSTDEVICE static int64_t& impl(D& d) {
    return DimGetter<i - 1>::impl(d.tail);
  }
 };
@ -115,18 +115,18 @@ template <>
 struct DimGetter<0> {
  // Return a copy if Dim is const
  template <typename D>
-  HOSTDEVICE static int impl(const D& d) {
+  HOSTDEVICE static int64_t impl(const D& d) {
    return d.head;
  }
  // Return a reference if Dim is mutable
  template <typename D>
-  HOSTDEVICE static int& impl(D& d) {
+  HOSTDEVICE static int64_t& impl(D& d) {
    return d.head;
  }
 };
 template <int D>
-HOSTDEVICE int& indexer(Dim<D>& dim, int idx) {
+HOSTDEVICE int64_t& indexer(Dim<D>& dim, int idx) {
 #ifndef __CUDA_ARCH__
  if (idx < 0) {
    throw std::invalid_argument("Tried to access a negative dimension");
@ -141,7 +141,7 @@ HOSTDEVICE int& indexer(Dim<D>& dim, int idx) {
 }
 template <>
-HOSTDEVICE int& indexer<1>(Dim<1>& dim, int idx) {
+HOSTDEVICE int64_t& indexer<1>(Dim<1>& dim, int idx) {
 #ifndef __CUDA_ARCH__
  if (idx != 0) {
    throw std::invalid_argument("Invalid index");
@ -153,7 +153,7 @@ HOSTDEVICE int& indexer<1>(Dim<1>& dim, int idx) {
 }
 template <int D>
-HOSTDEVICE int indexer(const Dim<D>& dim, int idx) {
+HOSTDEVICE int64_t indexer(const Dim<D>& dim, int idx) {
 #ifndef __CUDA_ARCH__
  if (idx < 0) {
    throw std::invalid_argument("Tried to access a negative dimension");
@ -168,7 +168,7 @@ HOSTDEVICE int indexer(const Dim<D>& dim, int idx) {
 }
 template <>
-HOSTDEVICE int indexer<1>(const Dim<1>& dim, int idx) {
+HOSTDEVICE int64_t indexer<1>(const Dim<1>& dim, int idx) {
 #ifndef __CUDA_ARCH__
  if (idx != 0) {
    throw std::invalid_argument("Invalid index");
@ -182,73 +182,76 @@ HOSTDEVICE int indexer<1>(const Dim<1>& dim, int idx) {
 }  // namespace
 // Static access to constant Dim
 template <int i, int l>
-HOSTDEVICE int get(const Dim<l>& d) {
+HOSTDEVICE int64_t get(const Dim<l>& d) {
  return DimGetter<i>::impl(d);
 }
 // Static access to mutable Dim
 template <int i, int l>
-HOSTDEVICE int& get(Dim<l>& d) {
+HOSTDEVICE int64_t& get(Dim<l>& d) {
  return DimGetter<i>::impl(d);
 }
 // Dynamic access to constant Dim
 template <int l>
-HOSTDEVICE int Dim<l>::operator[](int i) const {
+HOSTDEVICE int64_t Dim<l>::operator[](int i) const {
  return indexer(*this, i);
 }
 // Dynamic access to mutable Dim
 template <int l>
-HOSTDEVICE int& Dim<l>::operator[](int i) {
+HOSTDEVICE int64_t& Dim<l>::operator[](int i) {
  return indexer(*this, i);
 }
 // Dynamic access to constant Dim
-inline HOSTDEVICE int Dim<1>::operator[](int i) const {
+inline HOSTDEVICE int64_t Dim<1>::operator[](int i) const {
  return indexer(*this, i);
 }
 // Dynamic access to mutable Dim
-inline HOSTDEVICE int& Dim<1>::operator[](int i) { return indexer(*this, i); }
+inline HOSTDEVICE int64_t& Dim<1>::operator[](int i) {
  return indexer(*this, i);
 }
 // Dynamic access to constant Dim
 // without std::enable_if will try to instantiate this on get<0>(d)
 template <int l>
-HOSTDEVICE typename std::enable_if<(l > 0), int>::type get(const Dim<l>& d,
+HOSTDEVICE typename std::enable_if<(l > 0), int64_t>::type get(const Dim<l>& d,
-                                                           int i) {
+                                                               int i) {
  return d[i];
 }
 // Dynamic access to mutable Dim
 template <int l>
-HOSTDEVICE typename std::enable_if<(l > 0), int&>::type get(Dim<l>& d, int i) {
+HOSTDEVICE typename std::enable_if<(l > 0), int64_t&>::type get(Dim<l>& d,
                                                                int i) {
  return d[i];
 }
 // Dot product of two dims
 template <int i>
-HOSTDEVICE int linearize(const Dim<i>& a, const Dim<i>& b) {
+HOSTDEVICE int64_t linearize(const Dim<i>& a, const Dim<i>& b) {
  return a.head * b.head + linearize(a.tail, b.tail);
 }
 // Base case dot product of two Dims
 // Notice it is inline because it is no longer a template
 template <>
-HOSTDEVICE inline int linearize(const Dim<1>& a, const Dim<1>& b) {
+HOSTDEVICE inline int64_t linearize(const Dim<1>& a, const Dim<1>& b) {
  return a.head * b.head;
 }
 // Product of a Dim
 template <int i>
-HOSTDEVICE int product(const Dim<i>& a, int prod = 1) {
+HOSTDEVICE int64_t product(const Dim<i>& a, int prod = 1) {
  return prod * a.head * product(a.tail);
 }
 // Base case product of a Dim
 // Notice it is inline because it is no longer a template
 template <>
-HOSTDEVICE inline int product(const Dim<1>& a, int prod) {
+HOSTDEVICE inline int64_t product(const Dim<1>& a, int prod) {
  return prod * a.head;
 }
--- a/paddle/framework/dim_test.cu
+++ b/paddle/framework/dim_test.cu
@ -8,7 +8,7 @@ __global__ void test(paddle::framework::Dim<2>* o) {
  o[0] = paddle::framework::make_dim(5, 6);
 }
-__global__ void dyn_idx_gpu(int* o) {
+__global__ void dyn_idx_gpu(int64_t* o) {
  auto d = paddle::framework::make_dim(5, 6);
  o[0] = d[1];
 }
@ -47,9 +47,9 @@ TEST(Dim, Equality) {
  EXPECT_EQ(b[1], 11);
  // dynamic access on GPU
-  thrust::device_vector<int> r(1);
+  thrust::device_vector<int64_t> r(1);
  dyn_idx_gpu<<<1, 1>>>(thrust::raw_pointer_cast(r.data()));
-  int res = r[0];
+  int64_t res = r[0];
  EXPECT_EQ(res, 6);
  // ex_prefix_mul
--- a/paddle/framework/eigen.h
+++ b/paddle/framework/eigen.h
@ -28,7 +28,7 @@ struct EigenDim {
  static Type From(const DDim& dims) {
    PADDLE_ENFORCE(arity(dims) == D, "D must match arity(DDim)");
    Type ret;
-    for (int d = 0; d < arity(dims); d++) {
+    for (int64_t d = 0; d < arity(dims); d++) {
      ret[d] = dims[d];
    }
    return ret;
--- a/paddle/framework/framework.proto
+++ b/paddle/framework/framework.proto
@ -22,8 +22,14 @@ enum AttrType {
  INTS = 3;
  FLOATS = 4;
  STRINGS = 5;
  INT_PAIRS = 6;
 }
 message IntPair {
  required int32 first = 1;
  required int32 second = 2;
 };
 // OpDesc describes an instance of a C++ framework::OperatorBase
 // derived class type.
 message OpDesc {
@ -37,6 +43,7 @@ message OpDesc {
    repeated int32 ints = 6;
    repeated float floats = 7;
    repeated string strings = 8;
    repeated IntPair int_pairs = 9;
  };
  message Var {
--- a/paddle/framework/grad_op_builder_test.cc
+++ b/paddle/framework/grad_op_builder_test.cc
@ -3,7 +3,7 @@
 #include "paddle/framework/op_registry.h"
 #include "paddle/framework/operator.h"
-USE_OP(add_two);
+USE_OP(add);
 namespace paddle {
 namespace framework {
@ -41,7 +41,7 @@ namespace f = paddle::framework;
 TEST(GradOpBuilder, AddTwo) {
  std::shared_ptr<f::OperatorBase> add_op(f::OpRegistry::CreateOp(
-      "add_two", {{"X", {"x"}}, {"Y", {"y"}}}, {{"Out", {"out"}}}, {}));
+      "add", {{"X", {"x"}}, {"Y", {"y"}}}, {{"Out", {"out"}}}, {}));
  std::shared_ptr<f::OperatorBase> grad_add_op =
      f::OpRegistry::CreateGradOp(*add_op);
  EXPECT_EQ(grad_add_op->Inputs().size(), 4UL);
--- a/paddle/framework/op_registry_test.cc
+++ b/paddle/framework/op_registry_test.cc
@ -174,36 +174,4 @@ TEST(OpRegistry, CustomChecker) {
  op->Run(scope, dev_ctx);
  int test_attr = op->GetAttr<int>("test_attr");
  ASSERT_EQ(test_attr, 4);
-}
+}
 class TestAttrProtoMaker : public pd::OpProtoAndCheckerMaker {
 public:
  TestAttrProtoMaker(pd::OpProto* proto, pd::OpAttrChecker* op_checker)
      : OpProtoAndCheckerMaker(proto, op_checker) {
    AddAttr<float>("scale", "scale of test op");
    AddAttr<float>("scale", "scale of test op");
  }
 };
 TEST(ProtoMaker, DuplicatedAttr) {
  pd::OpProto op_proto;
  pd::OpAttrChecker op_checker;
  auto proto_maker = TestAttrProtoMaker(&op_proto, &op_checker);
  ASSERT_THROW(proto_maker.Validate(), paddle::platform::EnforceNotMet);
 }
 class TestInOutProtoMaker : public pd::OpProtoAndCheckerMaker {
 public:
  TestInOutProtoMaker(pd::OpProto* proto, pd::OpAttrChecker* op_checker)
      : OpProtoAndCheckerMaker(proto, op_checker) {
    AddInput("input", "input of test op");
    AddInput("input", "input of test op");
  }
 };
 TEST(ProtoMaker, DuplicatedInOut) {
  pd::OpProto op_proto;
  pd::OpAttrChecker op_checker;
  auto proto_maker = TestInOutProtoMaker(&op_proto, &op_checker);
  ASSERT_THROW(proto_maker.Validate(), paddle::platform::EnforceNotMet);
 }
--- a/paddle/framework/operator_test.cc
+++ b/paddle/framework/operator_test.cc
@ -263,4 +263,38 @@ TEST(Operator, Clone) {
  OperatorClone a("ABC", {}, {}, {});
  auto b = a.Clone();
  ASSERT_EQ(a.Type(), b->Type());
 }
 class TestAttrProtoMaker : public paddle::framework::OpProtoAndCheckerMaker {
 public:
  TestAttrProtoMaker(paddle::framework::OpProto* proto,
                     paddle::framework::OpAttrChecker* op_checker)
      : OpProtoAndCheckerMaker(proto, op_checker) {
    AddAttr<float>("scale", "scale of test op");
    AddAttr<float>("scale", "scale of test op");
  }
 };
 TEST(ProtoMaker, DuplicatedAttr) {
  paddle::framework::OpProto op_proto;
  paddle::framework::OpAttrChecker op_checker;
  auto proto_maker = TestAttrProtoMaker(&op_proto, &op_checker);
  ASSERT_THROW(proto_maker.Validate(), paddle::platform::EnforceNotMet);
 }
 class TestInOutProtoMaker : public paddle::framework::OpProtoAndCheckerMaker {
 public:
  TestInOutProtoMaker(paddle::framework::OpProto* proto,
                      paddle::framework::OpAttrChecker* op_checker)
      : OpProtoAndCheckerMaker(proto, op_checker) {
    AddInput("input", "input of test op");
    AddInput("input", "input of test op");
  }
 };
 TEST(ProtoMaker, DuplicatedInOut) {
  paddle::framework::OpProto op_proto;
  paddle::framework::OpAttrChecker op_checker;
  auto proto_maker = TestInOutProtoMaker(&op_proto, &op_checker);
  ASSERT_THROW(proto_maker.Validate(), paddle::platform::EnforceNotMet);
 }
--- a/paddle/framework/tensor_impl.h
+++ b/paddle/framework/tensor_impl.h
@ -58,7 +58,7 @@ inline T* Tensor::mutable_data(platform::Place place) {
                    "Tensor's numel must be larger than zero to call "
                    "Tensor::mutable_data. Call Tensor::set_dim first.");
  /* some versions of boost::variant don't have operator!= */
-  size_t size = product(dims_) * sizeof(T);
+  int64_t size = product(dims_) * sizeof(T);
  if (holder_ == nullptr || !(holder_->place() == place) ||
      holder_->size() < size + offset_) {
    if (platform::is_cpu_place(place)) {
@ -131,7 +131,7 @@ inline Tensor Tensor::Slice(const int& begin_idx, const int& end_idx) const {
  PADDLE_ENFORCE_LT(begin_idx, end_idx,
                    "Begin index must be less than end index.");
  PADDLE_ENFORCE_NE(dims_[0], 1, "Can not slice a tensor with dims_[0] = 1.");
-  int base = product(dims_) / dims_[0];
+  size_t base = product(dims_) / dims_[0];
  Tensor dst;
  dst.holder_ = holder_;
  DDim dst_dims = dims_;
--- a/paddle/gserver/layers/Conv3DLayer.cpp
+++ b/paddle/gserver/layers/Conv3DLayer.cpp
@ -83,8 +83,8 @@ void Conv3DLayer::forward(PassType passType) {
  int outWidth = getSize();
  resetOutput(batchSize, outWidth);
  REGISTER_TIMER_INFO("FwdConv3D", getName().c_str());
  for (size_t i = 0; i != inputLayers_.size(); ++i) {
    REGISTER_TIMER_INFO("FwdConv3D", getName().c_str());
    const MatrixPtr &inMat = getInputValue(i);
    const MatrixPtr &outMat = getOutputValue();
    int M = M_[i];
@ -120,7 +120,6 @@ void Conv3DLayer::forward(PassType passType) {
    }
  }
  if (nullptr != this->biasParameter_) {
    REGISTER_TIMER_INFO("FwBiasTimer", getName().c_str());
    this->addBias();
  }
  forwardActivation();
@ -134,15 +133,14 @@ void Conv3DLayer::backward(const UpdateCallback &callback) {
    biases_->getParameterPtr()->incUpdate(callback);
  }
  REGISTER_TIMER_INFO("BwdConv3D", getName().c_str());
  for (size_t i = 0; i != inputLayers_.size(); ++i) {
    REGISTER_TIMER_INFO("BwdConv3D", getName().c_str());
    if (weights_[i]->getWGrad()) {
      bpropWeights(i);
    }
    if (getInputGrad(i)) {
      bpropData(i);
    }
    REGISTER_TIMER_INFO("WeightUpdate", getName().c_str());
    weights_[i]->getParameterPtr()->incUpdate(callback);
  }
 }
--- a/paddle/gserver/layers/DeConv3DLayer.cpp
+++ b/paddle/gserver/layers/DeConv3DLayer.cpp
@ -84,8 +84,8 @@ void DeConv3DLayer::forward(PassType passType) {
  resetOutput(batchSize, outWidth);
  const MatrixPtr outMat = getOutputValue();
  REGISTER_TIMER_INFO("FwdDeConv3D", getName().c_str());
  for (size_t i = 0; i != inputLayers_.size(); ++i) {
    REGISTER_TIMER_INFO("FwdDeConv3D", getName().c_str());
    const MatrixPtr &inMat = getInputValue(i);
    int M = M_[i];
    int N = N_[i];
@ -120,7 +120,6 @@ void DeConv3DLayer::forward(PassType passType) {
    }
  }
  if (nullptr != this->biasParameter_) {
    REGISTER_TIMER_INFO("FwBiasTimer", getName().c_str());
    this->addBias();
  }
  forwardActivation();
@ -133,12 +132,12 @@ void DeConv3DLayer::backward(const UpdateCallback &callback) {
    bpropBiases();
    biases_->getParameterPtr()->incUpdate(callback);
  }
  REGISTER_TIMER_INFO("BwdDeConv3D", getName().c_str());
  for (size_t i = 0; i < inputLayers_.size(); ++i) {
    if (weights_[i]->getWGrad() || this->needGradient_) {
      int M = M_[i];
      int N = N_[i];
      int K = K_[i];
      REGISTER_TIMER_INFO("BwdDeConv3D", getName().c_str());
      Matrix::resizeOrCreate(colBuf_, K * groups_[i], N, false, useGpu_);
      const MatrixPtr &inMat = getInputValue(i);
      for (int n = 0; n < batchSize; ++n) {
@ -182,7 +181,6 @@ void DeConv3DLayer::backward(const UpdateCallback &callback) {
          }
        }
      }
      REGISTER_TIMER_INFO("WeightUpdate", getName().c_str());
      weights_[i]->getParameterPtr()->incUpdate(callback);
    }
  }
--- a/paddle/operators/CMakeLists.txt
+++ b/paddle/operators/CMakeLists.txt
@ -14,27 +14,31 @@ function(op_library TARGET)
    cmake_parse_arguments(op_library "${options}" "${oneValueArgs}"
            "${multiValueArgs}" ${ARGN})
-    foreach(src ${op_library_SRCS})
+    list(LENGTH op_library_SRCS op_library_SRCS_len)
-        if (${src} MATCHES ".*\\.cu$")
+    if (${op_library_SRCS_len} EQUAL 0)
-            list(APPEND cu_srcs ${src})
+        if (EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/${TARGET}.cc)
-        elseif(${src} MATCHES ".*\\.cc$")
+            list(APPEND cc_srcs ${TARGET}.cc)
            list(APPEND cc_srcs ${src})
        else()
            message(FATAL_ERROR "${TARGET} Source file ${src} should only be .cc or .cu")
        endif()
-    endforeach()
+        if (EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/${TARGET}.cu)
            list(APPEND cu_srcs ${TARGET}.cu)
        endif()
    else()
        foreach(src ${op_library_SRCS})
            if (${src} MATCHES ".*\\.cu$")
                list(APPEND cu_srcs ${src})
            elseif(${src} MATCHES ".*\\.cc$")
                list(APPEND cc_srcs ${src})
            else()
                message(FATAL_ERROR "${TARGET} Source file ${src} should only be .cc or .cu")
            endif()
        endforeach()
    endif()
    list(LENGTH cc_srcs cc_srcs_len)
    if (${cc_srcs_len} EQUAL 0)
        message(FATAL_ERROR "The op library ${TARGET} should contains at least one .cc file")
    endif()
    list(LENGTH cu_srcs cu_srcs_len)
    list(LENGTH op_library_DEPS dep_len)
    if (${cu_srcs_len} EQUAL 0 AND ${dep_len} EQUAL 0)
        message(WARNING "The op library ${TARGET} not support GPU!")
    endif()
    if (WITH_GPU)
        nv_library(${TARGET} SRCS ${cc_srcs} ${cu_srcs} DEPS ${op_library_DEPS}
                ${op_common_deps})
@ -46,22 +50,22 @@ endfunction()
 add_subdirectory(math)
-list(REMOVE_ITEM GENERAL_OPS
+set(DEPS_OPS
-     net_op
+    identity_op
-     minus_op
+    minus_op
-     mul_op
+    mul_op
-     recurrent_op
+    recurrent_op
-     scale_op)
+    scale_op)
-
+op_library(identity_op DEPS scale_op)
-op_library(net_op SRCS net_op.cc)
+op_library(minus_op DEPS scale_op)
-op_library(minus_op SRCS minus_op.cc minus_op.cu DEPS scale_op)
+op_library(mul_op DEPS math_function)
 op_library(mul_op SRCS mul_op.cc mul_op.cu DEPS math_function)
 op_library(recurrent_op SRCS recurrent_op.cc rnn/recurrent_op_utils.cc 
  DEPS framework_proto tensor operator net_op)
-op_library(scale_op SRCS scale_op.cc scale_op.cu DEPS net_op)
+op_library(scale_op DEPS net_op)
 list(REMOVE_ITEM GENERAL_OPS ${DEPS_OPS})
 foreach(src ${GENERAL_OPS})
-    op_library(${src} SRCS ${src}.cc ${src}.cu)
+    op_library(${src})
 endforeach()
 set(GLOB_OP_LIB ${OP_LIBRARY} CACHE INTERNAL "Global OP library")
--- a/paddle/operators/add_op.cc
+++ b/paddle/operators/add_op.cc
@ -57,7 +57,6 @@ class AddOpGrad : public framework::OperatorWithKernel {
 }  // namespace paddle
 namespace ops = paddle::operators;
-REGISTER_OP(add_two, ops::AddOp, ops::AddOpMaker, add_two_grad, ops::AddOpGrad);
+REGISTER_OP(add, ops::AddOp, ops::AddOpMaker, add_grad, ops::AddOpGrad);
-REGISTER_OP_CPU_KERNEL(add_two,
+REGISTER_OP_CPU_KERNEL(add, ops::AddKernel<paddle::platform::CPUPlace, float>);
                       ops::AddKernel<paddle::platform::CPUPlace, float>);
--- a/paddle/operators/add_op.cu
+++ b/paddle/operators/add_op.cu
@ -12,10 +12,7 @@
   See the License for the specific language governing permissions and
   limitations under the License. */
 #define EIGEN_USE_GPU
 #include "paddle/framework/op_registry.h"
 #include "paddle/operators/add_op.h"
 namespace ops = paddle::operators;
-REGISTER_OP_GPU_KERNEL(add_two,
+REGISTER_OP_GPU_KERNEL(add, ops::AddKernel<paddle::platform::GPUPlace, float>);
                       ops::AddKernel<paddle::platform::GPUPlace, float>);
--- a/paddle/operators/cos_sim_op.cc
+++ b/paddle/operators/cos_sim_op.cc
@ -0,0 +1,107 @@
 /* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at
   http://www.apache.org/licenses/LICENSE-2.0
   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License. */
 #include "paddle/operators/cos_sim_op.h"
 namespace paddle {
 namespace operators {
 using framework::Tensor;
 class CosSimOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
 protected:
  void InferShape(const framework::InferShapeContext &ctx) const override {
    PADDLE_ENFORCE_NOT_NULL(ctx.InputVar("X"), "Input(X) must not be null.");
    PADDLE_ENFORCE_NOT_NULL(ctx.InputVar("Y"), "Input(Y) must not be null.");
    PADDLE_ENFORCE_EQ(ctx.Input<Tensor>("X")->dims(),
                      ctx.Input<Tensor>("Y")->dims(),
                      "Dimensions of Input(X) and Input(Y) must be the same.");
    auto dims = ctx.Input<Tensor>("X")->dims();
    ctx.Output<Tensor>("Out")->Resize({dims[0], 1});
    ctx.Output<Tensor>("XNorm")->Resize({dims[0], 1});
    ctx.Output<Tensor>("YNorm")->Resize({dims[0], 1});
  }
 };
 class CosSimOpMaker : public framework::OpProtoAndCheckerMaker {
 public:
  CosSimOpMaker(framework::OpProto *proto, framework::OpAttrChecker *op_checker)
      : OpProtoAndCheckerMaker(proto, op_checker) {
    AddInput("X", "The first input of cos_sim op.");
    AddInput("Y", "The second input of cos_sim op.");
    AddOutput("Out", "The output of cos_sim op.");
    AddOutput("XNorm", "Row norm of the first input.").AsIntermediate();
    AddOutput("YNorm", "Row norm of the second input.").AsIntermediate();
    AddComment(R"DOC(
 Cosine Similarity Operator.
 The equation is: Out = X^T * Y / (sqrt(X^T * X) * sqrt(Y^T * Y))
 )DOC");
  }
 };
 class CosSimOpGrad : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
 protected:
  void InferShape(const framework::InferShapeContext &ctx) const override {
    PADDLE_ENFORCE_NOT_NULL(ctx.InputVar("X"), "Input(X) must not be null.");
    PADDLE_ENFORCE_NOT_NULL(ctx.InputVar("Y"), "Input(Y) must not be null.");
    PADDLE_ENFORCE_NOT_NULL(ctx.InputVar("XNorm"),
                            "Input(XNorm) must not be null.");
    PADDLE_ENFORCE_NOT_NULL(ctx.InputVar("YNorm"),
                            "Input(YNorm) must not be null.");
    PADDLE_ENFORCE_NOT_NULL(ctx.InputVar(framework::GradVarName("Out")),
                            "Input(Out@GRAD) must not be null.");
    auto x_dims = ctx.Input<Tensor>("X")->dims();
    auto y_dims = ctx.Input<Tensor>("Y")->dims();
    auto xnorm_dims = ctx.Input<Tensor>("XNorm")->dims();
    auto ynorm_dims = ctx.Input<Tensor>("YNorm")->dims();
    auto out_dims = ctx.Input<Tensor>(framework::GradVarName("Out"))->dims();
    PADDLE_ENFORCE_EQ(x_dims, y_dims,
                      "Dimensions of Input(X) and Input(Y) must be the same.");
    PADDLE_ENFORCE_EQ(xnorm_dims[0], x_dims[0],
                      "1st dimension of XNorm must equal that of Input(X).");
    PADDLE_ENFORCE_EQ(xnorm_dims[1], 1, "2st dimension of XNorm must be one.");
    PADDLE_ENFORCE_EQ(ynorm_dims[0], y_dims[0],
                      "1st dimension of YNorm must equal that of Input(Y).");
    PADDLE_ENFORCE_EQ(ynorm_dims[1], 1, "2st dimension of YNorm must be one.");
    PADDLE_ENFORCE_EQ(out_dims[0], x_dims[0],
                      "1st dimension of Out@GRAD must equal that of Input(X)");
    PADDLE_ENFORCE_EQ(out_dims[1], 1, "1st dimension of Out@GRAD must be one.");
    auto *x_grad = ctx.Output<Tensor>(framework::GradVarName("X"));
    auto *y_grad = ctx.Output<Tensor>(framework::GradVarName("Y"));
    if (x_grad) x_grad->Resize(x_dims);
    if (y_grad) y_grad->Resize(y_dims);
  }
 };
 }  // namespace operators
 }  // namespace paddle
 namespace ops = paddle::operators;
 REGISTER_OP(cos_sim, ops::CosSimOp, ops::CosSimOpMaker, cos_sim_grad,
            ops::CosSimOpGrad);
 REGISTER_OP_CPU_KERNEL(cos_sim,
                       ops::CosSimKernel<paddle::platform::CPUPlace, float>);
 REGISTER_OP_CPU_KERNEL(
    cos_sim_grad, ops::CosSimGradKernel<paddle::platform::CPUPlace, float>);
--- a/paddle/operators/cos_sim_op.cu
+++ b/paddle/operators/cos_sim_op.cu
@ -13,8 +13,10 @@
   limitations under the License. */
 #define EIGEN_USE_GPU
-#include "paddle/operators/gather_op.h"
+#include "paddle/operators/cos_sim_op.h"
 namespace ops = paddle::operators;
-REGISTER_OP_GPU_KERNEL(gather,
+REGISTER_OP_GPU_KERNEL(cos_sim,
-                       ops::GatherOpKernel<paddle::platform::GPUPlace, float>);
+                       ops::CosSimKernel<paddle::platform::GPUPlace, float>);
 REGISTER_OP_GPU_KERNEL(
    cos_sim_grad, ops::CosSimGradKernel<paddle::platform::GPUPlace, float>);
--- a/paddle/operators/cos_sim_op.h
+++ b/paddle/operators/cos_sim_op.h
@ -0,0 +1,107 @@
 /* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at
   http://www.apache.org/licenses/LICENSE-2.0
   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License. */
 #pragma once
 #include "paddle/framework/eigen.h"
 #include "paddle/framework/op_registry.h"
 namespace paddle {
 namespace operators {
 using Tensor = framework::Tensor;
 template <typename T, int MajorType = Eigen::RowMajor,
          typename IndexType = Eigen::DenseIndex>
 using EigenMatrix = framework::EigenMatrix<T, MajorType, IndexType>;
 template <typename T, int MajorType = Eigen::RowMajor,
          typename IndexType = Eigen::DenseIndex>
 using EigenVector = framework::EigenVector<T, MajorType, IndexType>;
 template <typename Place, typename T>
 class CosSimKernel : public framework::OpKernel {
 public:
  void Compute(const framework::ExecutionContext& context) const override {
    auto* input_x = context.Input<Tensor>("X");
    auto* input_y = context.Input<Tensor>("Y");
    auto* output_z = context.Output<Tensor>("Out");
    auto* output_x_norm = context.Output<Tensor>("XNorm");
    auto* output_y_norm = context.Output<Tensor>("YNorm");
    output_z->mutable_data<T>(context.GetPlace());
    output_x_norm->mutable_data<T>(context.GetPlace());
    output_y_norm->mutable_data<T>(context.GetPlace());
    auto dims = input_x->dims();
    int size = static_cast<int>(framework::product(dims));
    auto new_dims = framework::make_ddim({dims[0], size / dims[0]});
    auto x = EigenMatrix<T>::From(*input_x, new_dims);
    auto y = EigenMatrix<T>::From(*input_y, new_dims);
    auto z = EigenVector<T>::Flatten(*output_z);
    auto x_norm = EigenVector<T>::Flatten(*output_x_norm);
    auto y_norm = EigenVector<T>::Flatten(*output_y_norm);
    auto place = context.GetEigenDevice<Place>();
    auto xy = (x * y).sum(Eigen::array<int, 1>({{1}}));
    x_norm.device(place) = x.square().sum(Eigen::array<int, 1>({{1}})).sqrt();
    y_norm.device(place) = y.square().sum(Eigen::array<int, 1>({{1}})).sqrt();
    z.device(place) = xy / x_norm / y_norm;
  }
 };
 template <typename Place, typename T>
 class CosSimGradKernel : public framework::OpKernel {
 public:
  void Compute(const framework::ExecutionContext& context) const override {
    auto* input_x = context.Input<Tensor>("X");
    auto* input_y = context.Input<Tensor>("Y");
    auto* input_z = context.Input<Tensor>("Out");
    auto* input_x_norm = context.Input<Tensor>("XNorm");
    auto* input_y_norm = context.Input<Tensor>("YNorm");
    auto* output_grad_x = context.Output<Tensor>(framework::GradVarName("X"));
    auto* output_grad_y = context.Output<Tensor>(framework::GradVarName("Y"));
    auto* input_grad_z = context.Input<Tensor>(framework::GradVarName("Out"));
    auto dims = input_x->dims();
    int size = static_cast<int>(framework::product(dims));
    auto new_dims = framework::make_ddim({dims[0], size / dims[0]});
    auto x = EigenMatrix<T>::From(*input_x, new_dims);
    auto y = EigenMatrix<T>::From(*input_y, new_dims);
    auto z = EigenMatrix<T>::From(*input_z);
    auto x_norm = EigenMatrix<T>::From(*input_x_norm);
    auto y_norm = EigenMatrix<T>::From(*input_y_norm);
    auto dz = EigenMatrix<T>::From(*input_grad_z);
    Eigen::DSizes<int, 2> bcast(1, new_dims[1]);
    auto z_bcast = z.broadcast(bcast);
    auto dz_bcast = dz.broadcast(bcast);
    auto place = context.GetEigenDevice<Place>();
    auto x_snorm_bcast = x_norm.square().eval().broadcast(bcast);
    auto y_snorm_bcast = y_norm.square().eval().broadcast(bcast);
    auto norm_prod_bcast = (x_norm * y_norm).eval().broadcast(bcast);
    if (output_grad_x) {
      output_grad_x->mutable_data<T>(context.GetPlace());
      auto dx = EigenMatrix<T>::From(*output_grad_x, new_dims);
      dx.device(place) =
          dz_bcast * (y / norm_prod_bcast - z_bcast * x / x_snorm_bcast);
    }
    if (output_grad_y) {
      output_grad_y->mutable_data<T>(context.GetPlace());
      auto dy = EigenMatrix<T>::From(*output_grad_y, new_dims);
      dy.device(place) =
          dz_bcast * (x / norm_prod_bcast - z_bcast * y / y_snorm_bcast);
    }
  }
 };
 }  // namespace operators
 }  // namespace paddle
--- a/paddle/operators/gaussian_random_op.cc
+++ b/paddle/operators/gaussian_random_op.cc
@ -31,8 +31,8 @@ class CPUGaussianRandomKernel : public framework::OpKernel {
    }
    engine.seed(seed);
    std::normal_distribution<T> dist(mean, std);
-    ssize_t size = framework::product(tensor->dims());
+    int64_t size = framework::product(tensor->dims());
-    for (ssize_t i = 0; i < size; ++i) {
+    for (int64_t i = 0; i < size; ++i) {
      data[i] = dist(engine);
    }
  }
@ -46,9 +46,14 @@ class GaussianRandomOp : public framework::OperatorWithKernel {
  void InferShape(const framework::InferShapeContext& context) const override {
    auto* tensor = context.Output<framework::Tensor>("Out");
    auto dims = GetAttr<std::vector<int>>("dims");
    std::vector<int64_t> temp;
    temp.reserve(dims.size());
    for (auto dim : dims) {
      temp.push_back(static_cast<int64_t>(dim));
    }
    PADDLE_ENFORCE(dims.size() > 0UL,
                   "dims can be one int or array. dims must be set.");
-    tensor->Resize(framework::make_ddim(dims));
+    tensor->Resize(framework::make_ddim(temp));
  }
 };
--- a/paddle/operators/identity_op.cc
+++ b/paddle/operators/identity_op.cc
@ -0,0 +1,54 @@
 /* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at
   http://www.apache.org/licenses/LICENSE-2.0
   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License. */
 #include "paddle/operators/net_op.h"
 #include "paddle/operators/scale_op.h"
 namespace paddle {
 namespace operators {
 // identity is a alias of scale op. This is also a example for creating a alias
 // operator.
 template <typename AttrType>
 class IdentityOpMaker : public framework::OpProtoAndCheckerMaker {
 public:
  IdentityOpMaker(framework::OpProto *proto,
                  framework::OpAttrChecker *op_checker)
      : OpProtoAndCheckerMaker(proto, op_checker) {
    AddInput("X", "input tensor of identity op");
    AddOutput("Out", "output tensor of identity op");
    AddComment("identity operator. Just a alias of scale op which scale = 1.0");
  }
 };
 template <typename AttrType>
 class IdentityOp : public NetOp {
 public:
  IdentityOp(const std::string &type, const framework::VariableNameMap &inputs,
             const framework::VariableNameMap &outputs,
             const framework::AttributeMap &attrs)
      : NetOp(type, inputs, outputs, attrs) {
    AppendOp(framework::OpRegistry::CreateOp(
        "scale", {{"X", {Input("X")}}}, {{"Out", {Output("Out")}}},
        {{"scale", static_cast<AttrType>(1)}}));
  }
 };
 }  // namespace operators
 }  // namespace paddle
 namespace ops = paddle::operators;
 REGISTER_OP_WITHOUT_GRADIENT(identity, ops::IdentityOp<float>,
                             ops::IdentityOpMaker<float>);
--- a/Show More
+++ b/Show More