From 9d569c5a38582cbf9022578c046f89a88697c493 Mon Sep 17 00:00:00 2001
From: fengjiayi <fengjiayi@baidu.com>
Date: Thu, 3 Aug 2017 17:57:00 -0700
Subject: [PATCH 01/64] Update Backward.md

Add the "Backward Operator Registry" section
---
 paddle/framework/backward.md | 24 ++++++++++++++++++++++--
 1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/paddle/framework/backward.md b/paddle/framework/backward.md
index 74c001b06a..61f308b469 100644
--- a/paddle/framework/backward.md
+++ b/paddle/framework/backward.md
@@ -1,8 +1,28 @@
-## Operator/expression 's Backward
+# Operator/expression 's Backward
 
-### Motivation
+## Motivation
 
 In Neural Network, the backpropagation algorithm follows the chain rule, so we need to compound the fundmental gradient operators/expressions together with chain rule . Every forward network need a backward network to construct the full computation lineage, the operator/ expression's Backward feature will generate the backward pass respect to forward pass.
+ 
+## Backward Operator Registry
+
+A backward network is built up with several backward operators. Backward operators take forward operators' inputs, outputs and output gradients, and then calculate its input gradients. In most cases, there is a one-to-one correspondence between forward and backward operators. We use registry mechanism to save these correspondences, which is quite similar with operator registry itself.
+
+For example, we have got a `add_two_op`, and is registered by the following code:
+
+```cpp
+REGISTER_OP(add_two, AddTwoOp, AddTwoOpMaker);
+```
+
+`add_two` is the operator's type. `AddTwoOp` and `AddTwoOpMaker` are the operator class and the operator maker class respectively.
+
+Assume that we have also got the backward operator of `add_two_op`, which calculating the gradients of `add_two_op`'s inputs. Then we register it by the following way:
+
+```cpp
+REGISTER_GRADIENT_OP(add_two, add_two_grad, AddTwoGradOp);
+```
+
+`add_two_grad` is the type of backward operator, and `AddTwoGradOp` is its class name.
 
 ### Implement : gradient operator registry
 

From 7304006b7121c844d071227a6c2d24245a06e32e Mon Sep 17 00:00:00 2001
From: fengjiayi <fengjiayi@baidu.com>
Date: Tue, 8 Aug 2017 16:38:27 -0700
Subject: [PATCH 02/64] Update backward.md

---
 paddle/framework/backward.md | 29 ++++++++++++++++++++---------
 1 file changed, 20 insertions(+), 9 deletions(-)

diff --git a/paddle/framework/backward.md b/paddle/framework/backward.md
index 61f308b469..c717c2f30b 100644
--- a/paddle/framework/backward.md
+++ b/paddle/framework/backward.md
@@ -24,20 +24,31 @@ REGISTER_GRADIENT_OP(add_two, add_two_grad, AddTwoGradOp);
 
 `add_two_grad` is the type of backward operator, and `AddTwoGradOp` is its class name.
 
-### Implement : gradient operator registry
+## Backward Opeartor Creating
 
-|                        | forward operator | backward operator                |
-| ---------------------- | ---------------- | -------------------------------- |
-| **Operator::inputs_**  | Inputs           | Inputs, Outputs, OutputGradients |
-| **Operator::outputs_** | Outputs          | InputGradients                   |
+### Usage
 
-Inputs/Outputs means the input/output of the operator,  InputGradients/OutputGradients is the gradient respect to forward opeartor. Forward operator and Backward operator are isomorphic, save their corresponding needs into member attribute.
+Given a certain forward operator, we can get its corresponding backward opeartor by calling:
 
-We use a global hash map record the gradient operators available, follow the philosophy  of minimum core, make operator pluggable unit. Each gradient is an operator and it needs to regist itself. 
+```cpp
+OperatorBase* bwd_op = BuildGradOp(const OperatorBase* fwd_op);
+``` 
+
+The function `BuildGradOp` will sequentially execute following processes:
+
+1. Getting the `type_` of given forward operator, and then creating the corresponding backward operator.
+
+2. Copying all the attributes of forward operator expect `input_format` and `output_format`(if it has), for their elements differ between forward and backward operators.
+
+3. Copying forward operator's `inputs_` and `outputs_` to backward operator's `inputs_`. And adding forward inputs' gradient variables into backward `output_`, adding forward outputs' gradient variables into backward `input_`.
+
+4. Building backward operator's `input_format`, `output_format` (if necessary) and `in_out_idxs_` according to its `inputs_` and `outputs_` just created.
+
+## Backward Network Building
 
-grad_op_builder(fengjiayi)
+A backward network is a series of backward operators. The main idea of building a backward network is creating backward operators in the inverted sequence and put them together.
 
-### Implement : Backward network
+In our design, the network itself is also a kind of operator. So the operators contained by a big network may be some small network. 
 
 given a forward network, it generates the backward network. We only care about the Gradients—`OutputGradients`,`InputGradients`.
 

From e6db484d154c041c1cf6650743bcf27dd2549b77 Mon Sep 17 00:00:00 2001
From: Luo Tao <luotao02@baidu.com>
Date: Mon, 14 Aug 2017 15:51:00 +0800
Subject: [PATCH 03/64] make clear that current huber_cost is for
 two-classification

---
 paddle/gserver/layers/CostLayer.cpp           | 29 ++++++++++---------
 paddle/gserver/layers/CostLayer.h             | 18 +++++-------
 paddle/gserver/tests/test_LayerGrad.cpp       |  2 +-
 python/paddle/trainer/config_parser.py        |  2 +-
 .../paddle/trainer_config_helpers/layers.py   | 27 ++++++++++++-----
 .../protostr/test_cost_layers.protostr        | 10 +++----
 .../tests/configs/test_cost_layers.py         |  2 +-
 7 files changed, 50 insertions(+), 40 deletions(-)

diff --git a/paddle/gserver/layers/CostLayer.cpp b/paddle/gserver/layers/CostLayer.cpp
index 6bfdea3c6e..138c86a6d6 100644
--- a/paddle/gserver/layers/CostLayer.cpp
+++ b/paddle/gserver/layers/CostLayer.cpp
@@ -575,10 +575,10 @@ void MultiBinaryLabelCrossEntropy::backwardImp(Matrix& output,
 //
 // Huber loss for robust 2-classes classification
 //
-REGISTER_LAYER(huber, HuberTwoClass);
+REGISTER_LAYER(huber, HuberTwoClassification);
 
-bool HuberTwoClass::init(const LayerMap& layerMap,
-                         const ParameterMap& parameterMap) {
+bool HuberTwoClassification::init(const LayerMap& layerMap,
+                                  const ParameterMap& parameterMap) {
   CostLayer::init(layerMap, parameterMap);
   if (useGpu_) {
     tmpCpuInput_.reserve(inputLayers_.size());
@@ -589,7 +589,9 @@ bool HuberTwoClass::init(const LayerMap& layerMap,
   return true;
 }
 
-void HuberTwoClass::forwardImp(Matrix& output, Argument& label, Matrix& cost) {
+void HuberTwoClassification::forwardImp(Matrix& output,
+                                        Argument& label,
+                                        Matrix& cost) {
   if (useGpu_) {
     for (size_t i = 0; i < inputLayers_.size(); i++) {
       tmpCpuInput_[i].resizeAndCopyFrom(
@@ -600,10 +602,11 @@ void HuberTwoClass::forwardImp(Matrix& output, Argument& label, Matrix& cost) {
   forwardImpIn(output, label, cost);
 }
 
-void HuberTwoClass::forwardImpIn(Matrix& output,
-                                 Argument& label,
-                                 Matrix& target) {
+void HuberTwoClassification::forwardImpIn(Matrix& output,
+                                          Argument& label,
+                                          Matrix& target) {
   size_t numSamples = target.getHeight();
+  CHECK(label.ids);
   CHECK_EQ((*label.ids).getSize(), numSamples);
   CHECK_EQ(output.getHeight(), numSamples);
   CHECK_EQ(output.getWidth(), (size_t)1);
@@ -624,9 +627,9 @@ void HuberTwoClass::forwardImpIn(Matrix& output,
   target.copyFrom(cost.data(), numSamples);
 }
 
-void HuberTwoClass::backwardImp(Matrix& outputValue,
-                                Argument& label,
-                                Matrix& outputGrad) {
+void HuberTwoClassification::backwardImp(Matrix& outputValue,
+                                         Argument& label,
+                                         Matrix& outputGrad) {
   if (useGpu_) {
     backwardImpIn(
         *tmpCpuInput_[0].value, tmpCpuInput_[1], *tmpCpuInput_[0].grad);
@@ -636,9 +639,9 @@ void HuberTwoClass::backwardImp(Matrix& outputValue,
   }
 }
 
-void HuberTwoClass::backwardImpIn(Matrix& output,
-                                  Argument& label,
-                                  Matrix& outputG) {
+void HuberTwoClassification::backwardImpIn(Matrix& output,
+                                           Argument& label,
+                                           Matrix& outputG) {
   size_t numSamples = output.getHeight();
   real* out = output.getData();
   real* grad = outputG.getData();
diff --git a/paddle/gserver/layers/CostLayer.h b/paddle/gserver/layers/CostLayer.h
index 14c0b33ec1..77427b7a08 100644
--- a/paddle/gserver/layers/CostLayer.h
+++ b/paddle/gserver/layers/CostLayer.h
@@ -307,21 +307,17 @@ public:
 /**
  * Huber loss for robust 2-classes classification.
  *
- * For label={0, 1}, let y=2*label-1. Given output f, the loss is:
- * \f[
- * Loss =
- * \left\{\begin{matrix}
- *  4 * y * f     &   \textit{if}  \ \  y* f < -1 \\
- *  (1 - y * f)^2 &  \textit{if}   \ \  -1 < y * f < 1  \\
- *  0             &                    \textit{otherwise}
- * \end{matrix}\right.
- * \f]
+ * For label={0, 1}, let y=2*label-1. Given output f(x), the loss is:
+ * Loss = 4 * y * f, if y* f < -1 \\
+ * Loss = (1 - y * f)^2, if -1 < y * f < 1  \\
+ * Loss = 0, otherwise
  */
-class HuberTwoClass : public CostLayer {
+class HuberTwoClassification : public CostLayer {
   std::vector<Argument> tmpCpuInput_;
 
 public:
-  explicit HuberTwoClass(const LayerConfig& config) : CostLayer(config) {}
+  explicit HuberTwoClassification(const LayerConfig& config)
+      : CostLayer(config) {}
 
   bool init(const LayerMap& layerMap,
             const ParameterMap& parameterMap) override;
diff --git a/paddle/gserver/tests/test_LayerGrad.cpp b/paddle/gserver/tests/test_LayerGrad.cpp
index 0f312b6ca5..6d60250f6d 100644
--- a/paddle/gserver/tests/test_LayerGrad.cpp
+++ b/paddle/gserver/tests/test_LayerGrad.cpp
@@ -830,7 +830,7 @@ TEST(Layer, square_error_weighted) {
 
 TEST(Layer, huber_two_class) {
   TestConfig config;
-  config.layerConfig.set_type("huber");
+  config.layerConfig.set_type("huber_classification");
   config.biasSize = 0;
 
   config.inputDefs.push_back({INPUT_DATA, "layer_0", 1, 0});
diff --git a/python/paddle/trainer/config_parser.py b/python/paddle/trainer/config_parser.py
index da99e5bd53..248da9417f 100644
--- a/python/paddle/trainer/config_parser.py
+++ b/python/paddle/trainer/config_parser.py
@@ -2255,7 +2255,7 @@ define_cost('PnpairValidation', 'pnpair-validation')
 define_cost('SumOfSquaresCostLayer', 'square_error')
 define_cost('MultiBinaryLabelCrossEntropy', 'multi_binary_label_cross_entropy')
 define_cost('SoftBinaryClassCrossEntropy', 'soft_binary_class_cross_entropy')
-define_cost('HuberTwoClass', 'huber')
+define_cost('HuberTwoClassification', 'huber_classification')
 define_cost('SumCost', 'sum_cost')
 define_cost('SmoothL1Cost', 'smooth_l1')
 
diff --git a/python/paddle/trainer_config_helpers/layers.py b/python/paddle/trainer_config_helpers/layers.py
index 1bc55c8696..20d96efe15 100755
--- a/python/paddle/trainer_config_helpers/layers.py
+++ b/python/paddle/trainer_config_helpers/layers.py
@@ -108,7 +108,7 @@ __all__ = [
     'sum_cost',
     'rank_cost',
     'lambda_cost',
-    'huber_cost',
+    'huber_classification_cost',
     'block_expand_layer',
     'maxout_layer',
     'out_prod_layer',
@@ -216,7 +216,7 @@ class LayerType(object):
 
     RANK_COST = 'rank-cost'
     LAMBDA_COST = 'lambda_cost'
-    HUBER = 'huber'
+    HUBER_CLASSIFICATION = 'huber_classification'
     CROSS_ENTROPY = 'multi-class-cross-entropy'
     CROSS_ENTROPY_WITH_SELFNORM = 'multi_class_cross_entropy_with_selfnorm'
     SOFT_BIN_CLASS_CROSS_ENTROPY = 'soft_binary_class_cross_entropy'
@@ -5605,16 +5605,26 @@ def sum_cost(input, name=None, layer_attr=None):
 
 @wrap_name_default()
 @layer_support()
-def huber_cost(input, label, name=None, coeff=1.0, layer_attr=None):
+def huber_classification_cost(input,
+                              label,
+                              name=None,
+                              coeff=1.0,
+                              layer_attr=None):
     """
-    A loss layer for huber loss.
+    For classification purposes, a variant of the Huber loss called modified Huber 
+    is sometimes used. Given a prediction f(x) (a real-valued classifier score) and 
+    a true binary class label :math:`y\in \left \{-1, 1 \right \}`, the modified Huber 
+    loss is defined as:
+
+    .. math:
+       loss = \max \left ( 0, 1-yf(x) \right )^2, yf(x)\geq 1 
+       loss = -4yf(x), \text{otherwise}
 
     The example usage is:
 
     .. code-block:: python
 
-       cost = huber_cost(input=input_layer,
-                         label=label_layer)
+       cost = huber_classification_cost(input=input_layer, label=label_layer)
 
     :param input: The first input layer.
     :type input: LayerOutput.
@@ -5634,11 +5644,12 @@ def huber_cost(input, label, name=None, coeff=1.0, layer_attr=None):
         assert input.size == 1
     Layer(
         name=name,
-        type=LayerType.HUBER,
+        type=LayerType.HUBER_CLASSIFICATION,
         inputs=[input.name, label.name],
         coeff=coeff,
         **ExtraLayerAttribute.to_kwargs(layer_attr))
-    return LayerOutput(name, LayerType.HUBER, parents=[input, label], size=1)
+    return LayerOutput(
+        name, LayerType.HUBER_CLASSIFICATION, parents=[input, label], size=1)
 
 
 @wrap_name_default()
diff --git a/python/paddle/trainer_config_helpers/tests/configs/protostr/test_cost_layers.protostr b/python/paddle/trainer_config_helpers/tests/configs/protostr/test_cost_layers.protostr
index 05847344be..a64e5ea0dd 100644
--- a/python/paddle/trainer_config_helpers/tests/configs/protostr/test_cost_layers.protostr
+++ b/python/paddle/trainer_config_helpers/tests/configs/protostr/test_cost_layers.protostr
@@ -180,8 +180,8 @@ layers {
   active_type: ""
 }
 layers {
-  name: "__huber_cost_0__"
-  type: "huber"
+  name: "__huber_classification_cost_0__"
+  type: "huber_classification"
   size: 1
   active_type: ""
   inputs {
@@ -300,7 +300,7 @@ output_layer_names: "__rank_cost_0__"
 output_layer_names: "__lambda_cost_0__"
 output_layer_names: "__cross_entropy_0__"
 output_layer_names: "__cross_entropy_with_selfnorm_0__"
-output_layer_names: "__huber_cost_0__"
+output_layer_names: "__huber_classification_cost_0__"
 output_layer_names: "__multi_binary_label_cross_entropy_0__"
 output_layer_names: "__sum_cost_0__"
 output_layer_names: "__nce_layer_0__"
@@ -326,7 +326,7 @@ sub_models {
   layer_names: "__cross_entropy_with_selfnorm_0__"
   layer_names: "huber_probs"
   layer_names: "huber_label"
-  layer_names: "__huber_cost_0__"
+  layer_names: "__huber_classification_cost_0__"
   layer_names: "__multi_binary_label_cross_entropy_0__"
   layer_names: "__sum_cost_0__"
   layer_names: "__nce_layer_0__"
@@ -349,7 +349,7 @@ sub_models {
   output_layer_names: "__lambda_cost_0__"
   output_layer_names: "__cross_entropy_0__"
   output_layer_names: "__cross_entropy_with_selfnorm_0__"
-  output_layer_names: "__huber_cost_0__"
+  output_layer_names: "__huber_classification_cost_0__"
   output_layer_names: "__multi_binary_label_cross_entropy_0__"
   output_layer_names: "__sum_cost_0__"
   output_layer_names: "__nce_layer_0__"
diff --git a/python/paddle/trainer_config_helpers/tests/configs/test_cost_layers.py b/python/paddle/trainer_config_helpers/tests/configs/test_cost_layers.py
index d2a3b702a1..98bf026d60 100644
--- a/python/paddle/trainer_config_helpers/tests/configs/test_cost_layers.py
+++ b/python/paddle/trainer_config_helpers/tests/configs/test_cost_layers.py
@@ -33,7 +33,7 @@ outputs(
         input=probs, label=xe_label),
     cross_entropy_with_selfnorm(
         input=probs, label=xe_label),
-    huber_cost(
+    huber_classification_cost(
         input=data_layer(
             name='huber_probs', size=1),
         label=data_layer(

From 27a99bfb1446171969da0219a6125a79c39eb582 Mon Sep 17 00:00:00 2001
From: Luo Tao <luotao02@baidu.com>
Date: Thu, 17 Aug 2017 18:10:37 +0800
Subject: [PATCH 04/64] Add base class for huber_regression_cost and
 huber_classification_cost

---
 doc/api/v2/config/layer.rst          |  6 +--
 paddle/gserver/layers/CostLayer.cpp  | 55 ++++++++++++----------------
 paddle/gserver/layers/CostLayer.h    | 27 ++++++++++----
 python/paddle/v2/tests/test_layer.py |  2 +-
 4 files changed, 46 insertions(+), 44 deletions(-)

diff --git a/doc/api/v2/config/layer.rst b/doc/api/v2/config/layer.rst
index cb330ea5e1..22a6b2ab84 100644
--- a/doc/api/v2/config/layer.rst
+++ b/doc/api/v2/config/layer.rst
@@ -409,9 +409,9 @@ multi_binary_label_cross_entropy_cost
 ..  autoclass:: paddle.v2.layer.multi_binary_label_cross_entropy_cost
     :noindex:
 
-huber_cost
-----------
-..  autoclass:: paddle.v2.layer.huber_cost
+huber_classification_cost
+-------------------------
+..  autoclass:: paddle.v2.layer.huber_classification_cost
     :noindex:
 
 lambda_cost
diff --git a/paddle/gserver/layers/CostLayer.cpp b/paddle/gserver/layers/CostLayer.cpp
index 138c86a6d6..69cf393225 100644
--- a/paddle/gserver/layers/CostLayer.cpp
+++ b/paddle/gserver/layers/CostLayer.cpp
@@ -572,13 +572,8 @@ void MultiBinaryLabelCrossEntropy::backwardImp(Matrix& output,
   }
 }
 
-//
-// Huber loss for robust 2-classes classification
-//
-REGISTER_LAYER(huber, HuberTwoClassification);
-
-bool HuberTwoClassification::init(const LayerMap& layerMap,
-                                  const ParameterMap& parameterMap) {
+bool HuberCost::init(const LayerMap& layerMap,
+                     const ParameterMap& parameterMap) {
   CostLayer::init(layerMap, parameterMap);
   if (useGpu_) {
     tmpCpuInput_.reserve(inputLayers_.size());
@@ -589,9 +584,7 @@ bool HuberTwoClassification::init(const LayerMap& layerMap,
   return true;
 }
 
-void HuberTwoClassification::forwardImp(Matrix& output,
-                                        Argument& label,
-                                        Matrix& cost) {
+void HuberCost::forwardImp(Matrix& output, Argument& label, Matrix& cost) {
   if (useGpu_) {
     for (size_t i = 0; i < inputLayers_.size(); i++) {
       tmpCpuInput_[i].resizeAndCopyFrom(
@@ -599,12 +592,22 @@ void HuberTwoClassification::forwardImp(Matrix& output,
     }
     hl_stream_synchronize(HPPL_STREAM_DEFAULT);
   }
-  forwardImpIn(output, label, cost);
 }
 
-void HuberTwoClassification::forwardImpIn(Matrix& output,
-                                          Argument& label,
-                                          Matrix& target) {
+//
+// Huber loss for robust 2-classes classification
+//
+REGISTER_LAYER(huber_classification, HuberTwoClassification);
+
+bool HuberTwoClassification::init(const LayerMap& layerMap,
+                                  const ParameterMap& parameterMap) {
+  return HuberCost::init(layerMap, parameterMap);
+}
+
+void HuberTwoClassification::forwardImp(Matrix& output,
+                                        Argument& label,
+                                        Matrix& target) {
+  HuberCost::forwardImp(output, label, target);
   size_t numSamples = target.getHeight();
   CHECK(label.ids);
   CHECK_EQ((*label.ids).getSize(), numSamples);
@@ -627,25 +630,13 @@ void HuberTwoClassification::forwardImpIn(Matrix& output,
   target.copyFrom(cost.data(), numSamples);
 }
 
-void HuberTwoClassification::backwardImp(Matrix& outputValue,
+void HuberTwoClassification::backwardImp(Matrix& output,
                                          Argument& label,
-                                         Matrix& outputGrad) {
-  if (useGpu_) {
-    backwardImpIn(
-        *tmpCpuInput_[0].value, tmpCpuInput_[1], *tmpCpuInput_[0].grad);
-    outputGrad.copyFrom(*tmpCpuInput_[0].grad);
-  } else {
-    backwardImpIn(outputValue, label, outputGrad);
-  }
-}
-
-void HuberTwoClassification::backwardImpIn(Matrix& output,
-                                           Argument& label,
-                                           Matrix& outputG) {
+                                         Matrix& outputG) {
   size_t numSamples = output.getHeight();
-  real* out = output.getData();
-  real* grad = outputG.getData();
-  int* lbl = (*label.ids).getData();
+  real* out = useGpu_ ? tmpCpuInput_[0].value->getData() : output.getData();
+  int* lbl = useGpu_ ? tmpCpuInput_[1].ids->getData() : (*label.ids).getData();
+  real* grad = useGpu_ ? tmpCpuInput_[0].grad->getData() : outputG.getData();
   for (size_t i = 0; i < numSamples; ++i) {
     int y = 2 * lbl[i] - 1;
     if (y * out[i] < -1)
@@ -653,8 +644,8 @@ void HuberTwoClassification::backwardImpIn(Matrix& output,
     else if (y * out[i] < 1)
       grad[i] += -2 * (1 - y * out[i]) * y;
   }
+  if (useGpu_) outputG.copyFrom(grad, numSamples);
 }
-
 /**
  * This cost layer compute the sum of its input as loss.
  * \f[
diff --git a/paddle/gserver/layers/CostLayer.h b/paddle/gserver/layers/CostLayer.h
index 77427b7a08..c006dc8110 100644
--- a/paddle/gserver/layers/CostLayer.h
+++ b/paddle/gserver/layers/CostLayer.h
@@ -304,6 +304,23 @@ public:
                    Matrix& outputGrad) override;
 };
 
+/*
+ * A base layer for HuberRegressionLoss and HuberTwoClassification.
+ */
+class HuberCost : public CostLayer {
+public:
+  std::vector<Argument> tmpCpuInput_;
+
+  explicit HuberCost(const LayerConfig& config) : CostLayer(config) {}
+
+  bool init(const LayerMap& layerMap,
+            const ParameterMap& parameterMap) override;
+
+  void forwardImp(Matrix& output, Argument& label, Matrix& cost) override;
+
+  void backwardImp(Matrix& outputValue, Argument& label, Matrix& outputGrad) {}
+};
+
 /**
  * Huber loss for robust 2-classes classification.
  *
@@ -312,25 +329,19 @@ public:
  * Loss = (1 - y * f)^2, if -1 < y * f < 1  \\
  * Loss = 0, otherwise
  */
-class HuberTwoClassification : public CostLayer {
-  std::vector<Argument> tmpCpuInput_;
-
+class HuberTwoClassification : public HuberCost {
 public:
   explicit HuberTwoClassification(const LayerConfig& config)
-      : CostLayer(config) {}
+      : HuberCost(config) {}
 
   bool init(const LayerMap& layerMap,
             const ParameterMap& parameterMap) override;
 
   void forwardImp(Matrix& output, Argument& label, Matrix& cost) override;
 
-  void forwardImpIn(Matrix& output, Argument& label, Matrix& cost);
-
   void backwardImp(Matrix& outputValue,
                    Argument& label,
                    Matrix& outputGrad) override;
-
-  void backwardImpIn(Matrix& outputValue, Argument& label, Matrix& outputGrad);
 };
 
 typedef std::shared_ptr<CostLayer> CostLayerPtr;
diff --git a/python/paddle/v2/tests/test_layer.py b/python/paddle/v2/tests/test_layer.py
index f2097e195f..7373a55ce6 100644
--- a/python/paddle/v2/tests/test_layer.py
+++ b/python/paddle/v2/tests/test_layer.py
@@ -141,7 +141,7 @@ class CostLayerTest(unittest.TestCase):
         cost8 = layer.rank_cost(left=score, right=score, label=score)
         cost9 = layer.lambda_cost(input=inference, score=score)
         cost10 = layer.sum_cost(input=inference)
-        cost11 = layer.huber_cost(input=score, label=label)
+        cost11 = layer.huber_classification_cost(input=score, label=label)
 
         print layer.parse_network([cost1, cost2])
         print layer.parse_network([cost3, cost4])

From 4d8992c3bc64a835aa6a1e6e12678594d3f117b5 Mon Sep 17 00:00:00 2001
From: tensor-tang <jian.j.tang@intel.com>
Date: Fri, 18 Aug 2017 09:58:41 +0800
Subject: [PATCH 05/64] check format before set header format

---
 paddle/parameter/Parameter.h | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/paddle/parameter/Parameter.h b/paddle/parameter/Parameter.h
index e31cbc3dee..08a426eb74 100644
--- a/paddle/parameter/Parameter.h
+++ b/paddle/parameter/Parameter.h
@@ -278,7 +278,11 @@ public:
   /**
    * @brief Set the format in header.
    */
-  void setHeaderFormat(int32_t fmt) { headerFormat_ = fmt; }
+  void setHeaderFormat(int32_t fmt) {
+    CHECK(isHeaderFormatSupported(fmt)) << "Unsupported format version: "
+                                        << fmt;
+    headerFormat_ = fmt;
+  }
 
   /**
    * @brief  Parameter Update Hook.

From 462b9b1d20942dca35dbe532248e53cdeccea6b2 Mon Sep 17 00:00:00 2001
From: tensor-tang <jian.j.tang@intel.com>
Date: Fri, 18 Aug 2017 10:13:06 +0800
Subject: [PATCH 06/64] update mkldnn tag v0.10

---
 cmake/external/mkldnn.cmake | 2 +-
 cmake/external/mklml.cmake  | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/cmake/external/mkldnn.cmake b/cmake/external/mkldnn.cmake
index 25c6b4ef52..9686df0021 100644
--- a/cmake/external/mkldnn.cmake
+++ b/cmake/external/mkldnn.cmake
@@ -51,7 +51,7 @@ ExternalProject_Add(
     ${EXTERNAL_PROJECT_LOG_ARGS}
     DEPENDS             ${MKLDNN_DEPENDS}
     GIT_REPOSITORY      "https://github.com/01org/mkl-dnn.git"
-    GIT_TAG             "v0.9"
+    GIT_TAG             "v0.10"
     PREFIX              ${MKLDNN_SOURCES_DIR}
     UPDATE_COMMAND      ""
     CMAKE_ARGS          -DCMAKE_INSTALL_PREFIX=${MKLDNN_INSTALL_DIR}
diff --git a/cmake/external/mklml.cmake b/cmake/external/mklml.cmake
index e9fd3d4bed..51fafb9479 100644
--- a/cmake/external/mklml.cmake
+++ b/cmake/external/mklml.cmake
@@ -28,7 +28,7 @@ INCLUDE(ExternalProject)
 
 SET(MKLML_PROJECT       "extern_mklml")
 SET(MKLML_VER           "mklml_lnx_2018.0.20170720")
-SET(MKLML_URL           "https://github.com/01org/mkl-dnn/releases/download/v0.9/${MKLML_VER}.tgz")
+SET(MKLML_URL           "https://github.com/01org/mkl-dnn/releases/download/v0.10/${MKLML_VER}.tgz")
 SET(MKLML_SOURCE_DIR    "${THIRD_PARTY_PATH}/mklml")
 SET(MKLML_DOWNLOAD_DIR  "${MKLML_SOURCE_DIR}/src/${MKLML_PROJECT}")
 SET(MKLML_DST_DIR       "mklml")

From 62e6dac402ca63b402b5dfd1d7649cba1e258d41 Mon Sep 17 00:00:00 2001
From: tensor-tang <jian.j.tang@intel.com>
Date: Fri, 18 Aug 2017 14:30:09 +0800
Subject: [PATCH 07/64] add MKLDNNMatrix files

---
 paddle/gserver/layers/MKLDNNLayer.h |  1 +
 paddle/math/CMakeLists.txt          | 15 ++++++++++
 paddle/math/MKLDNNMatrix.cpp        | 19 ++++++++++++
 paddle/math/MKLDNNMatrix.h          | 45 +++++++++++++++++++++++++++++
 4 files changed, 80 insertions(+)
 create mode 100644 paddle/math/MKLDNNMatrix.cpp
 create mode 100644 paddle/math/MKLDNNMatrix.h

diff --git a/paddle/gserver/layers/MKLDNNLayer.h b/paddle/gserver/layers/MKLDNNLayer.h
index 63e29f447e..9533027fa6 100644
--- a/paddle/gserver/layers/MKLDNNLayer.h
+++ b/paddle/gserver/layers/MKLDNNLayer.h
@@ -18,6 +18,7 @@ limitations under the License. */
 #include "Layer.h"
 #include "MKLDNNBase.h"
 #include "mkldnn.hpp"
+#include "paddle/math/MKLDNNMatrix.h"
 
 DECLARE_bool(use_mkldnn);
 DECLARE_bool(use_mkldnn_wgt);
diff --git a/paddle/math/CMakeLists.txt b/paddle/math/CMakeLists.txt
index bf28092e82..ad6de18c81 100644
--- a/paddle/math/CMakeLists.txt
+++ b/paddle/math/CMakeLists.txt
@@ -14,6 +14,21 @@
 #
 file(GLOB MATH_HEADERS . *.h)
 file(GLOB MATH_SOURCES . *.cpp)
+
+message(STATUS "----------MATH_HEADERS:${MATH_HEADERS}")
+message(STATUS "----------MATH_SOURCES:${MATH_SOURCES}")
+if(NOT WITH_MKLDNN)
+    file(GLOB_RECURSE DNN_HEADER RELATIVE "${CMAKE_CURRENT_SOURCE_DIR}" "MKLDNN*.h")
+    file(GLOB_RECURSE DNN_SOURCES RELATIVE "${CMAKE_CURRENT_SOURCE_DIR}" "MKLDNN*.cpp")
+    message(STATUS "----------DNN_HEADER:${DNN_HEADER}")
+    message(STATUS "----------DNN_SOURCES:${DNN_SOURCES}")
+    list(REMOVE_ITEM MATH_HEADERS ${DNN_HEADER})
+    list(REMOVE_ITEM MATH_SOURCES ${DNN_SOURCES})
+    message(STATUS "Skip compiling with MKLDNNMatrix")
+else()
+    message(STATUS "Compile with MKLDNNMatrix")
+endif()
+
 set(MATH_SOURCES
     "${PADDLE_SOURCE_DIR}/paddle/math/BaseMatrix.cu"
     "${PADDLE_SOURCE_DIR}/paddle/math/TrainingAlgorithmOp.cu"
diff --git a/paddle/math/MKLDNNMatrix.cpp b/paddle/math/MKLDNNMatrix.cpp
new file mode 100644
index 0000000000..df8e72d78b
--- /dev/null
+++ b/paddle/math/MKLDNNMatrix.cpp
@@ -0,0 +1,19 @@
+/* Copyright (c) 2017 PaddlePaddle Authors. All Rights Reserve.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "MKLDNNMatrix.h"
+
+using namespace mkldnn;  // NOLINT
+
+namespace paddle {}  // namespace paddle
diff --git a/paddle/math/MKLDNNMatrix.h b/paddle/math/MKLDNNMatrix.h
new file mode 100644
index 0000000000..91ef56f2c3
--- /dev/null
+++ b/paddle/math/MKLDNNMatrix.h
@@ -0,0 +1,45 @@
+/* Copyright (c) 2017 PaddlePaddle Authors. All Rights Reserve.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+
+//#include "Matrix.h"
+#include "Vector.h"
+
+#include "mkldnn.hpp"
+#include "paddle/parameter/Parameter.h"
+
+namespace paddle {
+
+static const std::map<mkldnn::memory::format, PARAM_FORMAT> PARAM_FOARMAT_MAP =
+    {{mkldnn::memory::format::oi, PARAM_FORMAT_MKLDNN_OI}};
+
+class MKLDNNMatrix;
+typedef std::shared_ptr<MKLDNNMatrix> MKLDNNMatrixPtr;
+
+/**
+ * @brief MKLDNN Matrix.
+ *
+ */
+class MKLDNNMatrix : public CpuVector {
+public:
+  explicit MKLDNNMatrix(size_t size, int fmt) : CpuVector(size), fmt_(fmt) {}
+
+  ~MKLDNNMatrix() {}
+
+protected:
+  int fmt_;
+};
+
+}  // namespace paddle

From 3065cb26258e1a7a014c6e367747214615832c3a Mon Sep 17 00:00:00 2001
From: Luo Tao <luotao02@baidu.com>
Date: Fri, 18 Aug 2017 17:43:06 +0800
Subject: [PATCH 08/64] add huber_regression_cost

---
 doc/api/v2/config/layer.rst                   |  5 ++
 paddle/gserver/layers/CostLayer.cpp           | 55 +++++++++++++++++++
 paddle/gserver/layers/CostLayer.h             | 24 ++++++++
 paddle/gserver/tests/test_LayerGrad.cpp       | 20 ++++++-
 proto/ModelConfig.proto                       |  3 +
 python/paddle/trainer/config_parser.py        | 11 ++++
 .../paddle/trainer_config_helpers/layers.py   | 53 ++++++++++++++++++
 .../protostr/test_cost_layers.protostr        | 17 ++++++
 .../tests/configs/test_cost_layers.py         |  2 +
 python/paddle/v2/tests/test_layer.py          |  5 +-
 10 files changed, 192 insertions(+), 3 deletions(-)

diff --git a/doc/api/v2/config/layer.rst b/doc/api/v2/config/layer.rst
index 22a6b2ab84..9a5901616f 100644
--- a/doc/api/v2/config/layer.rst
+++ b/doc/api/v2/config/layer.rst
@@ -409,6 +409,11 @@ multi_binary_label_cross_entropy_cost
 ..  autoclass:: paddle.v2.layer.multi_binary_label_cross_entropy_cost
     :noindex:
 
+huber_regression_cost
+-------------------------
+..  autoclass:: paddle.v2.layer.huber_regression_cost
+    :noindex:
+
 huber_classification_cost
 -------------------------
 ..  autoclass:: paddle.v2.layer.huber_classification_cost
diff --git a/paddle/gserver/layers/CostLayer.cpp b/paddle/gserver/layers/CostLayer.cpp
index 69cf393225..91a742422e 100644
--- a/paddle/gserver/layers/CostLayer.cpp
+++ b/paddle/gserver/layers/CostLayer.cpp
@@ -594,6 +594,61 @@ void HuberCost::forwardImp(Matrix& output, Argument& label, Matrix& cost) {
   }
 }
 
+//
+// Huber loss for robust regression.
+//
+REGISTER_LAYER(huber_regression, HuberRegressionLoss);
+
+bool HuberRegressionLoss::init(const LayerMap& layerMap,
+                               const ParameterMap& parameterMap) {
+  HuberCost::init(layerMap, parameterMap);
+  delta_ = config_.delta();
+  return true;
+}
+
+void HuberRegressionLoss::forwardImp(Matrix& output,
+                                     Argument& label,
+                                     Matrix& target) {
+  HuberCost::forwardImp(output, label, target);
+  size_t numSamples = target.getHeight();
+  CHECK(label.value);
+  CHECK_EQ((*label.value).getHeight(), numSamples);
+  CHECK_EQ(output.getHeight(), numSamples);
+  CHECK_EQ(output.getWidth(), (*label.value).getWidth());
+  CHECK_EQ(target.getWidth(), (size_t)1);
+
+  real* out = useGpu_ ? tmpCpuInput_[0].value->getData() : output.getData();
+  real* lbl =
+      useGpu_ ? tmpCpuInput_[1].value->getData() : (*label.value).getData();
+  std::vector<real> cost(numSamples);
+  for (size_t i = 0; i < numSamples; ++i) {
+    real a = std::abs(lbl[i] - out[i]);
+    if (a <= delta_)
+      cost[i] = a * a / 2;
+    else
+      cost[i] = delta_ * (a - delta_ / 2);
+  }
+  target.copyFrom(cost.data(), numSamples);
+}
+
+void HuberRegressionLoss::backwardImp(Matrix& output,
+                                      Argument& label,
+                                      Matrix& outputG) {
+  size_t numSamples = output.getHeight();
+  real* out = useGpu_ ? tmpCpuInput_[0].value->getData() : output.getData();
+  real* lbl =
+      useGpu_ ? tmpCpuInput_[1].value->getData() : (*label.value).getData();
+  real* grad = useGpu_ ? tmpCpuInput_[0].grad->getData() : outputG.getData();
+  for (size_t i = 0; i < numSamples; ++i) {
+    real a = lbl[i] - out[i];
+    if (std::abs(a) <= delta_)
+      grad[i] += -a;
+    else
+      grad[i] += a > 0 ? delta_ : -delta_;
+  }
+  if (useGpu_) outputG.copyFrom(grad, numSamples);
+}
+
 //
 // Huber loss for robust 2-classes classification
 //
diff --git a/paddle/gserver/layers/CostLayer.h b/paddle/gserver/layers/CostLayer.h
index c006dc8110..0ce72ef40a 100644
--- a/paddle/gserver/layers/CostLayer.h
+++ b/paddle/gserver/layers/CostLayer.h
@@ -321,6 +321,30 @@ public:
   void backwardImp(Matrix& outputValue, Argument& label, Matrix& outputGrad) {}
 };
 
+/**
+ * Huber loss for robust regression.
+ *
+ * Given output f(x), label y and delta, the loss is:
+ * Loss = 0.5 * (1 - y * f)^2, if abs(y - f) <= delta \\
+ * Loss = delta * abs(y - f) - 0.5 * delta^2, otherwise
+ */
+class HuberRegressionLoss : public HuberCost {
+public:
+  explicit HuberRegressionLoss(const LayerConfig& config) : HuberCost(config) {}
+
+  bool init(const LayerMap& layerMap,
+            const ParameterMap& parameterMap) override;
+
+  void forwardImp(Matrix& output, Argument& label, Matrix& cost) override;
+
+  void backwardImp(Matrix& outputValue,
+                   Argument& label,
+                   Matrix& outputGrad) override;
+
+protected:
+  real delta_;
+};
+
 /**
  * Huber loss for robust 2-classes classification.
  *
diff --git a/paddle/gserver/tests/test_LayerGrad.cpp b/paddle/gserver/tests/test_LayerGrad.cpp
index 6d60250f6d..c522b20f0e 100644
--- a/paddle/gserver/tests/test_LayerGrad.cpp
+++ b/paddle/gserver/tests/test_LayerGrad.cpp
@@ -828,6 +828,24 @@ TEST(Layer, square_error_weighted) {
   }
 }
 
+TEST(Layer, huber_regression_loss) {
+  TestConfig config;
+  config.layerConfig.set_type("huber_regression");
+  config.biasSize = 0;
+
+  config.inputDefs.push_back({INPUT_DATA, "layer_0", 10, 0});
+  config.inputDefs.push_back({INPUT_DATA_TARGET, "layer_1", 10, 0});
+  config.layerConfig.add_inputs();
+  config.layerConfig.add_inputs();
+
+  for (auto useGpu : {false, true}) {
+    for (auto delta : {1, 3, 5}) {
+      config.layerConfig.set_delta(delta);
+      testLayerGrad(config, "huber_regression", 100, /* trans */ false, useGpu);
+    }
+  }
+}
+
 TEST(Layer, huber_two_class) {
   TestConfig config;
   config.layerConfig.set_type("huber_classification");
@@ -839,7 +857,7 @@ TEST(Layer, huber_two_class) {
   config.layerConfig.add_inputs();
 
   for (auto useGpu : {false, true}) {
-    testLayerGrad(config, "huber", 100, /* trans */ false, useGpu);
+    testLayerGrad(config, "huber_two_class", 100, /* trans */ false, useGpu);
   }
 }
 
diff --git a/proto/ModelConfig.proto b/proto/ModelConfig.proto
index 4f3d5bf3f6..e19e0f85f3 100644
--- a/proto/ModelConfig.proto
+++ b/proto/ModelConfig.proto
@@ -496,6 +496,9 @@ message LayerConfig {
   optional int32 axis = 54 [ default = 2 ];
   repeated uint32 offset = 55;
   repeated uint32 shape = 56;
+
+  // for HuberRegressionLoss
+  optional double delta = 57 [ default = 1.0 ];
 }
 
 message EvaluatorConfig {
diff --git a/python/paddle/trainer/config_parser.py b/python/paddle/trainer/config_parser.py
index 248da9417f..a3ca3f2510 100644
--- a/python/paddle/trainer/config_parser.py
+++ b/python/paddle/trainer/config_parser.py
@@ -2317,6 +2317,17 @@ class LambdaCost(LayerBase):
         self.config.max_sort_size = max_sort_size
 
 
+@config_layer('huber_regression')
+class HuberRegressionLoss(LayerBase):
+    def __init__(self, name, inputs, delta=1., coeff=1., device=None):
+        super(HuberRegressionLoss, self).__init__(
+            name, 'huber_regression', 1, inputs=inputs, device=device)
+        config_assert(
+            len(self.inputs) == 2, 'HuberRegression must have 2 inputs')
+        self.config.delta = delta
+        self.config.coeff = coeff
+
+
 @config_layer('nce')
 class NCELayer(LayerBase):
     def __init__(self,
diff --git a/python/paddle/trainer_config_helpers/layers.py b/python/paddle/trainer_config_helpers/layers.py
index 20d96efe15..d61c94dc82 100755
--- a/python/paddle/trainer_config_helpers/layers.py
+++ b/python/paddle/trainer_config_helpers/layers.py
@@ -108,6 +108,7 @@ __all__ = [
     'sum_cost',
     'rank_cost',
     'lambda_cost',
+    'huber_regression_cost',
     'huber_classification_cost',
     'block_expand_layer',
     'maxout_layer',
@@ -216,6 +217,7 @@ class LayerType(object):
 
     RANK_COST = 'rank-cost'
     LAMBDA_COST = 'lambda_cost'
+    HUBER_REGRESSION = 'huber_regression'
     HUBER_CLASSIFICATION = 'huber_classification'
     CROSS_ENTROPY = 'multi-class-cross-entropy'
     CROSS_ENTROPY_WITH_SELFNORM = 'multi_class_cross_entropy_with_selfnorm'
@@ -5603,6 +5605,57 @@ def sum_cost(input, name=None, layer_attr=None):
     return LayerOutput(name, LayerType.SUM_COST, parents=[input], size=1)
 
 
+@wrap_name_default()
+@layer_support()
+def huber_regression_cost(input,
+                          label,
+                          name=None,
+                          delta=1.0,
+                          coeff=1.0,
+                          layer_attr=None):
+    """
+    In statistics, the Huber loss is a loss function used in robust regression, 
+    that is less sensitive to outliers in data than the squared error loss. 
+    Given a prediction f(x), a label y and :math:`\delta`, the loss function 
+    is defined as:
+
+    .. math:
+       loss = 0.5*\left ( y-f(x) \right )^2, \left | y-f(x) \right |\leq \delta
+       loss = \delta \left | y-f(x) \right |-0.5\delta ^2, otherwise
+
+    The example usage is:
+
+    .. code-block:: python
+
+       cost = huber_regression_cost(input=input_layer, label=label_layer)
+
+    :param input: The first input layer.
+    :type input: LayerOutput.
+    :param label: The input label.
+    :type input: LayerOutput.
+    :param name: The name of this layers. It is not necessary.
+    :type name: None|basestring.
+    :param delta: The difference between the observed and predicted values.
+    :type delta: float.
+    :param coeff: The coefficient affects the gradient in the backward.
+    :type coeff: float.
+    :param layer_attr: Extra Layer Attribute.
+    :type layer_attr: ExtraLayerAttribute
+    :return: LayerOutput object.
+    :rtype: LayerOutput.
+    """
+    assert isinstance(input, LayerOutput)
+    Layer(
+        name=name,
+        type=LayerType.HUBER_REGRESSION,
+        inputs=[input.name, label.name],
+        delta=delta,
+        coeff=coeff,
+        **ExtraLayerAttribute.to_kwargs(layer_attr))
+    return LayerOutput(
+        name, LayerType.HUBER_REGRESSION, parents=[input, label], size=1)
+
+
 @wrap_name_default()
 @layer_support()
 def huber_classification_cost(input,
diff --git a/python/paddle/trainer_config_helpers/tests/configs/protostr/test_cost_layers.protostr b/python/paddle/trainer_config_helpers/tests/configs/protostr/test_cost_layers.protostr
index a64e5ea0dd..55ab464ddf 100644
--- a/python/paddle/trainer_config_helpers/tests/configs/protostr/test_cost_layers.protostr
+++ b/python/paddle/trainer_config_helpers/tests/configs/protostr/test_cost_layers.protostr
@@ -167,6 +167,20 @@ layers {
   softmax_selfnorm_alpha: 0.1
   coeff: 1.0
 }
+layers {
+  name: "__huber_regression_cost_0__"
+  type: "huber_regression"
+  size: 1
+  active_type: ""
+  inputs {
+    input_layer_name: "input"
+  }
+  inputs {
+    input_layer_name: "labels"
+  }
+  coeff: 1.0
+  delta: 1.0
+}
 layers {
   name: "huber_probs"
   type: "data"
@@ -300,6 +314,7 @@ output_layer_names: "__rank_cost_0__"
 output_layer_names: "__lambda_cost_0__"
 output_layer_names: "__cross_entropy_0__"
 output_layer_names: "__cross_entropy_with_selfnorm_0__"
+output_layer_names: "__huber_regression_cost_0__"
 output_layer_names: "__huber_classification_cost_0__"
 output_layer_names: "__multi_binary_label_cross_entropy_0__"
 output_layer_names: "__sum_cost_0__"
@@ -324,6 +339,7 @@ sub_models {
   layer_names: "__lambda_cost_0__"
   layer_names: "__cross_entropy_0__"
   layer_names: "__cross_entropy_with_selfnorm_0__"
+  layer_names: "__huber_regression_cost_0__"
   layer_names: "huber_probs"
   layer_names: "huber_label"
   layer_names: "__huber_classification_cost_0__"
@@ -349,6 +365,7 @@ sub_models {
   output_layer_names: "__lambda_cost_0__"
   output_layer_names: "__cross_entropy_0__"
   output_layer_names: "__cross_entropy_with_selfnorm_0__"
+  output_layer_names: "__huber_regression_cost_0__"
   output_layer_names: "__huber_classification_cost_0__"
   output_layer_names: "__multi_binary_label_cross_entropy_0__"
   output_layer_names: "__sum_cost_0__"
diff --git a/python/paddle/trainer_config_helpers/tests/configs/test_cost_layers.py b/python/paddle/trainer_config_helpers/tests/configs/test_cost_layers.py
index 98bf026d60..7ce375c708 100644
--- a/python/paddle/trainer_config_helpers/tests/configs/test_cost_layers.py
+++ b/python/paddle/trainer_config_helpers/tests/configs/test_cost_layers.py
@@ -33,6 +33,8 @@ outputs(
         input=probs, label=xe_label),
     cross_entropy_with_selfnorm(
         input=probs, label=xe_label),
+    huber_regression_cost(
+        input=seq_in, label=labels),
     huber_classification_cost(
         input=data_layer(
             name='huber_probs', size=1),
diff --git a/python/paddle/v2/tests/test_layer.py b/python/paddle/v2/tests/test_layer.py
index 7373a55ce6..783a0ca85d 100644
--- a/python/paddle/v2/tests/test_layer.py
+++ b/python/paddle/v2/tests/test_layer.py
@@ -141,12 +141,13 @@ class CostLayerTest(unittest.TestCase):
         cost8 = layer.rank_cost(left=score, right=score, label=score)
         cost9 = layer.lambda_cost(input=inference, score=score)
         cost10 = layer.sum_cost(input=inference)
-        cost11 = layer.huber_classification_cost(input=score, label=label)
+        cost11 = layer.huber_regression_cost(input=score, label=label)
+        cost12 = layer.huber_classification_cost(input=score, label=label)
 
         print layer.parse_network([cost1, cost2])
         print layer.parse_network([cost3, cost4])
         print layer.parse_network([cost5, cost6])
-        print layer.parse_network([cost7, cost8, cost9, cost10, cost11])
+        print layer.parse_network([cost7, cost8, cost9, cost10, cost11, cost12])
 
         crf = layer.crf(input=inference, label=label)
         crf_decoding = layer.crf_decoding(input=inference, size=3)

From 4bffbd30f0dbc2a2bbff4aa8108867fceecc260a Mon Sep 17 00:00:00 2001
From: tensor-tang <jian.j.tang@intel.com>
Date: Mon, 21 Aug 2017 16:44:30 +0800
Subject: [PATCH 09/64] use MKLDNNMatrix in fc forward

---
 paddle/gserver/layers/Layer.cpp         |  2 +-
 paddle/gserver/layers/Layer.h           | 20 +++++++-
 paddle/gserver/layers/MKLDNNFcLayer.cpp | 63 ++++++++++++++++---------
 paddle/gserver/layers/MKLDNNLayer.h     | 25 +++++++---
 paddle/math/CMakeLists.txt              |  4 --
 paddle/math/MKLDNNMatrix.cpp            | 29 +++++++++++-
 paddle/math/MKLDNNMatrix.h              | 43 +++++++++++++----
 7 files changed, 143 insertions(+), 43 deletions(-)

diff --git a/paddle/gserver/layers/Layer.cpp b/paddle/gserver/layers/Layer.cpp
index d5621412ca..2bc20eee6c 100644
--- a/paddle/gserver/layers/Layer.cpp
+++ b/paddle/gserver/layers/Layer.cpp
@@ -41,7 +41,7 @@ namespace paddle {
 Layer::Layer(const LayerConfig& config, bool useGpu)
     : config_(config),
       useGpu_(useGpu),
-      deviceId_(-1),
+      deviceId_(CPU_DEVICE),
       needSequenceInfo_(true) {}
 
 bool Layer::init(const LayerMap& layerMap, const ParameterMap& parameterMap) {
diff --git a/paddle/gserver/layers/Layer.h b/paddle/gserver/layers/Layer.h
index 0ed482889d..ec4d093e0c 100644
--- a/paddle/gserver/layers/Layer.h
+++ b/paddle/gserver/layers/Layer.h
@@ -59,7 +59,12 @@ protected:
   LayerConfig config_;
   /// whether to use GPU
   bool useGpu_;
-  /// Device Id. CPU is -1, and GPU is 0, 1, 2 ...
+  /// Paddle device ID, MKLDNN is -2, CPU is -1
+  enum PADDLE_DEVICE_ID {
+    MKLDNN_DEVICE = -2,
+    CPU_DEVICE = -1,
+  };
+  /// Device Id. MKLDNN is -2, CPU is -1, and GPU is 0, 1, 2 ...
   int deviceId_;
   /// Input layers
   std::vector<LayerPtr> inputLayers_;
@@ -321,6 +326,19 @@ public:
     if (deviceId == getDeviceId()) {
       return output_;
     } else {
+      bool CPU2MKLDNN =
+          getDeviceId() == CPU_DEVICE && deviceId == MKLDNN_DEVICE;
+      bool MKLDNN2CPU =
+          getDeviceId() == MKLDNN_DEVICE && deviceId == CPU_DEVICE;
+      if (CPU2MKLDNN) {
+        // TODO: do something
+        return output_;
+      } else if (MKLDNN2CPU) {
+        // TODO: do something
+        return output_;
+      }
+
+      // TODO: handle mkldnn device or add mkldnn device to other
       for (size_t i = 0; i < outputOtherDevice_.size(); i++) {
         if (outputOtherDevice_[i].deviceId == deviceId) {
           return outputOtherDevice_[i];
diff --git a/paddle/gserver/layers/MKLDNNFcLayer.cpp b/paddle/gserver/layers/MKLDNNFcLayer.cpp
index d201fac65e..fac0390eee 100644
--- a/paddle/gserver/layers/MKLDNNFcLayer.cpp
+++ b/paddle/gserver/layers/MKLDNNFcLayer.cpp
@@ -135,33 +135,51 @@ void MKLDNNFcLayer::reshape() {
 
 void MKLDNNFcLayer::resetFwd() {
   bool hasBias = biases_ && biases_->getW();
-  real* iData = getInputValue(0)->getData();
-  real* oData = getOutputValue()->getData();
-  real* wData = weight_->getW()->getData();
-  real* bData = hasBias ? biases_->getW()->getData() : NULL;
+  const MatrixPtr& in = getInputValue(0);
+  const MatrixPtr& wgt = weight_->getW();
+  const MatrixPtr& bias = hasBias ? biases_->getW() : nullptr;
+  const MatrixPtr& out = output_.value;
+
+  if (getPrev(0)->getDeviceId() == MKLDNN_DEVICE) {
+    inVal_ = std::dynamic_pointer_cast<MKLDNNMatrix>(in);
+    CHECK(inVal_) << "Input should be MKLDNNMatrix";
+    // TODO:  change input nchw to nc if available
+    // inVal_->downSpatial()
+  } else {
+    inVal_ = MKLDNNMatrix::create(
+        in,
+        hasSpatial_ ? memory::dims{bs_, ic_, ih_, iw_} : memory::dims{bs_, ic_},
+        hasSpatial_ ? format::nchw : format::nc,
+        engine_);
+  }
 
-  // TODO(TJ): below create should be covered in MkldnnMatrix
-  // create memory desc
-  memory::desc iMD = hasSpatial_ ? createMD({bs_, ic_, ih_, iw_}, format::nchw)
-                                 : createMD({bs_, ic_}, format::nc);
-  memory::desc wMD = hasSpatial_ ? createMD({oc_, ic_, ih_, iw_}, format::oihw)
-                                 : createMD({oc_, ic_}, format::oi);
-  memory::desc bMD = bData != NULL ? createMD({oc_}, format::x)
-                                   : createMD({}, format::format_undef);
-  memory::desc oMD = createMD({bs_, oc_}, format::nc);
+  wgtVal_ = MKLDNNMatrix::create(
+      wgt,
+      hasSpatial_ ? memory::dims{oc_, ic_, ih_, iw_} : memory::dims{oc_, ic_},
+      hasSpatial_ ? format::oihw : format::oi,
+      engine_);
 
-  // create memory primitive desc and memory self
-  inVal_.reset(new memory(memory::primitive_desc(iMD, engine_), iData));
-  wgtVal_.reset(new memory(memory::primitive_desc(wMD, engine_), wData));
-  outVal_.reset(new memory(memory::primitive_desc(oMD, engine_), oData));
+  biasVal_ =
+      hasBias ? MKLDNNMatrix::create(bias, {oc_}, format::x, engine_) : nullptr;
+
+  outVal_ = MKLDNNMatrix::create(out, {bs_, oc_}, format::nc, engine_);
+
+  // change original output to mkldnn output
+  output_.value = std::dynamic_pointer_cast<Matrix>(outVal_);
 
+  // create forward handle
   prop_kind pk = prop_kind::forward;
-  fc_fwd::desc fwdDesc = bData != NULL ? fc_fwd::desc(pk, iMD, wMD, bMD, oMD)
-                                       : fc_fwd::desc(pk, iMD, wMD, oMD);
+  fc_fwd::desc fwdDesc =
+      hasBias ? fc_fwd::desc(pk,
+                             inVal_->getMD(),
+                             wgtVal_->getMD(),
+                             biasVal_->getMD(),
+                             outVal_->getMD())
+              : fc_fwd::desc(
+                    pk, inVal_->getMD(), wgtVal_->getMD(), outVal_->getMD());
   fc_fwd::primitive_desc fwdPD = fc_fwd::primitive_desc(fwdDesc, engine_);
 
-  if (bData != NULL) {
-    biasVal_.reset(new memory(memory::primitive_desc(bMD, engine_), bData));
+  if (hasBias) {
     fwd_.reset(new fc_fwd(fwdPD, *inVal_, *wgtVal_, *biasVal_, *outVal_));
   } else {
     fwd_.reset(new fc_fwd(fwdPD, *inVal_, *wgtVal_, *outVal_));
@@ -197,7 +215,8 @@ void MKLDNNFcLayer::resetBwd() {
     // update data
     inVal_->set_data_handle(iData);
   } else {
-    inVal_.reset(new memory(memory::primitive_desc(iMD, engine_), iData));
+    LOG(FATAL) << "Should not be empty";
+    // inVal_.reset(new memory(memory::primitive_desc(iMD, engine_), iData));
   }
 
   // create memory primitive desc and memory self
diff --git a/paddle/gserver/layers/MKLDNNLayer.h b/paddle/gserver/layers/MKLDNNLayer.h
index 9533027fa6..b44095befb 100644
--- a/paddle/gserver/layers/MKLDNNLayer.h
+++ b/paddle/gserver/layers/MKLDNNLayer.h
@@ -21,7 +21,6 @@ limitations under the License. */
 #include "paddle/math/MKLDNNMatrix.h"
 
 DECLARE_bool(use_mkldnn);
-DECLARE_bool(use_mkldnn_wgt);
 
 namespace paddle {
 
@@ -54,13 +53,14 @@ protected:
   std::vector<mkldnn::primitive> pipelineBwd_;
 
   // TODO(TJ): change below memory as MKLDNNMatrixPtr type
-  std::shared_ptr<mkldnn::memory> inVal_;
+  // MKLDNNMatrixPtr ;
+  MKLDNNMatrixPtr inVal_;
   std::shared_ptr<mkldnn::memory> inGrad_;
-  std::shared_ptr<mkldnn::memory> outVal_;
+  MKLDNNMatrixPtr outVal_;
   std::shared_ptr<mkldnn::memory> outGrad_;
-  std::shared_ptr<mkldnn::memory> wgtVal_;
+  MKLDNNMatrixPtr wgtVal_;
   std::shared_ptr<mkldnn::memory> wgtGrad_;
-  std::shared_ptr<mkldnn::memory> biasVal_;
+  MKLDNNMatrixPtr biasVal_;
   std::shared_ptr<mkldnn::memory> biasGrad_;
 
 public:
@@ -94,7 +94,7 @@ public:
     stream_.reset(new MKLDNNStream());
     engine_ = CPUEngine::Instance().getEngine();
 
-    // TODO(TJ): deivecId
+    setDeviceID(MKLDNN_DEVICE);
     return true;
   }
 
@@ -128,6 +128,19 @@ public:
     // TODO(TJ): isFmtSuppoted(fmt)
     return mkldnn::memory::desc(dims, type, fmt);
   }
+
+  void resetMKLDNNOutput(size_t height, size_t width) {
+    Layer::resetOutput(height, width);
+    // get valu and grad, use mkldnn matrix instaed
+    // output_.value;
+  }
+
+protected:
+  void setDeviceID(int id) {
+    deviceId_ = id;
+    output_.deviceId = id;
+    // TODO: handle mkldnn device or add mkldnn device to other
+  }
 };
 
 }  // namespace paddle
diff --git a/paddle/math/CMakeLists.txt b/paddle/math/CMakeLists.txt
index ad6de18c81..8afe6b509d 100644
--- a/paddle/math/CMakeLists.txt
+++ b/paddle/math/CMakeLists.txt
@@ -15,13 +15,9 @@
 file(GLOB MATH_HEADERS . *.h)
 file(GLOB MATH_SOURCES . *.cpp)
 
-message(STATUS "----------MATH_HEADERS:${MATH_HEADERS}")
-message(STATUS "----------MATH_SOURCES:${MATH_SOURCES}")
 if(NOT WITH_MKLDNN)
     file(GLOB_RECURSE DNN_HEADER RELATIVE "${CMAKE_CURRENT_SOURCE_DIR}" "MKLDNN*.h")
     file(GLOB_RECURSE DNN_SOURCES RELATIVE "${CMAKE_CURRENT_SOURCE_DIR}" "MKLDNN*.cpp")
-    message(STATUS "----------DNN_HEADER:${DNN_HEADER}")
-    message(STATUS "----------DNN_SOURCES:${DNN_SOURCES}")
     list(REMOVE_ITEM MATH_HEADERS ${DNN_HEADER})
     list(REMOVE_ITEM MATH_SOURCES ${DNN_SOURCES})
     message(STATUS "Skip compiling with MKLDNNMatrix")
diff --git a/paddle/math/MKLDNNMatrix.cpp b/paddle/math/MKLDNNMatrix.cpp
index df8e72d78b..44fc54278c 100644
--- a/paddle/math/MKLDNNMatrix.cpp
+++ b/paddle/math/MKLDNNMatrix.cpp
@@ -16,4 +16,31 @@ limitations under the License. */
 
 using namespace mkldnn;  // NOLINT
 
-namespace paddle {}  // namespace paddle
+namespace paddle {
+
+MKLDNNMatrixPtr MKLDNNMatrix::create(const MatrixPtr& m,
+                                     memory::dims dims,
+                                     memory::format fmt,
+                                     engine& eg,
+                                     mkldnn::memory::data_type dtype) {
+  CpuMatrixPtr cpuM = std::dynamic_pointer_cast<CpuMatrix>(m);
+  CHECK(cpuM) << "Only support create from CPU matrix yet";
+
+  size_t ndims = dims.size();
+  CHECK(ndims > 0) << "Input dims should not be empty";
+  size_t cnt = 1;
+  for (size_t i = 0; i < ndims; ++i) {
+    cnt *= dims[i];
+  }
+  CHECK_EQ(cnt, m->getElementCnt()) << "Count size does not match";
+
+  size_t width = m->getWidth();
+  size_t height = m->getHeight();
+  real* data = m->getData();
+
+  memory::desc md = memory::desc(dims, dtype, fmt);
+  memory::primitive_desc pd = memory::primitive_desc(md, eg);
+  return std::make_shared<MKLDNNMatrix>(data, height, width, pd);
+}
+
+}  // namespace paddle
diff --git a/paddle/math/MKLDNNMatrix.h b/paddle/math/MKLDNNMatrix.h
index 91ef56f2c3..73eb50d2a0 100644
--- a/paddle/math/MKLDNNMatrix.h
+++ b/paddle/math/MKLDNNMatrix.h
@@ -14,9 +14,8 @@ limitations under the License. */
 
 #pragma once
 
-//#include "Matrix.h"
-#include "Vector.h"
-
+#include <vector>
+#include "Matrix.h"
 #include "mkldnn.hpp"
 #include "paddle/parameter/Parameter.h"
 
@@ -32,14 +31,42 @@ typedef std::shared_ptr<MKLDNNMatrix> MKLDNNMatrixPtr;
  * @brief MKLDNN Matrix.
  *
  */
-class MKLDNNMatrix : public CpuVector {
+class MKLDNNMatrix : public CpuMatrix, public mkldnn::memory {
 public:
-  explicit MKLDNNMatrix(size_t size, int fmt) : CpuVector(size), fmt_(fmt) {}
+  MKLDNNMatrix(real* data,
+               size_t height,
+               size_t width,
+               mkldnn::memory::primitive_desc pd)
+      : CpuMatrix(data, height, width, false), mkldnn::memory(pd, data) {}
 
-  ~MKLDNNMatrix() {}
+  MKLDNNMatrix(size_t height, size_t width, mkldnn::memory::primitive_desc pd)
+      : CpuMatrix(height, width, false), mkldnn::memory(pd) {
+    set_data_handle(CpuMatrix::getData());
+  }
+
+  static MKLDNNMatrixPtr create(
+      const MatrixPtr& m,
+      mkldnn::memory::dims dims,
+      mkldnn::memory::format fmt,
+      mkldnn::engine& eg,
+      mkldnn::memory::data_type dtype = mkldnn::memory::data_type::f32);
+
+  /**
+   * Get primitive descriptor
+   */
+  mkldnn::memory::primitive_desc getPD() { return this->get_primitive_desc(); }
 
-protected:
-  int fmt_;
+  /**
+   * Get memory descriptor
+   */
+  mkldnn::memory::desc getMD() { return getPD().desc(); }
+
+  /**
+   * Get format
+   */
+  int getFormat() { return getMD().data.format; }
+
+  ~MKLDNNMatrix() {}
 };
 
 }  // namespace paddle

From 73ab2d4678418d42dd5a03d5f8531b49eab2e7ce Mon Sep 17 00:00:00 2001
From: Luo Tao <luotao02@baidu.com>
Date: Mon, 21 Aug 2017 17:57:06 +0800
Subject: [PATCH 10/64] fix backward error of huber_regression_cost

---
 paddle/gserver/layers/CostLayer.cpp | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/paddle/gserver/layers/CostLayer.cpp b/paddle/gserver/layers/CostLayer.cpp
index 91a742422e..7f648070f2 100644
--- a/paddle/gserver/layers/CostLayer.cpp
+++ b/paddle/gserver/layers/CostLayer.cpp
@@ -644,7 +644,7 @@ void HuberRegressionLoss::backwardImp(Matrix& output,
     if (std::abs(a) <= delta_)
       grad[i] += -a;
     else
-      grad[i] += a > 0 ? delta_ : -delta_;
+      grad[i] += a > 0 ? -delta_ : delta_;
   }
   if (useGpu_) outputG.copyFrom(grad, numSamples);
 }

From 4eecd0c2d531f66e64eebff88a99488275143207 Mon Sep 17 00:00:00 2001
From: tensor-tang <jian.j.tang@intel.com>
Date: Tue, 22 Aug 2017 14:18:16 +0800
Subject: [PATCH 11/64] use MKLDNNMatrix in fc backward

---
 paddle/gserver/layers/MKLDNNFcLayer.cpp | 77 ++++++++++++-------------
 paddle/gserver/layers/MKLDNNLayer.h     | 59 ++++++++++++++-----
 paddle/math/MKLDNNMatrix.h              | 33 +++++++++--
 3 files changed, 110 insertions(+), 59 deletions(-)

diff --git a/paddle/gserver/layers/MKLDNNFcLayer.cpp b/paddle/gserver/layers/MKLDNNFcLayer.cpp
index fac0390eee..5463104469 100644
--- a/paddle/gserver/layers/MKLDNNFcLayer.cpp
+++ b/paddle/gserver/layers/MKLDNNFcLayer.cpp
@@ -158,10 +158,8 @@ void MKLDNNFcLayer::resetFwd() {
       hasSpatial_ ? memory::dims{oc_, ic_, ih_, iw_} : memory::dims{oc_, ic_},
       hasSpatial_ ? format::oihw : format::oi,
       engine_);
-
   biasVal_ =
       hasBias ? MKLDNNMatrix::create(bias, {oc_}, format::x, engine_) : nullptr;
-
   outVal_ = MKLDNNMatrix::create(out, {bs_, oc_}, format::nc, engine_);
 
   // change original output to mkldnn output
@@ -193,46 +191,41 @@ void MKLDNNFcLayer::resetBwd() {
     return;
   }
   needResetBwd_ = false;
-
   bool hasBias = biases_ && biases_->getWGrad();
-  real* iData = getInputValue(0)->getData();
-  real* iDiff = getInputGrad(0) != nullptr ? getInputGrad(0)->getData() : NULL;
-  real* oDiff = getOutputGrad()->getData();
-  real* wDiff = weight_->getWGrad()->getData();
-  real* bDiff = hasBias ? biases_->getWGrad()->getData() : NULL;
 
   /// backward weight
-  // create memory desc for backward memory
-  memory::desc iMD = hasSpatial_ ? createMD({bs_, ic_, ih_, iw_}, format::nchw)
-                                 : createMD({bs_, ic_}, format::nc);
-  memory::desc wMD = hasSpatial_ ? createMD({oc_, ic_, ih_, iw_}, format::oihw)
-                                 : createMD({oc_, ic_}, format::oi);
-  memory::desc oMD = createMD({bs_, oc_}, format::nc);
-  memory::desc bMD = bDiff != NULL ? createMD({oc_}, format::x)
-                                   : createMD({}, format::format_undef);
-
-  if (inVal_) {
-    // update data
-    inVal_->set_data_handle(iData);
-  } else {
-    LOG(FATAL) << "Should not be empty";
-    // inVal_.reset(new memory(memory::primitive_desc(iMD, engine_), iData));
-  }
-
-  // create memory primitive desc and memory self
-  wgtGrad_.reset(new memory(memory::primitive_desc(wMD, engine_), wDiff));
-  outGrad_.reset(new memory(memory::primitive_desc(oMD, engine_), oDiff));
+  CHECK(inVal_) << "Should have input value";
+  const MatrixPtr& wgt = weight_->getWGrad();
+  const MatrixPtr& bias = hasBias ? biases_->getWGrad() : nullptr;
+  const MatrixPtr& out = output_.grad;
+
+  wgtGrad_ = MKLDNNMatrix::create(
+      wgt, wgtVal_->getDims(), wgtVal_->getFormat(), engine_);
+  biasGrad_ =
+      hasBias ? MKLDNNMatrix::create(bias, {oc_}, format::x, engine_) : nullptr;
 
-  fc_fwd::desc fwdDesc = fc_fwd::desc(prop_kind::forward, iMD, wMD, oMD);
+  outGrad_ = MKLDNNMatrix::create(out, {bs_, oc_}, format::nc, engine_);
+  // change original output to mkldnn output
+  // TODO: right?
+  output_.grad = std::dynamic_pointer_cast<Matrix>(outGrad_);
+
+  // create memory primitive desc
+  fc_fwd::desc fwdDesc = fc_fwd::desc(prop_kind::forward,
+                                      inVal_->getMD(),
+                                      wgtGrad_->getMD(),
+                                      outGrad_->getMD());
   fc_fwd::primitive_desc fwdPD = fc_fwd::primitive_desc(fwdDesc, engine_);
-  fc_bwdWgt::desc bwdWgtDesc = bDiff != NULL
-                                   ? fc_bwdWgt::desc(iMD, wMD, bMD, oMD)
-                                   : fc_bwdWgt::desc(iMD, wMD, oMD);
+  fc_bwdWgt::desc bwdWgtDesc =
+      hasBias ? fc_bwdWgt::desc(inVal_->getMD(),
+                                wgtGrad_->getMD(),
+                                biasGrad_->getMD(),
+                                outGrad_->getMD())
+              : fc_bwdWgt::desc(
+                    inVal_->getMD(), wgtGrad_->getMD(), outGrad_->getMD());
   fc_bwdWgt::primitive_desc bwdWgtPD =
       fc_bwdWgt::primitive_desc(bwdWgtDesc, engine_, fwdPD);
 
-  if (bDiff != NULL) {
-    biasGrad_.reset(new memory(memory::primitive_desc(bMD, engine_), bDiff));
+  if (hasBias) {
     bwdWgt_.reset(
         new fc_bwdWgt(bwdWgtPD, *inVal_, *outGrad_, *wgtGrad_, *biasGrad_));
   } else {
@@ -242,13 +235,19 @@ void MKLDNNFcLayer::resetBwd() {
   pipelineBwd_.push_back(*bwdWgt_);
 
   /// backward data
-  if (iDiff == NULL) {
+  const MatrixPtr& in = getInputGrad(0);
+  if (in == nullptr) {
     return;
   }
-  fc_bwdData::desc bwdDataDesc = fc_bwdData::desc(iMD, wMD, oMD);
+  fc_bwdData::desc bwdDataDesc =
+      fc_bwdData::desc(inVal_->getMD(), wgtGrad_->getMD(), outGrad_->getMD());
   fc_bwdData::primitive_desc bwdDataPD =
       fc_bwdData::primitive_desc(bwdDataDesc, engine_, fwdPD);
-  inGrad_.reset(new memory(memory::primitive_desc(iMD, engine_), iDiff));
+
+  // TODO: check right, just from ingrad?
+  inGrad_ =
+      MKLDNNMatrix::create(in, inVal_->getDims(), inVal_->getFormat(), engine_);
+
   CHECK(wgtVal_) << "Should have weight memory";
   bwdData_.reset(new fc_bwdData(bwdDataPD, *outGrad_, *wgtVal_, *inGrad_));
   pipelineBwd_.push_back(*bwdData_);
@@ -264,7 +263,7 @@ void MKLDNNFcLayer::forward(PassType passType) {
     // update input data
     // since it might be changed if this is after data layer
     real* iData = getInputValue(0)->getData();
-    inVal_->set_data_handle(iData);
+    inVal_->updateData(iData);
 
     // just submit forward pipeline
     stream_->submit(pipelineFwd_);
@@ -288,7 +287,7 @@ void MKLDNNFcLayer::backward(const UpdateCallback& callback) {
 
     // update diff
     real* oDiff = getOutputGrad()->getData();
-    outGrad_->set_data_handle(oDiff);
+    outGrad_->updateData(oDiff);
 
     // just sumbmit backward pipeline
     stream_->submit(pipelineBwd_);
diff --git a/paddle/gserver/layers/MKLDNNLayer.h b/paddle/gserver/layers/MKLDNNLayer.h
index b44095befb..fbd62d9aaa 100644
--- a/paddle/gserver/layers/MKLDNNLayer.h
+++ b/paddle/gserver/layers/MKLDNNLayer.h
@@ -52,16 +52,15 @@ protected:
   std::vector<mkldnn::primitive> pipelineFwd_;
   std::vector<mkldnn::primitive> pipelineBwd_;
 
-  // TODO(TJ): change below memory as MKLDNNMatrixPtr type
-  // MKLDNNMatrixPtr ;
+  // MKLDNNMatrixPtr
   MKLDNNMatrixPtr inVal_;
-  std::shared_ptr<mkldnn::memory> inGrad_;
+  MKLDNNMatrixPtr inGrad_;
   MKLDNNMatrixPtr outVal_;
-  std::shared_ptr<mkldnn::memory> outGrad_;
+  MKLDNNMatrixPtr outGrad_;
   MKLDNNMatrixPtr wgtVal_;
-  std::shared_ptr<mkldnn::memory> wgtGrad_;
+  MKLDNNMatrixPtr wgtGrad_;
   MKLDNNMatrixPtr biasVal_;
-  std::shared_ptr<mkldnn::memory> biasGrad_;
+  MKLDNNMatrixPtr biasGrad_;
 
 public:
   explicit MKLDNNLayer(const LayerConfig& config)
@@ -84,17 +83,24 @@ public:
 
   virtual bool init(const LayerMap& layerMap,
                     const ParameterMap& parameterMap) {
+    CHECK(FLAGS_use_mkldnn) << "MkldnnLayers only support use_mkldnn."
+                            << "Please set WITH_MKLDNN=ON "
+                            << "and set use_mkldnn=True";
+    if (useGpu_ == true) {
+      LOG(WARNING) << "Do not support GPU yet, will change to useGpu = false";
+      useGpu_ = false;
+    }
+
+    // set device id before Layer::init
+    setDevice(MKLDNN_DEVICE);
+    // change param device to MKLDNN device
+    setParamsDevice(MKLDNN_DEVICE, parameterMap);
     if (!Layer::init(layerMap, parameterMap)) {
       return false;
     }
 
-    CHECK(FLAGS_use_mkldnn) << "MkldnnLayers only support use_mkldnn."
-                            << "Please set WITH_MKLDNN=ON "
-                            << "and set use_mkldnn=True";
     stream_.reset(new MKLDNNStream());
     engine_ = CPUEngine::Instance().getEngine();
-
-    setDeviceID(MKLDNN_DEVICE);
     return true;
   }
 
@@ -136,10 +142,33 @@ public:
   }
 
 protected:
-  void setDeviceID(int id) {
-    deviceId_ = id;
-    output_.deviceId = id;
-    // TODO: handle mkldnn device or add mkldnn device to other
+  /**
+   * Set deviceId of this layer.
+   */
+  void setDevice(int id) { deviceId_ = id; }
+
+  /**
+   * Set deviceId of the params used in this layer.
+   */
+  void setParamsDevice(int id, const ParameterMap& parameterMap) {
+    for (auto& inputConfig : config_.inputs()) {
+      if (inputConfig.has_input_parameter_name()) {
+        ParameterPtr parameter;
+        std::string name = inputConfig.input_parameter_name();
+        CHECK(mapGet(name, parameterMap, &parameter))
+            << "Cannot find input parameter " << name << " for layer "
+            << getName();
+        parameter->setDevice(id);
+      }
+    }
+    if (config_.has_bias_parameter_name()) {
+      ParameterPtr parameter;
+      std::string name = config_.bias_parameter_name();
+      CHECK(mapGet(name, parameterMap, &parameter))
+          << "Cannot find bias parameter " << name << " for layer "
+          << getName();
+      parameter->setDevice(id);
+    }
   }
 };
 
diff --git a/paddle/math/MKLDNNMatrix.h b/paddle/math/MKLDNNMatrix.h
index 73eb50d2a0..54c0a1fdcb 100644
--- a/paddle/math/MKLDNNMatrix.h
+++ b/paddle/math/MKLDNNMatrix.h
@@ -44,6 +44,8 @@ public:
     set_data_handle(CpuMatrix::getData());
   }
 
+  ~MKLDNNMatrix() {}
+
   static MKLDNNMatrixPtr create(
       const MatrixPtr& m,
       mkldnn::memory::dims dims,
@@ -52,21 +54,42 @@ public:
       mkldnn::memory::data_type dtype = mkldnn::memory::data_type::f32);
 
   /**
-   * Get primitive descriptor
+   * Get primitive descriptor.
    */
   mkldnn::memory::primitive_desc getPD() { return this->get_primitive_desc(); }
 
   /**
-   * Get memory descriptor
+   * Get memory descriptor.
    */
   mkldnn::memory::desc getMD() { return getPD().desc(); }
 
   /**
-   * Get format
+   * Get dims.
    */
-  int getFormat() { return getMD().data.format; }
+  mkldnn::memory::dims getDims() {
+    mkldnn::memory::dims dst;
+    int* src = getMD().data.dims;
+    int ndims = getMD().data.ndims;
+    dst.resize(ndims);
+    for (int i = 0; i < ndims; ++i) {
+      dst[i] = src[i];
+    }
+    return dst;
+  }
 
-  ~MKLDNNMatrix() {}
+  /**
+   * Get format.
+   */
+  mkldnn::memory::format getFormat() {
+    return (mkldnn::memory::format)(getMD().data.format);
+  }
+
+  /**
+   * Update the memory data handle.
+   * Caution: This will not check the buffer size of the data,
+   *          it should be coverd by user.
+   */
+  void updateData(void* data) { set_data_handle(data); }
 };
 
 }  // namespace paddle

From 48d87e5e912ad084ccc63dae8649f90a3f0989ba Mon Sep 17 00:00:00 2001
From: tensor-tang <jian.j.tang@intel.com>
Date: Wed, 23 Aug 2017 16:47:51 +0800
Subject: [PATCH 12/64] pass test, support input CPU device

---
 paddle/gserver/layers/Layer.h           |  35 +++++---
 paddle/gserver/layers/MKLDNNFcLayer.cpp | 108 +++++++++++++++---------
 paddle/gserver/layers/MKLDNNLayer.h     |  81 +++++++++++++++---
 paddle/math/Allocator.h                 |   6 ++
 paddle/math/MKLDNNMatrix.cpp            |  71 +++++++++++++---
 paddle/math/MKLDNNMatrix.h              |  49 ++++++++---
 6 files changed, 258 insertions(+), 92 deletions(-)

diff --git a/paddle/gserver/layers/Layer.h b/paddle/gserver/layers/Layer.h
index ec4d093e0c..edef36194a 100644
--- a/paddle/gserver/layers/Layer.h
+++ b/paddle/gserver/layers/Layer.h
@@ -82,6 +82,7 @@ protected:
   Argument output_;
   /// Several outputs stored on different devices, used in 'parallel_nn' case,
   /// and record them by deviceId_.
+  /// Also used in 'use_mkldnn' case.
   std::vector<Argument> outputOtherDevice_;
   /// If there are several outputs, map them by each name.
   std::map<std::string, Argument*> outputMap_;
@@ -177,6 +178,13 @@ protected:
     return inputLayer.getOutput(deviceId_);
   }
 
+  /**
+   * Get the argument of input layer with deviceId.
+   */
+  const Argument& getInput(size_t inputIndex, int deviceId) const {
+    return inputLayers_[inputIndex]->getOutput(deviceId);
+  }
+
   /**
    * Get the forward-input value.
    */
@@ -191,6 +199,13 @@ protected:
     return inputLayer.getOutput(deviceId_).value;
   }
 
+  /**
+   * Get the forward-input value with deviceId.
+   */
+  const MatrixPtr& getInputValue(int inputIndex, int deviceId) {
+    return inputLayers_[inputIndex]->getOutput(deviceId).value;
+  }
+
   /**
    * Get the forward-input grad.
    */
@@ -205,6 +220,13 @@ protected:
     return inputLayer.getOutput(deviceId_).grad;
   }
 
+  /**
+   * Get the forward-input grad.
+   */
+  const MatrixPtr& getInputGrad(int inputIndex, int deviceId) {
+    return inputLayers_[inputIndex]->getOutput(deviceId).grad;
+  }
+
   /**
    * Get the forward-input label.
    */
@@ -326,19 +348,6 @@ public:
     if (deviceId == getDeviceId()) {
       return output_;
     } else {
-      bool CPU2MKLDNN =
-          getDeviceId() == CPU_DEVICE && deviceId == MKLDNN_DEVICE;
-      bool MKLDNN2CPU =
-          getDeviceId() == MKLDNN_DEVICE && deviceId == CPU_DEVICE;
-      if (CPU2MKLDNN) {
-        // TODO: do something
-        return output_;
-      } else if (MKLDNN2CPU) {
-        // TODO: do something
-        return output_;
-      }
-
-      // TODO: handle mkldnn device or add mkldnn device to other
       for (size_t i = 0; i < outputOtherDevice_.size(); i++) {
         if (outputOtherDevice_[i].deviceId == deviceId) {
           return outputOtherDevice_[i];
diff --git a/paddle/gserver/layers/MKLDNNFcLayer.cpp b/paddle/gserver/layers/MKLDNNFcLayer.cpp
index 5463104469..a3291e6a8f 100644
--- a/paddle/gserver/layers/MKLDNNFcLayer.cpp
+++ b/paddle/gserver/layers/MKLDNNFcLayer.cpp
@@ -97,7 +97,7 @@ void MKLDNNFcLayer::convertWeightsToPaddle() {
 }
 
 void MKLDNNFcLayer::reshape() {
-  const Argument& input = getInput(0);
+  const Argument& input = getInput(0, getPrev(0)->getDeviceId());
   int batchSize = input.getBatchSize();
   if (bs_ == batchSize) {
     return;
@@ -135,35 +135,43 @@ void MKLDNNFcLayer::reshape() {
 
 void MKLDNNFcLayer::resetFwd() {
   bool hasBias = biases_ && biases_->getW();
-  const MatrixPtr& in = getInputValue(0);
   const MatrixPtr& wgt = weight_->getW();
   const MatrixPtr& bias = hasBias ? biases_->getW() : nullptr;
   const MatrixPtr& out = output_.value;
 
-  if (getPrev(0)->getDeviceId() == MKLDNN_DEVICE) {
+  if (prevIsMKLDNN()) {
+    const MatrixPtr& in = getInputValue(0);
     inVal_ = std::dynamic_pointer_cast<MKLDNNMatrix>(in);
     CHECK(inVal_) << "Input should be MKLDNNMatrix";
-    // TODO:  change input nchw to nc if available
-    // inVal_->downSpatial()
   } else {
+    CHECK_EQ(getPrev(0)->getDeviceId(), CPU_DEVICE) << "Only support CPU yet";
+    const MatrixPtr& in = getInputValue(0, CPU_DEVICE);
     inVal_ = MKLDNNMatrix::create(
-        in,
-        hasSpatial_ ? memory::dims{bs_, ic_, ih_, iw_} : memory::dims{bs_, ic_},
-        hasSpatial_ ? format::nchw : format::nc,
-        engine_);
+        in, memory::dims{bs_, ic_, ih_, iw_}, format::nchw, engine_);
   }
-
+  inVal_->downSpatial();
   wgtVal_ = MKLDNNMatrix::create(
-      wgt,
-      hasSpatial_ ? memory::dims{oc_, ic_, ih_, iw_} : memory::dims{oc_, ic_},
-      hasSpatial_ ? format::oihw : format::oi,
-      engine_);
+      wgt, memory::dims{oc_, ic_, ih_, iw_}, format::oihw, engine_);
+  wgtVal_->downSpatial();
   biasVal_ =
       hasBias ? MKLDNNMatrix::create(bias, {oc_}, format::x, engine_) : nullptr;
   outVal_ = MKLDNNMatrix::create(out, {bs_, oc_}, format::nc, engine_);
 
-  // change original output to mkldnn output
+  // change original output value to mkldnn output value
   output_.value = std::dynamic_pointer_cast<Matrix>(outVal_);
+  if (!nextIsMKLDNN()) {
+    Argument cpuOutput;
+    for (size_t i = 0; i < outputOtherDevice_.size(); i++) {
+      if (outputOtherDevice_[i].deviceId == CPU_DEVICE) {
+        cpuOutput = outputOtherDevice_[i];
+      }
+    }
+    cpuOutput.setFrameHeight(output_.getFrameHeight());
+    cpuOutput.setFrameWidth(output_.getFrameWidth());
+
+    // fc cpu output value do not need convert
+    cpuOutput.value = output_.value;
+  }
 
   // create forward handle
   prop_kind pk = prop_kind::forward;
@@ -176,12 +184,13 @@ void MKLDNNFcLayer::resetFwd() {
               : fc_fwd::desc(
                     pk, inVal_->getMD(), wgtVal_->getMD(), outVal_->getMD());
   fc_fwd::primitive_desc fwdPD = fc_fwd::primitive_desc(fwdDesc, engine_);
-
   if (hasBias) {
     fwd_.reset(new fc_fwd(fwdPD, *inVal_, *wgtVal_, *biasVal_, *outVal_));
   } else {
     fwd_.reset(new fc_fwd(fwdPD, *inVal_, *wgtVal_, *outVal_));
   }
+  printValueFormatFlow();
+
   pipelineFwd_.clear();
   pipelineFwd_.push_back(*fwd_);
 }
@@ -197,17 +206,24 @@ void MKLDNNFcLayer::resetBwd() {
   CHECK(inVal_) << "Should have input value";
   const MatrixPtr& wgt = weight_->getWGrad();
   const MatrixPtr& bias = hasBias ? biases_->getWGrad() : nullptr;
-  const MatrixPtr& out = output_.grad;
 
-  wgtGrad_ = MKLDNNMatrix::create(
-      wgt, wgtVal_->getDims(), wgtVal_->getFormat(), engine_);
-  biasGrad_ =
-      hasBias ? MKLDNNMatrix::create(bias, {oc_}, format::x, engine_) : nullptr;
+  if (nextIsMKLDNN()) {
+    // can not directly cast outputgrad to mkldnnmatrix,
+    // since each layer can not write the inputgrad to mkldnn inputgrad.
+    // So just create from matrix with outputvalue format.
+    const MatrixPtr& out = getOutput(MKLDNN_DEVICE).grad;
+    outGrad_ = MKLDNNMatrix::create(out, outVal_->getPD());
+    // TODO: maybe need merge topdiffs
+  } else {
+    // TODO: merge topdiffs
+    const MatrixPtr& out = getOutput(CPU_DEVICE).grad;
+    // fc do not need to convert from cpu device since output always nc
+    // only need create from cpu device
+    outGrad_ = MKLDNNMatrix::create(out, outVal_->getPD());
+  }
 
-  outGrad_ = MKLDNNMatrix::create(out, {bs_, oc_}, format::nc, engine_);
-  // change original output to mkldnn output
-  // TODO: right?
-  output_.grad = std::dynamic_pointer_cast<Matrix>(outGrad_);
+  wgtGrad_ = MKLDNNMatrix::create(wgt, wgtVal_->getPD());
+  biasGrad_ = hasBias ? MKLDNNMatrix::create(bias, biasVal_->getPD()) : nullptr;
 
   // create memory primitive desc
   fc_fwd::desc fwdDesc = fc_fwd::desc(prop_kind::forward,
@@ -235,21 +251,38 @@ void MKLDNNFcLayer::resetBwd() {
   pipelineBwd_.push_back(*bwdWgt_);
 
   /// backward data
-  const MatrixPtr& in = getInputGrad(0);
-  if (in == nullptr) {
-    return;
+  if (prevIsMKLDNN()) {
+    const MatrixPtr& in = getInputGrad(0, MKLDNN_DEVICE);
+    if (in == nullptr) {
+      return;
+    }
+    if (getInput(0, MKLDNN_DEVICE).getAllCount() > 1) {
+      // TODO: many mkldnn bots
+      // add sum handle
+    } else {
+      inGrad_ = MKLDNNMatrix::create(in, inVal_->getPD());
+    }
+  } else {
+    const MatrixPtr& in = getInputGrad(0, CPU_DEVICE);
+    if (in == nullptr) {
+      return;
+    }
+    if (getInput(0, CPU_DEVICE).getAllCount() > 1) {
+      // TODO: many  bots
+      // add sum handle
+    } else {
+      inGrad_ = MKLDNNMatrix::create(in, inVal_->getPD());
+    }
   }
+
   fc_bwdData::desc bwdDataDesc =
       fc_bwdData::desc(inVal_->getMD(), wgtGrad_->getMD(), outGrad_->getMD());
   fc_bwdData::primitive_desc bwdDataPD =
       fc_bwdData::primitive_desc(bwdDataDesc, engine_, fwdPD);
 
-  // TODO: check right, just from ingrad?
-  inGrad_ =
-      MKLDNNMatrix::create(in, inVal_->getDims(), inVal_->getFormat(), engine_);
-
   CHECK(wgtVal_) << "Should have weight memory";
   bwdData_.reset(new fc_bwdData(bwdDataPD, *outGrad_, *wgtVal_, *inGrad_));
+  printGradFormatFlow();
   pipelineBwd_.push_back(*bwdData_);
 }
 
@@ -259,11 +292,7 @@ void MKLDNNFcLayer::forward(PassType passType) {
 
   {
     REGISTER_TIMER_INFO("mkldnn_FwdTimer", getName().c_str());
-
-    // update input data
-    // since it might be changed if this is after data layer
-    real* iData = getInputValue(0)->getData();
-    inVal_->updateData(iData);
+    syncInputValue();
 
     // just submit forward pipeline
     stream_->submit(pipelineFwd_);
@@ -285,10 +314,7 @@ void MKLDNNFcLayer::backward(const UpdateCallback& callback) {
     REGISTER_TIMER_INFO("mkldnn_bwdTimer", getName().c_str());
     resetBwd();
 
-    // update diff
-    real* oDiff = getOutputGrad()->getData();
-    outGrad_->updateData(oDiff);
-
+    syncOutputGrad();
     // just sumbmit backward pipeline
     stream_->submit(pipelineBwd_);
   }
diff --git a/paddle/gserver/layers/MKLDNNLayer.h b/paddle/gserver/layers/MKLDNNLayer.h
index fbd62d9aaa..3dd17a36ff 100644
--- a/paddle/gserver/layers/MKLDNNLayer.h
+++ b/paddle/gserver/layers/MKLDNNLayer.h
@@ -125,23 +125,80 @@ public:
                        << ", oh: " << oh_ << ", ow: " << ow_;
   }
 
-  // TODO(TJ): move to MkldnnMatrix
-  // create memory desc
-  inline mkldnn::memory::desc createMD(
-      mkldnn::memory::dims dims,
-      mkldnn::memory::format fmt,
-      mkldnn::memory::data_type type = mkldnn::memory::data_type::f32) {
-    // TODO(TJ): isFmtSuppoted(fmt)
-    return mkldnn::memory::desc(dims, type, fmt);
+  /**
+   * Print the mkldnn memory format flow of value
+   */
+  virtual void printValueFormatFlow() {
+    if (inVal_ && outVal_) {
+      VLOG(MKLDNN_FMTS) << "value format flow --- " << inVal_->getFormat()
+                        << " >>> " << outVal_->getFormat();
+    }
   }
 
-  void resetMKLDNNOutput(size_t height, size_t width) {
-    Layer::resetOutput(height, width);
-    // get valu and grad, use mkldnn matrix instaed
-    // output_.value;
+  /**
+   * Print the mkldnn memory format flow of grad
+   */
+  virtual void printGradFormatFlow() {
+    if (inGrad_ && outGrad_) {
+      VLOG(MKLDNN_FMTS) << "grad format flow --- " << inGrad_->getFormat()
+                        << " <<< " << outGrad_->getFormat();
+    }
   }
 
 protected:
+  /**
+   * If next layer only has MKLDNN type.
+   * Otherwise, only support otherdevice CPU device.
+   */
+  bool nextIsMKLDNN() {
+    for (size_t i = 0; i < outputOtherDevice_.size(); i++) {
+      CHECK_EQ(outputOtherDevice_[i].deviceId, CPU_DEVICE)
+          << "Only support other device is CPU yet";
+    }
+    return outputOtherDevice_.size() == 0;
+  }
+
+  /**
+   * Is previous layer MKLDNN type.
+   * Otherwise, only support otherdevice CPU device.
+   */
+  bool prevIsMKLDNN(int index = 0) {
+    int prevDevice = getPrev(index)->getDeviceId();
+    if (prevDevice == MKLDNN_DEVICE) {
+      return true;
+    } else {
+      // do not support GPU yet
+      CHECK_EQ(prevDevice, CPU_DEVICE) << "Only support CPU yet";
+      return false;
+    }
+  }
+
+  /**
+   * Sync input value data
+   */
+  void syncInputValue() {
+    if (prevIsMKLDNN()) {
+      return;
+    }
+    real* iData = getInputValue(0, CPU_DEVICE)->getData();
+    // update input data
+    // since it might be changed if this is after data layer
+    inVal_->updateData(iData);
+  }
+
+  /**
+   * Sync output grad data
+   */
+  void syncOutputGrad() {
+    if (nextIsMKLDNN()) {
+      return;
+    }
+
+    // update diff
+    real* oDiff = getOutput(CPU_DEVICE).grad->getData();
+    outGrad_->updateData(oDiff);
+  }
+
   /**
    * Set deviceId of this layer.
    */
diff --git a/paddle/math/Allocator.h b/paddle/math/Allocator.h
index 666a8b8368..94ef561f06 100644
--- a/paddle/math/Allocator.h
+++ b/paddle/math/Allocator.h
@@ -48,7 +48,13 @@ public:
    */
   virtual void* alloc(size_t size) {
     void* ptr;
+#ifdef PADDLE_USE_MKLDNN
+    // refer to https://github.com/01org/mkl-dnn/blob/master/include/mkldnn.hpp
+    // memory alignment
+    CHECK_EQ(posix_memalign(&ptr, 4096ul, size), 0);
+#else
     CHECK_EQ(posix_memalign(&ptr, 32ul, size), 0);
+#endif
     CHECK(ptr) << "Fail to allocate CPU memory: size=" << size;
     return ptr;
   }
diff --git a/paddle/math/MKLDNNMatrix.cpp b/paddle/math/MKLDNNMatrix.cpp
index 44fc54278c..24d54ec0f7 100644
--- a/paddle/math/MKLDNNMatrix.cpp
+++ b/paddle/math/MKLDNNMatrix.cpp
@@ -18,29 +18,74 @@ using namespace mkldnn;  // NOLINT
 
 namespace paddle {
 
-MKLDNNMatrixPtr MKLDNNMatrix::create(const MatrixPtr& m,
-                                     memory::dims dims,
-                                     memory::format fmt,
-                                     engine& eg,
-                                     mkldnn::memory::data_type dtype) {
-  CpuMatrixPtr cpuM = std::dynamic_pointer_cast<CpuMatrix>(m);
-  CHECK(cpuM) << "Only support create from CPU matrix yet";
-
-  size_t ndims = dims.size();
+MKLDNNMatrixPtr MKLDNNMatrix::create(MatrixPtr m, memory::primitive_desc pd) {
+  memory::desc md = pd.desc();
+  size_t ndims = md.data.ndims;
+  int* dims = md.data.dims;
   CHECK(ndims > 0) << "Input dims should not be empty";
-  size_t cnt = 1;
+  size_t cnts = 1;
   for (size_t i = 0; i < ndims; ++i) {
-    cnt *= dims[i];
+    cnts *= dims[i];
   }
-  CHECK_EQ(cnt, m->getElementCnt()) << "Count size does not match";
 
+  if (m == nullptr) {
+    size_t height = dims[0];
+    size_t width = cnts / dims[0];
+    // LOG(INFO) << height << "," << width;
+    m = Matrix::create(height, width, false, false);
+  }
+
+  CHECK(m) << " Matrix should not be empty";
+  CpuMatrixPtr cpuMatrix = std::dynamic_pointer_cast<CpuMatrix>(m);
+  CHECK(cpuMatrix) << "Only support create from CPU matrix yet";
+
+  CHECK_EQ(cnts, m->getElementCnt()) << "Count size does not match";
   size_t width = m->getWidth();
   size_t height = m->getHeight();
   real* data = m->getData();
+  return std::make_shared<MKLDNNMatrix>(data, height, width, pd);
+}
 
+MKLDNNMatrixPtr MKLDNNMatrix::create(MatrixPtr m,
+                                     memory::dims dims,
+                                     memory::format fmt,
+                                     engine& eg,
+                                     mkldnn::memory::data_type dtype) {
   memory::desc md = memory::desc(dims, dtype, fmt);
   memory::primitive_desc pd = memory::primitive_desc(md, eg);
-  return std::make_shared<MKLDNNMatrix>(data, height, width, pd);
+  return create(m, pd);
+}
+
+void MKLDNNMatrix::downSpatial() {
+  int fmt = getFormat();
+  if (!(fmt == memory::format::nchw || fmt == memory::format::oihw)) {
+    // only support nchw and oihw yet, later can support more like nhwc, ihwo
+    return;
+  }
+
+  memory::dims srcDims = getDims();
+  const int H = 2, W = 3;
+  if (srcDims[H] != 1 || srcDims[W] != 1) {
+    // can not down spatial
+    return;
+  }
+
+  memory::dims dstDims = memory::dims{srcDims[0], srcDims[1]};
+  memory::format dstFmt;
+  switch (fmt) {
+    case memory::format::nchw:
+      dstFmt = memory::format::nc;
+      break;
+    case memory::format::oihw:
+      dstFmt = memory::format::oi;
+      break;
+    default:
+      LOG(FATAL) << "unsupported format";
+  }
+  memory::desc md = memory::desc(dstDims, getDtype(), dstFmt);
+  memory::primitive_desc pd = memory::primitive_desc(md, getEngine());
+  void* data = getData();
+  memory(pd, data);
 }
 
 }  // namespace paddle
diff --git a/paddle/math/MKLDNNMatrix.h b/paddle/math/MKLDNNMatrix.h
index 54c0a1fdcb..05adc867c2 100644
--- a/paddle/math/MKLDNNMatrix.h
+++ b/paddle/math/MKLDNNMatrix.h
@@ -39,20 +39,37 @@ public:
                mkldnn::memory::primitive_desc pd)
       : CpuMatrix(data, height, width, false), mkldnn::memory(pd, data) {}
 
-  MKLDNNMatrix(size_t height, size_t width, mkldnn::memory::primitive_desc pd)
-      : CpuMatrix(height, width, false), mkldnn::memory(pd) {
-    set_data_handle(CpuMatrix::getData());
-  }
-
   ~MKLDNNMatrix() {}
 
+  /**
+   * Create MKLDNNMatrix from a MatrixPtr and memory primitive_desc
+   */
+  static MKLDNNMatrixPtr create(MatrixPtr m, mkldnn::memory::primitive_desc pd);
+
+  /**
+   * Create MKLDNNMatrix from a MatrixPtr and memory details info
+   */
   static MKLDNNMatrixPtr create(
-      const MatrixPtr& m,
+      MatrixPtr m,
       mkldnn::memory::dims dims,
       mkldnn::memory::format fmt,
       mkldnn::engine& eg,
       mkldnn::memory::data_type dtype = mkldnn::memory::data_type::f32);
 
+public:
+  /**
+   * Dimensionality reduction.
+   * Change format "nchw --> nc" or "oihw --> oi" if the h and w are both 1
+   */
+  void downSpatial();
+
+  /**
+   * Update the memory data handle.
+   * Caution: This will not check the buffer size of the data,
+   *          it should be coverd by user.
+   */
+  void updateData(void* data) { set_data_handle(data); }
+
   /**
    * Get primitive descriptor.
    */
@@ -64,12 +81,13 @@ public:
   mkldnn::memory::desc getMD() { return getPD().desc(); }
 
   /**
-   * Get dims.
+   * Get dimensions.
    */
   mkldnn::memory::dims getDims() {
+    mkldnn::memory::desc md = getMD();
+    const int* src = md.data.dims;
+    int ndims = md.data.ndims;
     mkldnn::memory::dims dst;
-    int* src = getMD().data.dims;
-    int ndims = getMD().data.ndims;
     dst.resize(ndims);
     for (int i = 0; i < ndims; ++i) {
       dst[i] = src[i];
@@ -85,11 +103,16 @@ public:
   }
 
   /**
-   * Update the memory data handle.
-   * Caution: This will not check the buffer size of the data,
-   *          it should be coverd by user.
+   * Get memory data type.
    */
-  void updateData(void* data) { set_data_handle(data); }
+  mkldnn::memory::data_type getDtype() {
+    return (mkldnn::memory::data_type)(getMD().data.data_type);
+  }
+
+  /**
+   * Get engine.
+   */
+  mkldnn::engine getEngine() { return getPD().get_engine(); }
 };
 
 }  // namespace paddle

From 780c8d969e0d2d220df19a672c141ff7c44f53d2 Mon Sep 17 00:00:00 2001
From: tensor-tang <jian.j.tang@intel.com>
Date: Wed, 23 Aug 2017 17:03:16 +0800
Subject: [PATCH 13/64] make downSpatial work, and remove hasSpatial_

---
 paddle/gserver/layers/MKLDNNFcLayer.cpp | 4 ----
 paddle/gserver/layers/MKLDNNFcLayer.h   | 5 +----
 paddle/math/MKLDNNMatrix.cpp            | 7 ++++++-
 3 files changed, 7 insertions(+), 9 deletions(-)

diff --git a/paddle/gserver/layers/MKLDNNFcLayer.cpp b/paddle/gserver/layers/MKLDNNFcLayer.cpp
index a3291e6a8f..a5555c4618 100644
--- a/paddle/gserver/layers/MKLDNNFcLayer.cpp
+++ b/paddle/gserver/layers/MKLDNNFcLayer.cpp
@@ -111,10 +111,6 @@ void MKLDNNFcLayer::reshape() {
   if (iw_ == 0) {
     iw_ = 1;
   }
-  hasSpatial_ = true;
-  if (ih_ == 1 && iw_ == 1) {
-    hasSpatial_ = false;
-  }
   CHECK_EQ(iLayerSize_, inputLayers_[0]->getSize());
   ic_ = iLayerSize_ / (ih_ * iw_);
   CHECK_EQ(size_t(ic_ * ih_ * iw_), iLayerSize_) << "not divisible";
diff --git a/paddle/gserver/layers/MKLDNNFcLayer.h b/paddle/gserver/layers/MKLDNNFcLayer.h
index 7954852a23..e2657a8d5e 100644
--- a/paddle/gserver/layers/MKLDNNFcLayer.h
+++ b/paddle/gserver/layers/MKLDNNFcLayer.h
@@ -32,16 +32,13 @@ protected:
   // if has already init the weight
   bool hasInitedWgt_;
 
-  // if input layer has image size info (ih>1 && iw>1)
-  bool hasSpatial_;
-
   // fc weight and bias
   std::unique_ptr<Weight> weight_;
   std::unique_ptr<Weight> biases_;
 
 public:
   explicit MKLDNNFcLayer(const LayerConfig& config)
-      : MKLDNNLayer(config), hasInitedWgt_(false), hasSpatial_(true) {}
+      : MKLDNNLayer(config), hasInitedWgt_(false) {}
 
   ~MKLDNNFcLayer() {}
 
diff --git a/paddle/math/MKLDNNMatrix.cpp b/paddle/math/MKLDNNMatrix.cpp
index 24d54ec0f7..94df9c1550 100644
--- a/paddle/math/MKLDNNMatrix.cpp
+++ b/paddle/math/MKLDNNMatrix.cpp
@@ -85,7 +85,12 @@ void MKLDNNMatrix::downSpatial() {
   memory::desc md = memory::desc(dstDims, getDtype(), dstFmt);
   memory::primitive_desc pd = memory::primitive_desc(md, getEngine());
   void* data = getData();
-  memory(pd, data);
+  mkldnn_primitive_t result;
+  mkldnn::error::wrap_c_api(
+      mkldnn_primitive_create(&result, pd.get(), nullptr, nullptr),
+      "could not create a memory primitive");
+  reset(result);
+  set_data_handle(data);
 }
 
 }  // namespace paddle

From 161a15f055c2cbe1937522a7a11dbdeb31f1a774 Mon Sep 17 00:00:00 2001
From: zchen0211 <chenzhuoyuan07@gmail.com>
Date: Thu, 24 Aug 2017 03:11:54 +0000
Subject: [PATCH 14/64] gradient check

---
 python/paddle/v2/framework/tests/gradient_checker.py | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/python/paddle/v2/framework/tests/gradient_checker.py b/python/paddle/v2/framework/tests/gradient_checker.py
index c22c6f8831..d7809e52fb 100644
--- a/python/paddle/v2/framework/tests/gradient_checker.py
+++ b/python/paddle/v2/framework/tests/gradient_checker.py
@@ -86,6 +86,9 @@ def get_numeric_gradient(op,
     # we only compute gradient of one element each time.
     # we use a for loop to compute the gradient of every element.
     for i in xrange(tensor_size):
+        for var_name in input_values:
+            tensor_ = local_scope.find_var(var_name).get_tensor()
+            tensor_.set(numpy.copy(input_values[var_name]), core.CPUPlace())
         # get one input element throw it's index i.
         origin = tensor_to_check.get_float_element(i)
 
@@ -95,6 +98,9 @@ def get_numeric_gradient(op,
         y_pos = get_output()
 
         # plus delta to this element, run op and get the sum of the result tensor.
+        for var_name in input_values:
+            tensor_ = local_scope.find_var(var_name).get_tensor()
+            tensor_.set(numpy.copy(input_values[var_name]), core.CPUPlace())
         x_neg = origin - delta
         tensor_to_check.set_float_element(i, x_neg)
         y_neg = get_output()

From 0dffe68ca9973c5cf7d95029e369330ffcfe0187 Mon Sep 17 00:00:00 2001
From: hedaoyuan <hedaoyuan@github.com>
Date: Thu, 24 Aug 2017 23:45:17 +0800
Subject: [PATCH 15/64] Add NeonDepthwiseConvFunction.

---
 paddle/function/CMakeLists.txt             |   2 +
 paddle/function/DepthwiseConvOpTest.cpp    |   5 +
 paddle/function/Im2Col.h                   |  92 +++++++++
 paddle/function/neon/NeonDepthwiseConv.cpp | 227 +++++++++++++++++++++
 paddle/function/neon/NeonDepthwiseConv.h   |  25 +++
 paddle/function/neon/neon_util.h           |  47 +++++
 6 files changed, 398 insertions(+)
 create mode 100644 paddle/function/neon/NeonDepthwiseConv.cpp
 create mode 100644 paddle/function/neon/NeonDepthwiseConv.h
 create mode 100644 paddle/function/neon/neon_util.h

diff --git a/paddle/function/CMakeLists.txt b/paddle/function/CMakeLists.txt
index c572a9d433..05f808a6a1 100644
--- a/paddle/function/CMakeLists.txt
+++ b/paddle/function/CMakeLists.txt
@@ -21,6 +21,8 @@ if(USE_NNPACK)
   endif()
 endif()
 
+list(APPEND cpp_files neon/NeonDepthwiseConv.cpp)
+
 add_library(paddle_function STATIC ${cpp_files} ${cu_objs})
 add_dependencies(paddle_function ${external_project_dependencies})
 add_dependencies(paddle_function paddle_proto)
diff --git a/paddle/function/DepthwiseConvOpTest.cpp b/paddle/function/DepthwiseConvOpTest.cpp
index f44ae0c342..bdace2c372 100644
--- a/paddle/function/DepthwiseConvOpTest.cpp
+++ b/paddle/function/DepthwiseConvOpTest.cpp
@@ -34,4 +34,9 @@ TEST(DepthwiseConv, BackwardFilter) {
 }
 #endif
 
+TEST(DepthwiseConv, Forward) {
+  DepthwiseConvolution<DEVICE_TYPE_CPU, DEVICE_TYPE_CPU>(
+      "GemmConv-CPU", "NeonDepthwiseConv-CPU", forward);
+}
+
 }  // namespace paddle
diff --git a/paddle/function/Im2Col.h b/paddle/function/Im2Col.h
index 48e2e32f92..9b91e223a6 100644
--- a/paddle/function/Im2Col.h
+++ b/paddle/function/Im2Col.h
@@ -16,6 +16,7 @@ limitations under the License. */
 
 #include "TensorShape.h"
 #include "TensorType.h"
+#include "neon/neon_util.h"
 
 namespace paddle {
 
@@ -93,4 +94,95 @@ public:
                   int paddingWidth);
 };
 
+template <class T>
+struct Padding {
+  static void run(const T* src,
+                  T* dest,
+                  int channels,
+                  int inputHeight,
+                  int inputWidth,
+                  int paddingHeight,
+                  int paddingWidth) {
+    const int destWidth = inputWidth + 2 * paddingWidth;
+    for (int c = 0; c < channels; c++) {
+      if (paddingHeight > 0) {
+        memset(dest, 0, destWidth * paddingHeight * sizeof(T));
+        dest += destWidth * paddingHeight;
+      }
+
+      for (int i = 0; i < inputHeight; i++) {
+        // padding head
+        for (int j = 0; j < paddingWidth; j++) {
+          *dest++ = T(0);
+        }
+
+        memcpy(dest, src, inputWidth * sizeof(T));
+        dest += inputWidth;
+        src += inputWidth;
+
+        // padding tail
+        for (int j = 0; j < paddingWidth; j++) {
+          *dest++ = T(0);
+        }
+      }
+
+      if (paddingHeight > 0) {
+        memset(dest, 0, destWidth * paddingHeight * sizeof(T));
+        dest += destWidth * paddingHeight;
+      }
+    }
+  }
+};
+
+#if defined(__ARM_NEON__) || defined(__ARM_NEON)
+template <>
+struct Padding<float> {
+  static void run(const float* src,
+                  float* dest,
+                  int channels,
+                  int inputHeight,
+                  int inputWidth,
+                  int paddingHeight,
+                  int paddingWidth) {
+    const int destWidth = inputWidth + 2 * paddingWidth;
+    for (int c = 0; c < channels; c++) {
+      if (paddingHeight > 0) {
+        memset(dest, 0, destWidth * paddingHeight * sizeof(float));
+        dest += destWidth * paddingHeight;
+      }
+
+      for (int i = 0; i < inputHeight; i++) {
+        // padding head
+        for (int j = 0; j < paddingWidth; j++) {
+          *dest++ = float(0);
+        }
+
+        int step = inputWidth >> 2;
+        int remain = inputWidth & 3;
+        for (int s = 0; s < step; s++) {
+          float32x4_t s0 = vld1q_f32(src);
+          vst1q_f32(dest, s0);
+          src += 4;
+          dest += 4;
+        }
+        for (int r = 0; r < remain; r++) {
+          *dest++ = *src++;
+        }
+
+        // padding tail
+        for (int j = 0; j < paddingWidth; j++) {
+          *dest++ = float(0);
+        }
+      }
+
+      if (paddingHeight > 0) {
+        memset(dest, 0, destWidth * paddingHeight * sizeof(float));
+        dest += destWidth * paddingHeight;
+      }
+    }
+  }
+};
+
+#endif
+
 }  // namespace paddle
diff --git a/paddle/function/neon/NeonDepthwiseConv.cpp b/paddle/function/neon/NeonDepthwiseConv.cpp
new file mode 100644
index 0000000000..16d94c976e
--- /dev/null
+++ b/paddle/function/neon/NeonDepthwiseConv.cpp
@@ -0,0 +1,227 @@
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "neon_util.h"
+#include "paddle/function/ConvOp.h"
+#include "paddle/function/Im2Col.h"
+
+namespace paddle {
+
+namespace neon {
+
+#if defined(__ARM_NEON__) || defined(__ARM_NEON)
+
+template <int filterSize, int stride>
+struct DepthwiseConvKernel {};
+
+inline float32_t conv3x3(float32x4_t r0,
+                         float32x4_t r1,
+                         float32x4_t r2,
+                         float32x4_t k0,
+                         float32x4_t k1,
+                         float32x4_t k2) {
+  float32x4_t tmp;
+  tmp = vmulq_f32(r0, k0);
+  tmp = vmlaq_f32(tmp, r1, k1);
+  tmp = vmlaq_f32(tmp, r2, k2);
+  return vaddvq_f32(tmp);
+}
+
+/**
+ * Each step calculates four elements of the output.
+ * First step:
+ *   R0[0, 1, 2, 3...] * K[0][0]
+ *   R0[1, 2, 3, 4...] * K[0][1]
+ *   R0[2, 3, 4, 5...] * K[0][2]
+ *   R1[0, 1, 2, 3...] * K[1][0]
+ *   R1[1, 2, 3, 4...] * K[1][1]
+ *   R1[2, 3, 4, 5...] * K[1][2]
+ *   R2[0, 1, 2, 3...] * K[2][0]
+ *   R2[1, 2, 3, 4...] * K[2][1]
+ * + R2[2, 3, 4, 5...] * K[2][2]
+ * ------------------------------
+ *     Output[0, 1, 2, 3]
+ */
+template <>
+struct DepthwiseConvKernel<3, 1> {
+  static void run(const float* inputData,
+                  const float* filterData,
+                  int inputHeight,
+                  int inputWidth,
+                  int outputChannels,
+                  int outputHeight,
+                  int outputWidth,
+                  int filterMultiplier,
+                  float* outputData) {
+    const int steps = outputWidth >> 2;
+    const int remain = outputWidth & 3;
+    for (int c = 0; c < outputChannels; c++, filterData += 9) {
+      // Load the filters
+      float32x4_t k[3];
+      k[0] = vld1q_f32(filterData);
+      k[1] = vld1q_f32(filterData + 3);
+      k[2] = vld1q_f32(filterData + 6);
+      k[0] = vsetq_lane_f32(0.f, k[0], 3);
+      k[1] = vsetq_lane_f32(0.f, k[1], 3);
+      k[2] = vsetq_lane_f32(0.f, k[2], 3);
+
+      const float* r0 =
+          inputData + (c / filterMultiplier) * (inputHeight * inputWidth);
+      const float* r1 = r0 + inputWidth;
+      const float* r2 = r0 + inputWidth * 2;
+      float32x4_t input[3][3];
+      for (int h = 0; h < outputHeight; h++) {
+        for (int s = 0; s < steps; s++) {
+          // Load the inputs
+          float32x4_t tmp;
+          input[0][0] = vld1q_f32(r0);
+          tmp = vld1q_f32(r0 + 4);
+          input[0][1] = vextq_f32(input[0][0], tmp, 1);
+          input[0][2] = vextq_f32(input[0][0], tmp, 2);
+          input[1][0] = vld1q_f32(r1);
+          tmp = vld1q_f32(r1 + 4);
+          input[1][1] = vextq_f32(input[1][0], tmp, 1);
+          input[1][2] = vextq_f32(input[1][0], tmp, 2);
+          input[2][0] = vld1q_f32(r2);
+          tmp = vld1q_f32(r2 + 4);
+          input[2][1] = vextq_f32(input[2][0], tmp, 1);
+          input[2][2] = vextq_f32(input[2][0], tmp, 2);
+
+          float32x4_t tmp1 = vdupq_n_f32(0.f);
+          float32x4_t tmp2 = vdupq_n_f32(0.f);
+          tmp1 = vmlaq_laneq_f32(tmp1, input[0][0], k[0], 0);
+          tmp2 = vmlaq_laneq_f32(tmp2, input[0][1], k[0], 1);
+          tmp1 = vmlaq_laneq_f32(tmp1, input[0][2], k[0], 2);
+          tmp2 = vmlaq_laneq_f32(tmp2, input[1][0], k[1], 0);
+          tmp1 = vmlaq_laneq_f32(tmp1, input[1][1], k[1], 1);
+          tmp2 = vmlaq_laneq_f32(tmp2, input[1][2], k[1], 2);
+          tmp1 = vmlaq_laneq_f32(tmp1, input[2][0], k[2], 0);
+          tmp2 = vmlaq_laneq_f32(tmp2, input[2][1], k[2], 1);
+          tmp1 = vmlaq_laneq_f32(tmp1, input[2][2], k[2], 2);
+          tmp1 = vaddq_f32(tmp1, tmp2);
+
+          vst1q_f32(outputData, tmp1);
+          r0 += 4;
+          r1 += 4;
+          r2 += 4;
+          outputData += 4;
+        }
+
+        for (int r = 0; r < remain; r++) {
+          float32x4_t i0 = vld1q_f32(r0);
+          float32x4_t i1 = vld1q_f32(r1);
+          float32x4_t i2 = vld1q_f32(r2);
+          *outputData = conv3x3(i0, i1, i2, k[0], k[1], k[2]);
+          r0++;
+          r1++;
+          r2++;
+          outputData++;
+        }
+
+        r0 += 2;
+        r1 += 2;
+        r2 += 2;
+      }
+    }
+  }
+};
+
+template <DeviceType Device>
+class NeonDepthwiseConvFunction : public ConvFunctionBase {
+public:
+  void init(const FuncConfig& config) override {
+    ConvFunctionBase::init(config);
+  }
+
+  void check(const BufferArgs& inputs, const BufferArgs& outputs) override {
+    const TensorShape& input = inputs[0].shape();
+    const TensorShape& filter = inputs[1].shape();
+    const TensorShape& output = outputs[0].shape();
+    checkShape(input, filter, output);
+  }
+
+  void calc(const BufferArgs& inputs, const BufferArgs& outputs) override {
+    CHECK_EQ(numInputs_, inputs.size());
+    CHECK_EQ(numOutputs_, outputs.size());
+    check(inputs, outputs);
+
+    const TensorShape& input = inputs[0].shape();
+    const TensorShape& filter = inputs[1].shape();
+    const TensorShape& output = outputs[0].shape();
+
+    size_t batchSize = input[0];
+    size_t inputChannels = input[1];
+    size_t inputHeight = input[2];
+    size_t inputWidth = input[3];
+    size_t filterHeight = getFilterHeight(filter);
+    size_t filterWidth = getFilterWidth(filter);
+    size_t outputChannels = output[1];
+    size_t outputHeight = output[2];
+    size_t outputWidth = output[3];
+    size_t filterMultiplier = outputChannels / groups_;
+    CHECK_EQ(inputChannels, groups_);
+
+    // only support
+    CHECK_EQ(strideH(), strideW());
+    CHECK_EQ(filterHeight, filterWidth);
+    CHECK_EQ(filterHeight, size_t(3));
+    CHECK_LT(strideH(), size_t(3));
+
+    float* inputData = inputs[0].data<float>();
+    float* filterData = inputs[1].data<float>();
+    float* outputData = outputs[0].data<float>();
+
+    // padding the input
+    float* inputPadding = inputData;
+    if (paddingH() > 0 || paddingW() > 0) {
+      int newSize = batchSize * inputChannels * (inputHeight + 2 * paddingH()) *
+                    (inputWidth + 2 * paddingW());
+      resizeBuffer<Device>(newSize);
+      inputPadding = reinterpret_cast<float*>(memory_->getBuf());
+      Padding<float>::run(inputData,
+                          inputPadding,
+                          batchSize * inputChannels,
+                          inputHeight,
+                          inputWidth,
+                          paddingH(),
+                          paddingW());
+
+      // height and width of padding data
+      inputHeight += 2 * paddingH();
+      inputWidth += 2 * paddingW();
+    }
+
+    for (size_t i = 0; i < batchSize; i++) {
+      DepthwiseConvKernel<3, 1>::run(inputPadding,
+                                     filterData,
+                                     inputHeight,
+                                     inputWidth,
+                                     outputChannels,
+                                     outputHeight,
+                                     outputWidth,
+                                     filterMultiplier,
+                                     outputData);
+
+      inputPadding += inputChannels * inputHeight * inputWidth;
+      outputData += outputChannels * outputHeight * outputWidth;
+    }
+  }
+};
+
+REGISTER_TYPED_FUNC(NeonDepthwiseConv, CPU, NeonDepthwiseConvFunction);
+
+#endif
+
+}  // namespace neon
+}  // namespace paddle
diff --git a/paddle/function/neon/NeonDepthwiseConv.h b/paddle/function/neon/NeonDepthwiseConv.h
new file mode 100644
index 0000000000..23e4be1921
--- /dev/null
+++ b/paddle/function/neon/NeonDepthwiseConv.h
@@ -0,0 +1,25 @@
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+
+namespace paddle {
+
+namespace neon {
+
+template <int filterSize, int stride>
+struct DepthwiseConvKernel {};
+
+}  // namespace neon
+}  // namespace paddle
diff --git a/paddle/function/neon/neon_util.h b/paddle/function/neon/neon_util.h
new file mode 100644
index 0000000000..56b3febe2d
--- /dev/null
+++ b/paddle/function/neon/neon_util.h
@@ -0,0 +1,47 @@
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+
+#if defined(__ARM_NEON__) || defined(__ARM_NEON)
+
+#include <arm_neon.h>
+
+namespace paddle {
+
+namespace neon {
+
+inline float32x4_t vld1q_f32_aligned(const float* p) {
+  return vld1q_f32(
+      (const float*)__builtin_assume_aligned(p, sizeof(float32x4_t)));
+}
+
+#ifndef __aarch64__
+inline float32_t vaddvq_f32(float32x4_t a) {
+  float32x2_t v = vadd_f32(vget_high_f32(a), vget_low_f32(a));
+  return vget_lane_f32(vpadd_f32(v, v), 0);
+}
+
+inline float32x4_t vmlaq_laneq_f32(float32x4_t a,
+                                   float32x4_t b,
+                                   float32x4_t v,
+                                   const int lane) {
+  return vmlaq_n_f32(a, b, vgetq_lane_f32(v, lane));
+}
+#endif
+
+}  // namespace neon
+}  // namespace paddle
+
+#endif

From b7885b087b74a1ab446f8f34d1fd78085d8b4316 Mon Sep 17 00:00:00 2001
From: hedaoyuan <hedaoyuan@github.com>
Date: Fri, 25 Aug 2017 00:47:51 +0800
Subject: [PATCH 16/64] Add DepthwiseConvKernel for filter size is 4.

---
 paddle/function/neon/NeonDepthwiseConv.cpp | 155 +++++++++++++++++++--
 1 file changed, 145 insertions(+), 10 deletions(-)

diff --git a/paddle/function/neon/NeonDepthwiseConv.cpp b/paddle/function/neon/NeonDepthwiseConv.cpp
index 16d94c976e..c017241c92 100644
--- a/paddle/function/neon/NeonDepthwiseConv.cpp
+++ b/paddle/function/neon/NeonDepthwiseConv.cpp
@@ -38,6 +38,22 @@ inline float32_t conv3x3(float32x4_t r0,
   return vaddvq_f32(tmp);
 }
 
+inline float32_t conv4x4(float32x4_t r0,
+                         float32x4_t r1,
+                         float32x4_t r2,
+                         float32x4_t r3,
+                         float32x4_t k0,
+                         float32x4_t k1,
+                         float32x4_t k2,
+                         float32x4_t k3) {
+  float32x4_t tmp;
+  tmp = vmulq_f32(r0, k0);
+  tmp = vmlaq_f32(tmp, r1, k1);
+  tmp = vmlaq_f32(tmp, r2, k2);
+  tmp = vmlaq_f32(tmp, r3, k3);
+  return vaddvq_f32(tmp);
+}
+
 /**
  * Each step calculates four elements of the output.
  * First step:
@@ -137,6 +153,114 @@ struct DepthwiseConvKernel<3, 1> {
   }
 };
 
+/**
+ * Each step calculates four elements of the output.
+ */
+template <>
+struct DepthwiseConvKernel<4, 1> {
+  static void run(const float* inputData,
+                  const float* filterData,
+                  int inputHeight,
+                  int inputWidth,
+                  int outputChannels,
+                  int outputHeight,
+                  int outputWidth,
+                  int filterMultiplier,
+                  float* outputData) {
+    const int steps = outputWidth >> 2;
+    const int remain = outputWidth & 3;
+    for (int c = 0; c < outputChannels; c++, filterData += 16) {
+      // Load the filters
+      float32x4_t k[4];
+      k[0] = vld1q_f32(filterData);
+      k[1] = vld1q_f32(filterData + 4);
+      k[2] = vld1q_f32(filterData + 8);
+      k[3] = vld1q_f32(filterData + 12);
+
+      const float* r0 =
+          inputData + (c / filterMultiplier) * (inputHeight * inputWidth);
+      const float* r1 = r0 + inputWidth;
+      const float* r2 = r0 + inputWidth * 2;
+      const float* r3 = r0 + inputWidth * 3;
+      float32x4_t input[4][4];
+      for (int h = 0; h < outputHeight; h++) {
+        for (int s = 0; s < steps; s++) {
+          // Load the inputs
+          float32x4_t tmp;
+          input[0][0] = vld1q_f32(r0);
+          tmp = vld1q_f32(r0 + 4);
+          input[0][1] = vextq_f32(input[0][0], tmp, 1);
+          input[0][2] = vextq_f32(input[0][0], tmp, 2);
+          input[0][3] = vextq_f32(input[0][0], tmp, 3);
+
+          input[1][0] = vld1q_f32(r1);
+          tmp = vld1q_f32(r1 + 4);
+          input[1][1] = vextq_f32(input[1][0], tmp, 1);
+          input[1][2] = vextq_f32(input[1][0], tmp, 2);
+          input[1][3] = vextq_f32(input[1][0], tmp, 3);
+
+          input[2][0] = vld1q_f32(r2);
+          tmp = vld1q_f32(r2 + 4);
+          input[2][1] = vextq_f32(input[2][0], tmp, 1);
+          input[2][2] = vextq_f32(input[2][0], tmp, 2);
+          input[2][3] = vextq_f32(input[2][0], tmp, 3);
+
+          input[3][0] = vld1q_f32(r3);
+          tmp = vld1q_f32(r3 + 4);
+          input[3][1] = vextq_f32(input[3][0], tmp, 1);
+          input[3][2] = vextq_f32(input[3][0], tmp, 2);
+          input[3][3] = vextq_f32(input[3][0], tmp, 3);
+
+          float32x4_t tmp1 = vdupq_n_f32(0.f);
+          float32x4_t tmp2 = vdupq_n_f32(0.f);
+          tmp1 = vmlaq_laneq_f32(tmp1, input[0][0], k[0], 0);
+          tmp2 = vmlaq_laneq_f32(tmp2, input[0][1], k[0], 1);
+          tmp1 = vmlaq_laneq_f32(tmp1, input[0][2], k[0], 2);
+          tmp2 = vmlaq_laneq_f32(tmp2, input[0][3], k[0], 3);
+          tmp1 = vmlaq_laneq_f32(tmp1, input[1][0], k[1], 0);
+          tmp2 = vmlaq_laneq_f32(tmp2, input[1][1], k[1], 1);
+          tmp1 = vmlaq_laneq_f32(tmp1, input[1][2], k[1], 2);
+          tmp2 = vmlaq_laneq_f32(tmp2, input[1][3], k[1], 3);
+          tmp1 = vmlaq_laneq_f32(tmp1, input[2][0], k[2], 0);
+          tmp2 = vmlaq_laneq_f32(tmp2, input[2][1], k[2], 1);
+          tmp1 = vmlaq_laneq_f32(tmp1, input[2][2], k[2], 2);
+          tmp2 = vmlaq_laneq_f32(tmp2, input[2][3], k[2], 3);
+          tmp1 = vmlaq_laneq_f32(tmp1, input[3][0], k[3], 0);
+          tmp2 = vmlaq_laneq_f32(tmp2, input[3][1], k[3], 1);
+          tmp1 = vmlaq_laneq_f32(tmp1, input[3][2], k[3], 2);
+          tmp2 = vmlaq_laneq_f32(tmp2, input[3][3], k[3], 3);
+          tmp1 = vaddq_f32(tmp1, tmp2);
+
+          vst1q_f32(outputData, tmp1);
+          r0 += 4;
+          r1 += 4;
+          r2 += 4;
+          r3 += 4;
+          outputData += 4;
+        }
+
+        for (int r = 0; r < remain; r++) {
+          float32x4_t i0 = vld1q_f32(r0);
+          float32x4_t i1 = vld1q_f32(r1);
+          float32x4_t i2 = vld1q_f32(r2);
+          float32x4_t i3 = vld1q_f32(r3);
+          *outputData = conv4x4(i0, i1, i2, i3, k[0], k[1], k[2], k[3]);
+          r0++;
+          r1++;
+          r2++;
+          r3++;
+          outputData++;
+        }
+
+        r0 += 3;
+        r1 += 3;
+        r2 += 3;
+        r3 += 3;
+      }
+    }
+  }
+};
+
 template <DeviceType Device>
 class NeonDepthwiseConvFunction : public ConvFunctionBase {
 public:
@@ -175,7 +299,6 @@ public:
     // only support
     CHECK_EQ(strideH(), strideW());
     CHECK_EQ(filterHeight, filterWidth);
-    CHECK_EQ(filterHeight, size_t(3));
     CHECK_LT(strideH(), size_t(3));
 
     float* inputData = inputs[0].data<float>();
@@ -203,15 +326,27 @@ public:
     }
 
     for (size_t i = 0; i < batchSize; i++) {
-      DepthwiseConvKernel<3, 1>::run(inputPadding,
-                                     filterData,
-                                     inputHeight,
-                                     inputWidth,
-                                     outputChannels,
-                                     outputHeight,
-                                     outputWidth,
-                                     filterMultiplier,
-                                     outputData);
+      if (filterWidth == 3) {
+        DepthwiseConvKernel<3, 1>::run(inputPadding,
+                                       filterData,
+                                       inputHeight,
+                                       inputWidth,
+                                       outputChannels,
+                                       outputHeight,
+                                       outputWidth,
+                                       filterMultiplier,
+                                       outputData);
+      } else if (filterWidth == 4) {
+        DepthwiseConvKernel<4, 1>::run(inputPadding,
+                                       filterData,
+                                       inputHeight,
+                                       inputWidth,
+                                       outputChannels,
+                                       outputHeight,
+                                       outputWidth,
+                                       filterMultiplier,
+                                       outputData);
+      }
 
       inputPadding += inputChannels * inputHeight * inputWidth;
       outputData += outputChannels * outputHeight * outputWidth;

From c5e28dd1a0b4cb6f8ba74bc16760dc6cf32ad50e Mon Sep 17 00:00:00 2001
From: zchen0211 <chenzhuoyuan07@gmail.com>
Date: Thu, 24 Aug 2017 02:58:41 +0000
Subject: [PATCH 17/64] scatter check in

---
 paddle/operators/CMakeLists.txt               |  1 +
 paddle/operators/scatter_op.cc                | 76 +++++++++++++++++++
 paddle/operators/scatter_op.cu                | 20 +++++
 paddle/operators/scatter_op.h                 | 60 +++++++++++++++
 paddle/pybind/CMakeLists.txt                  |  1 +
 paddle/pybind/pybind.cc                       |  1 +
 .../paddle/v2/framework/tests/CMakeLists.txt  |  1 +
 .../v2/framework/tests/test_gather_op.py      |  3 -
 .../v2/framework/tests/test_scatter_op.py     | 38 ++++++++++
 9 files changed, 198 insertions(+), 3 deletions(-)
 create mode 100644 paddle/operators/scatter_op.cc
 create mode 100644 paddle/operators/scatter_op.cu
 create mode 100644 paddle/operators/scatter_op.h
 create mode 100644 python/paddle/v2/framework/tests/test_scatter_op.py

diff --git a/paddle/operators/CMakeLists.txt b/paddle/operators/CMakeLists.txt
index f466dbc79a..f0fd12f1b5 100644
--- a/paddle/operators/CMakeLists.txt
+++ b/paddle/operators/CMakeLists.txt
@@ -47,6 +47,7 @@ cc_test(gather_test SRCS gather_test.cc DEPS tensor)
 op_library(gather_op SRCS gather_op.cc gather_op.cu)
 
 cc_test(scatter_test SRCS scatter_test.cc DEPS tensor)
+op_library(scatter_op SRCS scatter_op.cc scatter_op.cu)
 
 cc_library(net_op SRCS net_op.cc DEPS op_registry)
 cc_test(net_op_test SRCS net_op_test.cc DEPS net_op)
diff --git a/paddle/operators/scatter_op.cc b/paddle/operators/scatter_op.cc
new file mode 100644
index 0000000000..cf01ef6279
--- /dev/null
+++ b/paddle/operators/scatter_op.cc
@@ -0,0 +1,76 @@
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "paddle/operators/scatter_op.h"
+#include "paddle/framework/ddim.h"
+
+namespace paddle {
+namespace operators {
+
+class ScatterOp : public framework::OperatorWithKernel {
+ public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+
+ protected:
+  void InferShape(const framework::InferShapeContext &ctx) const override {
+    framework::DDim output_dims(ctx.Input<Tensor>("Ref")->dims());
+    ctx.Output<Tensor>("Out")->Resize(output_dims);
+  }
+};
+
+class ScatterGradOp : public framework::OperatorWithKernel {
+ public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+
+ protected:
+  void InferShape(const framework::InferShapeContext &ctx) const override {
+    auto Updates_grad = ctx.Output<Tensor>(framework::GradVarName("Updates"));
+    auto Updates = ctx.Input<Tensor>("Updates");
+    auto Ref_grad = ctx.Output<Tensor>(framework::GradVarName("Ref"));
+    auto Ref = ctx.Input<Tensor>("Ref");
+
+    Ref_grad->Resize(Ref->dims());
+    Updates_grad->Resize(Updates->dims());
+  }
+};
+
+class ScatterOpMaker : public framework::OpProtoAndCheckerMaker {
+ public:
+  ScatterOpMaker(framework::OpProto *proto,
+                 framework::OpAttrChecker *op_checker)
+      : OpProtoAndCheckerMaker(proto, op_checker) {
+    AddInput("Ref", "The source input of scatter op");
+    AddInput("Index",
+             "The index input of scatter op where Ref will be updated");
+    AddInput("Updates", "The updated value of updates op");
+    AddOutput("Out", "The output of add op");
+    AddComment(R"DOC(
+Scatter Operator by selecting from the first axis, 
+
+Out = Ref
+Out[Index] = Ref[Index] + Updates
+)DOC");
+  }
+};
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+REGISTER_OP(scatter, ops::ScatterOp, ops::ScatterOpMaker, scatter_grad,
+            ops::ScatterGradOp);
+REGISTER_OP_CPU_KERNEL(scatter,
+                       ops::ScatterOpKernel<paddle::platform::CPUPlace, float>);
+REGISTER_OP_CPU_KERNEL(
+    scatter_grad,
+    ops::ScatterGradientOpKernel<paddle::platform::CPUPlace, float>);
diff --git a/paddle/operators/scatter_op.cu b/paddle/operators/scatter_op.cu
new file mode 100644
index 0000000000..e6a6fa57d9
--- /dev/null
+++ b/paddle/operators/scatter_op.cu
@@ -0,0 +1,20 @@
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License. */
+
+#define EIGEN_USE_GPU
+#include "paddle/operators/scatter_op.h"
+
+namespace ops = paddle::operators;
+REGISTER_OP_GPU_KERNEL(scatter,
+                       ops::ScatterOpKernel<paddle::platform::GPUPlace, float>);
diff --git a/paddle/operators/scatter_op.h b/paddle/operators/scatter_op.h
new file mode 100644
index 0000000000..c2db3ae37c
--- /dev/null
+++ b/paddle/operators/scatter_op.h
@@ -0,0 +1,60 @@
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+#include "gather.h"
+#include "paddle/framework/eigen.h"
+#include "paddle/framework/op_registry.h"
+#include "scatter.h"
+
+namespace paddle {
+namespace operators {
+
+using Tensor = framework::Tensor;
+
+template <typename Place, typename T>
+class ScatterOpKernel : public framework::OpKernel {
+ public:
+  void Compute(const framework::ExecutionContext &ctx) const override {
+    auto *Ref = ctx.Input<Tensor>("Ref");
+    auto *Index = ctx.Input<Tensor>("Index");
+    auto *Updates = ctx.Input<Tensor>("Updates");
+    auto *Out = ctx.Output<Tensor>("Out");
+
+    // In place output: Out = Ref, Out[Index] += Updates
+    Out->ShareDataWith<T>(*Ref);
+    // Apply ScatterUpdate: Out[index] += Updates[:]
+    ScatterUpdate<T>(ctx.GetPlace(), Updates, Index, Out);
+  }
+};
+
+template <typename Place, typename T>
+class ScatterGradientOpKernel : public framework::OpKernel {
+ public:
+  void Compute(const framework::ExecutionContext &ctx) const override {
+    auto *dRef = ctx.Output<Tensor>(framework::GradVarName("Ref"));
+    auto *dUpdates = ctx.Output<Tensor>(framework::GradVarName("Updates"));
+    auto *Index = ctx.Input<Tensor>("Index");
+    auto *dO = ctx.Input<Tensor>(framework::GradVarName("Out"));
+
+    // In place gradient: dRef = dO
+    dRef->ShareDataWith<T>(*dO);
+    dUpdates->mutable_data<T>(ctx.GetPlace());
+    // Gradient by Gather: dUpdates += dO[Index]
+    Gather<T>(ctx.GetPlace(), dO, Index, dUpdates);
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
diff --git a/paddle/pybind/CMakeLists.txt b/paddle/pybind/CMakeLists.txt
index abb9c248ee..37e186a408 100644
--- a/paddle/pybind/CMakeLists.txt
+++ b/paddle/pybind/CMakeLists.txt
@@ -4,6 +4,7 @@ cc_library(paddle_pybind SHARED
     DEPS pybind python backward
     sgd_op
     gather_op
+    scatter_op
     add_op
     mul_op
     rowwise_add_op
diff --git a/paddle/pybind/pybind.cc b/paddle/pybind/pybind.cc
index 8fa8be2cef..3bc150ccb7 100644
--- a/paddle/pybind/pybind.cc
+++ b/paddle/pybind/pybind.cc
@@ -47,6 +47,7 @@ USE_OP(scale);
 USE_OP_ITSELF(identity);
 USE_OP(minus);
 USE_CPU_ONLY_OP(gather);
+USE_CPU_ONLY_OP(scatter);
 
 namespace paddle {
 namespace framework {
diff --git a/python/paddle/v2/framework/tests/CMakeLists.txt b/python/paddle/v2/framework/tests/CMakeLists.txt
index fb4686889a..661ebd8964 100644
--- a/python/paddle/v2/framework/tests/CMakeLists.txt
+++ b/python/paddle/v2/framework/tests/CMakeLists.txt
@@ -14,6 +14,7 @@ py_test(test_sigmoid_op SRCS test_sigmoid_op.py)
 py_test(test_softmax_op SRCS test_softmax_op.py)
 py_test(test_cross_entropy_op SRCS test_cross_entropy_op.py)
 py_test(test_gather_op SRCS test_gather_op.py)
+py_test(test_scatter_op SRCS test_scatter_op.py)
 py_test(test_fill_zeros_like_op SRCS test_fill_zeros_like_op.py)
 
 py_test(gradient_checker SRCS gradient_checker.py)
diff --git a/python/paddle/v2/framework/tests/test_gather_op.py b/python/paddle/v2/framework/tests/test_gather_op.py
index e868983042..e3de3fd0a1 100644
--- a/python/paddle/v2/framework/tests/test_gather_op.py
+++ b/python/paddle/v2/framework/tests/test_gather_op.py
@@ -21,12 +21,9 @@ class TestGatherOp(unittest.TestCase):
 
 class TestGatherGradOp(GradientChecker):
     def test_gather_grad(self):
-        print 'creating op'
         op = create_op("gather")
-        print 'creating op done'
         xnp = numpy.random.random((10, 20)).astype("float32")
         inputs = {'X': xnp, 'Index': numpy.array([1, 3, 5]).astype("int32")}
-        print 'correct before check gradient'
         self.check_grad(op, inputs, set("X"), "Out")
 
 
diff --git a/python/paddle/v2/framework/tests/test_scatter_op.py b/python/paddle/v2/framework/tests/test_scatter_op.py
new file mode 100644
index 0000000000..e7696844d5
--- /dev/null
+++ b/python/paddle/v2/framework/tests/test_scatter_op.py
@@ -0,0 +1,38 @@
+import unittest
+from op_test_util import OpTestMeta
+from gradient_checker import GradientChecker, create_op
+import numpy
+import paddle.v2.framework.core as core
+from paddle.v2.framework.op import Operator
+
+
+class TestScatterOp(unittest.TestCase):
+    __metaclass__ = OpTestMeta
+
+    def setUp(self):
+        self.type = "scatter"
+        ref_np = numpy.ones((3, 3)).astype("float32")
+        index_np = numpy.array([1, 2]).astype("int32")
+        updates_np = numpy.random.random((2, 3)).astype("float32")
+        output_np = numpy.copy(ref_np)
+        output_np[index_np] += updates_np
+        self.inputs = {'Ref': ref_np, 'Index': index_np, 'Updates': updates_np}
+        self.outputs = {'Out': output_np}
+
+
+class TestScatterGradOp(GradientChecker):
+    def test_scatter_grad(self):
+        op = create_op("scatter")
+        # test data setup
+        ref_np = numpy.ones((3, 10)).astype("float32")
+        index_np = numpy.array([1, 2]).astype("int32")
+        updates_np = numpy.random.random((2, 10)).astype("float32")
+        output_np = numpy.copy(ref_np)
+        output_np[index_np] += updates_np
+        inputs = {'Ref': ref_np, 'Index': index_np, 'Updates': updates_np}
+        # check gradient
+        self.check_grad(op, inputs, set(["Updates", "Ref"]), "Out")
+
+
+if __name__ == "__main__":
+    unittest.main()

From 2edb69a7891c63c1d1e8c760e53eed0c3537dfbf Mon Sep 17 00:00:00 2001
From: zchen0211 <chenzhuoyuan07@gmail.com>
Date: Fri, 25 Aug 2017 00:44:29 +0000
Subject: [PATCH 18/64] with in-place option

---
 .../v2/framework/tests/gradient_checker.py    | 21 ++++++++++++-------
 .../v2/framework/tests/test_scatter_op.py     |  3 ++-
 2 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/python/paddle/v2/framework/tests/gradient_checker.py b/python/paddle/v2/framework/tests/gradient_checker.py
index 8eb9f3f073..ac37671c77 100644
--- a/python/paddle/v2/framework/tests/gradient_checker.py
+++ b/python/paddle/v2/framework/tests/gradient_checker.py
@@ -32,7 +32,8 @@ def get_numeric_gradient(op,
                          output_name,
                          input_to_check,
                          delta=0.005,
-                         local_scope=None):
+                         local_scope=None,
+                         in_place=False):
     """
     Get Numeric Gradient for an operator's input.
     
@@ -90,9 +91,10 @@ def get_numeric_gradient(op,
     # we only compute gradient of one element each time.
     # we use a for loop to compute the gradient of every element.
     for i in xrange(tensor_size):
-        for var_name in input_values:
-            tensor_ = local_scope.find_var(var_name).get_tensor()
-            tensor_.set(numpy.copy(input_values[var_name]), core.CPUPlace())
+        if in_place:
+            for var_name in input_values:
+                tensor_ = local_scope.find_var(var_name).get_tensor()
+                tensor_.set(numpy.copy(input_values[var_name]), core.CPUPlace())
         # get one input element throw it's index i.
         origin = tensor_to_check.get_float_element(i)
 
@@ -102,9 +104,10 @@ def get_numeric_gradient(op,
         y_pos = get_output()
 
         # plus delta to this element, run op and get the sum of the result tensor.
-        for var_name in input_values:
-            tensor_ = local_scope.find_var(var_name).get_tensor()
-            tensor_.set(numpy.copy(input_values[var_name]), core.CPUPlace())
+        if in_place:
+            for var_name in input_values:
+                tensor_ = local_scope.find_var(var_name).get_tensor()
+                tensor_.set(numpy.copy(input_values[var_name]), core.CPUPlace())
         x_neg = origin - delta
         tensor_to_check.set_float_element(i, x_neg)
         y_neg = get_output()
@@ -257,6 +260,7 @@ class GradientChecker(unittest.TestCase):
                    output_name,
                    no_grad_set=None,
                    only_cpu=False,
+                   in_place=False,
                    max_relative_error=0.005):
         """
         :param forward_op: used to create backward_op
@@ -289,7 +293,8 @@ class GradientChecker(unittest.TestCase):
 
         # get numerical gradients
         numeric_grads = [
-            get_numeric_gradient(forward_op, input_vars, output_name, name)
+            get_numeric_gradient(
+                forward_op, input_vars, output_name, name, in_place=in_place)
             for name in inputs_to_check
         ]
 
diff --git a/python/paddle/v2/framework/tests/test_scatter_op.py b/python/paddle/v2/framework/tests/test_scatter_op.py
index e7696844d5..861fe6cf89 100644
--- a/python/paddle/v2/framework/tests/test_scatter_op.py
+++ b/python/paddle/v2/framework/tests/test_scatter_op.py
@@ -31,7 +31,8 @@ class TestScatterGradOp(GradientChecker):
         output_np[index_np] += updates_np
         inputs = {'Ref': ref_np, 'Index': index_np, 'Updates': updates_np}
         # check gradient
-        self.check_grad(op, inputs, set(["Updates", "Ref"]), "Out")
+        self.check_grad(
+            op, inputs, set(["Updates", "Ref"]), "Out", in_place=True)
 
 
 if __name__ == "__main__":

From f22ece9273b54f1a248f7a787e252eb04a5acea3 Mon Sep 17 00:00:00 2001
From: Yi Wang <yiwang01@baidu.com>
Date: Thu, 24 Aug 2017 19:44:19 -0700
Subject: [PATCH 19/64] Add a document on building using Docker

---
 Dockerfile                     |  4 +-
 doc/howto/dev/build_en.md      | 83 ++++++++++++++++++++++++++++++++++
 paddle/scripts/docker/build.sh |  6 +--
 3 files changed, 87 insertions(+), 6 deletions(-)
 create mode 100644 doc/howto/dev/build_en.md

diff --git a/Dockerfile b/Dockerfile
index 98f61ba586..136db772cc 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -10,13 +10,11 @@ RUN /bin/bash -c 'if [[ -n ${UBUNTU_MIRROR} ]]; then sed -i 's#http://archive.ub
 ARG WITH_GPU
 ARG WITH_AVX
 ARG WITH_DOC
-ARG WITH_STYLE_CHECK
 
 ENV WOBOQ OFF
-ENV WITH_GPU=${WITH_GPU:-OFF}
+ENV WITH_GPU=${WITH_GPU:-ON}
 ENV WITH_AVX=${WITH_AVX:-ON}
 ENV WITH_DOC=${WITH_DOC:-OFF}
-ENV WITH_STYLE_CHECK=${WITH_STYLE_CHECK:-OFF}
 
 ENV HOME /root
 # Add bash enhancements
diff --git a/doc/howto/dev/build_en.md b/doc/howto/dev/build_en.md
new file mode 100644
index 0000000000..80488a147d
--- /dev/null
+++ b/doc/howto/dev/build_en.md
@@ -0,0 +1,83 @@
+# Build PaddlePaddle from Source Code and Run Unit Test
+
+## What Developers Need
+
+To contribute to PaddlePaddle, you need
+
+1. A computer -- Linux, BSD, Windows, MacOS, and
+1. Docker.
+
+Nothing else.  Not even Python and GCC, because you can install all build tools into a Docker image.
+
+## General Process
+
+1. Retrieve source code.
+
+   ```bash
+   git clone https://github.com/paddlepaddle/paddle
+   ```
+
+2. Install build tools.
+
+   ```bash
+   cd paddle; docker build -t paddle:dev .
+   ```
+
+3. Build from source.
+
+   ```bash
+   docker run -v $PWD:/paddle paddle:dev
+   ```
+
+4. Run unit tests.
+
+   ```bash
+   docker run -v $PWD:/paddle paddle:dev "cd/build; ctest"
+   ```
+
+
+## Docker, Or Not?
+
+- What is Docker?
+
+  If you haven't heard of it, consider it something like Python's virtualenv.
+
+- Docker or virtual machine?
+
+  Some people compare Docker with VMs, but Docker doesn't virtualize any hardware, and it doesn't run a guest OS.
+
+- Why Docker?
+
+  Using a Docker image of build tools standardize the building environment, and easier for others to reproduce your problem, if there is any, and help.
+
+  Also, some build tools don't run on Windows or Mac or BSD, but Docker runs almost everywhere, so developers can use whatever computer they want.
+
+- Can I don't use Docker?
+
+  Sure, you don't have to install build tools into a Docker image; instead, you can install them onto your local computer.  This document exists because Docker would make the development way easier.
+
+- How difficult is it to learn Docker?
+
+  It takes you ten minutes to read https://docs.docker.com/get-started/ and saves you more than one hour to install all required build tools, configure them, and upgrade them when new versions of PaddlePaddle require some new tools.
+
+- Docker requires sudo
+
+  An owner of a computer has the administrative privilege, a.k.a., sudo.  If you use a shared computer for development, please ask the administrator to install and configure Docker.  We will do our best to support rkt, another container technology that doesn't require sudo.
+
+- Can I use my favorite IDE?
+
+  Yes, of course.  The source code resides on your local computer, and you can edit it using whatever editor you like.
+
+  Many PaddlePaddle developers are using Emacs.  They add the following few lines into their `~/.emacs` configure file:
+
+  ```emacs
+  (global-set-key "\C-cc" 'compile)
+  (setq compile-command
+   "docker run --rm -it -v $(git rev-parse --show-toplevel):/paddle paddle:dev")
+  ```
+
+  so they could type `Ctrl-C` and `c` to build PaddlePaddle from source.
+
+- How many parallel building processes does the Docker container run?
+
+  Our building Docker image runs a Bash script https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/scripts/docker/build.sh, which calls `make -j$(nproc)` to starts as many processes as the number of your processors.
diff --git a/paddle/scripts/docker/build.sh b/paddle/scripts/docker/build.sh
index 2941662f34..7bab814ae8 100644
--- a/paddle/scripts/docker/build.sh
+++ b/paddle/scripts/docker/build.sh
@@ -38,7 +38,7 @@ Configuring cmake in /paddle/build ...
       -DWITH_SWIG_PY=${WITH_SWIG_PY:-ON}
       -DCUDNN_ROOT=/usr/
       -DWITH_STYLE_CHECK=${WITH_STYLE_CHECK:-OFF}
-      -DWITH_TESTING=${WITH_TESTING:-OFF}
+      -DWITH_TESTING=${WITH_TESTING:-ON}
       -DCMAKE_EXPORT_COMPILE_COMMANDS=ON
 ========================================
 EOF
@@ -56,8 +56,8 @@ cmake .. \
       -DWITH_C_API=${WITH_C_API:-OFF} \
       -DWITH_PYTHON=${WITH_PYTHON:-ON} \
       -DCUDNN_ROOT=/usr/ \
-      -DWITH_STYLE_CHECK=${WITH_STYLE_CHECK:-OFF} \
-      -DWITH_TESTING=${WITH_TESTING:-OFF} \
+      -DWITH_STYLE_CHECK=${WITH_STYLE_CHECK:-ON} \
+      -DWITH_TESTING=${WITH_TESTING:-ON} \
       -DCMAKE_EXPORT_COMPILE_COMMANDS=ON
 
 cat <<EOF

From 6dcff9a444c263c5c40de9abaf7994cb76ed1856 Mon Sep 17 00:00:00 2001
From: hedaoyuan <hedaoyuan@github.com>
Date: Fri, 25 Aug 2017 11:36:38 +0800
Subject: [PATCH 20/64] Neon depthwise conv with filterSize = 3 and stride = 2.

---
 paddle/function/neon/NeonDepthwiseConv.cpp | 115 ++++++++++++++++++++-
 1 file changed, 114 insertions(+), 1 deletion(-)

diff --git a/paddle/function/neon/NeonDepthwiseConv.cpp b/paddle/function/neon/NeonDepthwiseConv.cpp
index c017241c92..53d14d9833 100644
--- a/paddle/function/neon/NeonDepthwiseConv.cpp
+++ b/paddle/function/neon/NeonDepthwiseConv.cpp
@@ -153,6 +153,109 @@ struct DepthwiseConvKernel<3, 1> {
   }
 };
 
+/**
+ * Each step calculates four elements of the output.
+ * First step:
+ *   R0[0, 2, 4, 6...] * K[0][0]
+ *   R0[1, 3, 5, 7...] * K[0][1]
+ *   R0[2, 4, 6, 8...] * K[0][2]
+ *   R1[0, 2, 4, 6...] * K[1][0]
+ *   R1[1, 3, 5, 7...] * K[1][1]
+ *   R1[2, 4, 6, 8...] * K[1][2]
+ *   R2[0, 2, 4, 6...] * K[2][0]
+ *   R2[1, 3, 5, 7...] * K[2][1]
+ *   R2[2, 4, 6, 8...] * K[2][2]
+ * ------------------------------
+ *     Output[0, 1, 2, 3]
+ */
+template <>
+struct DepthwiseConvKernel<3, 2> {
+  static void run(const float* inputData,
+                  const float* filterData,
+                  int inputHeight,
+                  int inputWidth,
+                  int outputChannels,
+                  int outputHeight,
+                  int outputWidth,
+                  int filterMultiplier,
+                  float* outputData) {
+    const int steps = outputWidth >> 2;
+    const int remain = outputWidth & 3;
+    for (int c = 0; c < outputChannels; c++, filterData += 9) {
+      // Load the filters
+      float32x4_t k[3];
+      k[0] = vld1q_f32(filterData);
+      k[1] = vld1q_f32(filterData + 3);
+      k[2] = vld1q_f32(filterData + 6);
+      k[0] = vsetq_lane_f32(0.f, k[0], 3);
+      k[1] = vsetq_lane_f32(0.f, k[1], 3);
+      k[2] = vsetq_lane_f32(0.f, k[2], 3);
+
+      const float* start =
+          inputData + (c / filterMultiplier) * (inputHeight * inputWidth);
+      float32x4_t input[3][3];
+      for (int h = 0; h < outputHeight; h++) {
+        const float* r0 = start + 2 * h * inputWidth;
+        const float* r1 = start + (2 * h + 1) * inputWidth;
+        const float* r2 = start + (2 * h + 2) * inputWidth;
+        for (int s = 0; s < steps; s++) {
+          // Load the inputs
+          float32x4_t data1;
+          float32x4x2_t data2;
+
+          data2 = vld2q_f32(r0);
+          input[0][0] = data2.val[0];
+          input[0][1] = data2.val[1];
+          data1 = vld1q_f32(r0 + 8);
+          input[0][2] = vextq_f32(data2.val[0], data1, 1);
+
+          data2 = vld2q_f32(r1);
+          input[1][0] = data2.val[0];
+          input[1][1] = data2.val[1];
+          data1 = vld1q_f32(r1 + 8);
+          input[1][2] = vextq_f32(data2.val[0], data1, 1);
+
+          data2 = vld2q_f32(r2);
+          input[2][0] = data2.val[0];
+          input[2][1] = data2.val[1];
+          data1 = vld1q_f32(r2 + 8);
+          input[2][2] = vextq_f32(data2.val[0], data1, 1);
+
+          float32x4_t tmp1 = vdupq_n_f32(0.f);
+          float32x4_t tmp2 = vdupq_n_f32(0.f);
+          tmp1 = vmlaq_laneq_f32(tmp1, input[0][0], k[0], 0);
+          tmp2 = vmlaq_laneq_f32(tmp2, input[0][1], k[0], 1);
+          tmp1 = vmlaq_laneq_f32(tmp1, input[0][2], k[0], 2);
+          tmp2 = vmlaq_laneq_f32(tmp2, input[1][0], k[1], 0);
+          tmp1 = vmlaq_laneq_f32(tmp1, input[1][1], k[1], 1);
+          tmp2 = vmlaq_laneq_f32(tmp2, input[1][2], k[1], 2);
+          tmp1 = vmlaq_laneq_f32(tmp1, input[2][0], k[2], 0);
+          tmp2 = vmlaq_laneq_f32(tmp2, input[2][1], k[2], 1);
+          tmp1 = vmlaq_laneq_f32(tmp1, input[2][2], k[2], 2);
+          tmp1 = vaddq_f32(tmp1, tmp2);
+
+          vst1q_f32(outputData, tmp1);
+          r0 += 8;
+          r1 += 8;
+          r2 += 8;
+          outputData += 4;
+        }
+
+        for (int r = 0; r < remain; r++) {
+          float32x4_t i0 = vld1q_f32(r0);
+          float32x4_t i1 = vld1q_f32(r1);
+          float32x4_t i2 = vld1q_f32(r2);
+          *outputData = conv3x3(i0, i1, i2, k[0], k[1], k[2]);
+          r0 += 2;
+          r1 += 2;
+          r2 += 2;
+          outputData++;
+        }
+      }
+    }
+  }
+};
+
 /**
  * Each step calculates four elements of the output.
  */
@@ -326,7 +429,7 @@ public:
     }
 
     for (size_t i = 0; i < batchSize; i++) {
-      if (filterWidth == 3) {
+      if (filterWidth == 3 && strideH() == 1) {
         DepthwiseConvKernel<3, 1>::run(inputPadding,
                                        filterData,
                                        inputHeight,
@@ -336,6 +439,16 @@ public:
                                        outputWidth,
                                        filterMultiplier,
                                        outputData);
+      } else if (filterWidth == 3 && strideH() == 2) {
+        DepthwiseConvKernel<3, 2>::run(inputPadding,
+                                       filterData,
+                                       inputHeight,
+                                       inputWidth,
+                                       outputChannels,
+                                       outputHeight,
+                                       outputWidth,
+                                       filterMultiplier,
+                                       outputData);
       } else if (filterWidth == 4) {
         DepthwiseConvKernel<4, 1>::run(inputPadding,
                                        filterData,

From 9fdf3970d0de568db4a9a3b757335604430ca137 Mon Sep 17 00:00:00 2001
From: Yi Wang <yiwang01@baidu.com>
Date: Thu, 24 Aug 2017 20:37:39 -0700
Subject: [PATCH 21/64] Update unit test running and CUDA

---
 doc/howto/dev/build_en.md | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/doc/howto/dev/build_en.md b/doc/howto/dev/build_en.md
index 80488a147d..de0733f963 100644
--- a/doc/howto/dev/build_en.md
+++ b/doc/howto/dev/build_en.md
@@ -29,12 +29,25 @@ Nothing else.  Not even Python and GCC, because you can install all build tools
    docker run -v $PWD:/paddle paddle:dev
    ```
 
+   This builds a CUDA-enabled version and writes all binary outputs to directory `./build` of the local computer, other than the Docker container.  If we want to build only the CPU part, we can type
+
+   ```bash
+   docker run -e WITH_GPU=OFF -v $PWD:/paddle paddle:dev
+   ```
+
 4. Run unit tests.
 
+   To run all unit tests using the first GPU of a node:
+
    ```bash
-   docker run -v $PWD:/paddle paddle:dev "cd/build; ctest"
+   NV_GPU=0 nvidia-docker run -v $PWD:/paddle paddle:dev bash -c "cd /paddle/build; ctest"
    ```
 
+   If we used `WITH_GPU=OFF` at build time, it generates only CPU-based unit tests, and we don't need nvidia-docker to run them.  We can just run
+
+   ```bash
+   docker run -v $PWD:/paddle paddle:dev bash -c "cd /paddle/build; ctest"
+   ```
 
 ## Docker, Or Not?
 

From f00c4112d2ca1d42c60d154002b2347ba2de5cd9 Mon Sep 17 00:00:00 2001
From: hedaoyuan <hedaoyuan@github.com>
Date: Fri, 25 Aug 2017 11:53:45 +0800
Subject: [PATCH 22/64] Neon depthwise conv with filterSize = 4 and stride = 2.

---
 paddle/function/neon/NeonDepthwiseConv.cpp | 122 ++++++++++++++++++++-
 1 file changed, 121 insertions(+), 1 deletion(-)

diff --git a/paddle/function/neon/NeonDepthwiseConv.cpp b/paddle/function/neon/NeonDepthwiseConv.cpp
index 53d14d9833..3fe28b1de3 100644
--- a/paddle/function/neon/NeonDepthwiseConv.cpp
+++ b/paddle/function/neon/NeonDepthwiseConv.cpp
@@ -364,6 +364,116 @@ struct DepthwiseConvKernel<4, 1> {
   }
 };
 
+/**
+ * Each step calculates four elements of the output.
+ */
+template <>
+struct DepthwiseConvKernel<4, 2> {
+  static void run(const float* inputData,
+                  const float* filterData,
+                  int inputHeight,
+                  int inputWidth,
+                  int outputChannels,
+                  int outputHeight,
+                  int outputWidth,
+                  int filterMultiplier,
+                  float* outputData) {
+    const int steps = outputWidth >> 2;
+    const int remain = outputWidth & 3;
+    for (int c = 0; c < outputChannels; c++, filterData += 16) {
+      // Load the filters
+      float32x4_t k[4];
+      k[0] = vld1q_f32(filterData);
+      k[1] = vld1q_f32(filterData + 4);
+      k[2] = vld1q_f32(filterData + 8);
+      k[3] = vld1q_f32(filterData + 12);
+
+      const float* start =
+          inputData + (c / filterMultiplier) * (inputHeight * inputWidth);
+      float32x4_t input[4][4];
+      for (int h = 0; h < outputHeight; h++) {
+        const float* r0 = start + 2 * h * inputWidth;
+        const float* r1 = start + (2 * h + 1) * inputWidth;
+        const float* r2 = start + (2 * h + 2) * inputWidth;
+        const float* r3 = start + (2 * h + 3) * inputWidth;
+        for (int s = 0; s < steps; s++) {
+          // Load the inputs
+          float32x4x2_t data1;
+          float32x4x2_t data2;
+
+          data1 = vld2q_f32(r0);
+          data2 = vld2q_f32(r0 + 8);
+          input[0][0] = data1.val[0];
+          input[0][1] = data1.val[1];
+          input[0][2] = vextq_f32(data1.val[0], data2.val[0], 1);
+          input[0][3] = vextq_f32(data1.val[1], data2.val[1], 1);
+
+          data1 = vld2q_f32(r1);
+          data2 = vld2q_f32(r1 + 8);
+          input[1][0] = data1.val[0];
+          input[1][1] = data1.val[1];
+          input[1][2] = vextq_f32(data1.val[0], data2.val[0], 1);
+          input[1][3] = vextq_f32(data1.val[1], data2.val[1], 1);
+
+          data1 = vld2q_f32(r2);
+          data2 = vld2q_f32(r2 + 8);
+          input[2][0] = data1.val[0];
+          input[2][1] = data1.val[1];
+          input[2][2] = vextq_f32(data1.val[0], data2.val[0], 1);
+          input[2][3] = vextq_f32(data1.val[1], data2.val[1], 1);
+
+          data1 = vld2q_f32(r3);
+          data2 = vld2q_f32(r3 + 8);
+          input[3][0] = data1.val[0];
+          input[3][1] = data1.val[1];
+          input[3][2] = vextq_f32(data1.val[0], data2.val[0], 1);
+          input[3][3] = vextq_f32(data1.val[1], data2.val[1], 1);
+
+          float32x4_t tmp1 = vdupq_n_f32(0.f);
+          float32x4_t tmp2 = vdupq_n_f32(0.f);
+          tmp1 = vmlaq_laneq_f32(tmp1, input[0][0], k[0], 0);
+          tmp2 = vmlaq_laneq_f32(tmp2, input[0][1], k[0], 1);
+          tmp1 = vmlaq_laneq_f32(tmp1, input[0][2], k[0], 2);
+          tmp2 = vmlaq_laneq_f32(tmp2, input[0][3], k[0], 3);
+          tmp1 = vmlaq_laneq_f32(tmp1, input[1][0], k[1], 0);
+          tmp2 = vmlaq_laneq_f32(tmp2, input[1][1], k[1], 1);
+          tmp1 = vmlaq_laneq_f32(tmp1, input[1][2], k[1], 2);
+          tmp2 = vmlaq_laneq_f32(tmp2, input[1][3], k[1], 3);
+          tmp1 = vmlaq_laneq_f32(tmp1, input[2][0], k[2], 0);
+          tmp2 = vmlaq_laneq_f32(tmp2, input[2][1], k[2], 1);
+          tmp1 = vmlaq_laneq_f32(tmp1, input[2][2], k[2], 2);
+          tmp2 = vmlaq_laneq_f32(tmp2, input[2][3], k[2], 3);
+          tmp1 = vmlaq_laneq_f32(tmp1, input[3][0], k[3], 0);
+          tmp2 = vmlaq_laneq_f32(tmp2, input[3][1], k[3], 1);
+          tmp1 = vmlaq_laneq_f32(tmp1, input[3][2], k[3], 2);
+          tmp2 = vmlaq_laneq_f32(tmp2, input[3][3], k[3], 3);
+          tmp1 = vaddq_f32(tmp1, tmp2);
+
+          vst1q_f32(outputData, tmp1);
+          r0 += 8;
+          r1 += 8;
+          r2 += 8;
+          r3 += 8;
+          outputData += 4;
+        }
+
+        for (int r = 0; r < remain; r++) {
+          float32x4_t i0 = vld1q_f32(r0);
+          float32x4_t i1 = vld1q_f32(r1);
+          float32x4_t i2 = vld1q_f32(r2);
+          float32x4_t i3 = vld1q_f32(r3);
+          *outputData = conv4x4(i0, i1, i2, i3, k[0], k[1], k[2], k[3]);
+          r0 += 2;
+          r1 += 2;
+          r2 += 2;
+          r3 += 2;
+          outputData++;
+        }
+      }
+    }
+  }
+};
+
 template <DeviceType Device>
 class NeonDepthwiseConvFunction : public ConvFunctionBase {
 public:
@@ -449,7 +559,7 @@ public:
                                        outputWidth,
                                        filterMultiplier,
                                        outputData);
-      } else if (filterWidth == 4) {
+      } else if (filterWidth == 4 && strideH() == 1) {
         DepthwiseConvKernel<4, 1>::run(inputPadding,
                                        filterData,
                                        inputHeight,
@@ -459,6 +569,16 @@ public:
                                        outputWidth,
                                        filterMultiplier,
                                        outputData);
+      } else if (filterWidth == 4 && strideH() == 2) {
+        DepthwiseConvKernel<4, 2>::run(inputPadding,
+                                       filterData,
+                                       inputHeight,
+                                       inputWidth,
+                                       outputChannels,
+                                       outputHeight,
+                                       outputWidth,
+                                       filterMultiplier,
+                                       outputData);
       }
 
       inputPadding += inputChannels * inputHeight * inputWidth;

From 1e61d91f24e9213ab43edc62cf2c6f9e47a62d1f Mon Sep 17 00:00:00 2001
From: Yi Wang <yiwang01@baidu.com>
Date: Thu, 24 Aug 2017 21:38:13 -0700
Subject: [PATCH 23/64] Update index and add Chinese version

---
 doc/howto/dev/build_cn.md | 100 ++++++++++++++++++++++++++++++++++++++
 doc/howto/dev/build_en.md |   6 ++-
 doc/howto/index_cn.rst    |   1 +
 doc/howto/index_en.rst    |   1 +
 4 files changed, 107 insertions(+), 1 deletion(-)
 create mode 100644 doc/howto/dev/build_cn.md

diff --git a/doc/howto/dev/build_cn.md b/doc/howto/dev/build_cn.md
new file mode 100644
index 0000000000..dc372de9fa
--- /dev/null
+++ b/doc/howto/dev/build_cn.md
@@ -0,0 +1,100 @@
+# 编译PaddlePaddle和运行单元测试
+
+## 需要的软硬件
+
+为了开发PaddlePaddle，我们需要
+
+1. 一台电脑，可以装的是 Linux, BSD, Windows 或者 MacOS 操作系统，以及
+1. Docker。
+
+不需要其他任何软件了。即便是 Python 和 GCC 都不需要，因为我们会把所有编译工具都安装进一个 Docker image 里。
+
+## 总体流程
+
+1. 获取源码
+
+   ```bash
+   git clone https://github.com/paddlepaddle/paddle
+   ```
+
+2. 安装工具
+
+   ```bash
+   cd paddle; docker build -t paddle:dev .
+   ```
+
+3. 编译
+
+   ```bash
+   docker run -v $PWD:/paddle paddle:dev
+   ```
+
+   这个命令编译出一个 CUDA-enabled 版本。所有二进制文件会被写到本机的 `./build` 目录，而不是写到 Docker container 里。如果我们只需要编译一个只支持 CPU 的版本，可以用
+
+   ```bash
+   docker run -e WITH_GPU=OFF -v $PWD:/paddle paddle:dev
+   ```
+
+4. 运行单元测试
+
+   用本机的第一个 GPU 来运行包括 GPU 单元测试在内的所有单元测试：
+
+   ```bash
+   NV_GPU=0 nvidia-docker run -v $PWD:/paddle paddle:dev bash -c "cd /paddle/build; ctest"
+   ```
+
+   如果编译的时候我们用了 `WITH_GPU=OFF` 选项，那么编译过程只会产生 CPU-based 单元测试，那么我们也就不需要 nvidia-docker 来运行单元测试了。我们只需要：
+
+   ```bash
+   docker run -v $PWD:/paddle paddle:dev bash -c "cd /paddle/build; ctest"
+   ```
+
+## 为什么要 Docker 呀？
+
+- 什么是 Docker?
+
+  如果您没有听说 Docker，可以把它想象为一个类似 virtualenv 的系统，但是虚拟的不仅仅是 Python 的运行环境。
+
+- Docker 还是虚拟机？
+
+  有人用虚拟机来类比 Docker。需要强调的是：Docker 不会虚拟任何硬件，Docker container 里运行的编译工具实际上都是在本机的 CPU 和操作系统上直接运行的，性能和把编译工具安装在本机运行基本一样。
+
+- 为什么用 Docker?
+
+  把工具和配置都安装在一个 Docker image 里可以标准化编译环境。这样如果遇到问题，其他人可以复现问题以便帮助。
+
+  另外，对于习惯使用Windows和MacOS的开发者来说，使用Docker就不用配置交叉编译环境了。
+
+- 我可以选择不用Docker吗？
+
+  当然可以。大家可以用把开发工具安装进入 Docker image 一样的方式，把这些工具安装到本机。这篇文档介绍基于 Docker 的开发流程，是因为这个流程比其他方法都更简便。
+
+- 学习 Docker 有多难？
+
+  理解 Docker 并不难，大概花十分钟看一遍 https://zhuanlan.zhihu.com/p/19902938 即可。这可以帮您省掉花一小时安装和配置各种开发工具，以及切换机器时需要新安装的辛苦。别忘了 PaddlePaddle 更新可能导致需要新的开发工具。更别提简化问题复现带来的好处了。
+
+- Docker 需要 sudo
+
+  如果用自己的电脑开发，自然也就有管理员权限（sudo）了。如果用公用的电脑开发，需要请管理员安装和配置好 Docker。此外，PaddlePaddle 项目在努力开始支持其他不需要 sudo 的集装箱技术，比如 rkt。
+
+- 我可以用 IDE 吗？
+
+  当然可以，因为源码就在本机上。IDE 默认调用 make 之类的程序来编译源码，我们只需要配置 IDE 来调用 Docker 命令编译源码即可。
+
+  很多 PaddlePaddle 开发者使用 Emacs。他们在自己的 `~/.emacs` 配置文件里加两行
+
+  ```emacs
+  (global-set-key "\C-cc" 'compile)
+  (setq compile-command
+   "docker run --rm -it -v $(git rev-parse --show-toplevel):/paddle paddle:dev")
+  ```
+
+  就可以按 `Ctrl-C` 和 `c` 键来启动编译了。
+
+- 可以并行编译吗？
+
+  是的。我们的 Docker image 运行一个 Bash 脚本 https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/scripts/docker/build.sh 。这个脚本调用 `make -j$(nproc)` 来启动和 CPU 核一样多的进程来并行编译。
+
+- Docker on Windows/MacOS？
+
+  Docker 在 Windows 和 MacOS 都可以运行。不过实际上是运行在一个 Linux 虚拟机上。可能需要注意给这个虚拟机多分配一些 CPU 和内存，以保证编译高效。具体做法请参考 https://github.com/PaddlePaddle/Paddle/issues/627 。
diff --git a/doc/howto/dev/build_en.md b/doc/howto/dev/build_en.md
index de0733f963..640d126018 100644
--- a/doc/howto/dev/build_en.md
+++ b/doc/howto/dev/build_en.md
@@ -91,6 +91,10 @@ Nothing else.  Not even Python and GCC, because you can install all build tools
 
   so they could type `Ctrl-C` and `c` to build PaddlePaddle from source.
 
-- How many parallel building processes does the Docker container run?
+- Does Docker do parallel building?
 
   Our building Docker image runs a Bash script https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/scripts/docker/build.sh, which calls `make -j$(nproc)` to starts as many processes as the number of your processors.
+
+- Docker on Windows/MacOS?
+
+  On Windows and MacOS, Docker containers run in a Linux VM.  You might want to give this VM some more memory and CPUs so to make the building efficient.  Please refer to https://github.com/PaddlePaddle/Paddle/issues/627 for details.
diff --git a/doc/howto/index_cn.rst b/doc/howto/index_cn.rst
index 26449a6365..0608aa3096 100644
--- a/doc/howto/index_cn.rst
+++ b/doc/howto/index_cn.rst
@@ -19,6 +19,7 @@
 ..  toctree::
   :maxdepth: 1
 
+  dev/build_cn.rst
   dev/write_docs_cn.rst
   dev/contribute_to_paddle_cn.md
 
diff --git a/doc/howto/index_en.rst b/doc/howto/index_en.rst
index 1fbfcd260b..1b6034be4e 100644
--- a/doc/howto/index_en.rst
+++ b/doc/howto/index_en.rst
@@ -18,6 +18,7 @@ Development
 ..  toctree::
   :maxdepth: 1
 
+  dev/build_en.rst
   dev/new_layer_en.rst
   dev/contribute_to_paddle_en.md
 

From aa28d046fb828814b9849aa1ebfc868be2db98f9 Mon Sep 17 00:00:00 2001
From: caoying03 <caoying03@baidu.com>
Date: Fri, 25 Aug 2017 14:11:36 +0800
Subject: [PATCH 24/64] fix a bug of sequence_slice layer when batch_size=1

---
 paddle/gserver/layers/SequenceSliceLayer.cpp   | 18 ++++++++++--------
 .../gserver/tests/test_SeqSliceLayerGrad.cpp   |  4 +++-
 2 files changed, 13 insertions(+), 9 deletions(-)

diff --git a/paddle/gserver/layers/SequenceSliceLayer.cpp b/paddle/gserver/layers/SequenceSliceLayer.cpp
index 5d72d37304..aab44c4646 100644
--- a/paddle/gserver/layers/SequenceSliceLayer.cpp
+++ b/paddle/gserver/layers/SequenceSliceLayer.cpp
@@ -130,6 +130,8 @@ void SequenceSliceLayer::calSelectedRows(const MatrixPtr starts,
   CHECK(starts || ends) << "At least one of the start or end indices "
                         << "should be given.";
 
+  bool hasSubseq = getInput(0).hasSubseq();
+
   outSeqStartPos_.resize(1, 0);
   outSubSeqStartPos_.resize(1, 0);
   selectedRows_.clear();
@@ -151,14 +153,13 @@ void SequenceSliceLayer::calSelectedRows(const MatrixPtr starts,
         int seqLen = endPos - begPos + 1;
         CHECK_GT(seqLen, 0U);
         for (int m = begPos; m <= endPos; ++m) selectedRows_.push_back(m);
-        inputSeqInfoVec_.size() > 1
+        hasSubseq
             ? outSubSeqStartPos_.push_back(outSubSeqStartPos_.back() + seqLen)
             : outSeqStartPos_.push_back(outSeqStartPos_.back() + seqLen);
       }
       rowIdx++;
     }
-    if (inputSeqInfoVec_.size() > 1)
-      outSeqStartPos_.push_back(outSubSeqStartPos_.back());
+    if (hasSubseq) outSeqStartPos_.push_back(outSubSeqStartPos_.back());
   }
 
   if (useGpu_) {
@@ -175,7 +176,7 @@ void SequenceSliceLayer::calSelectedRows(const MatrixPtr starts,
   output_.sequenceStartPositions->copyFrom(
       outSeqStartPos_.data(), outSeqStartPos_.size(), false);
 
-  if (inputSeqInfoVec_.size() > 1) {
+  if (hasSubseq) {
     ICpuGpuVector::resizeOrCreate(
         output_.subSequenceStartPositions, outSubSeqStartPos_.size(), false);
     output_.subSequenceStartPositions->copyFrom(
@@ -203,10 +204,11 @@ void SequenceSliceLayer::forward(PassType passType) {
   } else
     copySliceIdsToCpu();
 
-  // calculate the selected row indices in a batch,
-  // and build the output sequence information.
-  calSelectedRows(startIdsOnCpu_ ? startIdsOnCpu_ : nullptr,
-                  endIdsOnCpu_ ? endIdsOnCpu_ : nullptr);
+  /*
+   * calculate the selected row indices in a batch, and build the output
+   * sequence information.
+   */
+  calSelectedRows(startIdsOnCpu_, endIdsOnCpu_);
 
   resetOutput(selectedRows_.size(), getSize());
 
diff --git a/paddle/gserver/tests/test_SeqSliceLayerGrad.cpp b/paddle/gserver/tests/test_SeqSliceLayerGrad.cpp
index d560ca650b..e1d4ae1617 100644
--- a/paddle/gserver/tests/test_SeqSliceLayerGrad.cpp
+++ b/paddle/gserver/tests/test_SeqSliceLayerGrad.cpp
@@ -30,6 +30,8 @@ const int MAX_SEQ_NUM = 17;
 const int MAX_SEQ_LEN = 23;
 const int MAX_BEAM_SIZE = 13;
 
+const size_t SEED = (size_t)(time(NULL));
+
 vector<real> randSampling(real range, int n) {
   CHECK_GE(range, n);
   vector<real> num(range);
@@ -46,7 +48,7 @@ void genSeqInfo(vector<int>& seqStartPos, vector<int>& subSeqStartPos) {
   seqStartPos.resize(1, 0);
   subSeqStartPos.resize(1, 0);
 
-  srand((size_t)(time(NULL)));
+  srand(SEED);
   int seqNum = 1 + (rand() % MAX_SEQ_NUM);
   for (int i = 0; i < seqNum; ++i) {
     int subSeqNum = 1 + (rand() % MAX_SEQ_NUM);

From 4cc57836f393ada9b65cfeef444662afc34f1109 Mon Sep 17 00:00:00 2001
From: tensor-tang <jian.j.tang@intel.com>
Date: Fri, 25 Aug 2017 17:20:28 +0800
Subject: [PATCH 25/64] enable reorder

---
 paddle/gserver/layers/MKLDNNFcLayer.cpp | 39 +++++------------
 paddle/math/MKLDNNMatrix.cpp            | 57 +++++++++++++++++++++++++
 paddle/math/MKLDNNMatrix.h              | 33 ++++++++++++--
 3 files changed, 97 insertions(+), 32 deletions(-)

diff --git a/paddle/gserver/layers/MKLDNNFcLayer.cpp b/paddle/gserver/layers/MKLDNNFcLayer.cpp
index a5555c4618..ad50c15a7d 100644
--- a/paddle/gserver/layers/MKLDNNFcLayer.cpp
+++ b/paddle/gserver/layers/MKLDNNFcLayer.cpp
@@ -61,39 +61,20 @@ void MKLDNNFcLayer::convertWeightsFromPaddle() {
     return;
   }
 
-  // TODO(TJ): dst format should get from wgtVal_
-  int dstFmt = PARAM_FORMAT_MKLDNN_OI;
-  int srcFmt = weight_->getParameterPtr()->getHeaderFormat();
-  if (srcFmt == dstFmt) {
-    return;
-  }
-
-  // The weight_ is transposed from initial paddle weight
-  MatrixPtr paddleWgt = Matrix::create(
-      weight_->getW()->getData(), iLayerSize_, oc_, false, false);
-
-  // TODO(TJ): remove this print when do not need differ weights
-  std::ostringstream ostr;
-  paddleWgt->print(ostr);
-  VLOG(MKLDNN_ALL) << "Initial Weight from paddle: " << std::endl << ostr.str();
-
-  // The mkldnn weight is transposed from initial paddle matrix
-  MatrixPtr paddleWgtT;
-  paddleWgt->transpose(paddleWgtT, true);
-  weight_->getW()->copyFrom(*paddleWgtT);
-  weight_->getParameterPtr()->setHeaderFormat(dstFmt);
+  CHECK(wgtVal_) << "should have been initialized";
+  bool hasNoSpatial_ = ih_ == 1 && iw_ == 1;
+  auto targetDim = wgtVal_->getDims();
+  auto srcFmt = hasNoSpatial_ ? memory::format::io : memory::format::ihwo;
+  wgtVal_->reorderDataFrom(wgtVal_, srcFmt, targetDim);
   hasInitedWgt_ = true;
 }
 
 void MKLDNNFcLayer::convertWeightsToPaddle() {
-  MatrixPtr dnnWgt = weight_->getW();
-  MatrixPtr paddleWgt;
-  dnnWgt->transpose(paddleWgt, true);
-
-  // copy paddle weight and override on weight_
-  MatrixPtr dnnWgtT = Matrix::create(
-      dnnWgt->getData(), dnnWgt->getWidth(), dnnWgt->getHeight(), false, false);
-  dnnWgtT->copyFrom(*paddleWgt);
+  CHECK(wgtVal_) << "should have been initialized";
+  bool hasNoSpatial_ = ih_ == 1 && iw_ == 1;
+  auto targetDim = wgtVal_->getDims();
+  auto dstFmt = hasNoSpatial_ ? memory::format::io : memory::format::ihwo;
+  wgtVal_->reorderDataTo(wgtVal_, dstFmt, targetDim);
 }
 
 void MKLDNNFcLayer::reshape() {
diff --git a/paddle/math/MKLDNNMatrix.cpp b/paddle/math/MKLDNNMatrix.cpp
index 94df9c1550..32ae3b1bcf 100644
--- a/paddle/math/MKLDNNMatrix.cpp
+++ b/paddle/math/MKLDNNMatrix.cpp
@@ -56,6 +56,63 @@ MKLDNNMatrixPtr MKLDNNMatrix::create(MatrixPtr m,
   return create(m, pd);
 }
 
+void MKLDNNMatrix::reorderDataFrom(const MKLDNNMatrixPtr& m,
+                                   memory::format srcFmt,
+                                   memory::dims targetDim) {
+  memory::format dstFmt = getFormat();
+  if (srcFmt == dstFmt) {
+    return;
+  }
+  CHECK_EQ(getElementCnt(), m->getElementCnt()) << "size should equal";
+  real* srcData = getData();
+  real* dstData = m->getData();
+  reorderOnce(srcData, dstData, srcFmt, dstFmt, targetDim);
+}
+
+void MKLDNNMatrix::reorderDataTo(const MKLDNNMatrixPtr& m,
+                                 memory::format dstFmt,
+                                 memory::dims targetDim) {
+  memory::format srcFmt = getFormat();
+  if (srcFmt == dstFmt) {
+    return;
+  }
+  CHECK_EQ(getElementCnt(), m->getElementCnt()) << "size should equal";
+  real* srcData = getData();
+  real* dstData = m->getData();
+  reorderOnce(srcData, dstData, srcFmt, dstFmt, targetDim);
+}
+
+void MKLDNNMatrix::reorderOnce(void* srcData,
+                               void* dstData,
+                               memory::format srcFmt,
+                               memory::format dstFmt,
+                               memory::dims dm) {
+  CHECK(srcData);
+  CHECK(dstData);
+  MatrixPtr tmpSrc;
+  if (dstData == srcData) {
+    // inplace data
+    size_t sz = 1;
+    for (size_t i = 0; i < dm.size(); ++i) {
+      sz *= dm[i];
+    }
+    tmpSrc = Matrix::create(sz, 1, false, false);
+    tmpSrc->copyFrom((real*)srcData, sz);
+    srcData = tmpSrc->getData();
+  }
+
+  auto dtype = this->getDtype();
+  auto srcMD = memory::desc(dm, dtype, srcFmt);
+  auto dstMD = memory::desc(dm, dtype, dstFmt);
+
+  auto eg = this->getEngine();
+  auto src = memory(memory::primitive_desc(srcMD, eg), srcData);
+  auto dst = memory(memory::primitive_desc(dstMD, eg), dstData);
+
+  auto r = reorder(src, dst);
+  stream(stream::kind::eager).submit({r}).wait();
+}
+
 void MKLDNNMatrix::downSpatial() {
   int fmt = getFormat();
   if (!(fmt == memory::format::nchw || fmt == memory::format::oihw)) {
diff --git a/paddle/math/MKLDNNMatrix.h b/paddle/math/MKLDNNMatrix.h
index 05adc867c2..ea3fd7d461 100644
--- a/paddle/math/MKLDNNMatrix.h
+++ b/paddle/math/MKLDNNMatrix.h
@@ -21,9 +21,6 @@ limitations under the License. */
 
 namespace paddle {
 
-static const std::map<mkldnn::memory::format, PARAM_FORMAT> PARAM_FOARMAT_MAP =
-    {{mkldnn::memory::format::oi, PARAM_FORMAT_MKLDNN_OI}};
-
 class MKLDNNMatrix;
 typedef std::shared_ptr<MKLDNNMatrix> MKLDNNMatrixPtr;
 
@@ -57,6 +54,26 @@ public:
       mkldnn::memory::data_type dtype = mkldnn::memory::data_type::f32);
 
 public:
+  /**
+   * Reorder this MKLDNNMatrix from other format.
+   * Support inplace reorder
+   * Pay attention: this function would only reorder the data layout.
+   *                will NOT change this original dim or format info
+   */
+  void reorderDataFrom(const MKLDNNMatrixPtr& m,
+                       memory::format srcFmt,
+                       memory::dims targetDim);
+
+  /**
+   * Reorder this MKLDNNMatrix to other format.
+   * Support inplace reorder
+   * Pay attention: this function would only reorder the data layout.
+   *                will NOT change the dst dim or format info
+   */
+  void reorderDataTo(const MKLDNNMatrixPtr& m,
+                     memory::format dstFmt,
+                     memory::dims targetDim);
+
   /**
    * Dimensionality reduction.
    * Change format "nchw --> nc" or "oihw --> oi" if the h and w are both 1
@@ -113,6 +130,16 @@ public:
    * Get engine.
    */
   mkldnn::engine getEngine() { return getPD().get_engine(); }
+
+protected:
+  /**
+   * Do once reorder supported inplace.
+   */
+  void reorderOnce(void* srcData,
+                   void* dstData,
+                   memory::format srcFmt,
+                   memory::format dstFmt,
+                   memory::dims dm);
 };
 
 }  // namespace paddle

From c8d0c9af865cd0ac47d1cd7461c24793d833eeff Mon Sep 17 00:00:00 2001
From: Yi Wang <yiwang01@baidu.com>
Date: Fri, 25 Aug 2017 11:24:48 -0700
Subject: [PATCH 26/64] In response to comments from Luo Tao

---
 doc/howto/dev/build_cn.md | 6 +++---
 doc/howto/dev/build_en.md | 6 +++---
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/doc/howto/dev/build_cn.md b/doc/howto/dev/build_cn.md
index dc372de9fa..7c95579636 100644
--- a/doc/howto/dev/build_cn.md
+++ b/doc/howto/dev/build_cn.md
@@ -71,7 +71,7 @@
 
 - 学习 Docker 有多难？
 
-  理解 Docker 并不难，大概花十分钟看一遍 https://zhuanlan.zhihu.com/p/19902938 即可。这可以帮您省掉花一小时安装和配置各种开发工具，以及切换机器时需要新安装的辛苦。别忘了 PaddlePaddle 更新可能导致需要新的开发工具。更别提简化问题复现带来的好处了。
+  理解 Docker 并不难，大概花十分钟看一下[这篇文章](https://zhuanlan.zhihu.com/p/19902938)。这可以帮您省掉花一小时安装和配置各种开发工具，以及切换机器时需要新安装的辛苦。别忘了 PaddlePaddle 更新可能导致需要新的开发工具。更别提简化问题复现带来的好处了。
 
 - Docker 需要 sudo
 
@@ -93,8 +93,8 @@
 
 - 可以并行编译吗？
 
-  是的。我们的 Docker image 运行一个 Bash 脚本 https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/scripts/docker/build.sh 。这个脚本调用 `make -j$(nproc)` 来启动和 CPU 核一样多的进程来并行编译。
+  是的。我们的 Docker image 运行一个 [Bash 脚本](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/scripts/docker/build.sh)。这个脚本调用 `make -j$(nproc)` 来启动和 CPU 核一样多的进程来并行编译。
 
 - Docker on Windows/MacOS？
 
-  Docker 在 Windows 和 MacOS 都可以运行。不过实际上是运行在一个 Linux 虚拟机上。可能需要注意给这个虚拟机多分配一些 CPU 和内存，以保证编译高效。具体做法请参考 https://github.com/PaddlePaddle/Paddle/issues/627 。
+  Docker 在 Windows 和 MacOS 都可以运行。不过实际上是运行在一个 Linux 虚拟机上。可能需要注意给这个虚拟机多分配一些 CPU 和内存，以保证编译高效。具体做法请参考[这个issue](https://github.com/PaddlePaddle/Paddle/issues/627)。
diff --git a/doc/howto/dev/build_en.md b/doc/howto/dev/build_en.md
index 640d126018..3be2405ea7 100644
--- a/doc/howto/dev/build_en.md
+++ b/doc/howto/dev/build_en.md
@@ -71,7 +71,7 @@ Nothing else.  Not even Python and GCC, because you can install all build tools
 
 - How difficult is it to learn Docker?
 
-  It takes you ten minutes to read https://docs.docker.com/get-started/ and saves you more than one hour to install all required build tools, configure them, and upgrade them when new versions of PaddlePaddle require some new tools.
+  It takes you ten minutes to read [an introductory article](https://docs.docker.com/get-started) and saves you more than one hour to install all required build tools, configure them, and upgrade them when new versions of PaddlePaddle require some new tools.
 
 - Docker requires sudo
 
@@ -93,8 +93,8 @@ Nothing else.  Not even Python and GCC, because you can install all build tools
 
 - Does Docker do parallel building?
 
-  Our building Docker image runs a Bash script https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/scripts/docker/build.sh, which calls `make -j$(nproc)` to starts as many processes as the number of your processors.
+  Our building Docker image runs a [Bash script](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/scripts/docker/build.sh), which calls `make -j$(nproc)` to starts as many processes as the number of your processors.
 
 - Docker on Windows/MacOS?
 
-  On Windows and MacOS, Docker containers run in a Linux VM.  You might want to give this VM some more memory and CPUs so to make the building efficient.  Please refer to https://github.com/PaddlePaddle/Paddle/issues/627 for details.
+  On Windows and MacOS, Docker containers run in a Linux VM.  You might want to give this VM some more memory and CPUs so to make the building efficient.  Please refer to [this issue](https://github.com/PaddlePaddle/Paddle/issues/627) for details.

From f71f3935e3ce05a8e90edc971f5ab08d71ed2966 Mon Sep 17 00:00:00 2001
From: Yi Wang <yiwang01@baidu.com>
Date: Fri, 25 Aug 2017 11:51:53 -0700
Subject: [PATCH 27/64] In response to comments from Chen Xi

---
 doc/howto/dev/build_cn.md | 20 +++++++++++++-------
 doc/howto/dev/build_en.md | 34 ++++++++++++++++++++--------------
 2 files changed, 33 insertions(+), 21 deletions(-)

diff --git a/doc/howto/dev/build_cn.md b/doc/howto/dev/build_cn.md
index 7c95579636..0077d90118 100644
--- a/doc/howto/dev/build_cn.md
+++ b/doc/howto/dev/build_cn.md
@@ -23,13 +23,17 @@
    cd paddle; docker build -t paddle:dev .
    ```
 
+   请注意这个命令结尾处的 `.`；它表示 `docker build` 应该读取当前目录下的 [`Dockerfile`文件](https://github.com/PaddlePaddle/Paddle/blob/develop/Dockerfile)，按照其内容创建一个名为 `paddle:dev` 的 Docker image，并且把各种开发工具安装进去。
+
 3. 编译
 
+   以下命令启动一个 Docker container 来执行 `paddle:dev` 这个 Docker image，同时把当前目录（源码树根目录）映射为 container 里的 `/paddle` 目录，并且运行 `Dockerfile` 描述的默认入口程序 [`build.sh`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/scripts/docker/build.sh)。这个脚本调用 `cmake` 和 `make` 来编译 `/paddle` 里的源码，结果输出到 `/paddle/build`，也就是本地的源码树根目录里的 `build` 子目录。
+
    ```bash
    docker run -v $PWD:/paddle paddle:dev
    ```
 
-   这个命令编译出一个 CUDA-enabled 版本。所有二进制文件会被写到本机的 `./build` 目录，而不是写到 Docker container 里。如果我们只需要编译一个只支持 CPU 的版本，可以用
+   上述命令编译出一个 CUDA-enabled 版本。如果我们只需要编译一个只支持 CPU 的版本，可以用
 
    ```bash
    docker run -e WITH_GPU=OFF -v $PWD:/paddle paddle:dev
@@ -57,7 +61,7 @@
 
 - Docker 还是虚拟机？
 
-  有人用虚拟机来类比 Docker。需要强调的是：Docker 不会虚拟任何硬件，Docker container 里运行的编译工具实际上都是在本机的 CPU 和操作系统上直接运行的，性能和把编译工具安装在本机运行基本一样。
+  有人用虚拟机来类比 Docker。需要强调的是：Docker 不会虚拟任何硬件，Docker container 里运行的编译工具实际上都是在本机的 CPU 和操作系统上直接运行的，性能和把编译工具安装在本机运行一样。
 
 - 为什么用 Docker?
 
@@ -73,10 +77,6 @@
 
   理解 Docker 并不难，大概花十分钟看一下[这篇文章](https://zhuanlan.zhihu.com/p/19902938)。这可以帮您省掉花一小时安装和配置各种开发工具，以及切换机器时需要新安装的辛苦。别忘了 PaddlePaddle 更新可能导致需要新的开发工具。更别提简化问题复现带来的好处了。
 
-- Docker 需要 sudo
-
-  如果用自己的电脑开发，自然也就有管理员权限（sudo）了。如果用公用的电脑开发，需要请管理员安装和配置好 Docker。此外，PaddlePaddle 项目在努力开始支持其他不需要 sudo 的集装箱技术，比如 rkt。
-
 - 我可以用 IDE 吗？
 
   当然可以，因为源码就在本机上。IDE 默认调用 make 之类的程序来编译源码，我们只需要配置 IDE 来调用 Docker 命令编译源码即可。
@@ -95,6 +95,12 @@
 
   是的。我们的 Docker image 运行一个 [Bash 脚本](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/scripts/docker/build.sh)。这个脚本调用 `make -j$(nproc)` 来启动和 CPU 核一样多的进程来并行编译。
 
-- Docker on Windows/MacOS？
+## 可能碰到的问题
+
+- Docker 需要 sudo
+
+  如果用自己的电脑开发，自然也就有管理员权限（sudo）了。如果用公用的电脑开发，需要请管理员安装和配置好 Docker。此外，PaddlePaddle 项目在努力开始支持其他不需要 sudo 的集装箱技术，比如 rkt。
+
+- 在 Windows/MacOS 上编译很慢
 
   Docker 在 Windows 和 MacOS 都可以运行。不过实际上是运行在一个 Linux 虚拟机上。可能需要注意给这个虚拟机多分配一些 CPU 和内存，以保证编译高效。具体做法请参考[这个issue](https://github.com/PaddlePaddle/Paddle/issues/627)。
diff --git a/doc/howto/dev/build_en.md b/doc/howto/dev/build_en.md
index 3be2405ea7..95752beba0 100644
--- a/doc/howto/dev/build_en.md
+++ b/doc/howto/dev/build_en.md
@@ -7,7 +7,7 @@ To contribute to PaddlePaddle, you need
 1. A computer -- Linux, BSD, Windows, MacOS, and
 1. Docker.
 
-Nothing else.  Not even Python and GCC, because you can install all build tools into a Docker image.
+Nothing else.  Not even Python and GCC, because you can install all build tools into a Docker image.  We run all the tools by running this image.
 
 ## General Process
 
@@ -17,19 +17,23 @@ Nothing else.  Not even Python and GCC, because you can install all build tools
    git clone https://github.com/paddlepaddle/paddle
    ```
 
-2. Install build tools.
+2. Install build tools into a Docker image.
 
    ```bash
    cd paddle; docker build -t paddle:dev .
    ```
 
+   Please be aware of the `.` at the end of the command, which refers to the [`./Dockerfile` file](https://github.com/PaddlePaddle/Paddle/blob/develop/Dockerfile).  `docker build` follows instructions in this file to create a Docker image named `paddle:dev`, and installs building tools into it.
+
 3. Build from source.
 
+   This following command starts a Docker container that executes the Docker image `paddle:dev`, mapping the current directory to `/paddle/` in the container, and runs the default entry-point [`build.sh`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/scripts/docker/build.sh) as specified in the Dockefile.  `build.sh` invokes `cmake` and `make` to build PaddlePaddle source code, which had been mapped to `/paddle`, and writes outputs to `/paddle/build`, which maps to `build` in the current source directory on the computer.
+
    ```bash
    docker run -v $PWD:/paddle paddle:dev
    ```
 
-   This builds a CUDA-enabled version and writes all binary outputs to directory `./build` of the local computer, other than the Docker container.  If we want to build only the CPU part, we can type
+   Above command builds a CUDA-enabled version.  If we want to build a CPU-only version, we can type
 
    ```bash
    docker run -e WITH_GPU=OFF -v $PWD:/paddle paddle:dev
@@ -57,25 +61,21 @@ Nothing else.  Not even Python and GCC, because you can install all build tools
 
 - Docker or virtual machine?
 
-  Some people compare Docker with VMs, but Docker doesn't virtualize any hardware, and it doesn't run a guest OS.
+  Some people compare Docker with VMs, but Docker doesn't virtualize any hardware nor running a guest OS, which means there is no compromise on the performance.
 
 - Why Docker?
 
-  Using a Docker image of build tools standardize the building environment, and easier for others to reproduce your problem, if there is any, and help.
+  Using a Docker image of build tools standardizes the building environment, which makes it easier for others to reproduce your problems and to help.
 
   Also, some build tools don't run on Windows or Mac or BSD, but Docker runs almost everywhere, so developers can use whatever computer they want.
 
-- Can I don't use Docker?
+- Can I choose not to use Docker?
 
-  Sure, you don't have to install build tools into a Docker image; instead, you can install them onto your local computer.  This document exists because Docker would make the development way easier.
+  Sure, you don't have to install build tools into a Docker image; instead, you can install them in your local computer.  This document exists because Docker would make the development way easier.
 
 - How difficult is it to learn Docker?
 
-  It takes you ten minutes to read [an introductory article](https://docs.docker.com/get-started) and saves you more than one hour to install all required build tools, configure them, and upgrade them when new versions of PaddlePaddle require some new tools.
-
-- Docker requires sudo
-
-  An owner of a computer has the administrative privilege, a.k.a., sudo.  If you use a shared computer for development, please ask the administrator to install and configure Docker.  We will do our best to support rkt, another container technology that doesn't require sudo.
+    It takes you ten minutes to read [an introductory article](https://docs.docker.com/get-started) and saves you more than one hour to install all required build tools, configure them, especially when new versions of PaddlePaddle require some new tools.  Not even to mention the time saved when other people trying to reproduce the issue you have.
 
 - Can I use my favorite IDE?
 
@@ -93,8 +93,14 @@ Nothing else.  Not even Python and GCC, because you can install all build tools
 
 - Does Docker do parallel building?
 
-  Our building Docker image runs a [Bash script](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/scripts/docker/build.sh), which calls `make -j$(nproc)` to starts as many processes as the number of your processors.
+  Our building Docker image runs a [Bash script](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/scripts/docker/build.sh), which calls `make -j$(nproc)` to starts as many processes as the number of your CPU cores.
+
+## Some Gotchas
+
+- Docker requires sudo
+
+  An owner of a computer has the administrative privilege, a.k.a., sudo, and Docker requires this privilege to work properly.  If you use a shared computer for development, please ask the administrator to install and configure Docker.  We will do our best to support rkt, another container technology that doesn't require sudo.
 
-- Docker on Windows/MacOS?
+- Docker on Windows/MacOS builds slowly
 
   On Windows and MacOS, Docker containers run in a Linux VM.  You might want to give this VM some more memory and CPUs so to make the building efficient.  Please refer to [this issue](https://github.com/PaddlePaddle/Paddle/issues/627) for details.

From 4b0235c1f2792cdecfe7d8f3e0bb1d0c57c6f361 Mon Sep 17 00:00:00 2001
From: Yi Wang <yiwang01@baidu.com>
Date: Fri, 25 Aug 2017 14:31:02 -0700
Subject: [PATCH 28/64] Update build.sh

---
 paddle/scripts/docker/build.sh | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/paddle/scripts/docker/build.sh b/paddle/scripts/docker/build.sh
index 7bab814ae8..1798642022 100644
--- a/paddle/scripts/docker/build.sh
+++ b/paddle/scripts/docker/build.sh
@@ -63,12 +63,11 @@ cmake .. \
 cat <<EOF
 ============================================
 Building in /paddle/build ...
-   Build unit tests: ${WITH_TESTING:-OFF}
 ============================================
 EOF
 make -j `nproc`
 
-if [ ${WITH_TESTING:-OFF} == "ON" ] && [ ${RUN_TEST:-OFF} == "ON" ] ; then
+if [ ${WITH_TESTING:-ON} == "ON" ] && [ ${RUN_TEST:-OFF} == "ON" ] ; then
 cat <<EOF
 ========================================
 Running unit tests ...

From ab53bea1dc8b303748ed95a8fc6cb08c628725f1 Mon Sep 17 00:00:00 2001
From: Yi Wang <yi.wang.2005@gmail.com>
Date: Fri, 25 Aug 2017 14:43:29 -0700
Subject: [PATCH 29/64] Run a specific test

---
 doc/howto/dev/build_cn.md | 6 ++++++
 doc/howto/dev/build_en.md | 6 ++++++
 2 files changed, 12 insertions(+)

diff --git a/doc/howto/dev/build_cn.md b/doc/howto/dev/build_cn.md
index 0077d90118..79b4ff9d5a 100644
--- a/doc/howto/dev/build_cn.md
+++ b/doc/howto/dev/build_cn.md
@@ -53,6 +53,12 @@
    docker run -v $PWD:/paddle paddle:dev bash -c "cd /paddle/build; ctest"
    ```
 
+   有时候我们只想运行一个特定的单元测试，比如 `memory_test`，我们可以
+
+   ```bash
+   docker run -v $PWD:/paddle paddle:dev bash -c "cd /paddle/build; ctest -V -R memory_test"
+   ```
+
 ## 为什么要 Docker 呀？
 
 - 什么是 Docker?
diff --git a/doc/howto/dev/build_en.md b/doc/howto/dev/build_en.md
index 95752beba0..e1b55929f9 100644
--- a/doc/howto/dev/build_en.md
+++ b/doc/howto/dev/build_en.md
@@ -53,6 +53,12 @@ Nothing else.  Not even Python and GCC, because you can install all build tools
    docker run -v $PWD:/paddle paddle:dev bash -c "cd /paddle/build; ctest"
    ```
 
+   Sometimes we want to run a specific unit test, say `memory_test`, we can run
+
+   ```bash
+   docker run -v $PWD:/paddle paddle:dev bash -c "cd /paddle/build; ctest -V -R memory_test"
+   ```
+
 ## Docker, Or Not?
 
 - What is Docker?

From 97649bf9b251707803b2665dedf1ef8f929d8c88 Mon Sep 17 00:00:00 2001
From: zchen0211 <chenzhuoyuan07@gmail.com>
Date: Fri, 25 Aug 2017 22:08:24 +0000
Subject: [PATCH 30/64] fix codes in scatter

---
 paddle/operators/scatter_op.cc                | 26 +++++++++++++------
 paddle/operators/scatter_op.h                 |  6 ++---
 .../v2/framework/tests/gradient_checker.py    | 13 +++++-----
 .../v2/framework/tests/test_scatter_op.py     |  1 -
 4 files changed, 28 insertions(+), 18 deletions(-)

diff --git a/paddle/operators/scatter_op.cc b/paddle/operators/scatter_op.cc
index cf01ef6279..f901edefa2 100644
--- a/paddle/operators/scatter_op.cc
+++ b/paddle/operators/scatter_op.cc
@@ -24,8 +24,18 @@ class ScatterOp : public framework::OperatorWithKernel {
 
  protected:
   void InferShape(const framework::InferShapeContext &ctx) const override {
-    framework::DDim output_dims(ctx.Input<Tensor>("Ref")->dims());
-    ctx.Output<Tensor>("Out")->Resize(output_dims);
+    PADDLE_ENFORCE_EQ(ctx.Input<Tensor>("Index")->dims().size(), 1,
+                      "Update Index should be 1-D.");
+    PADDLE_ENFORCE_EQ(ctx.Input<Tensor>("Ref")->dims().size(),
+                      ctx.Input<Tensor>("Updates")->dims().size(),
+                      "Reference and Updates should have the same shape size");
+    PADDLE_ENFORCE_EQ(ctx.Input<Tensor>("Updates")->dims()[0],
+                      ctx.Input<Tensor>("Index")->dims()[0],
+                      "Updates and Index should have same batch-size.");
+    framework::DDim data_dim(ctx.Input<Tensor>("Updates")->dims());
+    for (int i = 1; i < data_dim.size(); ++i)
+      PADDLE_ENFORCE_EQ(data_dim[i], ctx.Input<Tensor>("Updates")->dims()[i]);
+    ctx.Output<Tensor>("Out")->Resize(ctx.Input<Tensor>("Ref")->dims());
   }
 };
 
@@ -35,13 +45,13 @@ class ScatterGradOp : public framework::OperatorWithKernel {
 
  protected:
   void InferShape(const framework::InferShapeContext &ctx) const override {
-    auto Updates_grad = ctx.Output<Tensor>(framework::GradVarName("Updates"));
-    auto Updates = ctx.Input<Tensor>("Updates");
-    auto Ref_grad = ctx.Output<Tensor>(framework::GradVarName("Ref"));
-    auto Ref = ctx.Input<Tensor>("Ref");
+    auto *dUpdates = ctx.Output<Tensor>(framework::GradVarName("Updates"));
+    auto *Updates = ctx.Input<Tensor>("Updates");
+    auto *dRef = ctx.Output<Tensor>(framework::GradVarName("Ref"));
+    auto *Ref = ctx.Input<Tensor>("Ref");
 
-    Ref_grad->Resize(Ref->dims());
-    Updates_grad->Resize(Updates->dims());
+    dRef->Resize(Ref->dims());
+    dUpdates->Resize(Updates->dims());
   }
 };
 
diff --git a/paddle/operators/scatter_op.h b/paddle/operators/scatter_op.h
index c2db3ae37c..e9595638a8 100644
--- a/paddle/operators/scatter_op.h
+++ b/paddle/operators/scatter_op.h
@@ -46,13 +46,13 @@ class ScatterGradientOpKernel : public framework::OpKernel {
     auto *dRef = ctx.Output<Tensor>(framework::GradVarName("Ref"));
     auto *dUpdates = ctx.Output<Tensor>(framework::GradVarName("Updates"));
     auto *Index = ctx.Input<Tensor>("Index");
-    auto *dO = ctx.Input<Tensor>(framework::GradVarName("Out"));
+    auto *dOut = ctx.Input<Tensor>(framework::GradVarName("Out"));
 
     // In place gradient: dRef = dO
-    dRef->ShareDataWith<T>(*dO);
+    dRef->ShareDataWith<T>(*dOut);
     dUpdates->mutable_data<T>(ctx.GetPlace());
     // Gradient by Gather: dUpdates += dO[Index]
-    Gather<T>(ctx.GetPlace(), dO, Index, dUpdates);
+    Gather<T>(ctx.GetPlace(), dOut, Index, dUpdates);
   }
 };
 
diff --git a/python/paddle/v2/framework/tests/gradient_checker.py b/python/paddle/v2/framework/tests/gradient_checker.py
index ac37671c77..abe0b5391a 100644
--- a/python/paddle/v2/framework/tests/gradient_checker.py
+++ b/python/paddle/v2/framework/tests/gradient_checker.py
@@ -82,6 +82,11 @@ def get_numeric_gradient(op,
     def product(dim):
         return reduce(lambda a, b: a * b, dim, 1)
 
+    def copy_tensor():
+        for var_name in input_values:
+            tensor_ = local_scope.find_var(var_name).get_tensor()
+            tensor_.set(numpy.copy(input_values[var_name]), core.CPUPlace())
+
     # get the input tensor that we want to get it's numeric gradient.
     tensor_to_check = local_scope.find_var(input_to_check).get_tensor()
     tensor_size = product(tensor_to_check.get_dims())
@@ -92,9 +97,7 @@ def get_numeric_gradient(op,
     # we use a for loop to compute the gradient of every element.
     for i in xrange(tensor_size):
         if in_place:
-            for var_name in input_values:
-                tensor_ = local_scope.find_var(var_name).get_tensor()
-                tensor_.set(numpy.copy(input_values[var_name]), core.CPUPlace())
+            copy_tensor()
         # get one input element throw it's index i.
         origin = tensor_to_check.get_float_element(i)
 
@@ -105,9 +108,7 @@ def get_numeric_gradient(op,
 
         # plus delta to this element, run op and get the sum of the result tensor.
         if in_place:
-            for var_name in input_values:
-                tensor_ = local_scope.find_var(var_name).get_tensor()
-                tensor_.set(numpy.copy(input_values[var_name]), core.CPUPlace())
+            copy_tensor()
         x_neg = origin - delta
         tensor_to_check.set_float_element(i, x_neg)
         y_neg = get_output()
diff --git a/python/paddle/v2/framework/tests/test_scatter_op.py b/python/paddle/v2/framework/tests/test_scatter_op.py
index 861fe6cf89..c1f9444889 100644
--- a/python/paddle/v2/framework/tests/test_scatter_op.py
+++ b/python/paddle/v2/framework/tests/test_scatter_op.py
@@ -30,7 +30,6 @@ class TestScatterGradOp(GradientChecker):
         output_np = numpy.copy(ref_np)
         output_np[index_np] += updates_np
         inputs = {'Ref': ref_np, 'Index': index_np, 'Updates': updates_np}
-        # check gradient
         self.check_grad(
             op, inputs, set(["Updates", "Ref"]), "Out", in_place=True)
 

From 6f235553fd923d4b0b225fdc4a521570b03fbc24 Mon Sep 17 00:00:00 2001
From: zchen0211 <chenzhuoyuan07@gmail.com>
Date: Fri, 25 Aug 2017 22:20:20 +0000
Subject: [PATCH 31/64] scatter op fixed

---
 paddle/operators/scatter_op.cc | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/paddle/operators/scatter_op.cc b/paddle/operators/scatter_op.cc
index f901edefa2..09a2f94dde 100644
--- a/paddle/operators/scatter_op.cc
+++ b/paddle/operators/scatter_op.cc
@@ -1,16 +1,16 @@
 /* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
 
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
 
-    http://www.apache.org/licenses/LICENSE-2.0
+   http://www.apache.org/licenses/LICENSE-2.0
 
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License. */
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License. */
 
 #include "paddle/operators/scatter_op.h"
 #include "paddle/framework/ddim.h"

From 852f341615808b6a5e6249b3b7c1f5f20fd22ec9 Mon Sep 17 00:00:00 2001
From: Yi Wang <yi.wang.2005@gmail.com>
Date: Fri, 25 Aug 2017 16:48:52 -0700
Subject: [PATCH 32/64] Add clean build section

---
 doc/howto/dev/build_cn.md | 10 +++++++++-
 doc/howto/dev/build_en.md | 10 +++++++++-
 2 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/doc/howto/dev/build_cn.md b/doc/howto/dev/build_cn.md
index 79b4ff9d5a..d9d520893f 100644
--- a/doc/howto/dev/build_cn.md
+++ b/doc/howto/dev/build_cn.md
@@ -56,7 +56,15 @@
    有时候我们只想运行一个特定的单元测试，比如 `memory_test`，我们可以
 
    ```bash
-   docker run -v $PWD:/paddle paddle:dev bash -c "cd /paddle/build; ctest -V -R memory_test"
+   nvidia-docker run -v $PWD:/paddle paddle:dev bash -c "cd /paddle/build; ctest -V -R memory_test"
+   ```
+
+5. 清理
+
+   有时候我们会希望清理掉已经下载的第三方依赖以及已经编译的二进制文件。此时只需要：
+
+   ```bash
+   rm -rf build
    ```
 
 ## 为什么要 Docker 呀？
diff --git a/doc/howto/dev/build_en.md b/doc/howto/dev/build_en.md
index e1b55929f9..318bf3d384 100644
--- a/doc/howto/dev/build_en.md
+++ b/doc/howto/dev/build_en.md
@@ -56,7 +56,15 @@ Nothing else.  Not even Python and GCC, because you can install all build tools
    Sometimes we want to run a specific unit test, say `memory_test`, we can run
 
    ```bash
-   docker run -v $PWD:/paddle paddle:dev bash -c "cd /paddle/build; ctest -V -R memory_test"
+   nvidia-docker run -v $PWD:/paddle paddle:dev bash -c "cd /paddle/build; ctest -V -R memory_test"
+   ```
+
+5. Clean Build.
+
+   Sometimes, we might want to clean all thirt-party dependents and built binaries.  To do so, just
+
+   ```bash
+   rm -rf build
    ```
 
 ## Docker, Or Not?

From bfeecfd3d25c5c04dbd7d000490e77454944d044 Mon Sep 17 00:00:00 2001
From: zchen0211 <chenzhuoyuan07@gmail.com>
Date: Fri, 25 Aug 2017 23:54:46 +0000
Subject: [PATCH 33/64] fix problems

---
 paddle/operators/scatter_op.cc                 | 18 +++++++++---------
 paddle/operators/scatter_op.cu                 | 18 +++++++++---------
 .../v2/framework/tests/gradient_checker.py     |  6 +++---
 3 files changed, 21 insertions(+), 21 deletions(-)

diff --git a/paddle/operators/scatter_op.cc b/paddle/operators/scatter_op.cc
index 09a2f94dde..f901edefa2 100644
--- a/paddle/operators/scatter_op.cc
+++ b/paddle/operators/scatter_op.cc
@@ -1,16 +1,16 @@
 /* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
 
-   Licensed under the Apache License, Version 2.0 (the "License");
-   you may not use this file except in compliance with the License.
-   You may obtain a copy of the License at
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
 
-   http://www.apache.org/licenses/LICENSE-2.0
+    http://www.apache.org/licenses/LICENSE-2.0
 
-   Unless required by applicable law or agreed to in writing, software
-   distributed under the License is distributed on an "AS IS" BASIS,
-   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-   See the License for the specific language governing permissions and
-   limitations under the License. */
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
 
 #include "paddle/operators/scatter_op.h"
 #include "paddle/framework/ddim.h"
diff --git a/paddle/operators/scatter_op.cu b/paddle/operators/scatter_op.cu
index e6a6fa57d9..6716b47883 100644
--- a/paddle/operators/scatter_op.cu
+++ b/paddle/operators/scatter_op.cu
@@ -1,16 +1,16 @@
 /* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
 
-   Licensed under the Apache License, Version 2.0 (the "License");
-   you may not use this file except in compliance with the License.
-   You may obtain a copy of the License at
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
 
-   http://www.apache.org/licenses/LICENSE-2.0
+    http://www.apache.org/licenses/LICENSE-2.0
 
-   Unless required by applicable law or agreed to in writing, software
-   distributed under the License is distributed on an "AS IS" BASIS,
-   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-   See the License for the specific language governing permissions and
-   limitations under the License. */
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
 
 #define EIGEN_USE_GPU
 #include "paddle/operators/scatter_op.h"
diff --git a/python/paddle/v2/framework/tests/gradient_checker.py b/python/paddle/v2/framework/tests/gradient_checker.py
index abe0b5391a..9a7a7fbf5e 100644
--- a/python/paddle/v2/framework/tests/gradient_checker.py
+++ b/python/paddle/v2/framework/tests/gradient_checker.py
@@ -82,7 +82,7 @@ def get_numeric_gradient(op,
     def product(dim):
         return reduce(lambda a, b: a * b, dim, 1)
 
-    def copy_tensor():
+    def restore_inputs():
         for var_name in input_values:
             tensor_ = local_scope.find_var(var_name).get_tensor()
             tensor_.set(numpy.copy(input_values[var_name]), core.CPUPlace())
@@ -97,7 +97,7 @@ def get_numeric_gradient(op,
     # we use a for loop to compute the gradient of every element.
     for i in xrange(tensor_size):
         if in_place:
-            copy_tensor()
+            restore_inputs()
         # get one input element throw it's index i.
         origin = tensor_to_check.get_float_element(i)
 
@@ -108,7 +108,7 @@ def get_numeric_gradient(op,
 
         # plus delta to this element, run op and get the sum of the result tensor.
         if in_place:
-            copy_tensor()
+            restore_inputs()
         x_neg = origin - delta
         tensor_to_check.set_float_element(i, x_neg)
         y_neg = get_output()

From ec5e20c9f12e89e13b52978b8bb27997c77f059c Mon Sep 17 00:00:00 2001
From: Yi Wang <yi.wang.2005@gmail.com>
Date: Fri, 25 Aug 2017 17:14:28 -0700
Subject: [PATCH 34/64] Remove stopped containers and dangling images

---
 doc/howto/dev/build_cn.md | 18 +++++++++++-------
 doc/howto/dev/build_en.md |  4 ++++
 2 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/doc/howto/dev/build_cn.md b/doc/howto/dev/build_cn.md
index d9d520893f..0b911f7b75 100644
--- a/doc/howto/dev/build_cn.md
+++ b/doc/howto/dev/build_cn.md
@@ -7,7 +7,7 @@
 1. 一台电脑，可以装的是 Linux, BSD, Windows 或者 MacOS 操作系统，以及
 1. Docker。
 
-不需要其他任何软件了。即便是 Python 和 GCC 都不需要，因为我们会把所有编译工具都安装进一个 Docker image 里。
+不需要依赖其他任何软件了。即便是 Python 和 GCC 都不需要，因为我们会把所有编译工具都安装进一个 Docker image 里。
 
 ## 总体流程
 
@@ -17,7 +17,7 @@
    git clone https://github.com/paddlepaddle/paddle
    ```
 
-2. 安装工具
+2. 安装开发工具到 Docker image 里
 
    ```bash
    cd paddle; docker build -t paddle:dev .
@@ -30,13 +30,13 @@
    以下命令启动一个 Docker container 来执行 `paddle:dev` 这个 Docker image，同时把当前目录（源码树根目录）映射为 container 里的 `/paddle` 目录，并且运行 `Dockerfile` 描述的默认入口程序 [`build.sh`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/scripts/docker/build.sh)。这个脚本调用 `cmake` 和 `make` 来编译 `/paddle` 里的源码，结果输出到 `/paddle/build`，也就是本地的源码树根目录里的 `build` 子目录。
 
    ```bash
-   docker run -v $PWD:/paddle paddle:dev
+   docker run --rm -v $PWD:/paddle paddle:dev
    ```
 
    上述命令编译出一个 CUDA-enabled 版本。如果我们只需要编译一个只支持 CPU 的版本，可以用
 
    ```bash
-   docker run -e WITH_GPU=OFF -v $PWD:/paddle paddle:dev
+   docker run --rm -e WITH_GPU=OFF -v $PWD:/paddle paddle:dev
    ```
 
 4. 运行单元测试
@@ -44,19 +44,19 @@
    用本机的第一个 GPU 来运行包括 GPU 单元测试在内的所有单元测试：
 
    ```bash
-   NV_GPU=0 nvidia-docker run -v $PWD:/paddle paddle:dev bash -c "cd /paddle/build; ctest"
+   NV_GPU=0 nvidia-docker run --rm -v $PWD:/paddle paddle:dev bash -c "cd /paddle/build; ctest"
    ```
 
    如果编译的时候我们用了 `WITH_GPU=OFF` 选项，那么编译过程只会产生 CPU-based 单元测试，那么我们也就不需要 nvidia-docker 来运行单元测试了。我们只需要：
 
    ```bash
-   docker run -v $PWD:/paddle paddle:dev bash -c "cd /paddle/build; ctest"
+   docker run --rm -v $PWD:/paddle paddle:dev bash -c "cd /paddle/build; ctest"
    ```
 
    有时候我们只想运行一个特定的单元测试，比如 `memory_test`，我们可以
 
    ```bash
-   nvidia-docker run -v $PWD:/paddle paddle:dev bash -c "cd /paddle/build; ctest -V -R memory_test"
+   nvidia-docker run --rm -v $PWD:/paddle paddle:dev bash -c "cd /paddle/build; ctest -V -R memory_test"
    ```
 
 5. 清理
@@ -118,3 +118,7 @@
 - 在 Windows/MacOS 上编译很慢
 
   Docker 在 Windows 和 MacOS 都可以运行。不过实际上是运行在一个 Linux 虚拟机上。可能需要注意给这个虚拟机多分配一些 CPU 和内存，以保证编译高效。具体做法请参考[这个issue](https://github.com/PaddlePaddle/Paddle/issues/627)。
+
+- 磁盘不够
+
+  本文中的例子里，`docker run` 命令里都用了 `--rm` 参数，这样保证运行结束之后的 containers 不会保留在磁盘上。可以用 `docker ps -a` 命令看到停止后但是没有删除的 containers。`docker build` 命令有时候会产生一些中间结果，是没有名字的 images，也会占用磁盘。可以参考[这篇文章](https://zaiste.net/posts/removing_docker_containers/)来清理这些内容。
diff --git a/doc/howto/dev/build_en.md b/doc/howto/dev/build_en.md
index 318bf3d384..d0048e3714 100644
--- a/doc/howto/dev/build_en.md
+++ b/doc/howto/dev/build_en.md
@@ -118,3 +118,7 @@ Nothing else.  Not even Python and GCC, because you can install all build tools
 - Docker on Windows/MacOS builds slowly
 
   On Windows and MacOS, Docker containers run in a Linux VM.  You might want to give this VM some more memory and CPUs so to make the building efficient.  Please refer to [this issue](https://github.com/PaddlePaddle/Paddle/issues/627) for details.
+
+- Not enough disk space
+
+  Examples in this article uses option `--rm` with the `docker run` command.  This option ensures that stopped containers do not exist on hard disks.  We can use `docker ps -a` to list all containers, including stopped.  Sometimes `docker build` generates some intermediate dangling images, which also take disk space.  To clean them, please refer to [this article](https://zaiste.net/posts/removing_docker_containers/).

From 721b5020fae92600a0aa2c4093dbeeac63b597c6 Mon Sep 17 00:00:00 2001
From: qiaolongfei <qiaolongfei@baidu.com>
Date: Sat, 26 Aug 2017 16:34:33 -0700
Subject: [PATCH 35/64] change predict size to label class_num

---
 python/paddle/v2/framework/tests/mnist.py | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/python/paddle/v2/framework/tests/mnist.py b/python/paddle/v2/framework/tests/mnist.py
index 9a0b109850..9b2dbed25c 100644
--- a/python/paddle/v2/framework/tests/mnist.py
+++ b/python/paddle/v2/framework/tests/mnist.py
@@ -181,7 +181,7 @@ images = data_layer(name='pixel', dims=[BATCH_SIZE, 784])
 labels = data_layer(name='label', dims=[BATCH_SIZE])
 fc1 = fc_layer(net=forward_net, input=images, size=100, act="sigmoid")
 fc2 = fc_layer(net=forward_net, input=fc1, size=100, act="sigmoid")
-predict = fc_layer(net=forward_net, input=fc2, size=100, act="softmax")
+predict = fc_layer(net=forward_net, input=fc2, size=10, act="softmax")
 cost = cross_entropy_layer(net=forward_net, input=predict, label=labels)
 
 init_net.complete_add_op(True)
@@ -223,7 +223,7 @@ def test(cost_name):
         sum(error) / float(len(error))))
 
 
-PASS_NUM = 1
+PASS_NUM = 10
 
 init_net.run(scope, dev_ctx)
 for pass_id in range(PASS_NUM):

From 787cb8ce5d829435e5d45e8bc6bc51cfdcf49272 Mon Sep 17 00:00:00 2001
From: qiaolongfei <qiaolongfei@baidu.com>
Date: Sat, 26 Aug 2017 16:36:28 -0700
Subject: [PATCH 36/64] reset pass num to 1

---
 python/paddle/v2/framework/tests/mnist.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/python/paddle/v2/framework/tests/mnist.py b/python/paddle/v2/framework/tests/mnist.py
index 9b2dbed25c..a68f302f9c 100644
--- a/python/paddle/v2/framework/tests/mnist.py
+++ b/python/paddle/v2/framework/tests/mnist.py
@@ -223,7 +223,7 @@ def test(cost_name):
         sum(error) / float(len(error))))
 
 
-PASS_NUM = 10
+PASS_NUM = 1
 
 init_net.run(scope, dev_ctx)
 for pass_id in range(PASS_NUM):

From 3120ee5cfbbe6ecf3550b6a338a4c14afe6e4ebd Mon Sep 17 00:00:00 2001
From: dongzhihong <dzhwinter@gmail.com>
Date: Sat, 26 Aug 2017 18:46:06 -0700
Subject: [PATCH 37/64] fix backward doc

---
 paddle/framework/backward.md | 28 +++++++++++++++++++++-------
 1 file changed, 21 insertions(+), 7 deletions(-)

diff --git a/paddle/framework/backward.md b/paddle/framework/backward.md
index 74c001b06a..c8fa3fefe5 100644
--- a/paddle/framework/backward.md
+++ b/paddle/framework/backward.md
@@ -21,18 +21,32 @@ grad_op_builder(fengjiayi)
 
 given a forward network, it generates the backward network. We only care about the Gradients—`OutputGradients`,`InputGradients`.
 
-1. bla bla bla (yuyang)
+1. Op 
+
+   when the input forward network is a Op, return its gradient Operator Immediately.
 
 2. NetOp 
 
-   when the input forward network is a NetOp, it need to call the sub NetOp/Operators backward function recursively and ensure them done. During the process, we need to collect the `OutputGradients` name.
+   when the input forward network is a NetOp, it need to call the sub NetOp/Operators backward function recursively. During the process, we need to collect the `OutputGradients` name according to forward NetOp.
+
+   **shared variable**. As illustrated in the pictures, two operator's `Output` `Gradient` will overwirte their shared input variable.  
+
+   <p align="center">
+   <img src="./images/duplicate_op.png" width="70%" ><br/>
+
+   1. shared variable in two operators. 
+
+   </p>
+
+   Share variable between operators or same input variable used in multiple operators lead to a duplicate gradient variable. As demo show above, we need to rename gradient name recursively, and add a generic add operator replace the overwirte links. 
+
+   <p align="center">
+   <img src="images/duplicate_op2.png" width="90%" ><br/>
 
-   We share variable in the same scope, as a result, duplicate operator `OutputGradients` will overwirte then duplicate variable.  
+   2. replace shared variable gradient with `Add` Operator
 
-   ![./images/duplicate_op]()
+   </p>
 
-    Share variable between operators or same input variable used in multiple operators lead to a duplicate gradient variable. As demo show above, we need to rename gradient name recursively, and add a generic add operator instead. 
 
-![./images/duplicate_op2]()
 
-​	Then collect the sub graph OutputGradients/InputGradients as the NetOp's and return it.
+​	Then collect the sub graph `OutputGradients`/`InputGradients` as the NetOp's and return it.

From bb5c656b574b1e518da981d781db0e1e0a0e4d75 Mon Sep 17 00:00:00 2001
From: fengjiayi <fengjiayi@baidu.com>
Date: Sat, 26 Aug 2017 19:15:31 -0700
Subject: [PATCH 38/64] test

---
 paddle/framework/backward.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/paddle/framework/backward.md b/paddle/framework/backward.md
index c717c2f30b..d5dbd57d19 100644
--- a/paddle/framework/backward.md
+++ b/paddle/framework/backward.md
@@ -6,7 +6,7 @@ In Neural Network, the backpropagation algorithm follows the chain rule, so we n
  
 ## Backward Operator Registry
 
-A backward network is built up with several backward operators. Backward operators take forward operators' inputs, outputs and output gradients, and then calculate its input gradients. In most cases, there is a one-to-one correspondence between forward and backward operators. We use registry mechanism to save these correspondences, which is quite similar with operator registry itself.
+A backward network is built up with several backward operators. Backward operators take forward operators' inputs, outputs and output gradients and then calculate its input gradients. In most cases, there is a one-to-one correspondence between forward and backward operators. We use registry mechanism to save these correspondences, which is quite similar with operator registry itself.
 
 For example, we have got a `add_two_op`, and is registered by the following code:
 

From f646f7991ae49eff00370a03beb958fc88ac62ad Mon Sep 17 00:00:00 2001
From: qingqing01 <dangqingqing@baidu.com>
Date: Sun, 27 Aug 2017 12:01:46 +0800
Subject: [PATCH 39/64] Add chinese doc about how to write new operators.

---
 doc/howto/dev/new_op_cn.md | 300 +++++++++++++++++++++++++++++++++++++
 1 file changed, 300 insertions(+)
 create mode 100644 doc/howto/dev/new_op_cn.md

diff --git a/doc/howto/dev/new_op_cn.md b/doc/howto/dev/new_op_cn.md
new file mode 100644
index 0000000000..df20c15ec6
--- /dev/null
+++ b/doc/howto/dev/new_op_cn.md
@@ -0,0 +1,300 @@
+# 如何写新的Operator
+
+ - [概念简介](#概念简介)
+ - [实现C++类](#实现C++类)
+   - [定义ProtoMaker类](#定义ProtoMaker类)
+   - [定义Operator类](#定义Operator类)
+   - [定义`OpKernel`类](#定义`OpKernel`类)
+   - [注册类](#注册类)
+   - [编译](#编译)
+ - [绑定Python](#绑定Python)
+ - [实现单元测试](#实现单元测试)
+
+
+## 概念简介
+
+简单介绍需要用到基类，详细介绍请参考设计文档。
+
+- `framework::OperatorBase`: Operator(简写，Op)基类。
+- `framework::OpKernel`: Op计算函数的基类，称作Kernel。
+- `framework::OperatorWithKernel`：继承自OperatorBase，Op有计算函数，称作有Kernel。
+- `class OpProtoAndCheckerMaker`：描述该Op的输入、输出、属性、注释,主要用于Python API接口生成
+
+依据是否包含kernel，将Op分为两种：包含Kernel的Op和不包含kernel的Op，前者Op的定义继承自`OperatorBase`，后者继承自`OperatorWithKernel`。本教程主要介绍带Kernel的Op如何写，简单总结如下：
+
+Forward Op需要包含：
+
+   - OpProtoMake定义
+   - Op定义
+   - Kernel实现
+     
+与之对应的Backward Op包含：
+
+   - Op定义
+   - Kernel实现
+
+下面以矩阵乘操作，即[MulOp](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/mul_op.cc)为例来介绍如何写带Kernel的Operator。
+
+
+## 实现C++类
+
+
+### 1. 定义ProtoMaker类
+
+矩阵乘的公式：$$Out = X * Y$$ ，可见该计算由两个输入，一个输出组成。首先定义`ProtoMaker`来描述该Op的输入、输出及注释：
+
+    
+
+    ```
+	class MulOpMaker : public framework::OpProtoAndCheckerMaker {
+	 public:
+	  MulOpMaker(framework::OpProto *proto, framework::OpAttrChecker *op_checker)
+	      : OpProtoAndCheckerMaker(proto, op_checker) {
+	    AddInput("X", "The first input of mul op");
+	    AddInput("Y", "The second input of mul op");
+	    AddOutput("Out", "The output of mul op");
+	    AddComment(R"DOC(
+	Two Element Mul Operator.
+	The equation is: Out = X * Y
+	)DOC");
+	  }
+	};
+    ```
+   
+[`MulOpMaker`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/mul_op.cc#L43)继承自`framework::OpProtoAndCheckerMaker`，构造函数包括2个：
+
+   - `framework::OpProto` ： 前者存储Op的输入输出和参数属性，将用于Python API接口的生成。
+   - `framework::OpAttrChecker` ：后者用于检查参数属性的合法性。
+   
+构造函数里通过`AddInput`添加输入参数，通过`AddOutput`添加输出参数，通过`AddComment`添加该Op的注释，这些函数会将对应内容添加到`OpProto`中。
+
+在`MulOp`中添加两个输入`X`和`Y`，添加了一个输出`Out`，并解释了各自含义，该命名尽可能的规范。
+
+   
+再举个[`ScaleOp`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/scale_op.cc#L37)的例子：
+   
+```C++
+   template <typename AttrType>
+class ScaleOpMaker : public framework::OpProtoAndCheckerMaker {
+ public:
+  ScaleOpMaker(framework::OpProto *proto, framework::OpAttrChecker *op_checker)
+      : OpProtoAndCheckerMaker(proto, op_checker) {
+    AddInput("X", "The input tensor of scale operator.").NotInGradient();
+    AddOutput("Out", "The output tensor of scale operator.").NotInGradient();
+    AddComment(R"DOC(Scale operator
+The equation is: Out = scale*X
+)DOC");
+    AddAttr<AttrType>("scale", "scale of scale operator.").SetDefault(1.0);
+  }
+};
+```
+ 
+ 在这个例子里，两处不同：
+ 
+  - `AddInput("X","...").NotInGradient()` : 表示`X`这个输入不参与`ScaleOp`对应的梯度Op计算之中。
+  - `AddAttr<AttrType>("scale", "...").SetDefault(1.0);` : 增加`scale`系数，作为参数属性，并且设置默认值为1.0。
+   
+
+### 2. 定义Operator类
+
+
+	```C++
+	class MulOp : public framework::OperatorWithKernel {
+	 public:
+	  using framework::OperatorWithKernel::OperatorWithKernel;
+	
+	 protected:
+	  void InferShape(const framework::InferShapeContext &ctx) const override {
+	    auto dim0 = ctx.Input<Tensor>("X")->dims();
+	    auto dim1 = ctx.Input<Tensor>("Y")->dims();
+	    PADDLE_ENFORCE_EQ(dim0.size(), 2,
+	                      "input X(%s) should be a tensor with 2 dims, a matrix",
+	                      ctx.op_.Input("X"));
+	    PADDLE_ENFORCE_EQ(dim1.size(), 2,
+	                      "input Y(%s) should be a tensor with 2 dims, a matrix",
+	                      ctx.op_.Input("Y"));
+	    PADDLE_ENFORCE_EQ(
+	        dim0[1], dim1[0],
+	        "First matrix's width must be equal with second matrix's height.");
+	    ctx.Output<Tensor>("Out")->Resize({dim0[0], dim1[1]});
+	  }
+	};
+	```
+
+[`MulOp`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/mul_op.cc#L22)继承自`OperatorWithKernel`。`public`成员：
+	 
+```C++
+using framework::OperatorWithKernel::OperatorWithKernel;
+```
+
+这句表示使用基类`OperatorWithKernel`的构造函数，也可写成：
+   
+```C++
+  MulOp(const std::string &type, const framework::VariableNameMap &inputs,
+        const framework::VariableNameMap &outputs,
+        const framework::AttributeMap &attrs)
+    : OperatorWithKernel(type, inputs, outputs, attrs) {}
+```	
+	
+还需要重写`InferShape`接口。`InferShape`为const函数，不能修改Op的成员变量，参数为`const framework::InferShapeContext &ctx`，通过该参数可获取到输入输出以及属性。它的功能是：
+	 - 1). 做检查， 尽早报错：检查输入数据维度、类型等是否合法
+	 - 2). 设置输出Tensor的形状
+
+通常`OpProtoMaker`和`Op`类的定义写在`.cc`文件中，和要讲到的注册函数一起放在`.cc`中
+
+### 3. 定义`OpKernel`类
+
+```C++
+template <typename Place, typename T>
+class MulKernel : public framework::OpKernel {
+ public:
+  void Compute(const framework::ExecutionContext& context) const override {
+    auto* X = context.Input<Tensor>("X");
+    auto* Y = context.Input<Tensor>("Y");
+    auto* Z = context.Output<Tensor>("Out");
+    Z->mutable_data<T>(context.GetPlace());
+    auto* device_context =
+        const_cast<platform::DeviceContext*>(context.device_context_);
+    math::matmul<Place, T>(*X, false, *Y, false, 1, Z, 0, device_context);
+  }
+};
+```
+
+`MulKernel`继承自`framework::OpKernel`，带有模板参数:
+
+  - `typename  Place`: 表示设备类型，不同设备(CPU、GPU)共享同一个Kernel时，需加该模板参数，不共享则不加，一个不共享的例子是[`OnehotCrossEntropyOpKernel`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/cross_entropy_op.h#L43)。
+  
+ - `typename T` : 表示数据类型，如`float`, `double`等。
+   
+`MulKernel`需要重写`Compute`接口，该接口参数为`const framework::ExecutionContext& context`, `ExecutionContext`相比`InferShapeContext`增加了设备类型，同样可获取到输入输出和属性参数，`Compute`函数里写具体实现时。
+   
+注意，不同设备(CPU、GPU)共享一个Op定义，是否则共享同一个`OpKernel`，取决于`Compute`调用的函数是否支持不同设备。`MulOp`的CPU、GPU实现共享同一个`Kernel`，`OpKernel`不共享的例子可以参考[`OnehotCrossEntropyOpKernel`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/cross_entropy_op.h#L43)。 
+   
+到此前向Op实现完成，需要在`.cc`文件中注册该op和kernel。反向Op类的定义和Kernel定义与前向Op类似，这里不再重复。但注意，反向Op没有`ProtoMaker`。
+   
+### 4. 注册类
+
+在`.cc`文件中注册前向、反向Op类，注册CPU Kernel。
+
+    ```C++
+	namespace ops = paddle::operators;
+	REGISTER_OP(mul, ops::MulOp, ops::MulOpMaker, mul_grad, ops::MulOpGrad);
+	REGISTER_OP_CPU_KERNEL(mul, ops::MulKernel<paddle::platform::CPUPlace, float>);
+	REGISTER_OP_CPU_KERNEL(mul_grad,
+                      ops::MulGradKernel<paddle::platform::CPUPlace, float>);
+    ```
+    
+  - `REGISTER_OP` ： 注册`ops::MulOp`类，类型名为`mul`，该类的`ProtoMaker`为`ops::MulOpMaker`，注册`ops::MulOpGrad`，类型名为`mul_grad`，
+  - `REGISTER_OP_WITHOUT_GRADIENT` ： 用于注册没有反向的Op。
+  - `REGISTER_OP_CPU_KERNEL` ：注册`ops::MulKernel`类，并特化模板参数为`paddle::platform::CPUPlace`和`float`类型，同理，注册`ops::MulKernel`类。
+
+在 `.cu`文件中注册GPU Kernel。
+   
+   ```
+	namespace ops = paddle::operators;
+	REGISTER_OP_GPU_KERNEL(mul, ops::MulKernel<paddle::platform::GPUPlace, float>);
+	REGISTER_OP_GPU_KERNEL(mul_grad,
+	                       ops::MulGradKernel<paddle::platform::GPUPlace, float>);
+   ```
+
+### 5. 编译
+
+在[paddle/operators/CMakeLists.txt](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/CMakeLists.txt)文件中添加编译。
+   
+   ```
+   op_library(mul_op SRCS mul_op.cc mul_op.cu DEPS math_function)
+   ```
+   
+下面命令可以编译：
+   
+   ```
+   make mul_op
+   ```
+
+## 绑定Python
+
+ - 绑定Python 
+ 
+   在 [`paddle/pybind/pybind.cc 
+`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/pybind/pybind.cc)文件中添加该类：
+
+    ```
+    USE_OP(mul);
+    ```
+    如果只实现了CPU版本，则使用`USE_CPU_ONLY_OP`:
+    
+    ```
+    USE_CPU_ONLY_OP(gather);
+    ```
+    
+    使用`USE_OP`告知编译器需要链接该Op的目标文件，具体解释参考[代码注释](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/op_registry.h#L81)。
+    
+    
+ - 生成库
+
+   在 [`paddle/pybind/CMakeLists.txt`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/pybind/CMakeLists.txt)文件添加类到`DEPS`中。
+   
+   ```
+   if(WITH_PYTHON)
+cc_library(paddle_pybind SHARED
+    SRCS pybind.cc
+    DEPS pybind python backward
+    mul_op
+    minus_op)
+endif(WITH_PYTHON)
+   ```
+
+## 实现单元测试
+
+单测包括对比前向Op不同设备(CPU、GPU)的实现、对比反向OP不同设备(CPU、GPU)的实现、反向Op的梯度测试。下面介绍介绍[`MulOp`的单测](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/v2/framework/tests/test_mul_op.py)。
+
+- 前向Op单测
+
+前向Op单测继承自`unittest.TestCase`，并定义元类`__metaclass__ = OpTestMeta`，具体单测流程在`OpTestMeta`里完成。需在`setUp`函数定义输入输出和属性参数，以及Python对比的输出值。
+
+```
+import unittest
+import numpy as np
+from gradient_checker import GradientChecker, create_op
+from op_test_util import OpTestMeta
+
+class TestMulOp(unittest.TestCase):
+    __metaclass__ = OpTestMeta
+
+    def setUp(self):
+        self.type = "mul"
+        self.inputs = {
+            'X': np.random.random((32, 84)).astype("float32"),
+            'Y': np.random.random((84, 100)).astype("float32")
+        }
+        self.outputs = {'Out': np.dot(self.inputs['X'], self.inputs['Y'])}
+```
+   首先需要`import`必要的包,下面详细解释其他值：
+   
+   - `self.type = "mul" ` : 定义类型，和注册的类型一致。
+   - `self.inputs` : 定义输入，类型为Numpy.array，并初始化。
+   - `self.outputs` : 定义输出，并得到Python结算结果。
+
+ 
+ - 反向Op单测
+
+反向Op单测继承自`GradientChecker`，而`GradientChecker`集成自`unittest.TestCase`，所以反向单测函数需要`test_`开头。
+
+ ```
+ class MulGradOpTest(GradientChecker):
+    def test_mul(self):
+        op = create_op("mul")
+        inputs = {
+            'X': np.random.random((32, 84)).astype("float32"),
+            'Y': np.random.random((84, 100)).astype("float32")
+        }
+        self.compare_grad(op, inputs)      
+        # mul op will enlarge the relative error
+        self.check_grad(
+            op, inputs, set(["X", "Y"]), "Out", max_relative_error=0.5)
+ ```
+
+   - 调用`create_op("mul")`创建反向Op对应的前向Op。
+   - 定义输入`inputs`。
+   - 调用`compare_grad`函数对比CPU、GPU计算结果。
+   - 调用`check_grad`检查梯度稳定性。

From d78521d491d8c6625146137406f3b7402aebe143 Mon Sep 17 00:00:00 2001
From: qingqing01 <dangqingqing@baidu.com>
Date: Sun, 27 Aug 2017 12:11:15 +0800
Subject: [PATCH 40/64] fix doc format.

---
 doc/howto/dev/new_op_cn.md | 160 ++++++++++++++++++-------------------
 1 file changed, 80 insertions(+), 80 deletions(-)

diff --git a/doc/howto/dev/new_op_cn.md b/doc/howto/dev/new_op_cn.md
index df20c15ec6..ebd2cf3ff0 100644
--- a/doc/howto/dev/new_op_cn.md
+++ b/doc/howto/dev/new_op_cn.md
@@ -4,11 +4,13 @@
  - [实现C++类](#实现C++类)
    - [定义ProtoMaker类](#定义ProtoMaker类)
    - [定义Operator类](#定义Operator类)
-   - [定义`OpKernel`类](#定义`OpKernel`类)
+   - [定义OpKernel类](#定义OpKernel类)
    - [注册类](#注册类)
    - [编译](#编译)
  - [绑定Python](#绑定Python)
  - [实现单元测试](#实现单元测试)
+   - [前向Operator单测](#前向Operator单测)
+   - [反向Operator单测](#反向Operator单测)
 
 
 ## 概念简介
@@ -41,25 +43,23 @@ Forward Op需要包含：
 
 ### 1. 定义ProtoMaker类
 
-矩阵乘的公式：$$Out = X * Y$$ ，可见该计算由两个输入，一个输出组成。首先定义`ProtoMaker`来描述该Op的输入、输出及注释：
-
+矩阵乘的公式：$Out = X * Y$, 可见该计算由两个输入，一个输出组成。首先定义`ProtoMaker`来描述该Op的输入、输出及注释：
     
-
-    ```
-	class MulOpMaker : public framework::OpProtoAndCheckerMaker {
-	 public:
-	  MulOpMaker(framework::OpProto *proto, framework::OpAttrChecker *op_checker)
-	      : OpProtoAndCheckerMaker(proto, op_checker) {
-	    AddInput("X", "The first input of mul op");
-	    AddInput("Y", "The second input of mul op");
-	    AddOutput("Out", "The output of mul op");
-	    AddComment(R"DOC(
-	Two Element Mul Operator.
-	The equation is: Out = X * Y
-	)DOC");
-	  }
-	};
-    ```
+```
+class MulOpMaker : public framework::OpProtoAndCheckerMaker {
+ public:
+  MulOpMaker(framework::OpProto *proto, framework::OpAttrChecker *op_checker)
+      : OpProtoAndCheckerMaker(proto, op_checker) {
+    AddInput("X", "The first input of mul op");
+    AddInput("Y", "The second input of mul op");
+    AddOutput("Out", "The output of mul op");
+    AddComment(R"DOC(
+Two Element Mul Operator.
+The equation is: Out = X * Y
+)DOC");
+  }
+};
+```
    
 [`MulOpMaker`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/mul_op.cc#L43)继承自`framework::OpProtoAndCheckerMaker`，构造函数包括2个：
 
@@ -73,8 +73,8 @@ Forward Op需要包含：
    
 再举个[`ScaleOp`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/scale_op.cc#L37)的例子：
    
-```C++
-   template <typename AttrType>
+```
+template <typename AttrType>
 class ScaleOpMaker : public framework::OpProtoAndCheckerMaker {
  public:
   ScaleOpMaker(framework::OpProto *proto, framework::OpAttrChecker *op_checker)
@@ -98,42 +98,42 @@ The equation is: Out = scale*X
 ### 2. 定义Operator类
 
 
-	```C++
-	class MulOp : public framework::OperatorWithKernel {
-	 public:
-	  using framework::OperatorWithKernel::OperatorWithKernel;
-	
-	 protected:
-	  void InferShape(const framework::InferShapeContext &ctx) const override {
-	    auto dim0 = ctx.Input<Tensor>("X")->dims();
-	    auto dim1 = ctx.Input<Tensor>("Y")->dims();
-	    PADDLE_ENFORCE_EQ(dim0.size(), 2,
-	                      "input X(%s) should be a tensor with 2 dims, a matrix",
-	                      ctx.op_.Input("X"));
-	    PADDLE_ENFORCE_EQ(dim1.size(), 2,
-	                      "input Y(%s) should be a tensor with 2 dims, a matrix",
-	                      ctx.op_.Input("Y"));
-	    PADDLE_ENFORCE_EQ(
-	        dim0[1], dim1[0],
-	        "First matrix's width must be equal with second matrix's height.");
-	    ctx.Output<Tensor>("Out")->Resize({dim0[0], dim1[1]});
-	  }
-	};
-	```
+```c++
+class MulOp : public framework::OperatorWithKernel {
+ public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+
+ protected:
+  void InferShape(const framework::InferShapeContext &ctx) const override {
+    auto dim0 = ctx.Input<Tensor>("X")->dims();
+    auto dim1 = ctx.Input<Tensor>("Y")->dims();
+    PADDLE_ENFORCE_EQ(dim0.size(), 2,
+                      "input X(%s) should be a tensor with 2 dims, a matrix",
+                      ctx.op_.Input("X"));
+    PADDLE_ENFORCE_EQ(dim1.size(), 2,
+                      "input Y(%s) should be a tensor with 2 dims, a matrix",
+                      ctx.op_.Input("Y"));
+    PADDLE_ENFORCE_EQ(
+        dim0[1], dim1[0],
+        "First matrix's width must be equal with second matrix's height.");
+    ctx.Output<Tensor>("Out")->Resize({dim0[0], dim1[1]});
+  }
+};
+```
 
 [`MulOp`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/mul_op.cc#L22)继承自`OperatorWithKernel`。`public`成员：
 	 
-```C++
+```c++
 using framework::OperatorWithKernel::OperatorWithKernel;
 ```
 
 这句表示使用基类`OperatorWithKernel`的构造函数，也可写成：
    
-```C++
-  MulOp(const std::string &type, const framework::VariableNameMap &inputs,
-        const framework::VariableNameMap &outputs,
-        const framework::AttributeMap &attrs)
-    : OperatorWithKernel(type, inputs, outputs, attrs) {}
+```c++
+MulOp(const std::string &type, const framework::VariableNameMap &inputs,
+      const framework::VariableNameMap &outputs,
+      const framework::AttributeMap &attrs)
+  : OperatorWithKernel(type, inputs, outputs, attrs) {}
 ```	
 	
 还需要重写`InferShape`接口。`InferShape`为const函数，不能修改Op的成员变量，参数为`const framework::InferShapeContext &ctx`，通过该参数可获取到输入输出以及属性。它的功能是：
@@ -142,7 +142,7 @@ using framework::OperatorWithKernel::OperatorWithKernel;
 
 通常`OpProtoMaker`和`Op`类的定义写在`.cc`文件中，和要讲到的注册函数一起放在`.cc`中
 
-### 3. 定义`OpKernel`类
+### 3. 定义OpKernel类
 
 ```C++
 template <typename Place, typename T>
@@ -176,13 +176,13 @@ class MulKernel : public framework::OpKernel {
 
 在`.cc`文件中注册前向、反向Op类，注册CPU Kernel。
 
-    ```C++
-	namespace ops = paddle::operators;
-	REGISTER_OP(mul, ops::MulOp, ops::MulOpMaker, mul_grad, ops::MulOpGrad);
-	REGISTER_OP_CPU_KERNEL(mul, ops::MulKernel<paddle::platform::CPUPlace, float>);
-	REGISTER_OP_CPU_KERNEL(mul_grad,
-                      ops::MulGradKernel<paddle::platform::CPUPlace, float>);
-    ```
+```c++
+namespace ops = paddle::operators;
+REGISTER_OP(mul, ops::MulOp, ops::MulOpMaker, mul_grad, ops::MulOpGrad);
+REGISTER_OP_CPU_KERNEL(mul, ops::MulKernel<paddle::platform::CPUPlace, float>);
+REGISTER_OP_CPU_KERNEL(mul_grad,
+              ops::MulGradKernel<paddle::platform::CPUPlace, float>);
+```
     
   - `REGISTER_OP` ： 注册`ops::MulOp`类，类型名为`mul`，该类的`ProtoMaker`为`ops::MulOpMaker`，注册`ops::MulOpGrad`，类型名为`mul_grad`，
   - `REGISTER_OP_WITHOUT_GRADIENT` ： 用于注册没有反向的Op。
@@ -190,32 +190,32 @@ class MulKernel : public framework::OpKernel {
 
 在 `.cu`文件中注册GPU Kernel。
    
-   ```
-	namespace ops = paddle::operators;
-	REGISTER_OP_GPU_KERNEL(mul, ops::MulKernel<paddle::platform::GPUPlace, float>);
-	REGISTER_OP_GPU_KERNEL(mul_grad,
-	                       ops::MulGradKernel<paddle::platform::GPUPlace, float>);
-   ```
+```c++
+namespace ops = paddle::operators;
+REGISTER_OP_GPU_KERNEL(mul, ops::MulKernel<paddle::platform::GPUPlace, float>);
+REGISTER_OP_GPU_KERNEL(mul_grad,
+                       ops::MulGradKernel<paddle::platform::GPUPlace, float>);
+```
 
 ### 5. 编译
 
 在[paddle/operators/CMakeLists.txt](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/CMakeLists.txt)文件中添加编译。
    
-   ```
-   op_library(mul_op SRCS mul_op.cc mul_op.cu DEPS math_function)
-   ```
+```
+op_library(mul_op SRCS mul_op.cc mul_op.cu DEPS math_function)
+```
    
 下面命令可以编译：
    
-   ```
-   make mul_op
-   ```
+```
+make mul_op
+```
 
 ## 绑定Python
 
- - 绑定Python 
+- 绑定Python 
  
-   在 [`paddle/pybind/pybind.cc 
+    在 [`paddle/pybind/pybind.cc 
 `](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/pybind/pybind.cc)文件中添加该类：
 
     ```
@@ -232,23 +232,23 @@ class MulKernel : public framework::OpKernel {
     
  - 生成库
 
-   在 [`paddle/pybind/CMakeLists.txt`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/pybind/CMakeLists.txt)文件添加类到`DEPS`中。
+   在 [`paddle/pybind/CMakeLists.txt`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/pybind/CMakeLists.txt)文件添加类到`DEPS`中，使得该Op可以链接到生成的lib库中。
    
    ```
    if(WITH_PYTHON)
-cc_library(paddle_pybind SHARED
-    SRCS pybind.cc
-    DEPS pybind python backward
-    mul_op
-    minus_op)
-endif(WITH_PYTHON)
+     cc_library(paddle_pybind SHARED
+     SRCS pybind.cc
+     DEPS pybind python backward
+     mul_op
+     minus_op)
+   endif(WITH_PYTHON)
    ```
 
 ## 实现单元测试
 
 单测包括对比前向Op不同设备(CPU、GPU)的实现、对比反向OP不同设备(CPU、GPU)的实现、反向Op的梯度测试。下面介绍介绍[`MulOp`的单测](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/v2/framework/tests/test_mul_op.py)。
 
-- 前向Op单测
+### 前向Operator单测
 
 前向Op单测继承自`unittest.TestCase`，并定义元类`__metaclass__ = OpTestMeta`，具体单测流程在`OpTestMeta`里完成。需在`setUp`函数定义输入输出和属性参数，以及Python对比的输出值。
 
@@ -276,7 +276,7 @@ class TestMulOp(unittest.TestCase):
    - `self.outputs` : 定义输出，并得到Python结算结果。
 
  
- - 反向Op单测
+### 反向Operator单测
 
 反向Op单测继承自`GradientChecker`，而`GradientChecker`集成自`unittest.TestCase`，所以反向单测函数需要`test_`开头。
 

From 4a83dde594d0aa6d19aeff7471b040277a8a839f Mon Sep 17 00:00:00 2001
From: caoying03 <caoying03@baidu.com>
Date: Sun, 27 Aug 2017 11:28:05 +0800
Subject: [PATCH 41/64] save parameters into ordered dict.

---
 python/paddle/v2/parameters.py | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/python/paddle/v2/parameters.py b/python/paddle/v2/parameters.py
index b8af5abaea..475067ef22 100644
--- a/python/paddle/v2/parameters.py
+++ b/python/paddle/v2/parameters.py
@@ -14,6 +14,7 @@
 
 import numpy as np
 from paddle.proto.ParameterConfig_pb2 import ParameterConfig
+from collections import OrderedDict
 import paddle.trainer.config_parser as cp
 import struct
 import tarfile
@@ -62,7 +63,7 @@ class Parameters(object):
     """
 
     def __init__(self):
-        self.__param_conf__ = dict()
+        self.__param_conf__ = OrderedDict()
         self.__gradient_machines__ = []
         self.__tmp_params__ = dict()
 
@@ -231,6 +232,9 @@ class Parameters(object):
         :rtype: np.ndarray
         """
         import py_paddle.swig_paddle as api
+        if self.__param_conf__[key].is_static:
+            return np.zeros(self.__param_conf__[key].size, dtype=np.float32)
+
         return self.__getter_inner(key, api.PARAMETER_GRADIENT)
 
     def set(self, parameter_name, value):

From 4590f793f111dd4fc5134ca9bbd0a213b41962b7 Mon Sep 17 00:00:00 2001
From: fengjiayi <fengjiayi@baidu.com>
Date: Sun, 27 Aug 2017 17:37:41 -0700
Subject: [PATCH 42/64] Update backward document

---
 paddle/framework/backward.md | 24 ++++++++----------------
 1 file changed, 8 insertions(+), 16 deletions(-)

diff --git a/paddle/framework/backward.md b/paddle/framework/backward.md
index b4205fed2e..133b17c7be 100644
--- a/paddle/framework/backward.md
+++ b/paddle/framework/backward.md
@@ -2,32 +2,24 @@
 
 ## Motivation
 
-In Neural Network, the backpropagation algorithm follows the chain rule, so we need to compound the fundmental gradient operators/expressions together with chain rule . Every forward network need a backward network to construct the full computation lineage, the operator/ expression's Backward feature will generate the backward pass respect to forward pass.
- 
+In Neural Network, the backpropagation algorithm follows the chain rule, so we need to compound the fundmental gradient operators/expressions together with chain rule . Every forward network need a backward network to construct the full computation lineage, the operator/expression's backward pass will be generated respect to forward pass.
+  
 ## Backward Operator Registry
 
-A backward network is built up with several backward operators. Backward operators take forward operators' inputs, outputs and output gradients and then calculate its input gradients. In most cases, there is a one-to-one correspondence between forward and backward operators. We use registry mechanism to save these correspondences, which is quite similar with operator registry itself.
+A backward network is built up with several backward operators. Backward operators take forward operators' inputs, outputs and output gradients and then calculate its input gradients. In most cases, there is a one-to-one correspondence between forward and backward operators. We use registry mechanism to save these correspondences.
 
 For example, we have got a `add_two_op`, and is registered by the following code:
 
 ```cpp
-REGISTER_OP(add_two, AddTwoOp, AddTwoOpMaker);
+REGISTER_OP(add_two, AddTwoOp, AddTwoOpMaker, add_two_grad, AddTwoGradOp);
 ```
 
 `add_two` is the operator's type. `AddTwoOp` and `AddTwoOpMaker` are the operator class and the operator maker class respectively.
 
-Assume that we have also got the backward operator of `add_two_op`, which calculating the gradients of `add_two_op`'s inputs. Then we register it by the following way:
-
-```cpp
-REGISTER_GRADIENT_OP(add_two, add_two_grad, AddTwoGradOp);
-```
-
 `add_two_grad` is the type of backward operator, and `AddTwoGradOp` is its class name.
 
 ## Backward Opeartor Creating
 
-### Usage
-
 Given a certain forward operator, we can get its corresponding backward opeartor by calling:
 
 ```cpp
@@ -36,13 +28,13 @@ OperatorBase* bwd_op = BuildGradOp(const OperatorBase* fwd_op);
 
 The function `BuildGradOp` will sequentially execute following processes:
 
-1. Getting the `type_` of given forward operator, and then creating the corresponding backward operator.
+1. Get the `type_` of given forward operator, and then get the corresponding backward operator's type by looking up the `OpInfoMap`.
 
-2. Copying all the attributes of forward operator expect `input_format` and `output_format`(if it has), for their elements differ between forward and backward operators.
+2. Build two maps named `inputs` and `outputs` to temporary storage backward operator's inputs and outputs. Copy forward operator's `inputs_` and `outputs_` to map `inputs`, except these are not necessary for gradient computing.
 
-3. Copying forward operator's `inputs_` and `outputs_` to backward operator's `inputs_`. And adding forward inputs' gradient variables into backward `output_`, adding forward outputs' gradient variables into backward `input_`.
+3. Add forward inputs' gradient variables into map `output`, adding forward outputs' gradient variables into map `input`.
 
-4. Building backward operator's `input_format`, `output_format` (if necessary) and `in_out_idxs_` according to its `inputs_` and `outputs_` just created.
+4. Building backward operator with `inputs`, `outputs` and forward operator's attributes.
 
 ## Backward Network Building
 

From 98b7c6736445de1f287156e933b0d625f648e6da Mon Sep 17 00:00:00 2001
From: tensor-tang <jian.j.tang@intel.com>
Date: Mon, 28 Aug 2017 09:52:58 +0800
Subject: [PATCH 43/64] add todo

---
 paddle/gserver/layers/MKLDNNFcLayer.cpp | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/paddle/gserver/layers/MKLDNNFcLayer.cpp b/paddle/gserver/layers/MKLDNNFcLayer.cpp
index ad50c15a7d..d38e6a2099 100644
--- a/paddle/gserver/layers/MKLDNNFcLayer.cpp
+++ b/paddle/gserver/layers/MKLDNNFcLayer.cpp
@@ -184,15 +184,14 @@ void MKLDNNFcLayer::resetBwd() {
   const MatrixPtr& wgt = weight_->getWGrad();
   const MatrixPtr& bias = hasBias ? biases_->getWGrad() : nullptr;
 
+  // TODO(TJ): merge topdiffs
   if (nextIsMKLDNN()) {
     // can not directly cast outputgrad to mkldnnmatrix,
     // since each layer can not write the inputgrad to mkldnn inputgrad.
     // So just create from matrix with outputvalue format.
     const MatrixPtr& out = getOutput(MKLDNN_DEVICE).grad;
     outGrad_ = MKLDNNMatrix::create(out, outVal_->getPD());
-    // TODO: maybe need merge topdiffs
   } else {
-    // TODO: merge topdiffs
     const MatrixPtr& out = getOutput(CPU_DEVICE).grad;
     // fc do not need to convert from cpu device since output always nc
     // only need create from cpu device
@@ -234,8 +233,7 @@ void MKLDNNFcLayer::resetBwd() {
       return;
     }
     if (getInput(0, MKLDNN_DEVICE).getAllCount() > 1) {
-      // TODO: many mkldnn bots
-      // add sum handle
+      // TODO(TJ): use outputMaps_ ways when merge topdiff done
     } else {
       inGrad_ = MKLDNNMatrix::create(in, inVal_->getPD());
     }
@@ -245,8 +243,7 @@ void MKLDNNFcLayer::resetBwd() {
       return;
     }
     if (getInput(0, CPU_DEVICE).getAllCount() > 1) {
-      // TODO: many  bots
-      // add sum handle
+      // TODO(TJ): use outputMaps_ ways when merge topdiff done
     } else {
       inGrad_ = MKLDNNMatrix::create(in, inVal_->getPD());
     }

From be4c0123c4c6cccfaa8fafa9063ce84415854c28 Mon Sep 17 00:00:00 2001
From: caoying03 <caoying03@baidu.com>
Date: Mon, 28 Aug 2017 10:11:54 +0800
Subject: [PATCH 44/64] follow comments.

---
 python/paddle/v2/parameters.py | 23 ++++++++++++++++++++---
 1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/python/paddle/v2/parameters.py b/python/paddle/v2/parameters.py
index 475067ef22..cc3adf6f48 100644
--- a/python/paddle/v2/parameters.py
+++ b/python/paddle/v2/parameters.py
@@ -43,9 +43,26 @@ def create(layers):
 
 class Parameters(object):
     """
-    Parameters is a dictionary contains Paddle's parameter. The key of
-    Parameters is the name of parameter. The value of Parameters is a plain
-    :code:`numpy.ndarry` .
+    `Parameters` manages all the learnable parameters in a neural network.
+    It stores parameters' information in an OrderedDict, key of which is
+    the name of a parameter, and value related to a key is a parameter's
+    configuration, such as initialization mean and std, its size, whether it is
+    a static parameter, and so on.
+
+    :param __param_conf__: this member stores the configurations of learnable
+        parameters in a network in an OrderedDict. The parameters are added by
+        following their creation order in the neural network one by one:
+        parameters of the previous layers in a network are careted first.
+        When a user iterates over this dict, he can visit parameters in the
+        network from button to up.
+    :type __param_conf__: OrderedDict
+    :param __gradient_machines__: all of the parameters in a neural network are
+        appended to a Paddle gradient machine, which is used internally to copy
+        the parameter values between the C++ and Python end.
+    :type __gradient_machines__: list
+    :param __tmp_params__: a dict to store dummy parameters if no
+        __gradient_machines__ is appended to `Parameters`.
+    :type __tmp_params__: dict
 
     Basically usage is
 

From 346630f413a2e9aa9cbbdf2af4595a461ec09ac0 Mon Sep 17 00:00:00 2001
From: Luo Tao <luotao02@baidu.com>
Date: Mon, 28 Aug 2017 11:19:53 +0800
Subject: [PATCH 45/64] Remove "About" tab in "Documentation"

---
 doc/about/index_cn.md  | 11 -----------
 doc/about/index_en.rst | 14 --------------
 doc/index_en.rst       |  1 -
 3 files changed, 26 deletions(-)
 delete mode 100644 doc/about/index_cn.md
 delete mode 100644 doc/about/index_en.rst

diff --git a/doc/about/index_cn.md b/doc/about/index_cn.md
deleted file mode 100644
index 3bf030004d..0000000000
--- a/doc/about/index_cn.md
+++ /dev/null
@@ -1,11 +0,0 @@
-关于PaddlePaddle
-================
-
-PaddlePaddle是一个最早由百度科学家和工程师共同研发的并行分布式深度学习平台，兼备易用性、高效性、灵活性和可扩展性，目前已被百度内部多个产品线广泛使用。
-PaddlePaddle目前已经开放源码, 但是远未完善，我们希望能在这个基础上不断的改进、扩展和延伸。
-同时我们希望广大开发者积极提供反馈和贡献源代码，建立一个活跃的开源社区。
-
-致谢
---------
-
-在此，特别感谢PaddlePaddle的[所有贡献者](https://github.com/PaddlePaddle/Paddle/graphs/contributors)。
diff --git a/doc/about/index_en.rst b/doc/about/index_en.rst
deleted file mode 100644
index 065c430cde..0000000000
--- a/doc/about/index_en.rst
+++ /dev/null
@@ -1,14 +0,0 @@
-ABOUT
-=======
-
-PaddlPaddle is an easy-to-use, efficient, flexible and scalable deep learning platform,
-which is originally developed by Baidu scientists and engineers for the purpose of applying deep learning to many products at Baidu.
-
-PaddlePaddle is now open source but far from complete, which is intended to be built upon, improved, scaled, and extended.
-We hope to build an active open source community both by providing feedback and by actively contributing to the source code.
-
-
-Credits
---------
-
-We owe many thanks to `all contributors and developers <https://github.com/PaddlePaddle/Paddle/graphs/contributors>`_ of PaddlePaddle!
diff --git a/doc/index_en.rst b/doc/index_en.rst
index 168c7667c6..64684b8b9b 100644
--- a/doc/index_en.rst
+++ b/doc/index_en.rst
@@ -7,4 +7,3 @@ PaddlePaddle Documentation
   getstarted/index_en.rst
   howto/index_en.rst
   api/index_en.rst
-  about/index_en.rst

From fe51f726a2da85b0cb96734bd9b156760b044cf9 Mon Sep 17 00:00:00 2001
From: tensor-tang <jian.j.tang@intel.com>
Date: Mon, 28 Aug 2017 10:44:31 +0800
Subject: [PATCH 46/64] fix cmake

---
 paddle/math/CMakeLists.txt | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/paddle/math/CMakeLists.txt b/paddle/math/CMakeLists.txt
index 8afe6b509d..68b5296228 100644
--- a/paddle/math/CMakeLists.txt
+++ b/paddle/math/CMakeLists.txt
@@ -16,10 +16,10 @@ file(GLOB MATH_HEADERS . *.h)
 file(GLOB MATH_SOURCES . *.cpp)
 
 if(NOT WITH_MKLDNN)
-    file(GLOB_RECURSE DNN_HEADER RELATIVE "${CMAKE_CURRENT_SOURCE_DIR}" "MKLDNN*.h")
-    file(GLOB_RECURSE DNN_SOURCES RELATIVE "${CMAKE_CURRENT_SOURCE_DIR}" "MKLDNN*.cpp")
-    list(REMOVE_ITEM MATH_HEADERS ${DNN_HEADER})
-    list(REMOVE_ITEM MATH_SOURCES ${DNN_SOURCES})
+    set(DNN_HEADER "${CMAKE_CURRENT_SOURCE_DIR}/MKLDNNMatrix.h")
+    set(DNN_SOURCE "${CMAKE_CURRENT_SOURCE_DIR}/MKLDNNMatrix.cpp")
+    list(REMOVE_ITEM MATH_HEADERS "${DNN_HEADER}")
+    list(REMOVE_ITEM MATH_SOURCES "${DNN_SOURCE}")
     message(STATUS "Skip compiling with MKLDNNMatrix")
 else()
     message(STATUS "Compile with MKLDNNMatrix")

From f0b25c4cfb21b41e8bc7222d44f05a9818dc9b47 Mon Sep 17 00:00:00 2001
From: caoying03 <caoying03@baidu.com>
Date: Mon, 28 Aug 2017 12:20:28 +0800
Subject: [PATCH 47/64] follow comments to refine the comments.

---
 python/paddle/v2/parameters.py | 27 +++++++++++++--------------
 1 file changed, 13 insertions(+), 14 deletions(-)

diff --git a/python/paddle/v2/parameters.py b/python/paddle/v2/parameters.py
index cc3adf6f48..4cfd91882e 100644
--- a/python/paddle/v2/parameters.py
+++ b/python/paddle/v2/parameters.py
@@ -44,21 +44,20 @@ def create(layers):
 class Parameters(object):
     """
     `Parameters` manages all the learnable parameters in a neural network.
-    It stores parameters' information in an OrderedDict, key of which is
-    the name of a parameter, and value related to a key is a parameter's
-    configuration, such as initialization mean and std, its size, whether it is
-    a static parameter, and so on.
-
-    :param __param_conf__: this member stores the configurations of learnable
-        parameters in a network in an OrderedDict. The parameters are added by
-        following their creation order in the neural network one by one:
-        parameters of the previous layers in a network are careted first.
-        When a user iterates over this dict, he can visit parameters in the
-        network from button to up.
+    It stores parameters' information in an OrderedDict. The key is
+    the name of a parameter, and value is a parameter's configuration(in
+    protobuf format), such as initialization mean and std, its size, whether it
+    is a static parameter, and so on.
+
+    :param __param_conf__: store the configurations of learnable parameters in
+        the network in an OrderedDict. Parameter is added one by one into the
+        dict by following their created order in the network: parameters of
+        the previous layers in a network are careted first. You can visit the
+        parameters from bottom to top by iterating over this dict.
     :type __param_conf__: OrderedDict
     :param __gradient_machines__: all of the parameters in a neural network are
-        appended to a Paddle gradient machine, which is used internally to copy
-        the parameter values between the C++ and Python end.
+        appended to a PaddlePaddle gradient machine, which is used internally to
+        copy parameter values between C++ and Python end.
     :type __gradient_machines__: list
     :param __tmp_params__: a dict to store dummy parameters if no
         __gradient_machines__ is appended to `Parameters`.
@@ -271,7 +270,7 @@ class Parameters(object):
         append gradient machine to parameters. This method is used internally in
         Trainer.train.
 
-        :param gradient_machine: Paddle C++ GradientMachine object.
+        :param gradient_machine: PaddlePaddle C++ GradientMachine object.
         :type gradient_machine: api.GradientMachine
         :return:
         """

From 227fdfb65dcb45921398690610886ebdb9b34d98 Mon Sep 17 00:00:00 2001
From: hedaoyuan <hedaoyuan@github.com>
Date: Mon, 28 Aug 2017 13:35:51 +0800
Subject: [PATCH 48/64] Refine NeonDepthwiseConvFunction.

---
 paddle/function/neon/NeonDepthwiseConv.cpp | 70 ++++++++--------------
 1 file changed, 26 insertions(+), 44 deletions(-)

diff --git a/paddle/function/neon/NeonDepthwiseConv.cpp b/paddle/function/neon/NeonDepthwiseConv.cpp
index 3fe28b1de3..f09e98587d 100644
--- a/paddle/function/neon/NeonDepthwiseConv.cpp
+++ b/paddle/function/neon/NeonDepthwiseConv.cpp
@@ -509,10 +509,9 @@ public:
     size_t filterMultiplier = outputChannels / groups_;
     CHECK_EQ(inputChannels, groups_);
 
-    // only support
+    // only support strideH() == strideW() and filterHeight == filterWidth.
     CHECK_EQ(strideH(), strideW());
     CHECK_EQ(filterHeight, filterWidth);
-    CHECK_LT(strideH(), size_t(3));
 
     float* inputData = inputs[0].data<float>();
     float* filterData = inputs[1].data<float>();
@@ -538,49 +537,32 @@ public:
       inputWidth += 2 * paddingW();
     }
 
-    for (size_t i = 0; i < batchSize; i++) {
-      if (filterWidth == 3 && strideH() == 1) {
-        DepthwiseConvKernel<3, 1>::run(inputPadding,
-                                       filterData,
-                                       inputHeight,
-                                       inputWidth,
-                                       outputChannels,
-                                       outputHeight,
-                                       outputWidth,
-                                       filterMultiplier,
-                                       outputData);
-      } else if (filterWidth == 3 && strideH() == 2) {
-        DepthwiseConvKernel<3, 2>::run(inputPadding,
-                                       filterData,
-                                       inputHeight,
-                                       inputWidth,
-                                       outputChannels,
-                                       outputHeight,
-                                       outputWidth,
-                                       filterMultiplier,
-                                       outputData);
-      } else if (filterWidth == 4 && strideH() == 1) {
-        DepthwiseConvKernel<4, 1>::run(inputPadding,
-                                       filterData,
-                                       inputHeight,
-                                       inputWidth,
-                                       outputChannels,
-                                       outputHeight,
-                                       outputWidth,
-                                       filterMultiplier,
-                                       outputData);
-      } else if (filterWidth == 4 && strideH() == 2) {
-        DepthwiseConvKernel<4, 2>::run(inputPadding,
-                                       filterData,
-                                       inputHeight,
-                                       inputWidth,
-                                       outputChannels,
-                                       outputHeight,
-                                       outputWidth,
-                                       filterMultiplier,
-                                       outputData);
-      }
+    std::function<void(
+        const float*, const float*, int, int, int, int, int, int, float*)>
+        DepthWiseConv;
+
+    if (filterWidth == 3 && strideW() == 1) {
+      DepthWiseConv = DepthwiseConvKernel<3, 1>::run;
+    } else if (filterWidth == 3 && strideW() == 2) {
+      DepthWiseConv = DepthwiseConvKernel<3, 2>::run;
+    } else if (filterWidth == 4 && strideW() == 1) {
+      DepthWiseConv = DepthwiseConvKernel<4, 1>::run;
+    } else if (filterWidth == 4 && strideW() == 2) {
+      DepthWiseConv = DepthwiseConvKernel<4, 2>::run;
+    } else {
+      LOG(FATAL) << "Not supported";
+    }
 
+    for (size_t i = 0; i < batchSize; i++) {
+      DepthWiseConv(inputPadding,
+                    filterData,
+                    inputHeight,
+                    inputWidth,
+                    outputChannels,
+                    outputHeight,
+                    outputWidth,
+                    filterMultiplier,
+                    outputData);
       inputPadding += inputChannels * inputHeight * inputWidth;
       outputData += outputChannels * outputHeight * outputWidth;
     }

From 3a75b4b70cd21449691eaca82f1805759622e640 Mon Sep 17 00:00:00 2001
From: hedaoyuan <hedaoyuan@github.com>
Date: Mon, 28 Aug 2017 14:49:11 +0800
Subject: [PATCH 49/64] Fix CMakeLists.text

---
 paddle/function/CMakeLists.txt          | 2 +-
 paddle/function/DepthwiseConvOpTest.cpp | 4 ++++
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/paddle/function/CMakeLists.txt b/paddle/function/CMakeLists.txt
index 05f808a6a1..f43f15e5ca 100644
--- a/paddle/function/CMakeLists.txt
+++ b/paddle/function/CMakeLists.txt
@@ -44,11 +44,11 @@ if(WITH_GPU)
     add_simple_unittest(RowConvOpTest)
     add_simple_unittest(BlockExpandOpTest)
     add_simple_unittest(CropOpTest)
-    add_simple_unittest(DepthwiseConvOpTest)
 endif()
 
 add_simple_unittest(Im2ColTest)
 add_simple_unittest(GemmConvOpTest)
+add_simple_unittest(DepthwiseConvOpTest)
 endif()
 
 add_style_check_target(paddle_function ${h_files})
diff --git a/paddle/function/DepthwiseConvOpTest.cpp b/paddle/function/DepthwiseConvOpTest.cpp
index bdace2c372..d8e8c889d5 100644
--- a/paddle/function/DepthwiseConvOpTest.cpp
+++ b/paddle/function/DepthwiseConvOpTest.cpp
@@ -34,9 +34,13 @@ TEST(DepthwiseConv, BackwardFilter) {
 }
 #endif
 
+#if defined(__ARM_NEON__) || defined(__ARM_NEON)
+
 TEST(DepthwiseConv, Forward) {
   DepthwiseConvolution<DEVICE_TYPE_CPU, DEVICE_TYPE_CPU>(
       "GemmConv-CPU", "NeonDepthwiseConv-CPU", forward);
 }
 
+#endif
+
 }  // namespace paddle

From 34a92ab41a407679d454f437f1f3118b81dd1b34 Mon Sep 17 00:00:00 2001
From: hedaoyuan <hedaoyuan@github.com>
Date: Mon, 28 Aug 2017 14:58:00 +0800
Subject: [PATCH 50/64] ExpandConvLayer adds support of arm-neon acceleration.

---
 paddle/gserver/layers/ExpandConvLayer.cpp | 23 ++++++++++++++++++-----
 1 file changed, 18 insertions(+), 5 deletions(-)

diff --git a/paddle/gserver/layers/ExpandConvLayer.cpp b/paddle/gserver/layers/ExpandConvLayer.cpp
index 0ece279931..0e84581769 100644
--- a/paddle/gserver/layers/ExpandConvLayer.cpp
+++ b/paddle/gserver/layers/ExpandConvLayer.cpp
@@ -29,6 +29,10 @@ namespace paddle {
 REGISTER_LAYER(exconv, ExpandConvLayer);
 REGISTER_LAYER(exconvt, ExpandConvLayer);
 
+inline bool isDepthwiseConv(int channels, int groups) {
+  return channels == groups;
+}
+
 bool ExpandConvLayer::init(const LayerMap &layerMap,
                            const ParameterMap &parameterMap) {
   /* Initialize the basic convolutional parent class */
@@ -47,14 +51,23 @@ bool ExpandConvLayer::init(const LayerMap &layerMap,
     std::vector<size_t> paddings = {(size_t)paddingY_[i], (size_t)padding_[i]};
     std::vector<size_t> strides = {(size_t)strideY_[i], (size_t)stride_[i]};
 
-    if (useGpu_ && (size_t)groups_[i] == (size_t)channels_[i] && !isDeconv_) {
+    // Convolution Layer uses the GemmConv function by default.
+    convType = "GemmConv";
+    convGradInputType = "GemmConvGradInput";
+    convGradFilterType = "GemmConvGradFilter";
+
+    // If depth wise convolution and useGpu == true
+    if (useGpu_ && isDepthwiseConv(channels_[i], groups_[i]) && !isDeconv_) {
       convType = "DepthwiseConv";
       convGradInputType = "DepthwiseConvGradInput";
       convGradFilterType = "DepthwiseConvGradFilter";
-    } else {
-      convType = "GemmConv";
-      convGradInputType = "GemmConvGradInput";
-      convGradFilterType = "GemmConvGradFilter";
+    }
+
+    // If depth wise convolution and useGpu == false and ARM-NEON
+    if (!useGpu_ && isDepthwiseConv(channels_[i], groups_[i]) && !isDeconv_) {
+#if defined(__ARM_NEON__) || defined(__ARM_NEON)
+      convType = "NeonDepthwiseConv";
+#endif
     }
 
     if (FLAGS_use_nnpack && !isDeconv_) {

From e63ad0a6bdb36967d417633a074e0e966ca55e78 Mon Sep 17 00:00:00 2001
From: Luo Tao <luotao02@baidu.com>
Date: Mon, 28 Aug 2017 15:15:26 +0800
Subject: [PATCH 51/64] HuberRegressionLoss and HuberTwoClassification support
 multi-dimension data

---
 paddle/gserver/layers/CostLayer.cpp | 67 ++++++++++++++++++-----------
 1 file changed, 41 insertions(+), 26 deletions(-)

diff --git a/paddle/gserver/layers/CostLayer.cpp b/paddle/gserver/layers/CostLayer.cpp
index 7f648070f2..aa4a26a83f 100644
--- a/paddle/gserver/layers/CostLayer.cpp
+++ b/paddle/gserver/layers/CostLayer.cpp
@@ -611,22 +611,26 @@ void HuberRegressionLoss::forwardImp(Matrix& output,
                                      Matrix& target) {
   HuberCost::forwardImp(output, label, target);
   size_t numSamples = target.getHeight();
+  size_t dim = output.getWidth();
   CHECK(label.value);
   CHECK_EQ((*label.value).getHeight(), numSamples);
   CHECK_EQ(output.getHeight(), numSamples);
-  CHECK_EQ(output.getWidth(), (*label.value).getWidth());
+  CHECK_EQ(dim, (*label.value).getWidth());
   CHECK_EQ(target.getWidth(), (size_t)1);
 
   real* out = useGpu_ ? tmpCpuInput_[0].value->getData() : output.getData();
   real* lbl =
       useGpu_ ? tmpCpuInput_[1].value->getData() : (*label.value).getData();
-  std::vector<real> cost(numSamples);
+  std::vector<real> cost(numSamples, 0);
   for (size_t i = 0; i < numSamples; ++i) {
-    real a = std::abs(lbl[i] - out[i]);
-    if (a <= delta_)
-      cost[i] = a * a / 2;
-    else
-      cost[i] = delta_ * (a - delta_ / 2);
+    for (size_t j = 0; j < dim; ++j) {
+      int index = i * dim + j;
+      real a = std::abs(lbl[index] - out[index]);
+      if (a <= delta_)
+        cost[i] += a * a / 2;
+      else
+        cost[i] += delta_ * (a - delta_ / 2);
+    }
   }
   target.copyFrom(cost.data(), numSamples);
 }
@@ -635,18 +639,22 @@ void HuberRegressionLoss::backwardImp(Matrix& output,
                                       Argument& label,
                                       Matrix& outputG) {
   size_t numSamples = output.getHeight();
+  size_t dim = output.getWidth();
   real* out = useGpu_ ? tmpCpuInput_[0].value->getData() : output.getData();
   real* lbl =
       useGpu_ ? tmpCpuInput_[1].value->getData() : (*label.value).getData();
   real* grad = useGpu_ ? tmpCpuInput_[0].grad->getData() : outputG.getData();
   for (size_t i = 0; i < numSamples; ++i) {
-    real a = lbl[i] - out[i];
-    if (std::abs(a) <= delta_)
-      grad[i] += -a;
-    else
-      grad[i] += a > 0 ? -delta_ : delta_;
+    for (size_t j = 0; j < dim; ++j) {
+      int index = i * dim + j;
+      real a = lbl[index] - out[index];
+      if (std::abs(a) <= delta_)
+        grad[index] += -a;
+      else
+        grad[index] += a > 0 ? -delta_ : delta_;
+    }
   }
-  if (useGpu_) outputG.copyFrom(grad, numSamples);
+  if (useGpu_) outputG.copyFrom(grad, numSamples * dim);
 }
 
 //
@@ -664,23 +672,25 @@ void HuberTwoClassification::forwardImp(Matrix& output,
                                         Matrix& target) {
   HuberCost::forwardImp(output, label, target);
   size_t numSamples = target.getHeight();
+  size_t dim = output.getWidth();
   CHECK(label.ids);
   CHECK_EQ((*label.ids).getSize(), numSamples);
   CHECK_EQ(output.getHeight(), numSamples);
-  CHECK_EQ(output.getWidth(), (size_t)1);
   CHECK_EQ(target.getWidth(), (size_t)1);
 
   real* out = useGpu_ ? tmpCpuInput_[0].value->getData() : output.getData();
   int* lbl = useGpu_ ? tmpCpuInput_[1].ids->getData() : (*label.ids).getData();
-  std::vector<real> cost(numSamples);
+  std::vector<real> cost(numSamples, 0);
   for (size_t i = 0; i < numSamples; ++i) {
     int y = 2 * lbl[i] - 1;
-    if (out[i] * y < -1)
-      cost[i] = -4 * out[i] * y;
-    else if (out[i] * y < 1)
-      cost[i] = (1 - out[i] * y) * (1 - out[i] * y);
-    else
-      cost[i] = 0;
+    for (size_t j = 0; j < dim; ++j) {
+      int index = i * dim + j;
+      real a = out[index] * y;
+      if (a < -1)
+        cost[i] += -4 * a;
+      else if (a < 1)
+        cost[i] += (1 - a) * (1 - a);
+    }
   }
   target.copyFrom(cost.data(), numSamples);
 }
@@ -689,17 +699,22 @@ void HuberTwoClassification::backwardImp(Matrix& output,
                                          Argument& label,
                                          Matrix& outputG) {
   size_t numSamples = output.getHeight();
+  size_t dim = output.getWidth();
   real* out = useGpu_ ? tmpCpuInput_[0].value->getData() : output.getData();
   int* lbl = useGpu_ ? tmpCpuInput_[1].ids->getData() : (*label.ids).getData();
   real* grad = useGpu_ ? tmpCpuInput_[0].grad->getData() : outputG.getData();
   for (size_t i = 0; i < numSamples; ++i) {
     int y = 2 * lbl[i] - 1;
-    if (y * out[i] < -1)
-      grad[i] += -4 * y;
-    else if (y * out[i] < 1)
-      grad[i] += -2 * (1 - y * out[i]) * y;
+    for (size_t j = 0; j < dim; ++j) {
+      int index = i * dim + j;
+      real a = out[index] * y;
+      if (a < -1)
+        grad[index] += -4 * y;
+      else if (a < 1)
+        grad[index] += -2 * (1 - a) * y;
+    }
   }
-  if (useGpu_) outputG.copyFrom(grad, numSamples);
+  if (useGpu_) outputG.copyFrom(grad, numSamples * dim);
 }
 /**
  * This cost layer compute the sum of its input as loss.

From 5df384d67ff498c9438b2ef7dc9566af7d50c97a Mon Sep 17 00:00:00 2001
From: hedaoyuan <hedaoyuan@github.com>
Date: Mon, 28 Aug 2017 19:36:18 +0800
Subject: [PATCH 52/64] Remove NeonDepthwiseConv.h

---
 paddle/function/neon/NeonDepthwiseConv.h | 25 ------------------------
 1 file changed, 25 deletions(-)
 delete mode 100644 paddle/function/neon/NeonDepthwiseConv.h

diff --git a/paddle/function/neon/NeonDepthwiseConv.h b/paddle/function/neon/NeonDepthwiseConv.h
deleted file mode 100644
index 23e4be1921..0000000000
--- a/paddle/function/neon/NeonDepthwiseConv.h
+++ /dev/null
@@ -1,25 +0,0 @@
-/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License. */
-
-#pragma once
-
-namespace paddle {
-
-namespace neon {
-
-template <int filterSize, int stride>
-struct DepthwiseConvKernel {};
-
-}  // namespace neon
-}  // namespace paddle

From 4f0c071e4909ff041f3a86c3a40c482becf50845 Mon Sep 17 00:00:00 2001
From: qijun <qijun1994@hotmail.com>
Date: Mon, 28 Aug 2017 22:18:11 +0800
Subject: [PATCH 53/64] refine backward

---
 paddle/framework/backward.cc | 5 ++++-
 paddle/operators/net_op.cc   | 9 ++++++---
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/paddle/framework/backward.cc b/paddle/framework/backward.cc
index bfda18724c..6b4c612cd8 100644
--- a/paddle/framework/backward.cc
+++ b/paddle/framework/backward.cc
@@ -124,6 +124,9 @@ static std::unique_ptr<OperatorBase> BackwardRecursive(
     std::list<Pos> insert_position;
     for (auto& dup_output_op : dup_output_ops) {
       const std::string& name = dup_output_op.first;
+      // duplicate @Empty@ don't need to be added
+      if (name == kEmptyVarName) continue;
+
       auto& dup_op = dup_output_op.second;
       // no duplicate output
       if (dup_op.size() == 1) continue;
@@ -209,7 +212,7 @@ std::unique_ptr<OperatorBase> Backward(
     const OperatorBase& forwardOp,
     const std::unordered_set<std::string>& no_grad_vars) {
   std::unordered_set<std::string> no_grad_names;
-  no_grad_names.reserve(no_grad_vars.size());
+  no_grad_names.reserve(no_grad_vars.size() + 1);
 
   no_grad_names.insert(std::string(kEmptyVarName) + kGradVarSuffix);
 
diff --git a/paddle/operators/net_op.cc b/paddle/operators/net_op.cc
index 44d925f0b0..78b5e27678 100644
--- a/paddle/operators/net_op.cc
+++ b/paddle/operators/net_op.cc
@@ -31,10 +31,13 @@ void NetOp::CompleteAddOp(bool calc) {
   for (auto& op : ops_) {
     for (auto& ipt : op->Inputs()) {
       for (auto& var_name : ipt.second) {
-        if (!Contains(output_set, var_name)) {  // Not other op's output
-          input_set.insert(var_name);
-        } else {
+        // If input variable has been in output set, then it will be
+        // added into intermediate_outputs_. Otherwise, it will be
+        // added into input set.
+        if (Contains(output_set, var_name)) {
           intermediate_outputs_.insert(var_name);
+        } else {
+          input_set.insert(var_name);
         }
       }
     }

From 980edfa69a72f57dea689d1d5b1bff6b388e7a71 Mon Sep 17 00:00:00 2001
From: fengjiayi <fengjiayi@baidu.com>
Date: Mon, 28 Aug 2017 11:34:24 -0700
Subject: [PATCH 54/64] Refine backward document

---
 paddle/framework/backward.md | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/paddle/framework/backward.md b/paddle/framework/backward.md
index 133b17c7be..ce324a73f0 100644
--- a/paddle/framework/backward.md
+++ b/paddle/framework/backward.md
@@ -6,9 +6,16 @@ In Neural Network, the backpropagation algorithm follows the chain rule, so we n
   
 ## Backward Operator Registry
 
-A backward network is built up with several backward operators. Backward operators take forward operators' inputs, outputs and output gradients and then calculate its input gradients. In most cases, there is a one-to-one correspondence between forward and backward operators. We use registry mechanism to save these correspondences.
+A backward network is built up with several backward operators. Backward operators take forward operators' inputs, outputs and output gradients and then calculate its input gradients.
 
-For example, we have got a `add_two_op`, and is registered by the following code:
+-|                        | forward operator | backward operator 
+-| ---------------------- | ---------------- |------------------------- |		
+-| **Operator::inputs_**  | Inputs       | Inputs, Outputs, OutputGradients |	
+-| **Operator::outputs_** | Outputs          | InputGradients            |
+
+ In most cases, there is a one-to-one correspondence between forward and backward operators. These correspondences are recorded by a global hash map(`OpInfoMap`). To follow the philosophy of minimum core and make operators pluggable, the registry mechanism is introduced.
+
+For example, we have got a `add_two_op`, and we can register it's information and corresponding backward operator by the following macro:
 
 ```cpp
 REGISTER_OP(add_two, AddTwoOp, AddTwoOpMaker, add_two_grad, AddTwoGradOp);

From eaeb69f98f70bbea4fe4aae9f7c7b830f75959c5 Mon Sep 17 00:00:00 2001
From: fengjiayi <fengjiayi@baidu.com>
Date: Mon, 28 Aug 2017 13:47:37 -0700
Subject: [PATCH 55/64] Follow reviewer's comments

---
 paddle/framework/backward.md | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/paddle/framework/backward.md b/paddle/framework/backward.md
index ce324a73f0..8aa6728a95 100644
--- a/paddle/framework/backward.md
+++ b/paddle/framework/backward.md
@@ -2,28 +2,28 @@
 
 ## Motivation
 
-In Neural Network, the backpropagation algorithm follows the chain rule, so we need to compound the fundmental gradient operators/expressions together with chain rule . Every forward network need a backward network to construct the full computation lineage, the operator/expression's backward pass will be generated respect to forward pass.
+In Neural Network, the backpropagation algorithm follows the chain rule, so we need to compound the fundmental gradient operators/expressions together with chain rule . Every forward network need a backward network to construct the full computation graph, the operator/expression's backward pass will be generated respect to forward pass.
   
 ## Backward Operator Registry
 
 A backward network is built up with several backward operators. Backward operators take forward operators' inputs, outputs and output gradients and then calculate its input gradients.
 
--|                        | forward operator | backward operator 
--| ---------------------- | ---------------- |------------------------- |		
--| **Operator::inputs_**  | Inputs       | Inputs, Outputs, OutputGradients |	
--| **Operator::outputs_** | Outputs          | InputGradients            |
+|                        | forward operator | backward operator 
+| ---------------------- | ---------------- |------------------------- |		
+| **Operator::inputs_**  | Inputs       | Inputs, Outputs, OutputGradients |	
+| **Operator::outputs_** | Outputs          | InputGradients            |
 
  In most cases, there is a one-to-one correspondence between forward and backward operators. These correspondences are recorded by a global hash map(`OpInfoMap`). To follow the philosophy of minimum core and make operators pluggable, the registry mechanism is introduced.
 
-For example, we have got a `add_two_op`, and we can register it's information and corresponding backward operator by the following macro:
+For example, we have got a `mul_op`, and we can register it's information and corresponding backward operator by the following macro:
 
 ```cpp
-REGISTER_OP(add_two, AddTwoOp, AddTwoOpMaker, add_two_grad, AddTwoGradOp);
+REGISTER_OP(mul, MulOp, MulOpMaker, mul_grad, MulOpGrad);
 ```
 
-`add_two` is the operator's type. `AddTwoOp` and `AddTwoOpMaker` are the operator class and the operator maker class respectively.
+`mul` is the operator's type. `MulOp` and `MulOpMaker` are the operator class and the operator maker class respectively.
 
-`add_two_grad` is the type of backward operator, and `AddTwoGradOp` is its class name.
+`mul_grad` is the type of backward operator, and `MulOpGrad` is its class name.
 
 ## Backward Opeartor Creating
 

From c19eae4c8e7923aa52dc05560dcc91b8b6d58de8 Mon Sep 17 00:00:00 2001
From: qingqing01 <dangqingqing@baidu.com>
Date: Tue, 29 Aug 2017 15:46:52 +0800
Subject: [PATCH 56/64] update doc about how to write new operators.

---
 doc/howto/dev/new_op_cn.md                    | 56 +++++++++++++------
 .../v2/framework/tests/gradient_checker.py    |  2 +-
 2 files changed, 41 insertions(+), 17 deletions(-)

diff --git a/doc/howto/dev/new_op_cn.md b/doc/howto/dev/new_op_cn.md
index ebd2cf3ff0..228b3fd643 100644
--- a/doc/howto/dev/new_op_cn.md
+++ b/doc/howto/dev/new_op_cn.md
@@ -5,12 +5,13 @@
    - [定义ProtoMaker类](#定义ProtoMaker类)
    - [定义Operator类](#定义Operator类)
    - [定义OpKernel类](#定义OpKernel类)
-   - [注册类](#注册类)
+   - [注册Operator](#注册Operator)
    - [编译](#编译)
  - [绑定Python](#绑定Python)
  - [实现单元测试](#实现单元测试)
    - [前向Operator单测](#前向Operator单测)
    - [反向Operator单测](#反向Operator单测)
+   - [编译和执行](#编译和执行)
 
 
 ## 概念简介
@@ -22,19 +23,17 @@
 - `framework::OperatorWithKernel`：继承自OperatorBase，Op有计算函数，称作有Kernel。
 - `class OpProtoAndCheckerMaker`：描述该Op的输入、输出、属性、注释,主要用于Python API接口生成
 
-依据是否包含kernel，将Op分为两种：包含Kernel的Op和不包含kernel的Op，前者Op的定义继承自`OperatorBase`，后者继承自`OperatorWithKernel`。本教程主要介绍带Kernel的Op如何写，简单总结如下：
+依据是否包含kernel，将Op分为两种：包含Kernel的Op和不包含kernel的Op，前者Op的定义继承自`OperatorBase`，后者继承自`OperatorWithKernel`。本教程主要介绍带Kernel的Op如何写，简单总结Op需要包含的内容如下：
 
-Forward Op需要包含：
-
-   - OpProtoMake定义
-   - Op定义
-   - Kernel实现
+  
+ 内容            | 定义位置         
+--------------  | :----------------------  
+OpProtoMake定义  | `.cc`文件，Backward Op不需要定义OpProtoMake
+Op定义           | `.cc`文件 
+Kernel实现       | CPU、GPU共享Kernel在`.h`文件，否则，CPU可以在`.cc`文件，GPU可在`.cu`文件。 
+注册Op           | Op注册在`.cc`文件；Kernel注册CPU在`.cc`文件，GPU在`.cu`文件
+     
      
-与之对应的Backward Op包含：
-
-   - Op定义
-   - Kernel实现
-
 下面以矩阵乘操作，即[MulOp](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/mul_op.cc)为例来介绍如何写带Kernel的Operator。
 
 
@@ -137,8 +136,9 @@ MulOp(const std::string &type, const framework::VariableNameMap &inputs,
 ```	
 	
 还需要重写`InferShape`接口。`InferShape`为const函数，不能修改Op的成员变量，参数为`const framework::InferShapeContext &ctx`，通过该参数可获取到输入输出以及属性。它的功能是：
-	 - 1). 做检查， 尽早报错：检查输入数据维度、类型等是否合法
-	 - 2). 设置输出Tensor的形状
+
+  - 1). 做检查， 尽早报错：检查输入数据维度、类型等是否合法。
+  - 2). 设置输出Tensor的形状。
 
 通常`OpProtoMaker`和`Op`类的定义写在`.cc`文件中，和要讲到的注册函数一起放在`.cc`中
 
@@ -172,7 +172,7 @@ class MulKernel : public framework::OpKernel {
    
 到此前向Op实现完成，需要在`.cc`文件中注册该op和kernel。反向Op类的定义和Kernel定义与前向Op类似，这里不再重复。但注意，反向Op没有`ProtoMaker`。
    
-### 4. 注册类
+### 4. 注册Operator
 
 在`.cc`文件中注册前向、反向Op类，注册CPU Kernel。
 
@@ -297,4 +297,28 @@ class TestMulOp(unittest.TestCase):
    - 调用`create_op("mul")`创建反向Op对应的前向Op。
    - 定义输入`inputs`。
    - 调用`compare_grad`函数对比CPU、GPU计算结果。
-   - 调用`check_grad`检查梯度稳定性。
+   - 调用`check_grad`检查梯度稳定性，这里采用数值法检测梯度正确性。
+      - 第一个参数`op` : 前向op。
+      - 第二个参数`inputs` : 输入词典，词典的Key和`ProtoMaker`定义保持一致。
+      - 第三个参数`set(["X", "Y"])` : 指定对输入变量`X`、`Y`做梯度检测。
+      - 第四个参数`"Out"` : 指定前向网络最终的输出目标变量`Out`
+
+
+### 编译和执行 
+
+单测完成之后，在[`python/paddle/v2/framework/tests/CMakeLists.txt`](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/v2/framework/tests/CMakeLists.txt)里添加编译：
+
+```
+py_test(test_mul_op SRCS test_mul_op.py)
+```
+
+编译完成之后即可执行单测：
+
+```
+make test ARGS="-R test_mul_op -V"
+```
+或者:
+
+```
+ctest -R test_mul_op
+```
diff --git a/python/paddle/v2/framework/tests/gradient_checker.py b/python/paddle/v2/framework/tests/gradient_checker.py
index 9a7a7fbf5e..02cfb9b2c4 100644
--- a/python/paddle/v2/framework/tests/gradient_checker.py
+++ b/python/paddle/v2/framework/tests/gradient_checker.py
@@ -268,7 +268,7 @@ class GradientChecker(unittest.TestCase):
         :param input_vars: numpy value of input variable. The following
             computation will use these variables.
         :param inputs_to_check: inputs var names that should check gradient.
-        :param output_name: output name that used to
+        :param output_name: the final output variable name.
         :param max_relative_error: The relative tolerance parameter.
         :param no_grad_set: used when create backward ops
         :param only_cpu: only compute and check gradient on cpu kernel.

From b336119424d3fc0d9ffa39688612a83c23c6e10e Mon Sep 17 00:00:00 2001
From: qingqing01 <dangqingqing@baidu.com>
Date: Tue, 29 Aug 2017 16:03:07 +0800
Subject: [PATCH 57/64] Add WITH_TESTING=ON for cmake in the operators writing
 guide doc.

---
 doc/howto/dev/new_op_cn.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/doc/howto/dev/new_op_cn.md b/doc/howto/dev/new_op_cn.md
index 228b3fd643..7f8da2da5a 100644
--- a/doc/howto/dev/new_op_cn.md
+++ b/doc/howto/dev/new_op_cn.md
@@ -312,7 +312,7 @@ class TestMulOp(unittest.TestCase):
 py_test(test_mul_op SRCS test_mul_op.py)
 ```
 
-编译完成之后即可执行单测：
+编译时需要打开`WITH_TESTING`, 即 `cmake paddle_dir -DWITH_TESTING=ON`，编译成功之后执行单测命令为：
 
 ```
 make test ARGS="-R test_mul_op -V"

From b709af616f99c7f4e3ab300297608054638886a8 Mon Sep 17 00:00:00 2001
From: Luo Tao <luotao02@baidu.com>
Date: Tue, 29 Aug 2017 16:21:45 +0800
Subject: [PATCH 58/64] HuberTwoClassification only support one dimension

---
 paddle/gserver/layers/CostLayer.cpp | 31 +++++++++++------------------
 1 file changed, 12 insertions(+), 19 deletions(-)

diff --git a/paddle/gserver/layers/CostLayer.cpp b/paddle/gserver/layers/CostLayer.cpp
index aa4a26a83f..ce071323ff 100644
--- a/paddle/gserver/layers/CostLayer.cpp
+++ b/paddle/gserver/layers/CostLayer.cpp
@@ -672,10 +672,10 @@ void HuberTwoClassification::forwardImp(Matrix& output,
                                         Matrix& target) {
   HuberCost::forwardImp(output, label, target);
   size_t numSamples = target.getHeight();
-  size_t dim = output.getWidth();
   CHECK(label.ids);
   CHECK_EQ((*label.ids).getSize(), numSamples);
   CHECK_EQ(output.getHeight(), numSamples);
+  CHECK_EQ(output.getWidth(), (size_t)1);
   CHECK_EQ(target.getWidth(), (size_t)1);
 
   real* out = useGpu_ ? tmpCpuInput_[0].value->getData() : output.getData();
@@ -683,14 +683,11 @@ void HuberTwoClassification::forwardImp(Matrix& output,
   std::vector<real> cost(numSamples, 0);
   for (size_t i = 0; i < numSamples; ++i) {
     int y = 2 * lbl[i] - 1;
-    for (size_t j = 0; j < dim; ++j) {
-      int index = i * dim + j;
-      real a = out[index] * y;
-      if (a < -1)
-        cost[i] += -4 * a;
-      else if (a < 1)
-        cost[i] += (1 - a) * (1 - a);
-    }
+    real a = out[i] * y;
+    if (a < -1)
+      cost[i] = -4 * a;
+    else if (a < 1)
+      cost[i] = (1 - a) * (1 - a);
   }
   target.copyFrom(cost.data(), numSamples);
 }
@@ -699,22 +696,18 @@ void HuberTwoClassification::backwardImp(Matrix& output,
                                          Argument& label,
                                          Matrix& outputG) {
   size_t numSamples = output.getHeight();
-  size_t dim = output.getWidth();
   real* out = useGpu_ ? tmpCpuInput_[0].value->getData() : output.getData();
   int* lbl = useGpu_ ? tmpCpuInput_[1].ids->getData() : (*label.ids).getData();
   real* grad = useGpu_ ? tmpCpuInput_[0].grad->getData() : outputG.getData();
   for (size_t i = 0; i < numSamples; ++i) {
     int y = 2 * lbl[i] - 1;
-    for (size_t j = 0; j < dim; ++j) {
-      int index = i * dim + j;
-      real a = out[index] * y;
-      if (a < -1)
-        grad[index] += -4 * y;
-      else if (a < 1)
-        grad[index] += -2 * (1 - a) * y;
-    }
+    real a = out[i] * y;
+    if (a < -1)
+      grad[i] += -4 * y;
+    else if (a < 1)
+      grad[i] += -2 * (1 - a) * y;
   }
-  if (useGpu_) outputG.copyFrom(grad, numSamples * dim);
+  if (useGpu_) outputG.copyFrom(grad, numSamples);
 }
 /**
  * This cost layer compute the sum of its input as loss.

From 751d8533e830eae0ca6b9ee6e62e771a4b72a14b Mon Sep 17 00:00:00 2001
From: qingqing01 <dangqingqing@baidu.com>
Date: Tue, 29 Aug 2017 16:45:20 +0800
Subject: [PATCH 59/64] follow wuyi's comments.

---
 python/paddle/v2/framework/tests/gradient_checker.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/python/paddle/v2/framework/tests/gradient_checker.py b/python/paddle/v2/framework/tests/gradient_checker.py
index 02cfb9b2c4..518f828bac 100644
--- a/python/paddle/v2/framework/tests/gradient_checker.py
+++ b/python/paddle/v2/framework/tests/gradient_checker.py
@@ -268,7 +268,7 @@ class GradientChecker(unittest.TestCase):
         :param input_vars: numpy value of input variable. The following
             computation will use these variables.
         :param inputs_to_check: inputs var names that should check gradient.
-        :param output_name: the final output variable name.
+        :param output_name: the output variable name of forward network.
         :param max_relative_error: The relative tolerance parameter.
         :param no_grad_set: used when create backward ops
         :param only_cpu: only compute and check gradient on cpu kernel.

From bfbd066fdd1c4a81266864bf837d89742b3f2ad6 Mon Sep 17 00:00:00 2001
From: tensor-tang <jian.j.tang@intel.com>
Date: Tue, 29 Aug 2017 19:55:44 +0800
Subject: [PATCH 60/64] refine

---
 paddle/gserver/layers/MKLDNNFcLayer.cpp | 117 ++++++++++++------------
 paddle/gserver/layers/MKLDNNFcLayer.h   |   2 +
 paddle/gserver/layers/MKLDNNLayer.h     |  48 +++++++---
 paddle/math/MKLDNNMatrix.cpp            |  25 ++---
 paddle/math/MKLDNNMatrix.h              |  29 +++---
 5 files changed, 118 insertions(+), 103 deletions(-)

diff --git a/paddle/gserver/layers/MKLDNNFcLayer.cpp b/paddle/gserver/layers/MKLDNNFcLayer.cpp
index d38e6a2099..a08cca318e 100644
--- a/paddle/gserver/layers/MKLDNNFcLayer.cpp
+++ b/paddle/gserver/layers/MKLDNNFcLayer.cpp
@@ -77,6 +77,24 @@ void MKLDNNFcLayer::convertWeightsToPaddle() {
   wgtVal_->reorderDataTo(wgtVal_, dstFmt, targetDim);
 }
 
+void MKLDNNFcLayer::convertOutputToOtherDevice() {
+  copyOutputInfoToOtherDevice();
+  // find other cpu device and reorder output to cpu device
+  int cnt = 0;
+  for (size_t i = 0; i < outputOtherDevice_.size(); i++) {
+    if (outputOtherDevice_[i].deviceId == CPU_DEVICE) {
+      // fc cpu output value do not need convert
+      // just share point
+      outputOtherDevice_[i].value = output_.value;
+      ++cnt;
+    }
+  }
+
+  if (cnt > 1) {
+    LOG(WARNING) << "should not have more than one CPU devie";
+  }
+}
+
 void MKLDNNFcLayer::reshape() {
   const Argument& input = getInput(0, getPrev(0)->getDeviceId());
   int batchSize = input.getBatchSize();
@@ -116,7 +134,7 @@ void MKLDNNFcLayer::resetFwd() {
   const MatrixPtr& bias = hasBias ? biases_->getW() : nullptr;
   const MatrixPtr& out = output_.value;
 
-  if (prevIsMKLDNN()) {
+  if (prevIsOnlyMKLDNN()) {
     const MatrixPtr& in = getInputValue(0);
     inVal_ = std::dynamic_pointer_cast<MKLDNNMatrix>(in);
     CHECK(inVal_) << "Input should be MKLDNNMatrix";
@@ -136,30 +154,21 @@ void MKLDNNFcLayer::resetFwd() {
 
   // change original output value to mkldnn output value
   output_.value = std::dynamic_pointer_cast<Matrix>(outVal_);
-  if (!nextIsMKLDNN()) {
-    Argument cpuOutput;
-    for (size_t i = 0; i < outputOtherDevice_.size(); i++) {
-      if (outputOtherDevice_[i].deviceId == CPU_DEVICE) {
-        cpuOutput = outputOtherDevice_[i];
-      }
-    }
-    cpuOutput.setFrameHeight(output_.getFrameHeight());
-    cpuOutput.setFrameWidth(output_.getFrameWidth());
-
-    // fc cpu output value do not need convert
-    cpuOutput.value = output_.value;
+  if (!nextIsOnlyMKLDNN()) {
+    convertOutputToOtherDevice();
   }
 
   // create forward handle
   prop_kind pk = prop_kind::forward;
-  fc_fwd::desc fwdDesc =
-      hasBias ? fc_fwd::desc(pk,
-                             inVal_->getMD(),
-                             wgtVal_->getMD(),
-                             biasVal_->getMD(),
-                             outVal_->getMD())
-              : fc_fwd::desc(
-                    pk, inVal_->getMD(), wgtVal_->getMD(), outVal_->getMD());
+  fc_fwd::desc fwdDesc = hasBias ? fc_fwd::desc(pk,
+                                                inVal_->getMemoryDesc(),
+                                                wgtVal_->getMemoryDesc(),
+                                                biasVal_->getMemoryDesc(),
+                                                outVal_->getMemoryDesc())
+                                 : fc_fwd::desc(pk,
+                                                inVal_->getMemoryDesc(),
+                                                wgtVal_->getMemoryDesc(),
+                                                outVal_->getMemoryDesc());
   fc_fwd::primitive_desc fwdPD = fc_fwd::primitive_desc(fwdDesc, engine_);
   if (hasBias) {
     fwd_.reset(new fc_fwd(fwdPD, *inVal_, *wgtVal_, *biasVal_, *outVal_));
@@ -184,36 +193,38 @@ void MKLDNNFcLayer::resetBwd() {
   const MatrixPtr& wgt = weight_->getWGrad();
   const MatrixPtr& bias = hasBias ? biases_->getWGrad() : nullptr;
 
-  // TODO(TJ): merge topdiffs
-  if (nextIsMKLDNN()) {
+  // TODO(TJ): merge outgrad
+  if (nextIsOnlyMKLDNN()) {
     // can not directly cast outputgrad to mkldnnmatrix,
     // since each layer can not write the inputgrad to mkldnn inputgrad.
     // So just create from matrix with outputvalue format.
     const MatrixPtr& out = getOutput(MKLDNN_DEVICE).grad;
-    outGrad_ = MKLDNNMatrix::create(out, outVal_->getPD());
+    outGrad_ = MKLDNNMatrix::create(out, outVal_->getPrimitiveDesc());
   } else {
     const MatrixPtr& out = getOutput(CPU_DEVICE).grad;
     // fc do not need to convert from cpu device since output always nc
     // only need create from cpu device
-    outGrad_ = MKLDNNMatrix::create(out, outVal_->getPD());
+    outGrad_ = MKLDNNMatrix::create(out, outVal_->getPrimitiveDesc());
   }
 
-  wgtGrad_ = MKLDNNMatrix::create(wgt, wgtVal_->getPD());
-  biasGrad_ = hasBias ? MKLDNNMatrix::create(bias, biasVal_->getPD()) : nullptr;
+  wgtGrad_ = MKLDNNMatrix::create(wgt, wgtVal_->getPrimitiveDesc());
+  biasGrad_ = hasBias ? MKLDNNMatrix::create(bias, biasVal_->getPrimitiveDesc())
+                      : nullptr;
 
   // create memory primitive desc
   fc_fwd::desc fwdDesc = fc_fwd::desc(prop_kind::forward,
-                                      inVal_->getMD(),
-                                      wgtGrad_->getMD(),
-                                      outGrad_->getMD());
+                                      inVal_->getMemoryDesc(),
+                                      wgtGrad_->getMemoryDesc(),
+                                      outGrad_->getMemoryDesc());
   fc_fwd::primitive_desc fwdPD = fc_fwd::primitive_desc(fwdDesc, engine_);
-  fc_bwdWgt::desc bwdWgtDesc =
-      hasBias ? fc_bwdWgt::desc(inVal_->getMD(),
-                                wgtGrad_->getMD(),
-                                biasGrad_->getMD(),
-                                outGrad_->getMD())
-              : fc_bwdWgt::desc(
-                    inVal_->getMD(), wgtGrad_->getMD(), outGrad_->getMD());
+  fc_bwdWgt::desc bwdWgtDesc = hasBias
+                                   ? fc_bwdWgt::desc(inVal_->getMemoryDesc(),
+                                                     wgtGrad_->getMemoryDesc(),
+                                                     biasGrad_->getMemoryDesc(),
+                                                     outGrad_->getMemoryDesc())
+                                   : fc_bwdWgt::desc(inVal_->getMemoryDesc(),
+                                                     wgtGrad_->getMemoryDesc(),
+                                                     outGrad_->getMemoryDesc());
   fc_bwdWgt::primitive_desc bwdWgtPD =
       fc_bwdWgt::primitive_desc(bwdWgtDesc, engine_, fwdPD);
 
@@ -227,30 +238,20 @@ void MKLDNNFcLayer::resetBwd() {
   pipelineBwd_.push_back(*bwdWgt_);
 
   /// backward data
-  if (prevIsMKLDNN()) {
-    const MatrixPtr& in = getInputGrad(0, MKLDNN_DEVICE);
-    if (in == nullptr) {
-      return;
-    }
-    if (getInput(0, MKLDNN_DEVICE).getAllCount() > 1) {
-      // TODO(TJ): use outputMaps_ ways when merge topdiff done
-    } else {
-      inGrad_ = MKLDNNMatrix::create(in, inVal_->getPD());
-    }
+  int device = prevIsOnlyMKLDNN() ? MKLDNN_DEVICE : CPU_DEVICE;
+  const MatrixPtr& in = getInputGrad(0, device);
+  if (in == nullptr) {
+    return;
+  }
+  if (getInput(0, device).getAllCount() > 1) {
+    // TODO(TJ): use outputMaps_ ways when merge outgrad done
   } else {
-    const MatrixPtr& in = getInputGrad(0, CPU_DEVICE);
-    if (in == nullptr) {
-      return;
-    }
-    if (getInput(0, CPU_DEVICE).getAllCount() > 1) {
-      // TODO(TJ): use outputMaps_ ways when merge topdiff done
-    } else {
-      inGrad_ = MKLDNNMatrix::create(in, inVal_->getPD());
-    }
+    inGrad_ = MKLDNNMatrix::create(in, inVal_->getPrimitiveDesc());
   }
 
-  fc_bwdData::desc bwdDataDesc =
-      fc_bwdData::desc(inVal_->getMD(), wgtGrad_->getMD(), outGrad_->getMD());
+  fc_bwdData::desc bwdDataDesc = fc_bwdData::desc(inVal_->getMemoryDesc(),
+                                                  wgtGrad_->getMemoryDesc(),
+                                                  outGrad_->getMemoryDesc());
   fc_bwdData::primitive_desc bwdDataPD =
       fc_bwdData::primitive_desc(bwdDataDesc, engine_, fwdPD);
 
diff --git a/paddle/gserver/layers/MKLDNNFcLayer.h b/paddle/gserver/layers/MKLDNNFcLayer.h
index e2657a8d5e..e138a6faf1 100644
--- a/paddle/gserver/layers/MKLDNNFcLayer.h
+++ b/paddle/gserver/layers/MKLDNNFcLayer.h
@@ -72,6 +72,8 @@ protected:
    * only would be called when needed
    */
   void resetBwd();
+
+  void convertOutputToOtherDevice() override;
 };
 
 }  // namespace paddle
diff --git a/paddle/gserver/layers/MKLDNNLayer.h b/paddle/gserver/layers/MKLDNNLayer.h
index 3dd17a36ff..8fe9630e82 100644
--- a/paddle/gserver/layers/MKLDNNLayer.h
+++ b/paddle/gserver/layers/MKLDNNLayer.h
@@ -86,10 +86,7 @@ public:
     CHECK(FLAGS_use_mkldnn) << "MkldnnLayers only support use_mkldnn."
                             << "Please set WITH_MKLDNN=ON "
                             << "and set use_mkldnn=True";
-    if (useGpu_ == true) {
-      LOG(WARNING) << "Do not support GPU yet, will change to useGpu = false";
-      useGpu_ = false;
-    }
+    CHECK(!useGpu_) << "Do not support GPU yet";
 
     // set device id before Layer::init
     setDevice(MKLDNN_DEVICE);
@@ -116,6 +113,12 @@ public:
    */
   virtual void convertWeightsToPaddle() {}
 
+  /**
+   * convert MKLDNN output to other device.
+   * only support CPU device yet
+   */
+  virtual void convertOutputToOtherDevice() {}
+
   /**
    * print info about sizes
    */
@@ -147,22 +150,25 @@ public:
 
 protected:
   /**
-   * If next layer only has MKLDNN type.
-   * Otherwise, only support otherdevice CPU device.
+   * copy image size and sequence info to other device
    */
-  bool nextIsMKLDNN() {
+  void copyOutputInfoToOtherDevice() {
     for (size_t i = 0; i < outputOtherDevice_.size(); i++) {
-      CHECK_EQ(outputOtherDevice_[i].deviceId, CPU_DEVICE)
-          << "Only support other device is CPU yet";
+      outputOtherDevice_[i].setFrameHeight(output_.getFrameHeight());
+      outputOtherDevice_[i].setFrameWidth(output_.getFrameWidth());
+      outputOtherDevice_[i].sequenceStartPositions =
+          output_.sequenceStartPositions;
+      outputOtherDevice_[i].subSequenceStartPositions =
+          output_.subSequenceStartPositions;
+      outputOtherDevice_[i].cpuSequenceDims = output_.cpuSequenceDims;
     }
-    return outputOtherDevice_.size() == 0;
   }
 
   /**
-   * Is previous layer MKLDNN type.
-   * Otherwise, only support otherdevice CPU device.
+   * Is previous layer only has MKLDNN type.
+   * Otherwise, only support the previous layer using CPU device.
    */
-  bool prevIsMKLDNN(int index = 0) {
+  bool prevIsOnlyMKLDNN(int index = 0) {
     int prevDevice = getPrev(index)->getDeviceId();
     if (prevDevice == MKLDNN_DEVICE) {
       return true;
@@ -173,11 +179,23 @@ protected:
     }
   }
 
+  /**
+   * If output only has MKLDNN device.
+   * Otherwise, other devices should only using CPU device.
+   */
+  bool nextIsOnlyMKLDNN() {
+    for (size_t i = 0; i < outputOtherDevice_.size(); i++) {
+      CHECK_EQ(outputOtherDevice_[i].deviceId, CPU_DEVICE)
+          << "Only support other device is CPU yet";
+    }
+    return outputOtherDevice_.size() == 0;
+  }
+
   /**
    * Sync input value data
    */
   void syncInputValue() {
-    if (prevIsMKLDNN()) {
+    if (prevIsOnlyMKLDNN()) {
       return;
     }
     real* iData = getInputValue(0, CPU_DEVICE)->getData();
@@ -190,7 +208,7 @@ protected:
    * Sync output grad data
    */
   void syncOutputGrad() {
-    if (nextIsMKLDNN()) {
+    if (nextIsOnlyMKLDNN()) {
       return;
     }
 
diff --git a/paddle/math/MKLDNNMatrix.cpp b/paddle/math/MKLDNNMatrix.cpp
index 32ae3b1bcf..0a355e2644 100644
--- a/paddle/math/MKLDNNMatrix.cpp
+++ b/paddle/math/MKLDNNMatrix.cpp
@@ -31,7 +31,6 @@ MKLDNNMatrixPtr MKLDNNMatrix::create(MatrixPtr m, memory::primitive_desc pd) {
   if (m == nullptr) {
     size_t height = dims[0];
     size_t width = cnts / dims[0];
-    // LOG(INFO) << height << "," << width;
     m = Matrix::create(height, width, false, false);
   }
 
@@ -40,10 +39,8 @@ MKLDNNMatrixPtr MKLDNNMatrix::create(MatrixPtr m, memory::primitive_desc pd) {
   CHECK(cpuMatrix) << "Only support create from CPU matrix yet";
 
   CHECK_EQ(cnts, m->getElementCnt()) << "Count size does not match";
-  size_t width = m->getWidth();
-  size_t height = m->getHeight();
-  real* data = m->getData();
-  return std::make_shared<MKLDNNMatrix>(data, height, width, pd);
+  return std::make_shared<MKLDNNMatrix>(
+      m->getData(), m->getHeight(), m->getWidth(), pd);
 }
 
 MKLDNNMatrixPtr MKLDNNMatrix::create(MatrixPtr m,
@@ -51,9 +48,7 @@ MKLDNNMatrixPtr MKLDNNMatrix::create(MatrixPtr m,
                                      memory::format fmt,
                                      engine& eg,
                                      mkldnn::memory::data_type dtype) {
-  memory::desc md = memory::desc(dims, dtype, fmt);
-  memory::primitive_desc pd = memory::primitive_desc(md, eg);
-  return create(m, pd);
+  return create(m, memory::primitive_desc(memory::desc(dims, dtype, fmt), eg));
 }
 
 void MKLDNNMatrix::reorderDataFrom(const MKLDNNMatrixPtr& m,
@@ -64,9 +59,7 @@ void MKLDNNMatrix::reorderDataFrom(const MKLDNNMatrixPtr& m,
     return;
   }
   CHECK_EQ(getElementCnt(), m->getElementCnt()) << "size should equal";
-  real* srcData = getData();
-  real* dstData = m->getData();
-  reorderOnce(srcData, dstData, srcFmt, dstFmt, targetDim);
+  reorderOnce(getData(), m->getData(), srcFmt, dstFmt, targetDim);
 }
 
 void MKLDNNMatrix::reorderDataTo(const MKLDNNMatrixPtr& m,
@@ -77,9 +70,7 @@ void MKLDNNMatrix::reorderDataTo(const MKLDNNMatrixPtr& m,
     return;
   }
   CHECK_EQ(getElementCnt(), m->getElementCnt()) << "size should equal";
-  real* srcData = getData();
-  real* dstData = m->getData();
-  reorderOnce(srcData, dstData, srcFmt, dstFmt, targetDim);
+  reorderOnce(getData(), m->getData(), srcFmt, dstFmt, targetDim);
 }
 
 void MKLDNNMatrix::reorderOnce(void* srcData,
@@ -120,8 +111,9 @@ void MKLDNNMatrix::downSpatial() {
     return;
   }
 
-  memory::dims srcDims = getDims();
+  // TODO(TJ): change H(height) and W(width) if support nhwc or more
   const int H = 2, W = 3;
+  memory::dims srcDims = getDims();
   if (srcDims[H] != 1 || srcDims[W] != 1) {
     // can not down spatial
     return;
@@ -141,13 +133,12 @@ void MKLDNNMatrix::downSpatial() {
   }
   memory::desc md = memory::desc(dstDims, getDtype(), dstFmt);
   memory::primitive_desc pd = memory::primitive_desc(md, getEngine());
-  void* data = getData();
   mkldnn_primitive_t result;
   mkldnn::error::wrap_c_api(
       mkldnn_primitive_create(&result, pd.get(), nullptr, nullptr),
       "could not create a memory primitive");
   reset(result);
-  set_data_handle(data);
+  set_data_handle(getData());
 }
 
 }  // namespace paddle
diff --git a/paddle/math/MKLDNNMatrix.h b/paddle/math/MKLDNNMatrix.h
index ea3fd7d461..e50f698b49 100644
--- a/paddle/math/MKLDNNMatrix.h
+++ b/paddle/math/MKLDNNMatrix.h
@@ -56,9 +56,9 @@ public:
 public:
   /**
    * Reorder this MKLDNNMatrix from other format.
-   * Support inplace reorder
-   * Pay attention: this function would only reorder the data layout.
-   *                will NOT change this original dim or format info
+   * Support inplace reorder.
+   * @note: this function would only reorder the data layout.
+   *        will NOT change this original dim or format info
    */
   void reorderDataFrom(const MKLDNNMatrixPtr& m,
                        memory::format srcFmt,
@@ -66,9 +66,9 @@ public:
 
   /**
    * Reorder this MKLDNNMatrix to other format.
-   * Support inplace reorder
-   * Pay attention: this function would only reorder the data layout.
-   *                will NOT change the dst dim or format info
+   * Support inplace reorder.
+   * @note: this function would only reorder the data layout.
+   *        will NOT change the dst dim or format info
    */
   void reorderDataTo(const MKLDNNMatrixPtr& m,
                      memory::format dstFmt,
@@ -90,18 +90,20 @@ public:
   /**
    * Get primitive descriptor.
    */
-  mkldnn::memory::primitive_desc getPD() { return this->get_primitive_desc(); }
+  mkldnn::memory::primitive_desc getPrimitiveDesc() {
+    return this->get_primitive_desc();
+  }
 
   /**
    * Get memory descriptor.
    */
-  mkldnn::memory::desc getMD() { return getPD().desc(); }
+  mkldnn::memory::desc getMemoryDesc() { return getPrimitiveDesc().desc(); }
 
   /**
    * Get dimensions.
    */
   mkldnn::memory::dims getDims() {
-    mkldnn::memory::desc md = getMD();
+    mkldnn::memory::desc md = getMemoryDesc();
     const int* src = md.data.dims;
     int ndims = md.data.ndims;
     mkldnn::memory::dims dst;
@@ -116,24 +118,25 @@ public:
    * Get format.
    */
   mkldnn::memory::format getFormat() {
-    return (mkldnn::memory::format)(getMD().data.format);
+    return (mkldnn::memory::format)(getMemoryDesc().data.format);
   }
 
   /**
    * Get memory data type.
    */
   mkldnn::memory::data_type getDtype() {
-    return (mkldnn::memory::data_type)(getMD().data.data_type);
+    return (mkldnn::memory::data_type)(getMemoryDesc().data.data_type);
   }
 
   /**
    * Get engine.
    */
-  mkldnn::engine getEngine() { return getPD().get_engine(); }
+  mkldnn::engine getEngine() { return getPrimitiveDesc().get_engine(); }
 
 protected:
   /**
-   * Do once reorder supported inplace.
+   * Do reorder once.
+   * Can support inplace.
    */
   void reorderOnce(void* srcData,
                    void* dstData,

From 168707caddf9c0ed67a2d87074a5f05b7a63a5c9 Mon Sep 17 00:00:00 2001
From: hedaoyuan <hedaoyuan@github.com>
Date: Wed, 30 Aug 2017 11:35:19 +0800
Subject: [PATCH 61/64] Fix a small bug.

---
 paddle/gserver/layers/ExpandConvLayer.cpp | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/paddle/gserver/layers/ExpandConvLayer.cpp b/paddle/gserver/layers/ExpandConvLayer.cpp
index 0e84581769..20de475fc3 100644
--- a/paddle/gserver/layers/ExpandConvLayer.cpp
+++ b/paddle/gserver/layers/ExpandConvLayer.cpp
@@ -66,7 +66,11 @@ bool ExpandConvLayer::init(const LayerMap &layerMap,
     // If depth wise convolution and useGpu == false and ARM-NEON
     if (!useGpu_ && isDepthwiseConv(channels_[i], groups_[i]) && !isDeconv_) {
 #if defined(__ARM_NEON__) || defined(__ARM_NEON)
-      convType = "NeonDepthwiseConv";
+      if ((filterSize_[i] == filterSizeY_[i]) &&
+          (filterSize_[i] == 3 || filterSize_[i] == 4) &&
+          (stride_[i] == strideY_[i]) && (stride_[i] == 1 || stride_[i] == 2)) {
+        convType = "NeonDepthwiseConv";
+      }
 #endif
     }
 

From c5183caa04557628340983d17a64097f939db132 Mon Sep 17 00:00:00 2001
From: tensor-tang <jian.j.tang@intel.com>
Date: Wed, 30 Aug 2017 13:37:51 +0800
Subject: [PATCH 62/64] rename

---
 paddle/gserver/layers/MKLDNNFcLayer.cpp | 29 +++++++++++--------------
 paddle/gserver/layers/MKLDNNLayer.h     | 12 +++++-----
 2 files changed, 20 insertions(+), 21 deletions(-)

diff --git a/paddle/gserver/layers/MKLDNNFcLayer.cpp b/paddle/gserver/layers/MKLDNNFcLayer.cpp
index a08cca318e..8318c8c519 100644
--- a/paddle/gserver/layers/MKLDNNFcLayer.cpp
+++ b/paddle/gserver/layers/MKLDNNFcLayer.cpp
@@ -134,7 +134,7 @@ void MKLDNNFcLayer::resetFwd() {
   const MatrixPtr& bias = hasBias ? biases_->getW() : nullptr;
   const MatrixPtr& out = output_.value;
 
-  if (prevIsOnlyMKLDNN()) {
+  if (inputIsOnlyMKLDNN()) {
     const MatrixPtr& in = getInputValue(0);
     inVal_ = std::dynamic_pointer_cast<MKLDNNMatrix>(in);
     CHECK(inVal_) << "Input should be MKLDNNMatrix";
@@ -154,7 +154,7 @@ void MKLDNNFcLayer::resetFwd() {
 
   // change original output value to mkldnn output value
   output_.value = std::dynamic_pointer_cast<Matrix>(outVal_);
-  if (!nextIsOnlyMKLDNN()) {
+  if (!outputIsOnlyMKLDNN()) {
     convertOutputToOtherDevice();
   }
 
@@ -194,19 +194,16 @@ void MKLDNNFcLayer::resetBwd() {
   const MatrixPtr& bias = hasBias ? biases_->getWGrad() : nullptr;
 
   // TODO(TJ): merge outgrad
-  if (nextIsOnlyMKLDNN()) {
-    // can not directly cast outputgrad to mkldnnmatrix,
-    // since each layer can not write the inputgrad to mkldnn inputgrad.
-    // So just create from matrix with outputvalue format.
-    const MatrixPtr& out = getOutput(MKLDNN_DEVICE).grad;
-    outGrad_ = MKLDNNMatrix::create(out, outVal_->getPrimitiveDesc());
-  } else {
-    const MatrixPtr& out = getOutput(CPU_DEVICE).grad;
-    // fc do not need to convert from cpu device since output always nc
-    // only need create from cpu device
-    outGrad_ = MKLDNNMatrix::create(out, outVal_->getPrimitiveDesc());
-  }
-
+  int device = outputIsOnlyMKLDNN() ? MKLDNN_DEVICE : CPU_DEVICE;
+  // for MKLDNN device:
+  // can not directly cast outputgrad to mkldnnmatrix,
+  // since each layer can not write the inputgrad to mkldnn inputgrad.
+  // So just create from matrix with outputvalue format.
+  // for CPU device:
+  // fc do not need to convert from cpu device since output is always nc format
+  // only need create from cpu device
+  const MatrixPtr& out = getOutput(device).grad;
+  outGrad_ = MKLDNNMatrix::create(out, outVal_->getPrimitiveDesc());
   wgtGrad_ = MKLDNNMatrix::create(wgt, wgtVal_->getPrimitiveDesc());
   biasGrad_ = hasBias ? MKLDNNMatrix::create(bias, biasVal_->getPrimitiveDesc())
                       : nullptr;
@@ -238,7 +235,7 @@ void MKLDNNFcLayer::resetBwd() {
   pipelineBwd_.push_back(*bwdWgt_);
 
   /// backward data
-  int device = prevIsOnlyMKLDNN() ? MKLDNN_DEVICE : CPU_DEVICE;
+  device = inputIsOnlyMKLDNN() ? MKLDNN_DEVICE : CPU_DEVICE;
   const MatrixPtr& in = getInputGrad(0, device);
   if (in == nullptr) {
     return;
diff --git a/paddle/gserver/layers/MKLDNNLayer.h b/paddle/gserver/layers/MKLDNNLayer.h
index 8fe9630e82..b983b833d5 100644
--- a/paddle/gserver/layers/MKLDNNLayer.h
+++ b/paddle/gserver/layers/MKLDNNLayer.h
@@ -151,6 +151,8 @@ public:
 protected:
   /**
    * copy image size and sequence info to other device
+   * @note: can not directly use Layer::copyOutputToOtherDevice since here only
+   *        copy base info and do not copy data value
    */
   void copyOutputInfoToOtherDevice() {
     for (size_t i = 0; i < outputOtherDevice_.size(); i++) {
@@ -165,10 +167,10 @@ protected:
   }
 
   /**
-   * Is previous layer only has MKLDNN type.
+   * If input only has MKLDNN device.
    * Otherwise, only support the previous layer using CPU device.
    */
-  bool prevIsOnlyMKLDNN(int index = 0) {
+  bool inputIsOnlyMKLDNN(int index = 0) {
     int prevDevice = getPrev(index)->getDeviceId();
     if (prevDevice == MKLDNN_DEVICE) {
       return true;
@@ -183,7 +185,7 @@ protected:
    * If output only has MKLDNN device.
    * Otherwise, other devices should only using CPU device.
    */
-  bool nextIsOnlyMKLDNN() {
+  bool outputIsOnlyMKLDNN() {
     for (size_t i = 0; i < outputOtherDevice_.size(); i++) {
       CHECK_EQ(outputOtherDevice_[i].deviceId, CPU_DEVICE)
           << "Only support other device is CPU yet";
@@ -195,7 +197,7 @@ protected:
    * Sync input value data
    */
   void syncInputValue() {
-    if (prevIsOnlyMKLDNN()) {
+    if (inputIsOnlyMKLDNN()) {
       return;
     }
     real* iData = getInputValue(0, CPU_DEVICE)->getData();
@@ -208,7 +210,7 @@ protected:
    * Sync output grad data
    */
   void syncOutputGrad() {
-    if (nextIsOnlyMKLDNN()) {
+    if (outputIsOnlyMKLDNN()) {
       return;
     }
 

From 31632a694c718ac31b890b1b46788f9d70d570c8 Mon Sep 17 00:00:00 2001
From: Luo Tao <luotao02@baidu.com>
Date: Wed, 30 Aug 2017 14:48:03 +0800
Subject: [PATCH 63/64] remove unused ubuntu Debian install doc

---
 doc/getstarted/build_and_install/index_cn.rst |  4 +-
 doc/getstarted/build_and_install/index_en.rst |  3 +-
 .../build_and_install/ubuntu_install_cn.rst   | 71 -------------------
 .../build_and_install/ubuntu_install_en.rst   | 25 -------
 4 files changed, 2 insertions(+), 101 deletions(-)
 delete mode 100644 doc/getstarted/build_and_install/ubuntu_install_cn.rst
 delete mode 100644 doc/getstarted/build_and_install/ubuntu_install_en.rst

diff --git a/doc/getstarted/build_and_install/index_cn.rst b/doc/getstarted/build_and_install/index_cn.rst
index a24df6c518..dd9923697a 100644
--- a/doc/getstarted/build_and_install/index_cn.rst
+++ b/doc/getstarted/build_and_install/index_cn.rst
@@ -6,14 +6,12 @@
 安装流程
 ++++++++
 
-PaddlePaddle提供数个预编译的二进制来进行安装，包括Docker镜像，ubuntu的deb安装包等。我们推荐使用Docker镜像来部署环境，同时欢迎贡献更多的安装包。
+PaddlePaddle提供Docker镜像来部署环境。
 
 .. toctree::
    :maxdepth: 1
    
    docker_install_cn.rst 
-   ubuntu_install_cn.rst
-
 
 
 编译流程
diff --git a/doc/getstarted/build_and_install/index_en.rst b/doc/getstarted/build_and_install/index_en.rst
index 1bfd4f75c0..8a53588e04 100644
--- a/doc/getstarted/build_and_install/index_en.rst
+++ b/doc/getstarted/build_and_install/index_en.rst
@@ -8,14 +8,13 @@ Install PaddlePaddle
     :maxdepth: 1
 
     docker_install_en.rst
-    ubuntu_install_en.rst
 
 Build from Source
 -----------------
 
 ..  warning::
 
-    Please use :code:`deb` package or :code:`docker` image to install paddle. The building guide is used for hacking or contributing PaddlePaddle source code.
+    Please use :code:`docker` image to install paddle. The building guide is used for hacking or contributing PaddlePaddle source code.
 
 ..  toctree::
     :maxdepth: 1
diff --git a/doc/getstarted/build_and_install/ubuntu_install_cn.rst b/doc/getstarted/build_and_install/ubuntu_install_cn.rst
deleted file mode 100644
index 9e39ccb00f..0000000000
--- a/doc/getstarted/build_and_install/ubuntu_install_cn.rst
+++ /dev/null
@@ -1,71 +0,0 @@
-Ubuntu部署PaddlePaddle
-===================================
-
-PaddlePaddle提供了ubuntu 14.04 deb安装包。
-
-安装
-------
-
-安装包的下载地址是\: https://github.com/PaddlePaddle/Paddle/releases
-
-它包含四个版本\:
-
-* cpu版本: 支持主流x86处理器平台, 使用了avx指令集。
-
-* cpu-noavx版本：支持主流x86处理器平台，没有使用avx指令集。
-
-* gpu版本：支持主流x86处理器平台，支持nvidia cuda平台，使用了avx指令集。
-
-* gpu-noavx版本：支持主流x86处理器平台，支持nvidia cuda平台，没有使用avx指令集。
-
-下载完相关安装包后，执行:
-
-..  code-block:: shell
-
-    sudo apt-get install gdebi
-    gdebi paddle-*-cpu.deb
-
-或者:
-
-..  code-block:: shell
-
-    dpkg -i paddle-*-cpu.deb
-    apt-get install -f
-
-
-在 :code:`dpkg -i` 的时候如果报一些依赖未找到的错误是正常的，
-在 :code:`apt-get install -f` 里会继续安装 PaddlePaddle。
-
-安装完成后，可以使用命令 :code:`paddle version` 查看安装后的paddle 版本:
-
-..  code-block:: shell
-
-    PaddlePaddle 0.8.0b1, compiled with
-        with_avx: ON
-        with_gpu: OFF
-        with_double: OFF
-        with_python: ON
-        with_rdma: OFF
-        with_timer: OFF
-        with_predict_sdk:
-
-
-可能遇到的问题
---------------
-
-libcudart.so/libcudnn.so找不到
-++++++++++++++++++++++++++++++
-
-安装完成后，运行 :code:`paddle train` 报错\:
-
-..  code-block:: shell
-
-      0831 12:36:04.151525  1085 hl_dso_loader.cc:70] Check failed: nullptr != *dso_handle For Gpu version of PaddlePaddle, it couldn't find CUDA library: libcudart.so Please make sure you already specify its path.Note: for training data on Cpu using Gpu version of PaddlePaddle,you must specify libcudart.so via LD_LIBRARY_PATH.
-
-原因是未设置cuda运行时环境变量。 如果使用GPU版本的PaddlePaddle，请安装CUDA 7.5 和CUDNN 5到本地环境中，并设置：
-
-..  code-block:: shell
-
-    export LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/cuda/lib:$LD_LIBRARY_PATH
-    export PATH=/usr/local/cuda/bin:$PATH
-
diff --git a/doc/getstarted/build_and_install/ubuntu_install_en.rst b/doc/getstarted/build_and_install/ubuntu_install_en.rst
deleted file mode 100644
index ea8042085b..0000000000
--- a/doc/getstarted/build_and_install/ubuntu_install_en.rst
+++ /dev/null
@@ -1,25 +0,0 @@
-Debian Package installation guide
-=================================
-
-PaddlePaddle supports :code:`deb` pacakge. The installation of this :code:`deb` package is tested in ubuntu 14.04, but it should be support other debian based linux, too.
-
-There are four versions of debian package, :code:`cpu`, :code:`gpu`, :code:`cpu-noavx`, :code:`gpu-noavx`. And :code:`noavx` version is used to support CPU which does not contain :code:`AVX` instructions. The download url of :code:`deb` package is \: https://github.com/baidu/Paddle/releases/
-
-
-After downloading PaddlePaddle deb packages, you can use :code:`gdebi` install.
-
-..	code-block:: bash
-
-	gdebi paddle-*.deb
-
-If :code:`gdebi` is not installed, you can use :code:`sudo apt-get install gdebi` to install it.
-
-Or you can use following commands to install PaddlePaddle.
-
-..	code-block:: bash
-
-	dpkg -i paddle-*.deb
-	apt-get install -f
-
-And if you use GPU version deb package, you need to install CUDA toolkit and cuDNN, and set related environment variables(such as LD_LIBRARY_PATH) first. It is normal when `dpkg -i` get errors. `apt-get install -f` will continue install paddle, and install dependences. 
-

From 2563e32bb12b363c41d608bf0f6f1060ea769f8b Mon Sep 17 00:00:00 2001
From: qijun <qijun1994@hotmail.com>
Date: Wed, 30 Aug 2017 17:57:26 +0800
Subject: [PATCH 64/64] fix clang build error

---
 paddle/gserver/layers/CostLayer.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/paddle/gserver/layers/CostLayer.h b/paddle/gserver/layers/CostLayer.h
index 0ce72ef40a..0f655b48ee 100644
--- a/paddle/gserver/layers/CostLayer.h
+++ b/paddle/gserver/layers/CostLayer.h
@@ -318,7 +318,9 @@ public:
 
   void forwardImp(Matrix& output, Argument& label, Matrix& cost) override;
 
-  void backwardImp(Matrix& outputValue, Argument& label, Matrix& outputGrad) {}
+  void backwardImp(Matrix& outputValue,
+                   Argument& label,
+                   Matrix& outputGrad) override {}
 };
 
 /**