From 1c710053292394154d41db6c44ea808a8feaf65c Mon Sep 17 00:00:00 2001
From: Helin Wang <ustc.harry@gmail.com>
Date: Sun, 10 Sep 2017 15:03:58 -0700
Subject: [PATCH 01/21] Design Doc: Session

---
 doc/design/session.md | 62 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 62 insertions(+)
 create mode 100644 doc/design/session.md

diff --git a/doc/design/session.md b/doc/design/session.md
new file mode 100644
index 0000000000..2e8c0ece7a
--- /dev/null
+++ b/doc/design/session.md
@@ -0,0 +1,62 @@
+# Design Doc: Session
+
+## Abstract
+
+This design doc proposes to have an object called *Session* which
+encapsulates the environment in which the computation graph is
+executed.
+
+## Background
+
+A computation graph is executed in an environment which contains the
+[scope](./scope.md) and other states. PaddlePaddle used to only have
+an implicit global session on which `paddle.eval()` is executed.
+
+This has the limitation that the user can not create two independent
+environments. For example, in reinforcement learning, the user may
+want to have a stale model for inference and a fresh model for
+training, and only replace the stale model with the fresh model
+periodically. Also, we have no concept that can encapsulate a remote
+environment that could execute a computation graph.
+
+## Session
+
+Session is an object that owns all runtime states such as scope,
+reader OP's file handles, connection to a remote PaddlePaddle cluster,
+etc.
+
+Session has two methods: `eval` and `close`. `eval` executes the
+target OP in a given graph, and `close` closes the session and
+releases all related resources:
+
+```Python
+a = paddle.constant(1.0)
+b = paddle.constant(2.0)
+c = a + b
+sess = paddle.session()
+sess.eval(c)
+sess.close()
+```
+
+### Remote Session
+
+Paddle Cloud will support user creating a remote session pointing to
+the Paddle Cloud cluster. The user can send the computation graph to
+be executed on the Paddle Cloud. In this way, the user can control a
+cluster from her local computer:
+
+```Python
+reader = paddle.reader.recordio("/pfs/home/peter/mnist-train-*") # data stored on Paddle Cloud
+image = reader.column(0)
+label = reader.column(1)
+fc1 = paddle.op.fc(image, size=256, act="sigmoid")
+fc2 = paddle.op.fc(fc1, size=10, act="softmax")
+cost = paddle.op.cross_entropy(fc2)
+opt = paddle.optimizer.sgd(cost)
+
+remote_config = ... # remote configuration such as endpoint, number of nodes and authentication.
+sess = paddle.remoteSession(remote_config)
+for i in range(1000):
+    sess.eval(opt)
+sess.close()
+```

From 94dfd8649e06108bc0c03e6f53eb43ab13f30332 Mon Sep 17 00:00:00 2001
From: Helin Wang <ustc.harry@gmail.com>
Date: Mon, 25 Sep 2017 16:02:58 -0700
Subject: [PATCH 02/21] fix according to comments

---
 doc/design/session.md | 34 ++++++++++++++++++++++------------
 1 file changed, 22 insertions(+), 12 deletions(-)

diff --git a/doc/design/session.md b/doc/design/session.md
index 2e8c0ece7a..dc034c3906 100644
--- a/doc/design/session.md
+++ b/doc/design/session.md
@@ -6,26 +6,36 @@ This design doc proposes to have an object called *Session* which
 encapsulates the environment in which the computation graph is
 executed.
 
+The session is able to distinguish running a graph locally or
+remotely, using CPU only or using one or more GPUs. Different sessions
+have different runtime environments such as [scopes](./scope.md) and
+device contexts.
+
+
 ## Background
 
-A computation graph is executed in an environment which contains the
-[scope](./scope.md) and other states. PaddlePaddle used to only have
-an implicit global session on which `paddle.eval()` is executed.
+A computation graph runs in an environment which contains states such
+as the scope and device contexts. The current design has an implicit
+global session on which `paddle.eval()` is executed.
+
+Since the user is not able to explicitly switch between runtime
+environments such as the scope and the device contexts, the user
+cannot run a topology in two independent environments. For example, in
+reinforcement learning, the user may want to have a stale model for
+inference and a fresh model for training, and only replace the stale
+model with the fresh model periodically. Also, we have no concept that
+can encapsulate a remote environment that could execute a computation
+graph.
 
-This has the limitation that the user can not create two independent
-environments. For example, in reinforcement learning, the user may
-want to have a stale model for inference and a fresh model for
-training, and only replace the stale model with the fresh model
-periodically. Also, we have no concept that can encapsulate a remote
-environment that could execute a computation graph.
+We need a session concept to address above issues.
 
 ## Session
 
-Session is an object that owns all runtime states such as scope,
+A session is an object that owns all runtime states such as scope,
 reader OP's file handles, connection to a remote PaddlePaddle cluster,
 etc.
 
-Session has two methods: `eval` and `close`. `eval` executes the
+The session has two methods: `eval` and `close`. `eval` executes the
 target OP in a given graph, and `close` closes the session and
 releases all related resources:
 
@@ -51,7 +61,7 @@ image = reader.column(0)
 label = reader.column(1)
 fc1 = paddle.op.fc(image, size=256, act="sigmoid")
 fc2 = paddle.op.fc(fc1, size=10, act="softmax")
-cost = paddle.op.cross_entropy(fc2)
+cost = paddle.op.cross_entropy(fc2, label)
 opt = paddle.optimizer.sgd(cost)
 
 remote_config = ... # remote configuration such as endpoint, number of nodes and authentication.

From f24b5dffc42063ddfe28229fdb242c3df4ec1aa7 Mon Sep 17 00:00:00 2001
From: Helin Wang <ustc.harry@gmail.com>
Date: Tue, 26 Sep 2017 18:03:37 -0700
Subject: [PATCH 03/21] Update Session design doc

---
 doc/design/refactor/session.md | 160 +++++++++++++++++++++++++++++++++
 doc/design/session.md          |  72 ---------------
 2 files changed, 160 insertions(+), 72 deletions(-)
 create mode 100644 doc/design/refactor/session.md
 delete mode 100644 doc/design/session.md

diff --git a/doc/design/refactor/session.md b/doc/design/refactor/session.md
new file mode 100644
index 0000000000..5f58148f01
--- /dev/null
+++ b/doc/design/refactor/session.md
@@ -0,0 +1,160 @@
+# Design Doc: Session
+
+## Abstract
+
+The *session* object encapsulates the environment in which the
+computation graph is executed.
+
+We will have *local* session and *remote* session, they offer the
+same [interface](#interface). The local session encapsulates the local
+runtime environment and the remote session encapsulates the cluster
+runtime envrionment.
+
+The local runtime envrionment contains:
+
+1. computation devices (i.e., CPU, GPU) handles, and
+1. the [scope](../scope.md) which holds all variables.
+
+The remote runtime envrionment contains:
+
+1. computation devices (i.e., CPU and GPU on node 0, 1) in a cluster,
+   and
+1. the distributed [scope](../scope.md) in a cluster which holds all
+   variables.
+
+The user can create a remote session on Paddle Cloud and evaluate the
+computation graph with it. In this way, the user can control the
+remote computation resource in a cluster from his local computer.
+
+
+## Background
+
+The current design has an implicit global session on which
+`paddle.eval()` is executed. The pain point is:
+
+Since the user is not able to explicitly switch between runtime
+environments such as the scope and the device contexts, the user
+cannot run a topology in two independent environments.
+
+For example, in reinforcement learning, the user may want to have a
+stale model for inference and a fresh model for training, and only
+replace the stale model with the fresh model periodically.
+
+Furthermore, we have no concept that encapsulates a remote environment
+that executes a computation graph.
+
+We need the session object to address above issues.
+
+
+## Session
+
+A session is an object that owns the runtime environment. All
+computations are executed through `session.eval`.
+
+
+### Interface
+
+```
+eval(
+    targets,
+    feed_dict=None,
+)
+```
+
+Evaluates the target Operations or Variables in `targets`.
+
+- *targets*: the evaluation targets. Can be a single Operation or
+  Variable, or a list with the Operations or Variables as elements.
+
+  The value returned by `eval()` has the same shape as the `target`
+  argument.
+
+  The computation graph is implicitly inferred from the targets.
+
+- *feed_dict*: a dictionary that contains the tensors which overrides
+  the edges of the computation graph.
+
+```
+close()
+```
+
+Closes the session. Calling this method releases the scope.
+
+
+### Create a Local Session
+
+```
+session(
+    gpu_ids=None
+)
+```
+
+Creates a new session. One session owns one scope, so creating
+multiple sessions will create different scopes.
+
+- *gpu_ids*: a single `int` or a list of `int` of the GPU IDs to be
+  used as the computation devices. If not specified, all avaiable GPUs
+  will be used.
+
+
+#### Example
+
+```Python
+a = paddle.constant(1.0)
+b = paddle.constant(2.0)
+c = a + b
+sess = paddle.session(gpu_ids=[0,1])
+sess.eval(c)
+sess.close()
+```
+
+### Create a Remote Session
+
+```
+create_cloud_job(
+    name,
+    num_trainer,
+    mem_per_trainer,
+    gpu_per_trainer,
+    cpu_per_trainer,
+    num_ps,
+    mem_per_ps,
+    cpu_per_ps,
+)
+```
+
+Creates a Paddle Cloud job. Fails if the job name exists.
+
+```
+get_cloud_job(
+    name
+)
+```
+
+Gets a Paddle Cloud job.
+
+```
+remote_session(
+    job
+)
+```
+
+- *job*: the Paddle Cloud job.
+
+#### Example
+
+```Python
+reader = paddle.reader.recordio("/pfs/home/peter/mnist-train-*") # data stored on Paddle Cloud
+image = reader.column(0)
+label = reader.column(1)
+fc1 = paddle.op.fc(image, size=256, act="sigmoid")
+fc2 = paddle.op.fc(fc1, size=10, act="softmax")
+cost = paddle.op.cross_entropy(fc2, label)
+opt = paddle.optimizer.sgd(cost)
+
+job = paddle.create_cloud_job("test", 3, "1G", 1, 1, 2, "1G", 1)
+sess = paddle.remote_ession(job)
+for i in range(1000):
+    sess.eval(opt)
+sess.close()
+```
diff --git a/doc/design/session.md b/doc/design/session.md
deleted file mode 100644
index dc034c3906..0000000000
--- a/doc/design/session.md
+++ /dev/null
@@ -1,72 +0,0 @@
-# Design Doc: Session
-
-## Abstract
-
-This design doc proposes to have an object called *Session* which
-encapsulates the environment in which the computation graph is
-executed.
-
-The session is able to distinguish running a graph locally or
-remotely, using CPU only or using one or more GPUs. Different sessions
-have different runtime environments such as [scopes](./scope.md) and
-device contexts.
-
-
-## Background
-
-A computation graph runs in an environment which contains states such
-as the scope and device contexts. The current design has an implicit
-global session on which `paddle.eval()` is executed.
-
-Since the user is not able to explicitly switch between runtime
-environments such as the scope and the device contexts, the user
-cannot run a topology in two independent environments. For example, in
-reinforcement learning, the user may want to have a stale model for
-inference and a fresh model for training, and only replace the stale
-model with the fresh model periodically. Also, we have no concept that
-can encapsulate a remote environment that could execute a computation
-graph.
-
-We need a session concept to address above issues.
-
-## Session
-
-A session is an object that owns all runtime states such as scope,
-reader OP's file handles, connection to a remote PaddlePaddle cluster,
-etc.
-
-The session has two methods: `eval` and `close`. `eval` executes the
-target OP in a given graph, and `close` closes the session and
-releases all related resources:
-
-```Python
-a = paddle.constant(1.0)
-b = paddle.constant(2.0)
-c = a + b
-sess = paddle.session()
-sess.eval(c)
-sess.close()
-```
-
-### Remote Session
-
-Paddle Cloud will support user creating a remote session pointing to
-the Paddle Cloud cluster. The user can send the computation graph to
-be executed on the Paddle Cloud. In this way, the user can control a
-cluster from her local computer:
-
-```Python
-reader = paddle.reader.recordio("/pfs/home/peter/mnist-train-*") # data stored on Paddle Cloud
-image = reader.column(0)
-label = reader.column(1)
-fc1 = paddle.op.fc(image, size=256, act="sigmoid")
-fc2 = paddle.op.fc(fc1, size=10, act="softmax")
-cost = paddle.op.cross_entropy(fc2, label)
-opt = paddle.optimizer.sgd(cost)
-
-remote_config = ... # remote configuration such as endpoint, number of nodes and authentication.
-sess = paddle.remoteSession(remote_config)
-for i in range(1000):
-    sess.eval(opt)
-sess.close()
-```

From 757c76b83f3701c29efc88c546ec90a18952f98a Mon Sep 17 00:00:00 2001
From: Helin Wang <ustc.harry@gmail.com>
Date: Thu, 28 Sep 2017 15:07:17 -0700
Subject: [PATCH 04/21] update according to comments

---
 doc/design/refactor/session.md | 74 +++++++++++++++++++++-------------
 1 file changed, 47 insertions(+), 27 deletions(-)

diff --git a/doc/design/refactor/session.md b/doc/design/refactor/session.md
index 5f58148f01..9a7451ece5 100644
--- a/doc/design/refactor/session.md
+++ b/doc/design/refactor/session.md
@@ -5,17 +5,17 @@
 The *session* object encapsulates the environment in which the
 computation graph is executed.
 
-We will have *local* session and *remote* session, they offer the
+We will have the *local* session and *remote* session, they offer the
 same [interface](#interface). The local session encapsulates the local
 runtime environment and the remote session encapsulates the cluster
-runtime envrionment.
+runtime environment.
 
-The local runtime envrionment contains:
+The local runtime environment contains:
 
 1. computation devices (i.e., CPU, GPU) handles, and
 1. the [scope](../scope.md) which holds all variables.
 
-The remote runtime envrionment contains:
+The remote runtime environment contains:
 
 1. computation devices (i.e., CPU and GPU on node 0, 1) in a cluster,
    and
@@ -29,12 +29,12 @@ remote computation resource in a cluster from his local computer.
 
 ## Background
 
-The current design has an implicit global session on which
+The current design has an implicit global session in which
 `paddle.eval()` is executed. The pain point is:
 
 Since the user is not able to explicitly switch between runtime
-environments such as the scope and the device contexts, the user
-cannot run a topology in two independent environments.
+environments, the user cannot run a topology in two independent
+environments.
 
 For example, in reinforcement learning, the user may want to have a
 stale model for inference and a fresh model for training, and only
@@ -49,12 +49,12 @@ We need the session object to address above issues.
 ## Session
 
 A session is an object that owns the runtime environment. All
-computations are executed through `session.eval`.
+computations are executed through `session.eval()`.
 
 
 ### Interface
 
-```
+```python
 eval(
     targets,
     feed_dict=None,
@@ -64,37 +64,57 @@ eval(
 Evaluates the target Operations or Variables in `targets`.
 
 - *targets*: the evaluation targets. Can be a single Operation or
-  Variable, or a list with the Operations or Variables as elements.
+  Variable, or a list with the Operations or Variables as
+  elements. The value returned by `eval()` has the same shape as the
+  `target` argument.
+
+  The PaddlePaddle program is represented by
+  the [ProgramDesc](../design/program.md), `eval()` will infer the
+  ProgramDesc from the given targets and run the PaddlePaddle
+  program. Please
+  see
+  [this graph](./distributed_architecture.md#local-training-architecture) for
+  the detailed illustration for the local session
+  and
+  [this graph](./distributed_architecture.md#distributed-training-architecture) for
+  the detailed illustration for the remote session.
+
+- *feed_dict*: a dictionary that contains the tensors which override
+  the edges of the computation graph.
 
-  The value returned by `eval()` has the same shape as the `target`
-  argument.
+  feed_dict not only can provide the input data, it can override any
+  OP's input as well:
 
-  The computation graph is implicitly inferred from the targets.
+  ```python
+  a = pd.constant(1.0, name="a")
+  b = pd.constant(2.0)
+  c = pd.mul(a,b)
+  sess.eval(targets=c, feed_dict={"a":3.0}) # returns 6.0
+  ```
 
-- *feed_dict*: a dictionary that contains the tensors which overrides
-  the edges of the computation graph.
-
-```
+```python
 close()
 ```
 
-Closes the session. Calling this method releases the scope.
+Closes the session and releases the scope that the session owns.
 
 
 ### Create a Local Session
 
-```
+```python
 session(
-    gpu_ids=None
+    devices=None
 )
 ```
 
 Creates a new session. One session owns one scope, so creating
 multiple sessions will create different scopes.
 
-- *gpu_ids*: a single `int` or a list of `int` of the GPU IDs to be
-  used as the computation devices. If not specified, all avaiable GPUs
-  will be used.
+- *devices*: a single `string` or a list of `string` of device names,
+  the corresponding devices will be the computation devices for
+  `eval()`. If not specified, all available devices (e.g., all GPUs)
+  will be used. The user doesn't need to specify the CPU device since
+  it will be always used.
 
 
 #### Example
@@ -103,14 +123,14 @@ multiple sessions will create different scopes.
 a = paddle.constant(1.0)
 b = paddle.constant(2.0)
 c = a + b
-sess = paddle.session(gpu_ids=[0,1])
+sess = paddle.session(devices=["gpu:0", "gpu:1", "fpga:0"])
 sess.eval(c)
 sess.close()
 ```
 
 ### Create a Remote Session
 
-```
+```python
 create_cloud_job(
     name,
     num_trainer,
@@ -125,7 +145,7 @@ create_cloud_job(
 
 Creates a Paddle Cloud job. Fails if the job name exists.
 
-```
+```python
 get_cloud_job(
     name
 )
@@ -133,7 +153,7 @@ get_cloud_job(
 
 Gets a Paddle Cloud job.
 
-```
+```python
 remote_session(
     job
 )

From 5f51d0afc49f4bd4c624ec62aa1e3ccd31840aee Mon Sep 17 00:00:00 2001
From: Yi Wang <yiwang01@baidu.com>
Date: Wed, 4 Oct 2017 10:00:39 -0700
Subject: [PATCH 05/21] Add -D PADDLE_WITH_CUDA in cmake/configure.cmake

---
 cmake/configure.cmake | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/cmake/configure.cmake b/cmake/configure.cmake
index 51c3b918cc..4e044ca421 100644
--- a/cmake/configure.cmake
+++ b/cmake/configure.cmake
@@ -49,11 +49,16 @@ if(NOT WITH_GOLANG)
 endif(NOT WITH_GOLANG)
 
 if(NOT WITH_GPU)
+    # Will gradually remove uses of PADDLE_ONLY_CPU in source files,
+    # so could we remove -DPADDLE_ONLY_CPU.
+    # c.f. https://github.com/PaddlePaddle/Paddle/issues/4588
     add_definitions(-DPADDLE_ONLY_CPU)
     add_definitions(-DHPPL_STUB_FUNC)
 
     list(APPEND CMAKE_CXX_SOURCE_FILE_EXTENSIONS cu)
 else()
+    add_definitions(-DPADDLE_WITH_CUDA)
+
     FIND_PACKAGE(CUDA REQUIRED)
 
     if(${CUDA_VERSION_MAJOR} VERSION_LESS 7)

From a9e298bebef29390b815e431e1a475ab1417015a Mon Sep 17 00:00:00 2001
From: Helin Wang <ustc.harry@gmail.com>
Date: Wed, 4 Oct 2017 13:40:32 -0700
Subject: [PATCH 06/21] fix according to comments

---
 doc/design/refactor/session.md | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/doc/design/refactor/session.md b/doc/design/refactor/session.md
index 9a7451ece5..1d9a26683c 100644
--- a/doc/design/refactor/session.md
+++ b/doc/design/refactor/session.md
@@ -86,10 +86,10 @@ Evaluates the target Operations or Variables in `targets`.
   OP's input as well:
 
   ```python
-  a = pd.constant(1.0, name="a")
-  b = pd.constant(2.0)
+  a = pd.constant(2.0, name="a")
+  b = pd.variable(name="b")
   c = pd.mul(a,b)
-  sess.eval(targets=c, feed_dict={"a":3.0}) # returns 6.0
+  sess.eval(targets=c, feed_dict={"b":3.0}) # returns 6.0
   ```
 
 ```python
@@ -107,14 +107,14 @@ session(
 )
 ```
 
-Creates a new session. One session owns one scope, so creating
+Creates a new session. One session owns one global scope, so creating
 multiple sessions will create different scopes.
 
 - *devices*: a single `string` or a list of `string` of device names,
   the corresponding devices will be the computation devices for
   `eval()`. If not specified, all available devices (e.g., all GPUs)
   will be used. The user doesn't need to specify the CPU device since
-  it will be always used.
+  it will be always used. Multiple sessions can use the same device.
 
 
 #### Example

From 4558807c48035ee1d279d16975d6e2c607cb35f5 Mon Sep 17 00:00:00 2001
From: Yi Wang <yiwang01@baidu.com>
Date: Wed, 4 Oct 2017 14:01:34 -0700
Subject: [PATCH 07/21] Use PADDLE_WITH_CUDA instead of PADDLE_WITH_GPU

---
 paddle/api/Util.cpp                           |  2 +-
 paddle/capi/Matrix.cpp                        |  2 +-
 paddle/framework/grad_op_builder_test.cc      |  2 +-
 paddle/framework/lod_tensor.h                 |  4 +--
 paddle/framework/op_proto_maker_test.cc       |  2 +-
 paddle/framework/op_registry.h                |  2 +-
 paddle/framework/op_registry_test.cc          |  2 +-
 paddle/framework/operator.cc                  |  2 +-
 paddle/framework/tensor_impl.h                |  4 +--
 paddle/framework/tensor_test.cc               |  8 +++---
 paddle/function/BlockExpandOp.cpp             |  2 +-
 paddle/function/ContextProjectionOp.cpp       |  2 +-
 paddle/function/CosSimOp.cpp                  |  2 +-
 paddle/function/CropOp.cpp                    |  2 +-
 paddle/function/CrossMapNormalOp.cpp          |  2 +-
 paddle/function/DepthwiseConvOp.cpp           |  2 +-
 paddle/function/DepthwiseConvOpTest.cpp       |  2 +-
 paddle/function/GemmConvOp.cpp                |  2 +-
 paddle/function/GemmConvOpTest.cpp            |  2 +-
 paddle/function/Im2ColTest.cpp                |  2 +-
 paddle/function/MulOp.cpp                     |  2 +-
 paddle/function/PadOp.cpp                     |  2 +-
 paddle/function/RowConvOp.cpp                 |  2 +-
 paddle/function/SwitchOp.cpp                  |  2 +-
 paddle/gserver/layers/BatchNormBaseLayer.cpp  |  2 +-
 .../layers/BatchNormalizationLayer.cpp        |  6 ++---
 paddle/gserver/layers/PoolLayer.cpp           |  4 +--
 paddle/gserver/tests/LayerGradUtil.cpp        |  2 +-
 paddle/gserver/tests/test_BatchNorm.cpp       |  2 +-
 paddle/gserver/tests/test_ConvUnify.cpp       |  2 +-
 paddle/gserver/tests/test_DetectionOutput.cpp |  2 +-
 paddle/gserver/tests/test_Evaluator.cpp       |  2 +-
 paddle/gserver/tests/test_KmaxSeqScore.cpp    |  2 +-
 paddle/gserver/tests/test_LayerGrad.cpp       | 26 +++++++++----------
 paddle/gserver/tests/test_NetworkCompare.cpp  |  2 +-
 paddle/gserver/tests/test_PriorBox.cpp        |  2 +-
 .../gserver/tests/test_ProtoDataProvider.cpp  |  6 ++---
 paddle/gserver/tests/test_PyDataProvider.cpp  |  4 +--
 .../gserver/tests/test_SelectiveFCLayer.cpp   |  8 +++---
 .../gserver/tests/test_SeqSliceLayerGrad.cpp  |  2 +-
 paddle/gserver/tests/test_WarpCTCLayer.cpp    |  2 +-
 paddle/math/Matrix.cpp                        |  6 ++---
 paddle/math/SparseMatrix.cpp                  |  2 +-
 paddle/math/Vector.cpp                        |  6 ++---
 paddle/math/tests/test_Allocator.cpp          |  4 +--
 paddle/math/tests/test_BaseMatrix.cpp         |  2 +-
 paddle/math/tests/test_CpuGpuVector.cpp       |  2 +-
 paddle/math/tests/test_ExecViaCpu.cpp         |  2 +-
 paddle/math/tests/test_GpuProfiler.cpp        |  2 +-
 paddle/math/tests/test_Matrix.cpp             |  2 +-
 paddle/math/tests/test_SparseMatrix.cpp       |  6 ++---
 paddle/math/tests/test_TrainingAlgorithm.cpp  |  2 +-
 paddle/math/tests/test_batchTranspose.cpp     |  2 +-
 paddle/math/tests/test_matrixCompare.cpp      |  2 +-
 paddle/math/tests/test_perturbation.cpp       |  2 +-
 .../math/tests/test_sparseMatrixCompare.cpp   |  2 +-
 paddle/memory/detail/buddy_allocator.cc       |  2 +-
 paddle/memory/detail/system_allocator.cc      |  2 +-
 paddle/memory/detail/system_allocator.h       |  2 +-
 paddle/memory/detail/system_allocator_test.cc |  2 +-
 paddle/memory/memcpy.cc                       |  2 +-
 paddle/memory/memcpy.h                        |  2 +-
 paddle/memory/memory.cc                       |  2 +-
 paddle/memory/memory_test.cc                  |  2 +-
 paddle/operators/detail/strided_memcpy.h      |  2 +-
 paddle/operators/math/im2col_test.cc          |  4 +--
 paddle/operators/math/math_function_test.cc   |  2 +-
 paddle/operators/strided_memcpy_test.cc       |  4 +--
 paddle/platform/device_context.cc             |  2 +-
 paddle/platform/device_context.h              |  4 +--
 paddle/platform/enforce.h                     |  4 +--
 paddle/platform/enforce_test.cc               |  2 +-
 paddle/platform/gpu_info.h                    |  2 +-
 paddle/platform/variant.h                     |  2 +-
 paddle/pserver/test/SocketTest.cpp            |  2 +-
 paddle/pserver/test/test_ProtoServer.cpp      |  2 +-
 paddle/pybind/pybind.cc                       | 12 ++++-----
 paddle/pybind/tensor_py.h                     |  2 +-
 paddle/string/to_string_test.cc               |  2 +-
 paddle/trainer/MergeModel.cpp                 |  2 +-
 paddle/trainer/tests/test_Compare.cpp         |  2 +-
 paddle/trainer/tests/test_CompareSparse.cpp   |  4 +--
 paddle/trainer/tests/test_Trainer.cpp         |  4 +--
 paddle/trainer/tests/test_TrainerOnePass.cpp  |  6 ++---
 .../test_recurrent_machine_generation.cpp     |  2 +-
 paddle/utils/Flags.cpp                        |  2 +-
 paddle/utils/Util.h                           |  2 +-
 paddle/utils/Version.h                        |  2 +-
 88 files changed, 134 insertions(+), 134 deletions(-)

diff --git a/paddle/api/Util.cpp b/paddle/api/Util.cpp
index 7446d892fd..11bd05c09d 100644
--- a/paddle/api/Util.cpp
+++ b/paddle/api/Util.cpp
@@ -47,7 +47,7 @@ bool isUsingGpu() { return FLAGS_use_gpu; }
 void setUseGpu(bool useGpu) { FLAGS_use_gpu = useGpu; }
 
 bool isGpuVersion() {
-#ifndef PADDLE_WITH_GPU
+#ifndef PADDLE_WITH_CUDA
   return false;
 #else
   return true;
diff --git a/paddle/capi/Matrix.cpp b/paddle/capi/Matrix.cpp
index 5b3737a759..4547afaf1d 100644
--- a/paddle/capi/Matrix.cpp
+++ b/paddle/capi/Matrix.cpp
@@ -46,7 +46,7 @@ paddle_error paddle_matrix_set_row(paddle_matrix mat,
   if (rowID >= ptr->mat->getHeight()) return kPD_OUT_OF_RANGE;
   paddle::real* buf = ptr->mat->getRowBuf(rowID);
   size_t width = ptr->mat->getWidth();
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
   hl_memcpy(buf, rowArray, sizeof(paddle::real) * width);
 #else
   std::copy(rowArray, rowArray + width, buf);
diff --git a/paddle/framework/grad_op_builder_test.cc b/paddle/framework/grad_op_builder_test.cc
index 2dbc2e6620..793780ea44 100644
--- a/paddle/framework/grad_op_builder_test.cc
+++ b/paddle/framework/grad_op_builder_test.cc
@@ -183,4 +183,4 @@ TEST(GradOpDescBuilder, IOIgnoredInGradient) {
                 {f::GradVarName("in3_1"), f::GradVarName("in3_2")}));
   delete forw_op;
   delete grad_op;
-}
\ No newline at end of file
+}
diff --git a/paddle/framework/lod_tensor.h b/paddle/framework/lod_tensor.h
index b12c95b6b7..4db36ee766 100644
--- a/paddle/framework/lod_tensor.h
+++ b/paddle/framework/lod_tensor.h
@@ -15,7 +15,7 @@
 #pragma once
 
 #include <memory>
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 #include <thrust/device_vector.h>
 #include <thrust/host_vector.h>
 #include <thrust/system/cuda/experimental/pinned_allocator.h>
@@ -29,7 +29,7 @@
 namespace paddle {
 namespace framework {
 
-#ifndef PADDLE_WITH_GPU
+#ifndef PADDLE_WITH_CUDA
 template <typename T>
 using Vector = std::vector<T>;
 #else
diff --git a/paddle/framework/op_proto_maker_test.cc b/paddle/framework/op_proto_maker_test.cc
index b01e30f753..988a14cf4d 100644
--- a/paddle/framework/op_proto_maker_test.cc
+++ b/paddle/framework/op_proto_maker_test.cc
@@ -48,4 +48,4 @@ TEST(ProtoMaker, DuplicatedInOut) {
   paddle::framework::OpAttrChecker op_checker;
   auto proto_maker = TestInOutProtoMaker(&op_proto, &op_checker);
   ASSERT_THROW(proto_maker.Validate(), paddle::platform::EnforceNotMet);
-}
\ No newline at end of file
+}
diff --git a/paddle/framework/op_registry.h b/paddle/framework/op_registry.h
index aca6579f36..958cf581f5 100644
--- a/paddle/framework/op_registry.h
+++ b/paddle/framework/op_registry.h
@@ -211,7 +211,7 @@ class OpKernelRegistrar : public Registrar {
 // TODO(fengjiayi): The following macros
 // seems ugly, do we have better method?
 
-#ifndef PADDLE_WITH_GPU
+#ifndef PADDLE_WITH_CUDA
 #define USE_OP_KERNEL(op_type) USE_OP_DEVICE_KERNEL(op_type, CPU)
 #else
 #define USE_OP_KERNEL(op_type)        \
diff --git a/paddle/framework/op_registry_test.cc b/paddle/framework/op_registry_test.cc
index f89f40b444..b860fe6cac 100644
--- a/paddle/framework/op_registry_test.cc
+++ b/paddle/framework/op_registry_test.cc
@@ -183,4 +183,4 @@ class CosineOpComplete : public paddle::framework::CosineOp {
 TEST(OperatorRegistrar, Test) {
   using namespace paddle::framework;
   OperatorRegistrar<CosineOpComplete, CosineOpProtoAndCheckerMaker> reg("cos");
-}
\ No newline at end of file
+}
diff --git a/paddle/framework/operator.cc b/paddle/framework/operator.cc
index 21c1c6f9e6..2ca838f838 100644
--- a/paddle/framework/operator.cc
+++ b/paddle/framework/operator.cc
@@ -25,7 +25,7 @@ Eigen::DefaultDevice& ExecutionContext::GetEigenDevice<
   return *device_context_.GetEigenDevice<platform::CPUPlace>();
 }
 
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 template <>
 Eigen::GpuDevice&
 ExecutionContext::GetEigenDevice<platform::GPUPlace, Eigen::GpuDevice>() const {
diff --git a/paddle/framework/tensor_impl.h b/paddle/framework/tensor_impl.h
index 1cde1f74b8..379eac94f9 100644
--- a/paddle/framework/tensor_impl.h
+++ b/paddle/framework/tensor_impl.h
@@ -65,7 +65,7 @@ inline T* Tensor::mutable_data(platform::Place place) {
       holder_.reset(new PlaceholderImpl<T, platform::CPUPlace>(
           boost::get<platform::CPUPlace>(place), size));
     } else if (platform::is_gpu_place(place)) {
-#ifndef PADDLE_WITH_GPU
+#ifndef PADDLE_WITH_CUDA
       PADDLE_THROW("'GPUPlace' is not supported in CPU only device.");
     }
 #else
@@ -103,7 +103,7 @@ inline void Tensor::CopyFrom(const Tensor& src,
     memory::Copy(boost::get<platform::CPUPlace>(dst_place), dst_ptr,
                  boost::get<platform::CPUPlace>(src_place), src_ptr, size);
   }
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
   else if (platform::is_gpu_place(src_place) &&
            platform::is_cpu_place(dst_place)) {
     memory::Copy(boost::get<platform::CPUPlace>(dst_place), dst_ptr,
diff --git a/paddle/framework/tensor_test.cc b/paddle/framework/tensor_test.cc
index 86c6945ab5..58cf0fc3cb 100644
--- a/paddle/framework/tensor_test.cc
+++ b/paddle/framework/tensor_test.cc
@@ -74,7 +74,7 @@ TEST(Tensor, MutableData) {
     EXPECT_EQ(p1, p2);
   }
 
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
   {
     Tensor src_tensor;
     float* p1 = nullptr;
@@ -126,7 +126,7 @@ TEST(Tensor, ShareDataWith) {
     ASSERT_EQ(src_tensor.data<int>(), dst_tensor.data<int>());
   }
 
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
   {
     Tensor src_tensor;
     Tensor dst_tensor;
@@ -163,7 +163,7 @@ TEST(Tensor, Slice) {
     EXPECT_EQ(src_data_address + 3 * 4 * 1 * sizeof(int), slice_data_address);
   }
 
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
   {
     Tensor src_tensor;
     src_tensor.mutable_data<double>(make_ddim({6, 9}), GPUPlace());
@@ -218,7 +218,7 @@ TEST(Tensor, CopyFrom) {
       EXPECT_EQ(dst_ptr[i], slice_ptr[i]);
     }
   }
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
   {
     Tensor src_tensor;
     Tensor gpu_tensor;
diff --git a/paddle/function/BlockExpandOp.cpp b/paddle/function/BlockExpandOp.cpp
index ad78f5f584..bd0fe119ce 100644
--- a/paddle/function/BlockExpandOp.cpp
+++ b/paddle/function/BlockExpandOp.cpp
@@ -194,7 +194,7 @@ public:
 
 REGISTER_TYPED_FUNC(BlockExpand, CPU, BlockExpandForward);
 REGISTER_TYPED_FUNC(BlockExpandGrad, CPU, BlockExpandBackward);
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 REGISTER_TYPED_FUNC(BlockExpand, GPU, BlockExpandForward);
 REGISTER_TYPED_FUNC(BlockExpandGrad, GPU, BlockExpandBackward);
 #endif
diff --git a/paddle/function/ContextProjectionOp.cpp b/paddle/function/ContextProjectionOp.cpp
index ab18c39df8..23916c0f4b 100644
--- a/paddle/function/ContextProjectionOp.cpp
+++ b/paddle/function/ContextProjectionOp.cpp
@@ -395,7 +395,7 @@ REGISTER_TYPED_FUNC(ContextProjectionForward,
 REGISTER_TYPED_FUNC(ContextProjectionBackward,
                     CPU,
                     ContextProjectionBackwardFunc);
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 REGISTER_TYPED_FUNC(ContextProjectionForward,
                     GPU,
                     ContextProjectionForwardFunc);
diff --git a/paddle/function/CosSimOp.cpp b/paddle/function/CosSimOp.cpp
index 4418f144d3..2e5c281f37 100644
--- a/paddle/function/CosSimOp.cpp
+++ b/paddle/function/CosSimOp.cpp
@@ -233,7 +233,7 @@ private:
 
 REGISTER_TYPED_FUNC(CosSimForward, CPU, CosSimForwardFunc);
 REGISTER_TYPED_FUNC(CosSimBackward, CPU, CosSimBackwardFunc);
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 REGISTER_TYPED_FUNC(CosSimForward, GPU, CosSimForwardFunc);
 REGISTER_TYPED_FUNC(CosSimBackward, GPU, CosSimBackwardFunc);
 #endif
diff --git a/paddle/function/CropOp.cpp b/paddle/function/CropOp.cpp
index 39504cc2c1..46f98f12c1 100644
--- a/paddle/function/CropOp.cpp
+++ b/paddle/function/CropOp.cpp
@@ -169,7 +169,7 @@ private:
 
 REGISTER_TYPED_FUNC(Crop, CPU, CropFunc);
 REGISTER_TYPED_FUNC(CropGrad, CPU, CropGradFunc);
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 REGISTER_TYPED_FUNC(Crop, GPU, CropFunc);
 REGISTER_TYPED_FUNC(CropGrad, GPU, CropGradFunc);
 #endif
diff --git a/paddle/function/CrossMapNormalOp.cpp b/paddle/function/CrossMapNormalOp.cpp
index 1cf0918bed..9e88669d37 100644
--- a/paddle/function/CrossMapNormalOp.cpp
+++ b/paddle/function/CrossMapNormalOp.cpp
@@ -336,7 +336,7 @@ private:
 
 REGISTER_TYPED_FUNC(CrossMapNormal, CPU, CrossMapNormalFunc);
 REGISTER_TYPED_FUNC(CrossMapNormalGrad, CPU, CrossMapNormalGradFunc);
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 REGISTER_TYPED_FUNC(CrossMapNormal, GPU, CrossMapNormalFunc);
 REGISTER_TYPED_FUNC(CrossMapNormalGrad, GPU, CrossMapNormalGradFunc);
 #endif
diff --git a/paddle/function/DepthwiseConvOp.cpp b/paddle/function/DepthwiseConvOp.cpp
index 7656ab3d0a..9863e3ae1d 100644
--- a/paddle/function/DepthwiseConvOp.cpp
+++ b/paddle/function/DepthwiseConvOp.cpp
@@ -292,7 +292,7 @@ REGISTER_TYPED_FUNC(DepthwiseConvGradInput,
 REGISTER_TYPED_FUNC(DepthwiseConvGradFilter,
                     CPU,
                     DepthwiseConvGradFilterFunction);
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 REGISTER_TYPED_FUNC(DepthwiseConv, GPU, DepthwiseConvFunction);
 REGISTER_TYPED_FUNC(DepthwiseConvGradInput,
                     GPU,
diff --git a/paddle/function/DepthwiseConvOpTest.cpp b/paddle/function/DepthwiseConvOpTest.cpp
index 39033ecb2b..b1a90da7db 100644
--- a/paddle/function/DepthwiseConvOpTest.cpp
+++ b/paddle/function/DepthwiseConvOpTest.cpp
@@ -17,7 +17,7 @@ limitations under the License. */
 
 namespace paddle {
 
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 TEST(DepthwiseConv, Forward) {
   DepthwiseConvolution<DEVICE_TYPE_CPU, DEVICE_TYPE_GPU>(
       "GemmConv-CPU", "DepthwiseConv-GPU", forward);
diff --git a/paddle/function/GemmConvOp.cpp b/paddle/function/GemmConvOp.cpp
index 68e08c1480..bdb56ddac3 100644
--- a/paddle/function/GemmConvOp.cpp
+++ b/paddle/function/GemmConvOp.cpp
@@ -340,7 +340,7 @@ public:
 REGISTER_TYPED_FUNC(GemmConv, CPU, GemmConvFunction);
 REGISTER_TYPED_FUNC(GemmConvGradInput, CPU, GemmConvGradInputFunction);
 REGISTER_TYPED_FUNC(GemmConvGradFilter, CPU, GemmConvGradFilterFunction);
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 REGISTER_TYPED_FUNC(GemmConv, GPU, GemmConvFunction);
 REGISTER_TYPED_FUNC(GemmConvGradInput, GPU, GemmConvGradInputFunction);
 REGISTER_TYPED_FUNC(GemmConvGradFilter, GPU, GemmConvGradFilterFunction);
diff --git a/paddle/function/GemmConvOpTest.cpp b/paddle/function/GemmConvOpTest.cpp
index bd1cf3c6a4..b5b5e1f35b 100644
--- a/paddle/function/GemmConvOpTest.cpp
+++ b/paddle/function/GemmConvOpTest.cpp
@@ -24,7 +24,7 @@ TEST(GemmConv, NaiveConv) {
       "NaiveConv-CPU", "GemmConv-CPU", forward);
 }
 
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 TEST(GemmConv, Forward) {
   Convolution<DEVICE_TYPE_CPU, DEVICE_TYPE_GPU>(
       "GemmConv-CPU", "GemmConv-GPU", forward);
diff --git a/paddle/function/Im2ColTest.cpp b/paddle/function/Im2ColTest.cpp
index 55325e94b5..a0a01a5fc7 100644
--- a/paddle/function/Im2ColTest.cpp
+++ b/paddle/function/Im2ColTest.cpp
@@ -116,7 +116,7 @@ void TestIm2ColFunctor() {
 
 TEST(Im2ColFunctor, CPU) { TestIm2ColFunctor<DEVICE_TYPE_CPU, float>(); }
 
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 
 TEST(Im2ColFunctor, GPU) { TestIm2ColFunctor<DEVICE_TYPE_GPU, float>(); }
 
diff --git a/paddle/function/MulOp.cpp b/paddle/function/MulOp.cpp
index 655026320c..704a8c4132 100644
--- a/paddle/function/MulOp.cpp
+++ b/paddle/function/MulOp.cpp
@@ -341,7 +341,7 @@ private:
 };
 
 REGISTER_TYPED_FUNC(MulOp, CPU, MulFunc);
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 REGISTER_TYPED_FUNC(MulOp, GPU, MulFunc);
 #endif
 }  // namespace paddle
diff --git a/paddle/function/PadOp.cpp b/paddle/function/PadOp.cpp
index 24c9bf4e72..eed2f2e308 100644
--- a/paddle/function/PadOp.cpp
+++ b/paddle/function/PadOp.cpp
@@ -207,7 +207,7 @@ private:
 
 REGISTER_TYPED_FUNC(Pad, CPU, PadFunc);
 REGISTER_TYPED_FUNC(PadGrad, CPU, PadGradFunc);
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 REGISTER_TYPED_FUNC(Pad, GPU, PadFunc);
 REGISTER_TYPED_FUNC(PadGrad, GPU, PadGradFunc);
 #endif
diff --git a/paddle/function/RowConvOp.cpp b/paddle/function/RowConvOp.cpp
index 09e702f71a..7c802d6627 100644
--- a/paddle/function/RowConvOp.cpp
+++ b/paddle/function/RowConvOp.cpp
@@ -217,7 +217,7 @@ public:
 
 REGISTER_TYPED_FUNC(RowConv, CPU, RowConvFunc);
 REGISTER_TYPED_FUNC(RowConvGrad, CPU, RowConvGradFunc);
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 REGISTER_TYPED_FUNC(RowConv, GPU, RowConvFunc);
 REGISTER_TYPED_FUNC(RowConvGrad, GPU, RowConvGradFunc);
 #endif
diff --git a/paddle/function/SwitchOp.cpp b/paddle/function/SwitchOp.cpp
index db839b5b76..597723a2dd 100644
--- a/paddle/function/SwitchOp.cpp
+++ b/paddle/function/SwitchOp.cpp
@@ -132,7 +132,7 @@ public:
 
 REGISTER_TYPED_FUNC(NCHW2NHWC, CPU, NCHW2NHWCFunc);
 REGISTER_TYPED_FUNC(NHWC2NCHW, CPU, NHWC2NCHWFunc);
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 REGISTER_TYPED_FUNC(NCHW2NHWC, GPU, NCHW2NHWCFunc);
 REGISTER_TYPED_FUNC(NHWC2NCHW, GPU, NHWC2NCHWFunc);
 #endif
diff --git a/paddle/gserver/layers/BatchNormBaseLayer.cpp b/paddle/gserver/layers/BatchNormBaseLayer.cpp
index 55f52816ab..bc7d1c83a4 100644
--- a/paddle/gserver/layers/BatchNormBaseLayer.cpp
+++ b/paddle/gserver/layers/BatchNormBaseLayer.cpp
@@ -16,7 +16,7 @@ limitations under the License. */
 #include "BatchNormalizationLayer.h"
 #include "Layer.h"
 #include "paddle/utils/Stat.h"
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 #include "CudnnBatchNormLayer.h"
 #endif
 
diff --git a/paddle/gserver/layers/BatchNormalizationLayer.cpp b/paddle/gserver/layers/BatchNormalizationLayer.cpp
index 33cf24431d..dacff25e59 100644
--- a/paddle/gserver/layers/BatchNormalizationLayer.cpp
+++ b/paddle/gserver/layers/BatchNormalizationLayer.cpp
@@ -13,7 +13,7 @@ See the License for the specific language governing permissions and
 limitations under the License. */
 
 #include "paddle/utils/Stat.h"
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 #include "hl_batch_transpose.h"
 #endif
 #include "BatchNormalizationLayer.h"
@@ -90,7 +90,7 @@ void BatchNormalizationLayer::expandMat(const MatrixPtr& in, MatrixPtr& out) {
   size_t batchSize = in->getHeight();
   CHECK_EQ(out->getHeight(), batchSize * imgPixels_);
   if (useGpu_) {
-#ifndef PADDLE_WITH_GPU
+#ifndef PADDLE_WITH_CUDA
     LOG(FATAL) << "paddle is compiled only for cpu";
 #else
     batchTranspose(
@@ -127,7 +127,7 @@ void BatchNormalizationLayer::shrinkMat(const MatrixPtr& in, MatrixPtr& out) {
   }
   CHECK_EQ(in->getHeight(), static_cast<size_t>(batchSize * imgPixels_));
   if (useGpu_) {
-#ifndef PADDLE_WITH_GPU
+#ifndef PADDLE_WITH_CUDA
     LOG(FATAL) << "paddle is compiled only for cpu";
 #else
     batchTranspose(
diff --git a/paddle/gserver/layers/PoolLayer.cpp b/paddle/gserver/layers/PoolLayer.cpp
index 43ab4e4d47..7b932d5a76 100644
--- a/paddle/gserver/layers/PoolLayer.cpp
+++ b/paddle/gserver/layers/PoolLayer.cpp
@@ -15,7 +15,7 @@ limitations under the License. */
 #include "PoolLayer.h"
 #include "PoolProjectionLayer.h"
 #include "paddle/utils/Logging.h"
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 #include "CudnnPoolLayer.h"
 #endif
 namespace paddle {
@@ -53,7 +53,7 @@ Layer* PoolLayer::create(const LayerConfig& config) {
   const std::string& pool = config.inputs(0).pool_conf().pool_type();
   if (pool == "max-projection" || pool == "avg-projection") {
     return new PoolProjectionLayer(config);
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
   } else if (CudnnPoolLayer::typeCheck(pool)) {
     return new CudnnPoolLayer(config);
 #endif
diff --git a/paddle/gserver/tests/LayerGradUtil.cpp b/paddle/gserver/tests/LayerGradUtil.cpp
index 59df057a80..cd957c7c0b 100644
--- a/paddle/gserver/tests/LayerGradUtil.cpp
+++ b/paddle/gserver/tests/LayerGradUtil.cpp
@@ -674,7 +674,7 @@ void testLayerGradKernel(TestConfig testConf,
                          bool useGpu,
                          bool useWeight,
                          float epsilon) {
-#ifndef PADDLE_WITH_GPU
+#ifndef PADDLE_WITH_CUDA
   if (useGpu) return;
 #endif
   FLAGS_use_gpu = useGpu;
diff --git a/paddle/gserver/tests/test_BatchNorm.cpp b/paddle/gserver/tests/test_BatchNorm.cpp
index c1c85f8fac..050fde9d0a 100644
--- a/paddle/gserver/tests/test_BatchNorm.cpp
+++ b/paddle/gserver/tests/test_BatchNorm.cpp
@@ -119,7 +119,7 @@ TEST(Layer, batchNorm) {
   CHECK_EQ(static_cast<int>(convLayer->getOutputValue()->getWidth()), 576);
 }
 
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 void batchNormInference(int n, int c, int h, int w) {
   MatrixPtr input = std::make_shared<GpuMatrix>(n, c * h * w);
   MatrixPtr cudnnOut = std::make_shared<GpuMatrix>(n, c * h * w);
diff --git a/paddle/gserver/tests/test_ConvUnify.cpp b/paddle/gserver/tests/test_ConvUnify.cpp
index 16556469cb..ffcc47e2a8 100644
--- a/paddle/gserver/tests/test_ConvUnify.cpp
+++ b/paddle/gserver/tests/test_ConvUnify.cpp
@@ -117,7 +117,7 @@ MatrixPtr doOneConvTest(size_t imgSize,
 }
 
 TEST(Layer, convParaUnified) {
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
   MatrixPtr input, resultCpu, resultGpu;
 
   /// TEST1 for conv ///
diff --git a/paddle/gserver/tests/test_DetectionOutput.cpp b/paddle/gserver/tests/test_DetectionOutput.cpp
index 1a83f48fae..dc39c97a87 100644
--- a/paddle/gserver/tests/test_DetectionOutput.cpp
+++ b/paddle/gserver/tests/test_DetectionOutput.cpp
@@ -150,7 +150,7 @@ TEST(Layer, detectionOutputLayerFwd) {
                            useGpu,
                            result2);
 
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
   // GPU case 1.
   useGpu = true;
   inputLoc = Matrix::create(1, 16, false, useGpu);
diff --git a/paddle/gserver/tests/test_Evaluator.cpp b/paddle/gserver/tests/test_Evaluator.cpp
index 42bb570572..62a131171f 100644
--- a/paddle/gserver/tests/test_Evaluator.cpp
+++ b/paddle/gserver/tests/test_Evaluator.cpp
@@ -51,7 +51,7 @@ void testEvaluator(TestConfig testConf,
                    string testEvaluatorName,
                    size_t batchSize,
                    bool useGpu) {
-#ifndef PADDLE_WITH_GPU
+#ifndef PADDLE_WITH_CUDA
   if (useGpu) return;
 #endif
   FLAGS_use_gpu = useGpu;
diff --git a/paddle/gserver/tests/test_KmaxSeqScore.cpp b/paddle/gserver/tests/test_KmaxSeqScore.cpp
index 1594de8502..6386259882 100644
--- a/paddle/gserver/tests/test_KmaxSeqScore.cpp
+++ b/paddle/gserver/tests/test_KmaxSeqScore.cpp
@@ -97,7 +97,7 @@ TEST(Layer, kmaxSeqScoreLayer) {
       Matrix::create(subSeqStartPosition.back(), 1, false, false);
 
   std::vector<bool> mode = {false};
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
   mode.push_back(true);
 #endif
 
diff --git a/paddle/gserver/tests/test_LayerGrad.cpp b/paddle/gserver/tests/test_LayerGrad.cpp
index e887dee5f9..90a3352898 100644
--- a/paddle/gserver/tests/test_LayerGrad.cpp
+++ b/paddle/gserver/tests/test_LayerGrad.cpp
@@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */
 
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 #include <cudnn.h>
 #endif
 #include <gtest/gtest.h>
@@ -258,7 +258,7 @@ void testProjectionConv(size_t groups, bool isDeconv) {
                      true);
 }
 
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 TEST(Projection, conv) {
   /// test ConvProjection
   testProjectionConv(1, false);
@@ -422,7 +422,7 @@ TEST(Layer, depthwiseConvLayer) {
   //  'depthwise_conv' is a sepecial case of 'exconv' whose
   //  groups size equals to the input channels size.
   testDepthwiseConvLayer("exconv", /* useGpu= */ false);
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
   testDepthwiseConvLayer("exconv", /* useGpu= */ true);
 #endif
 }
@@ -480,7 +480,7 @@ void testConvLayer(const string& type, bool trans, bool useGpu) {
 
 TEST(Layer, convLayer) {
   testConvLayer("exconv", /* trans= */ false, /* useGpu= */ false);
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
   testConvLayer("exconv", /* trans= */ false, /* useGpu= */ true);
   testConvLayer("cudnn_conv", /* trans= */ false, /* useGpu= */ true);
 #endif
@@ -525,7 +525,7 @@ TEST(Layer, convTransLayer) {
   for (auto useGpu : {false, true}) {
     testConvTransLayer("exconvt", /* trans= */ false, /* useGpu= */ useGpu);
   }
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
   testConvTransLayer("cudnn_convt", /* trans= */ false, /* useGpu= */ true);
 #endif
 }
@@ -638,7 +638,7 @@ TEST(Layer, SelectiveFullyConnectedLayer) {
                 /* trans= */ false,
                 /* useGup= */ false,
                 false);
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
   testLayerGrad(config,
                 "selective_fc",
                 100,
@@ -1210,7 +1210,7 @@ void testPoolLayer(const string& poolType, bool trans, bool useGpu) {
   testLayerGrad(config, "pool", 100, trans, useGpu);
 }
 
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 void testPoolLayer2(const string& poolType, bool trans, bool useGpu) {
   TestConfig config;
   config.inputDefs.push_back({INPUT_DATA, "layer_0", 3200, 0});
@@ -1236,7 +1236,7 @@ TEST(Layer, PoolLayer) {
   testPoolLayer("avg-projection", /* trans= */ false, /* useGpu= */ false);
   testPoolLayer("max-projection", /* trans= */ false, /* useGpu= */ false);
 
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
   testPoolLayer("avg-projection", /* trans= */ false, /* useGpu= */ true);
   testPoolLayer("max-projection", /* trans= */ false, /* useGpu= */ true);
   testPoolLayer("cudnn-max-pool", /* trans= */ false, /* useGpu= */ true);
@@ -1309,7 +1309,7 @@ void testPool3DLayer(const string& poolType, bool trans, bool useGpu) {
 TEST(Layer, Pool3DLayer) {
   testPool3DLayer("avg", /* trans= */ false, /* useGpu= */ false);
   testPool3DLayer("max", /* trans= */ false, /* useGpu= */ false);
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
   testPool3DLayer("avg", /* trans= */ false, /* useGpu= */ true);
   testPool3DLayer("max", /* trans= */ false, /* useGpu= */ true);
 #endif
@@ -1695,7 +1695,7 @@ void testBatchNormLayer(const string& type, bool trans, bool useGpu) {
 
 TEST(Layer, BatchNormalizationLayer) {
   testBatchNormLayer("batch_norm", false, false);
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
   testBatchNormLayer("batch_norm", false, true);
   if (hl_get_cudnn_lib_version() >= int(4000)) {
     testBatchNormLayer("cudnn_batch_norm", false, true);
@@ -1744,7 +1744,7 @@ void testBatchNorm3DLayer(const string& type, bool trans, bool useGpu) {
 
 TEST(Layer, testBatchNorm3DLayer) {
   testBatchNorm3DLayer("batch_norm", false, false);
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
   testBatchNorm3DLayer("batch_norm", false, true);
   if (hl_get_cudnn_lib_version() >= int(4000)) {
     testBatchNorm3DLayer("cudnn_batch_norm", false, true);
@@ -2262,7 +2262,7 @@ void test3DConvLayer(const string& type, bool trans, bool useGpu) {
 
 TEST(Layer, test3DConvLayer) {
   test3DConvLayer("conv3d", /* trans= */ false, /* useGpu= */ false);
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
   test3DConvLayer("conv3d", /* trans= */ false, /* useGpu= */ true);
 #endif
 }
@@ -2339,7 +2339,7 @@ void test3DDeConvLayer(const string& type, bool trans, bool useGpu) {
 
 TEST(Layer, test3DDeConvLayer) {
   test3DDeConvLayer("deconv3d", /* trans= */ false, /* useGpu= */ false);
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
   test3DDeConvLayer("deconv3d", /* trans= */ false, /* useGpu= */ true);
 #endif
 }
diff --git a/paddle/gserver/tests/test_NetworkCompare.cpp b/paddle/gserver/tests/test_NetworkCompare.cpp
index e322fef9a4..2b92211936 100644
--- a/paddle/gserver/tests/test_NetworkCompare.cpp
+++ b/paddle/gserver/tests/test_NetworkCompare.cpp
@@ -243,7 +243,7 @@ TEST(Compare, concat_slice) {
   compareNetwork(config_file_a, config_file_b);
 }
 
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 TEST(Compare, img_pool) {
   std::string config_file_a = "./gserver/tests/img_pool_a.conf";
   std::string config_file_b = "./gserver/tests/img_pool_b.conf";
diff --git a/paddle/gserver/tests/test_PriorBox.cpp b/paddle/gserver/tests/test_PriorBox.cpp
index cbc0fff7b8..8dc5568784 100644
--- a/paddle/gserver/tests/test_PriorBox.cpp
+++ b/paddle/gserver/tests/test_PriorBox.cpp
@@ -151,7 +151,7 @@ TEST(Layer, priorBoxLayerFwd) {
                     useGpu,
                     result);
 
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
   // reset the input parameters
   variance[1] = 0.1;
   variance[3] = 0.2;
diff --git a/paddle/gserver/tests/test_ProtoDataProvider.cpp b/paddle/gserver/tests/test_ProtoDataProvider.cpp
index 988dbc2513..af6472619d 100644
--- a/paddle/gserver/tests/test_ProtoDataProvider.cpp
+++ b/paddle/gserver/tests/test_ProtoDataProvider.cpp
@@ -485,7 +485,7 @@ TEST(ProtoDataProvider, test) {
               // Currently in async mode, useGpu is not supported
               continue;
             }
-#ifndef PADDLE_WITH_GPU
+#ifndef PADDLE_WITH_CUDA
             if (useGpu) {
               continue;
             }
@@ -525,7 +525,7 @@ TEST(ProtoDataProvider, constant_slots) {
       for (int numConstantSlots : {1, 2}) {
         for (int useGpu : numTwoArray) {
           for (int dataCompression : numTwoArray) {
-#ifndef PADDLE_WITH_GPU
+#ifndef PADDLE_WITH_CUDA
             if (useGpu) {
               continue;
             }
@@ -708,7 +708,7 @@ TEST(ProtoSequenceDataProvider, test) {
               // Currently in async mode, useGpu is not supported
               continue;
             }
-#ifndef PADDLE_WITH_GPU
+#ifndef PADDLE_WITH_CUDA
             if (useGpu) {
               continue;
             }
diff --git a/paddle/gserver/tests/test_PyDataProvider.cpp b/paddle/gserver/tests/test_PyDataProvider.cpp
index f6522febf8..fe54799259 100644
--- a/paddle/gserver/tests/test_PyDataProvider.cpp
+++ b/paddle/gserver/tests/test_PyDataProvider.cpp
@@ -37,7 +37,7 @@ TEST(PyDataProvider, py_fill_slots) {
   config.clear_files();
   std::string dataFile = "gserver/tests/pyDataProvider/pyDataProviderList";
   config.set_files(dataFile);
-#ifndef PADDLE_WITH_GPU
+#ifndef PADDLE_WITH_CUDA
   bool useGpu = false;
 #else
   bool useGpu = true;
@@ -71,7 +71,7 @@ TEST(PyDataProvider, py_fill_nest_slots) {
   std::string dataFile = "gserver/tests/pyDataProvider/pyDataProviderList";
   config.set_files(dataFile);
   EXPECT_EQ(config.IsInitialized(), true);
-#ifndef PADDLE_WITH_GPU
+#ifndef PADDLE_WITH_CUDA
   bool useGpu = false;
 #else
   bool useGpu = true;
diff --git a/paddle/gserver/tests/test_SelectiveFCLayer.cpp b/paddle/gserver/tests/test_SelectiveFCLayer.cpp
index b25d32fb2c..4c87fe1bba 100644
--- a/paddle/gserver/tests/test_SelectiveFCLayer.cpp
+++ b/paddle/gserver/tests/test_SelectiveFCLayer.cpp
@@ -321,7 +321,7 @@ TEST(Layer, SelectiveFcLayer_train_dense_mul) {
       "filelist=gserver/tests/SelectiveFcTest/dense_mul_list";
 
   for (auto useGpu : {false, true}) {
-#ifndef PADDLE_WITH_GPU
+#ifndef PADDLE_WITH_CUDA
     if (useGpu) {
       break;
     }
@@ -388,7 +388,7 @@ void testSelectiveFcLayerTrainSparseMul(const LayerConfig& config,
                           outMatSelfc->getWidth(),
                           outMatSelfc->getElementCnt()));
   cpuOutMatSelfc->copyFrom(*outMatSelfc, HPPL_STREAM_DEFAULT);
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
   if (useGpu) {
     hl_stream_synchronize(HPPL_STREAM_DEFAULT);
   }
@@ -418,7 +418,7 @@ void testSelectiveFcLayerTrainSparseMul(const LayerConfig& config,
   MatrixPtr cpuOutMatFc(
       new CpuMatrix(outMatFc->getHeight(), outMatFc->getWidth()));
   cpuOutMatFc->copyFrom(*outMatFc, HPPL_STREAM_DEFAULT);
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
   if (useGpu) {
     hl_stream_synchronize(HPPL_STREAM_DEFAULT);
   }
@@ -443,7 +443,7 @@ TEST(Layer, SelectiveFcLayer_train_sparse_mul) {
   selLayerConfig.set_size(fcLayerWidth);
 
   testSelectiveFcLayerTrainSparseMul(selLayerConfig, false);
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
   testSelectiveFcLayerTrainSparseMul(selLayerConfig, true);
 #endif
 }
diff --git a/paddle/gserver/tests/test_SeqSliceLayerGrad.cpp b/paddle/gserver/tests/test_SeqSliceLayerGrad.cpp
index f28149081b..3366002ca1 100644
--- a/paddle/gserver/tests/test_SeqSliceLayerGrad.cpp
+++ b/paddle/gserver/tests/test_SeqSliceLayerGrad.cpp
@@ -195,7 +195,7 @@ TEST(Layer, SeqSliceLayer) {
   vector<vector<real>> ends;
 
   std::vector<bool> mode = {false};
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
   mode.push_back(true);
 #endif
   genSeqInfo(seqStartPos, subSeqStartPos);
diff --git a/paddle/gserver/tests/test_WarpCTCLayer.cpp b/paddle/gserver/tests/test_WarpCTCLayer.cpp
index ae5b64257f..da82946006 100644
--- a/paddle/gserver/tests/test_WarpCTCLayer.cpp
+++ b/paddle/gserver/tests/test_WarpCTCLayer.cpp
@@ -199,7 +199,7 @@ TEST(Layer, WarpCTCLayer) {
     for (auto batchSize : {1, 10, 32}) {
       for (auto normByTimes : {false, true}) {
         for (auto useGpu : {false, true}) {
-#ifndef PADDLE_WITH_GPU
+#ifndef PADDLE_WITH_CUDA
           if (useGpu) continue;
 #endif
           LOG(INFO) << "layerSize=" << layerSize << " batchSize=" << batchSize
diff --git a/paddle/math/Matrix.cpp b/paddle/math/Matrix.cpp
index de02f9c0d5..c3e34d5309 100644
--- a/paddle/math/Matrix.cpp
+++ b/paddle/math/Matrix.cpp
@@ -670,7 +670,7 @@ void GpuMatrix::leftMul(Matrix& a, real scaleAB, real scaleT) {
 }
 
 void GpuMatrix::selectRows(Matrix& table, IVector& ids) {
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
   CHECK(dynamic_cast<GpuMatrix*>(&table));
   CHECK(table.useGpu());
   CHECK(ids.useGpu());
@@ -694,7 +694,7 @@ void GpuMatrix::selectRows(Matrix& table, IVector& ids) {
 }
 
 void GpuMatrix::addToRows(Matrix& table, IVector& ids) {
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
   CHECK(dynamic_cast<GpuMatrix*>(&table));
   CHECK(table.useGpu());
   CHECK(ids.useGpu());
@@ -741,7 +741,7 @@ void GpuMatrix::rowMax(Matrix& max) {
 }
 
 void GpuMatrix::rowMax(IVector& maxIds, Matrix& maxVal) {
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
   CHECK(maxIds.useGpu() && maxVal.useGpu()) << "Matrix type are not equal";
   size_t numSamples = getHeight();
   size_t beam = maxVal.getWidth();
diff --git a/paddle/math/SparseMatrix.cpp b/paddle/math/SparseMatrix.cpp
index 1f31082ae8..284b68d590 100644
--- a/paddle/math/SparseMatrix.cpp
+++ b/paddle/math/SparseMatrix.cpp
@@ -836,7 +836,7 @@ void GpuSparseMatrix::zeroMem() {
 }
 
 void GpuSparseMatrix::rowMax(IVector& maxIds, Matrix& maxVal) {
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
   CHECK(maxIds.useGpu() && maxVal.useGpu()) << "Matrix type are not equal";
   size_t numSamples = getHeight();
   size_t beam = maxVal.getWidth();
diff --git a/paddle/math/Vector.cpp b/paddle/math/Vector.cpp
index 54e57b255d..ff72672e3a 100644
--- a/paddle/math/Vector.cpp
+++ b/paddle/math/Vector.cpp
@@ -172,7 +172,7 @@ void GpuVectorT<T>::isEqualTo(const VectorT<T>& b, const T& value) {
 
 template <class T>
 void GpuVectorT<T>::selectFrom(const VectorT<T>& src, const VectorT<int>& ids) {
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
   hl_vector_select_from<T>(this->getData(),
                            this->getSize(),
                            src.getData(),
@@ -850,7 +850,7 @@ CpuGpuVectorT<T>::CpuGpuVectorT(CpuGpuVectorT<T>& src,
                                 size_t size)
     : sync_(nullptr) {
   CHECK_LE(offset + size, static_cast<size_t>(src.getSize()));
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
   SyncedFlag* flag = src.getSync();
   if (*flag == DATA_AT_CPU) {
     src.copyToGpu();  // will set synchronous data between CPU and GPU
@@ -861,7 +861,7 @@ CpuGpuVectorT<T>::CpuGpuVectorT(CpuGpuVectorT<T>& src,
   auto cMemHandle = (src.getVector(false))->getMemoryHandle();
   cpuVectorT_ = std::make_shared<CpuVectorT<T>>(
       size, std::dynamic_pointer_cast<CpuMemoryHandle>(cMemHandle), offset);
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
   auto gMemHandle = (src.getVector(true))->getMemoryHandle();
   gpuVectorT_ = std::make_shared<GpuVectorT<T>>(
       size, std::dynamic_pointer_cast<GpuMemoryHandle>(gMemHandle), offset);
diff --git a/paddle/math/tests/test_Allocator.cpp b/paddle/math/tests/test_Allocator.cpp
index cf2f66aea1..1fecf659e5 100644
--- a/paddle/math/tests/test_Allocator.cpp
+++ b/paddle/math/tests/test_Allocator.cpp
@@ -68,7 +68,7 @@ void testPoolAllocator() {
 
 TEST(Allocator, Pool) {
   testPoolAllocator<CpuAllocator>();
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
   testPoolAllocator<GpuAllocator>();
 #endif
 }
@@ -92,7 +92,7 @@ TEST(MemoryHandle, Cpu) {
   EXPECT_EQ(ptr1, ptr2);
 }
 
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 TEST(MemoryHandle, Gpu) {
   int numGpu = hl_get_device_count();
 
diff --git a/paddle/math/tests/test_BaseMatrix.cpp b/paddle/math/tests/test_BaseMatrix.cpp
index 730759f3db..1766257860 100644
--- a/paddle/math/tests/test_BaseMatrix.cpp
+++ b/paddle/math/tests/test_BaseMatrix.cpp
@@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */
 
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 /**
  * This test file use autotest::AutoCompare and cmpWithoutArg to compares the
  * implementation of CPU and GPU member function in
diff --git a/paddle/math/tests/test_CpuGpuVector.cpp b/paddle/math/tests/test_CpuGpuVector.cpp
index ccb4a902b0..c72f89c824 100644
--- a/paddle/math/tests/test_CpuGpuVector.cpp
+++ b/paddle/math/tests/test_CpuGpuVector.cpp
@@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */
 
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 
 #include <gtest/gtest.h>
 #include "paddle/math/Vector.h"
diff --git a/paddle/math/tests/test_ExecViaCpu.cpp b/paddle/math/tests/test_ExecViaCpu.cpp
index 2d439cd060..25e0ba11de 100644
--- a/paddle/math/tests/test_ExecViaCpu.cpp
+++ b/paddle/math/tests/test_ExecViaCpu.cpp
@@ -94,7 +94,7 @@ void testWrapper(F&& f) {
   }
 }
 
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 TEST(ExecViaCpu, test1) {
   testWrapper(f);
   testWrapper(&f);
diff --git a/paddle/math/tests/test_GpuProfiler.cpp b/paddle/math/tests/test_GpuProfiler.cpp
index 6dab187e3e..9402bd3ec4 100644
--- a/paddle/math/tests/test_GpuProfiler.cpp
+++ b/paddle/math/tests/test_GpuProfiler.cpp
@@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */
 
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 
 #include <gtest/gtest.h>
 #include "paddle/math/Matrix.h"
diff --git a/paddle/math/tests/test_Matrix.cpp b/paddle/math/tests/test_Matrix.cpp
index 7a145eae6a..2f99fa3581 100644
--- a/paddle/math/tests/test_Matrix.cpp
+++ b/paddle/math/tests/test_Matrix.cpp
@@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */
 
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 /**
  * This test file use autotest::AutoCompare and cmpWithArg to compares the
  * implementation of CPU and GPU member function in Matrix.cpp.
diff --git a/paddle/math/tests/test_SparseMatrix.cpp b/paddle/math/tests/test_SparseMatrix.cpp
index 8151dde106..8abbe8d82e 100644
--- a/paddle/math/tests/test_SparseMatrix.cpp
+++ b/paddle/math/tests/test_SparseMatrix.cpp
@@ -47,7 +47,7 @@ struct MatrixPara {
   SparseFormat format;
 };
 
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 void test_sparse_matrix_mul(MatrixPara paraA,
                             MatrixPara paraB,
                             MatrixPara paraC) {
@@ -452,7 +452,7 @@ TEST(Matrix, SparseMatrixCSRFormatTrimFrom) {
   matB->trimFrom(*mat);
   checkSMatrixEqual2(matA, matB);
 
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
   GpuSparseMatrixPtr matC = std::make_shared<GpuSparseMatrix>(
       height, trimedWidth, height, FLOAT_VALUE, SPARSE_CSR, true);
   matC->trimFrom(*mat);
@@ -546,7 +546,7 @@ TEST(Matrix, SparseMatrixCSCFormatTrimFrom) {
   matB->trimFrom(*mat);
   checkSMatrixEqual2(matA, matB);
 
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
   GpuSparseMatrixPtr matC = std::make_shared<GpuSparseMatrix>(
       height, trimedWidth, height, FLOAT_VALUE, SPARSE_CSC, true);
   matC->trimFrom(*mat);
diff --git a/paddle/math/tests/test_TrainingAlgorithm.cpp b/paddle/math/tests/test_TrainingAlgorithm.cpp
index 36ac024007..5ae0aa036f 100644
--- a/paddle/math/tests/test_TrainingAlgorithm.cpp
+++ b/paddle/math/tests/test_TrainingAlgorithm.cpp
@@ -91,7 +91,7 @@ int VectorCheckErr(const VectorPtr& vector1, const VectorPtr& vector2) {
 typedef std::function<void(size_t size, bool useGpu)> testMatrixFunc;
 
 void testCase(testMatrixFunc matrixFunc) {
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
   for (auto useGpu : {false, true}) {
 #else
   for (auto useGpu : {false}) {
diff --git a/paddle/math/tests/test_batchTranspose.cpp b/paddle/math/tests/test_batchTranspose.cpp
index 0189e534eb..b70a619764 100644
--- a/paddle/math/tests/test_batchTranspose.cpp
+++ b/paddle/math/tests/test_batchTranspose.cpp
@@ -17,7 +17,7 @@ limitations under the License. */
 
 using namespace paddle;  // NOLINT
 
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 TEST(MatrixBatchTransTest, test_batch_matrix_transpose) {
   const int nx = 100;
   const int ny = 50;
diff --git a/paddle/math/tests/test_matrixCompare.cpp b/paddle/math/tests/test_matrixCompare.cpp
index 7735877ac8..7e5a1db44a 100644
--- a/paddle/math/tests/test_matrixCompare.cpp
+++ b/paddle/math/tests/test_matrixCompare.cpp
@@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */
 
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 /// This unittest checks GpuMatrix/CpuMatrix get same result, so disable when
 /// only cpu version.
 
diff --git a/paddle/math/tests/test_perturbation.cpp b/paddle/math/tests/test_perturbation.cpp
index dff18136ae..c7c07c817a 100644
--- a/paddle/math/tests/test_perturbation.cpp
+++ b/paddle/math/tests/test_perturbation.cpp
@@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */
 
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 
 #include <cuda_runtime.h>
 #include <gtest/gtest.h>
diff --git a/paddle/math/tests/test_sparseMatrixCompare.cpp b/paddle/math/tests/test_sparseMatrixCompare.cpp
index e39cc0a2f6..2b2a391b9d 100644
--- a/paddle/math/tests/test_sparseMatrixCompare.cpp
+++ b/paddle/math/tests/test_sparseMatrixCompare.cpp
@@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */
 
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 /// This unittest checks GpuSparseMatrix/CpuSparseMatrix get same result,
 //  so disable when
 /// only cpu version.
diff --git a/paddle/memory/detail/buddy_allocator.cc b/paddle/memory/detail/buddy_allocator.cc
index ed0c3374ff..fdc5ed19dc 100644
--- a/paddle/memory/detail/buddy_allocator.cc
+++ b/paddle/memory/detail/buddy_allocator.cc
@@ -175,7 +175,7 @@ void* BuddyAllocator::SystemAlloc(size_t size) {
 }
 
 BuddyAllocator::PoolSet::iterator BuddyAllocator::RefillPool() {
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
   if (system_allocator_->UseGpu()) {
     if ((total_used_ + total_free_) == 0) {
       // Compute the maximum allocation size for the first allocation.
diff --git a/paddle/memory/detail/system_allocator.cc b/paddle/memory/detail/system_allocator.cc
index 64f8182b5c..6c9a46dd09 100644
--- a/paddle/memory/detail/system_allocator.cc
+++ b/paddle/memory/detail/system_allocator.cc
@@ -62,7 +62,7 @@ void CPUAllocator::Free(void* p, size_t size, size_t index) {
 
 bool CPUAllocator::UseGpu() const { return false; }
 
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 
 void* GPUAllocator::Alloc(size_t& index, size_t size) {
   // CUDA documentation doesn't explain if cudaMalloc returns nullptr
diff --git a/paddle/memory/detail/system_allocator.h b/paddle/memory/detail/system_allocator.h
index 6b1f40347b..ee9b012f91 100644
--- a/paddle/memory/detail/system_allocator.h
+++ b/paddle/memory/detail/system_allocator.h
@@ -40,7 +40,7 @@ class CPUAllocator : public SystemAllocator {
   virtual bool UseGpu() const;
 };
 
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 class GPUAllocator : public SystemAllocator {
  public:
   virtual void* Alloc(size_t& index, size_t size);
diff --git a/paddle/memory/detail/system_allocator_test.cc b/paddle/memory/detail/system_allocator_test.cc
index 57d5443d50..cd563844e7 100644
--- a/paddle/memory/detail/system_allocator_test.cc
+++ b/paddle/memory/detail/system_allocator_test.cc
@@ -56,7 +56,7 @@ TEST(CPUAllocator, LockMem) {
   TestAllocator(a, 0);
 }
 
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 TEST(GPUAllocator, Alloc) {
   paddle::memory::detail::GPUAllocator a;
   TestAllocator(a, 2048);
diff --git a/paddle/memory/memcpy.cc b/paddle/memory/memcpy.cc
index 184d0f8fa7..790420a8ab 100644
--- a/paddle/memory/memcpy.cc
+++ b/paddle/memory/memcpy.cc
@@ -26,7 +26,7 @@ void Copy<platform::CPUPlace, platform::CPUPlace>(platform::CPUPlace, void* dst,
   std::memcpy(dst, src, num);
 }
 
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 template <>
 void Copy<platform::CPUPlace, platform::GPUPlace>(platform::CPUPlace dst_place,
                                                   void* dst,
diff --git a/paddle/memory/memcpy.h b/paddle/memory/memcpy.h
index 7142831d43..0bccee58c3 100644
--- a/paddle/memory/memcpy.h
+++ b/paddle/memory/memcpy.h
@@ -33,7 +33,7 @@ namespace memory {
 template <typename DstPlace, typename SrcPlace>
 void Copy(DstPlace, void* dst, SrcPlace, const void* src, size_t num);
 
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 
 /**
  * \brief   Copy memory from one place to another place.
diff --git a/paddle/memory/memory.cc b/paddle/memory/memory.cc
index 6d5a74dafe..355b6218d0 100644
--- a/paddle/memory/memory.cc
+++ b/paddle/memory/memory.cc
@@ -62,7 +62,7 @@ size_t Used<platform::CPUPlace>(platform::CPUPlace place) {
   return GetCPUBuddyAllocator()->Used();
 }
 
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 
 BuddyAllocator* GetGPUBuddyAllocator(int gpu_id) {
   using BuddyAllocVec = std::vector<BuddyAllocator*>;
diff --git a/paddle/memory/memory_test.cc b/paddle/memory/memory_test.cc
index 7a617f04dc..0d402038a0 100644
--- a/paddle/memory/memory_test.cc
+++ b/paddle/memory/memory_test.cc
@@ -80,7 +80,7 @@ TEST(BuddyAllocator, CPUMultAlloc) {
   }
 }
 
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 
 size_t align(size_t size, paddle::platform::GPUPlace place) {
   size += sizeof(paddle::memory::detail::Metadata);
diff --git a/paddle/operators/detail/strided_memcpy.h b/paddle/operators/detail/strided_memcpy.h
index 9f05a26322..068c82f399 100644
--- a/paddle/operators/detail/strided_memcpy.h
+++ b/paddle/operators/detail/strided_memcpy.h
@@ -34,7 +34,7 @@ struct StridedMemcpyFunctor<T, 1> {
       auto& cpu_place = boost::get<platform::CPUPlace>(place);
       memory::Copy(cpu_place, dst, cpu_place, src, sizeof(T) * dst_dim.head);
     } else {
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
       auto& gpu_place = boost::get<platform::GPUPlace>(place);
       auto& cuda_ctx =
           reinterpret_cast<const platform::CUDADeviceContext&>(dev_ctx);
diff --git a/paddle/operators/math/im2col_test.cc b/paddle/operators/math/im2col_test.cc
index 3d040ca2b5..40bdbfe733 100644
--- a/paddle/operators/math/im2col_test.cc
+++ b/paddle/operators/math/im2col_test.cc
@@ -71,7 +71,7 @@ void testIm2col() {
     context =
         new paddle::platform::CPUDeviceContext(paddle::platform::CPUPlace());
   } else {
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
     context =
         new paddle::platform::CUDADeviceContext(paddle::platform::GPUPlace());
 #else
@@ -116,7 +116,7 @@ void testIm2col() {
 
 TEST(math, im2col) {
   testIm2col<paddle::platform::CPUPlace>();
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
   testIm2col<paddle::platform::GPUPlace>();
 #endif
 }
diff --git a/paddle/operators/math/math_function_test.cc b/paddle/operators/math/math_function_test.cc
index 2252268620..9945ba101d 100644
--- a/paddle/operators/math/math_function_test.cc
+++ b/paddle/operators/math/math_function_test.cc
@@ -1,7 +1,7 @@
 #include "paddle/operators/math/math_function.h"
 #include "gtest/gtest.h"
 
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 TEST(math_function, notrans_mul_trans) {
   paddle::framework::Tensor input1;
   paddle::framework::Tensor input1_gpu;
diff --git a/paddle/operators/strided_memcpy_test.cc b/paddle/operators/strided_memcpy_test.cc
index e0dd7b19f1..68f064eaee 100644
--- a/paddle/operators/strided_memcpy_test.cc
+++ b/paddle/operators/strided_memcpy_test.cc
@@ -72,7 +72,7 @@ TEST(StridedMemcpy, CPUConcat) {
   }
 }
 
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 TEST(StridedMemcpy, GPUCrop) {
   // clang-format off
   int src[] = {
@@ -157,4 +157,4 @@ TEST(StridedMemcpy, GPUConcat) {
 
 #endif
 }  // namespace operators
-}  // namespace paddle
\ No newline at end of file
+}  // namespace paddle
diff --git a/paddle/platform/device_context.cc b/paddle/platform/device_context.cc
index 8dcc357a16..a9b6b79903 100644
--- a/paddle/platform/device_context.cc
+++ b/paddle/platform/device_context.cc
@@ -35,7 +35,7 @@ Eigen::DefaultDevice* CPUDeviceContext::eigen_device() const {
 
 Place CPUDeviceContext::GetPlace() const { return CPUPlace(); }
 
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 
 template <>
 Eigen::GpuDevice*
diff --git a/paddle/platform/device_context.h b/paddle/platform/device_context.h
index c1c4c7f760..ef5f19214d 100644
--- a/paddle/platform/device_context.h
+++ b/paddle/platform/device_context.h
@@ -14,7 +14,7 @@ limitations under the License. */
 #include "paddle/platform/enforce.h"
 #include "paddle/platform/place.h"
 
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 #include "paddle/platform/dynload/cublas.h"
 #include "paddle/platform/dynload/cudnn.h"
 #include "paddle/platform/gpu_info.h"
@@ -61,7 +61,7 @@ class CPUDeviceContext : public DeviceContext {
   std::unique_ptr<Eigen::DefaultDevice> eigen_device_;
 };
 
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 template <>
 struct EigenDeviceConverter<platform::GPUPlace> {
   using EigenDeviceType = Eigen::GpuDevice;
diff --git a/paddle/platform/enforce.h b/paddle/platform/enforce.h
index f9fe521d50..15d8446cd8 100644
--- a/paddle/platform/enforce.h
+++ b/paddle/platform/enforce.h
@@ -29,7 +29,7 @@ limitations under the License. */
 #include <cxxabi.h>  // for __cxa_demangle
 #endif
 
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 
 #include "paddle/platform/dynload/cublas.h"
 #include "paddle/platform/dynload/cudnn.h"
@@ -113,7 +113,7 @@ inline typename std::enable_if<sizeof...(Args) != 0, void>::type throw_on_error(
   }
 }
 
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 
 template <typename... Args>
 inline typename std::enable_if<sizeof...(Args) != 0, void>::type throw_on_error(
diff --git a/paddle/platform/enforce_test.cc b/paddle/platform/enforce_test.cc
index 80bdee3d9d..8206a055ea 100644
--- a/paddle/platform/enforce_test.cc
+++ b/paddle/platform/enforce_test.cc
@@ -213,4 +213,4 @@ TEST(ENFORCE_USER_DEFINED_CLASS, EQ) {
 TEST(ENFORCE_USER_DEFINED_CLASS, NE) {
   Dims a{{1, 2, 3, 4}}, b{{5, 6, 7, 8}};
   ASSERT_THROW(PADDLE_ENFORCE_EQ(a, b), paddle::platform::EnforceNotMet);
-}
\ No newline at end of file
+}
diff --git a/paddle/platform/gpu_info.h b/paddle/platform/gpu_info.h
index ac884386dd..e47c9b4a2a 100644
--- a/paddle/platform/gpu_info.h
+++ b/paddle/platform/gpu_info.h
@@ -14,7 +14,7 @@ limitations under the License. */
 
 #pragma once
 
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 
 #include <cuda_runtime.h>
 #include <stddef.h>
diff --git a/paddle/platform/variant.h b/paddle/platform/variant.h
index 8145799dfd..619897ca19 100644
--- a/paddle/platform/variant.h
+++ b/paddle/platform/variant.h
@@ -16,7 +16,7 @@
 
 #include <boost/config.hpp>
 
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 
 // Because boost's variadic templates has bug on nvcc, boost will disable
 // variadic template support when GPU enabled on nvcc.
diff --git a/paddle/pserver/test/SocketTest.cpp b/paddle/pserver/test/SocketTest.cpp
index 96724530f5..b43461d61b 100644
--- a/paddle/pserver/test/SocketTest.cpp
+++ b/paddle/pserver/test/SocketTest.cpp
@@ -215,7 +215,7 @@ int main(int argc, char** argv) {
 
   uint64_t dataSize = FLAGS_dim * sizeof(real);
 
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
   GpuVector gpuParam(FLAGS_dim);
   GpuVector gpuGrad(FLAGS_dim);
 #else
diff --git a/paddle/pserver/test/test_ProtoServer.cpp b/paddle/pserver/test/test_ProtoServer.cpp
index 74ab1f2f77..ad8ffed9c1 100644
--- a/paddle/pserver/test/test_ProtoServer.cpp
+++ b/paddle/pserver/test/test_ProtoServer.cpp
@@ -99,7 +99,7 @@ TEST(ProtoServer, regular) {
 }
 
 TEST(ProtoServer, extended) {
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
   ProtoClient* client;
   if (FLAGS_rdma_tcp == "rdma")
     client = new ProtoClient(FLAGS_server_addr, FLAGS_port, F_RDMA);
diff --git a/paddle/pybind/pybind.cc b/paddle/pybind/pybind.cc
index 761d82fc4d..cff54b1741 100644
--- a/paddle/pybind/pybind.cc
+++ b/paddle/pybind/pybind.cc
@@ -34,7 +34,7 @@ static size_t UniqueIntegerGenerator() {
 }
 
 bool IsCompileGPU() {
-#ifndef PADDLE_WITH_GPU
+#ifndef PADDLE_WITH_CUDA
   return false;
 #else
   return true;
@@ -78,7 +78,7 @@ PYBIND11_PLUGIN(core) {
       .def("set", PyCPUTensorSetFromArray<float>)
       .def("set", PyCPUTensorSetFromArray<int>)
       .def("set", PyCPUTensorSetFromArray<double>)
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
       .def("set", PyCUDATensorSetFromArray<float>)
       .def("set", PyCUDATensorSetFromArray<int>)
       .def("set", PyCUDATensorSetFromArray<double>)
@@ -96,7 +96,7 @@ PYBIND11_PLUGIN(core) {
       .def(
           "__init__",
           [](LoDTensor &instance, const std::vector<std::vector<size_t>> &lod) {
-#ifndef PADDLE_WITH_GPU
+#ifndef PADDLE_WITH_CUDA
             new (&instance) LoDTensor(lod);
 #else
              LoD new_lod;
@@ -107,7 +107,7 @@ PYBIND11_PLUGIN(core) {
           })
       .def("set_lod",
            [](LoDTensor &self, const std::vector<std::vector<size_t>> &lod) {
-#ifndef PADDLE_WITH_GPU
+#ifndef PADDLE_WITH_CUDA
              self.set_lod(lod);
 #else
              LoD new_lod;
@@ -117,7 +117,7 @@ PYBIND11_PLUGIN(core) {
 #endif
            })
       .def("lod", [](LoDTensor &self) -> std::vector<std::vector<size_t>> {
-#ifndef PADDLE_WITH_GPU
+#ifndef PADDLE_WITH_CUDA
         return self.lod();
 #else
            auto lod = self.lod();
@@ -203,7 +203,7 @@ All parameter, weight, gradient are variables in Paddle.
       .def_static("create",
                   [](paddle::platform::GPUPlace& place)
                       -> paddle::platform::DeviceContext* {
-#ifndef PADDLE_WITH_GPU
+#ifndef PADDLE_WITH_CUDA
                     PADDLE_THROW("GPUPlace is not supported in CPU device.");
 #else
                     return new paddle::platform::CUDADeviceContext(place);
diff --git a/paddle/pybind/tensor_py.h b/paddle/pybind/tensor_py.h
index 62e85fa54f..9e73f79cbd 100644
--- a/paddle/pybind/tensor_py.h
+++ b/paddle/pybind/tensor_py.h
@@ -106,7 +106,7 @@ void PyCPUTensorSetFromArray(
   std::memcpy(dst, array.data(), sizeof(T) * array.size());
 }
 
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 template <typename T>
 void PyCUDATensorSetFromArray(
     framework::Tensor &self,
diff --git a/paddle/string/to_string_test.cc b/paddle/string/to_string_test.cc
index 542c771a98..971484dd0c 100644
--- a/paddle/string/to_string_test.cc
+++ b/paddle/string/to_string_test.cc
@@ -36,4 +36,4 @@ TEST(to_string, user_defined) {
   using namespace paddle::string;
   UserDefinedClass instance;
   ASSERT_EQ(kOutputString, to_string(instance));
-}
\ No newline at end of file
+}
diff --git a/paddle/trainer/MergeModel.cpp b/paddle/trainer/MergeModel.cpp
index a37d53bc72..6c52eaf449 100644
--- a/paddle/trainer/MergeModel.cpp
+++ b/paddle/trainer/MergeModel.cpp
@@ -29,7 +29,7 @@ int main(int argc, char** argv) {
   initMain(argc, argv);
   initPython(argc, argv);
   string confFile = TrainerConfigHelper::getConfigNameFromPath(FLAGS_model_dir);
-#ifndef PADDLE_WITH_GPU
+#ifndef PADDLE_WITH_CUDA
   FLAGS_use_gpu = false;
 #endif
   auto config = std::make_shared<TrainerConfigHelper>(confFile);
diff --git a/paddle/trainer/tests/test_Compare.cpp b/paddle/trainer/tests/test_Compare.cpp
index b5d29da45a..f3a964acb6 100644
--- a/paddle/trainer/tests/test_Compare.cpp
+++ b/paddle/trainer/tests/test_Compare.cpp
@@ -146,7 +146,7 @@ void compareGradient(comData& comDataCpu, comData& comDataGpu) {
 }
 
 int main(int argc, char** argv) {
-#ifndef PADDLE_WITH_GPU
+#ifndef PADDLE_WITH_CUDA
   exit(0);
 #endif
   paddle::initMain(argc, argv);
diff --git a/paddle/trainer/tests/test_CompareSparse.cpp b/paddle/trainer/tests/test_CompareSparse.cpp
index 4da9ce20fb..5f1834bd73 100644
--- a/paddle/trainer/tests/test_CompareSparse.cpp
+++ b/paddle/trainer/tests/test_CompareSparse.cpp
@@ -174,7 +174,7 @@ TEST(compareSparse, multiGradientMachine) {
     FLAGS_local = local;
     FLAGS_ports_num_for_sparse = 5;
     for (bool useGpu : {false, true}) {
-#ifndef PADDLE_WITH_GPU
+#ifndef PADDLE_WITH_CUDA
       if (useGpu) continue;
 #endif
       FLAGS_parallel_nn = useGpu;
@@ -198,7 +198,7 @@ TEST(compareSparse, NeuralNetwork) {
     FLAGS_local = local;
     FLAGS_ports_num_for_sparse = 5;
     for (bool useGpu : {false, true}) {
-#ifndef PADDLE_WITH_GPU
+#ifndef PADDLE_WITH_CUDA
       if (useGpu) continue;
 #endif
       FLAGS_parallel_nn = useGpu;
diff --git a/paddle/trainer/tests/test_Trainer.cpp b/paddle/trainer/tests/test_Trainer.cpp
index f69e1aafee..425b3d10a3 100644
--- a/paddle/trainer/tests/test_Trainer.cpp
+++ b/paddle/trainer/tests/test_Trainer.cpp
@@ -51,7 +51,7 @@ void checkGradientTest(const string& configFile,
 
 TEST(checkGradient, cpu) { checkGradientTest(configFile1, false, false); }
 
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 TEST(checkGradient, gpu) { checkGradientTest(configFile1, true, false); }
 
 TEST(checkGradient, multiGpu) {
@@ -97,7 +97,7 @@ TEST(checkGradient, hsigmoid) { checkGradientTest(configFile2, false, false); }
 
 TEST(checkGradient, chunk) {
   checkGradientTest(configFile3, false, false);
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
   checkGradientTest(configFile3, true, true);
 #endif
 }
diff --git a/paddle/trainer/tests/test_TrainerOnePass.cpp b/paddle/trainer/tests/test_TrainerOnePass.cpp
index 4c4d124fa9..b2a93d4d5e 100644
--- a/paddle/trainer/tests/test_TrainerOnePass.cpp
+++ b/paddle/trainer/tests/test_TrainerOnePass.cpp
@@ -79,7 +79,7 @@ void trainerOnePassTest(const string& configFile,
 // 1. test trainer (cpu, gpu).
 TEST(trainerOnePass, cpu) { trainerOnePassTest(configFile1, false, false); }
 
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 TEST(trainerOnePass, gpu) { trainerOnePassTest(configFile1, true, false); }
 
 TEST(trainerOnePass, gpu2) { trainerOnePassTest(configFile1, true, false, 2); }
@@ -94,7 +94,7 @@ TEST(trainerOnePass, parallel) {
 #endif
 
 // 2. test average_window.
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 TEST(average_window, gpu) {
   trainerOnePassTest(configFile1, true, false, 4, 0.01);
 }
@@ -266,7 +266,7 @@ TEST(checkRemoteUpdater, cpuTrainerOldUpdater) {
   checkRemoteParameterUpdaterTest(configFile1, false, false, 1, true);
 }
 
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 TEST(checkRemoteUpdater, gpuTrainer) {
   checkRemoteParameterUpdaterTest(configFile1, true, false);
 }
diff --git a/paddle/trainer/tests/test_recurrent_machine_generation.cpp b/paddle/trainer/tests/test_recurrent_machine_generation.cpp
index 74b4fed7ed..a8fbe31c2b 100644
--- a/paddle/trainer/tests/test_recurrent_machine_generation.cpp
+++ b/paddle/trainer/tests/test_recurrent_machine_generation.cpp
@@ -113,7 +113,7 @@ void testGeneration(const string& configFile,
 #ifndef PADDLE_TYPE_DOUBLE
 
 TEST(RecurrentGradientMachine, test_generation) {
-#ifndef PADDLE_WITH_GPU
+#ifndef PADDLE_WITH_CUDA
   const auto useGpuConfs = {false};
 #else
   const auto useGpuConfs = {true, false};
diff --git a/paddle/utils/Flags.cpp b/paddle/utils/Flags.cpp
index 32155ded35..8f100f02e9 100644
--- a/paddle/utils/Flags.cpp
+++ b/paddle/utils/Flags.cpp
@@ -14,7 +14,7 @@ limitations under the License. */
 
 #include "Flags.h"
 
-#ifndef PADDLE_WITH_GPU
+#ifndef PADDLE_WITH_CUDA
 DEFINE_bool(use_gpu, false, "Only support CPU training");
 #else
 DEFINE_bool(use_gpu, true, "Whether to use GPU for training");
diff --git a/paddle/utils/Util.h b/paddle/utils/Util.h
index 904d0f5061..9579881ea3 100644
--- a/paddle/utils/Util.h
+++ b/paddle/utils/Util.h
@@ -218,7 +218,7 @@ protected:
  * *d2* is peer device to enable direct access to by the d1 device.
  */
 inline void enablePeerAccess(int d1, int d2) {
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
   if (hl_device_can_access_peer(d1, d2)) {
     SetDevice dev(d1);
     hl_device_enable_peer_access(d2);
diff --git a/paddle/utils/Version.h b/paddle/utils/Version.h
index 611fda83d9..004d62451c 100644
--- a/paddle/utils/Version.h
+++ b/paddle/utils/Version.h
@@ -48,7 +48,7 @@ void printVersion(std::ostream& os);
  * @return return true if paddle compiled with GPU
  */
 constexpr bool isWithGpu() {
-#ifndef PADDLE_WITH_GPU
+#ifndef PADDLE_WITH_CUDA
   return false;
 #else
   return true;

From 2b204f048bf6599bdb9ba799769404dc5fd206a8 Mon Sep 17 00:00:00 2001
From: Yi Wang <yiwang01@baidu.com>
Date: Wed, 4 Oct 2017 14:09:19 -0700
Subject: [PATCH 08/21] Rename platform::GetDeviceCount into
 platform::GetCUDADeviceCount

---
 paddle/memory/memory.cc                | 2 +-
 paddle/platform/device_context_test.cc | 4 ++--
 paddle/platform/gpu_info.cc            | 4 ++--
 paddle/platform/gpu_info.h             | 2 +-
 4 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/paddle/memory/memory.cc b/paddle/memory/memory.cc
index 6d5a74dafe..f816962890 100644
--- a/paddle/memory/memory.cc
+++ b/paddle/memory/memory.cc
@@ -77,7 +77,7 @@ BuddyAllocator* GetGPUBuddyAllocator(int gpu_id) {
 
   // GPU buddy allocator initialization
   std::call_once(gpu_allocator_flag, [&]() {
-    int gpu_num = platform::GetDeviceCount();
+    int gpu_num = platform::GetCUDADeviceCount();
     allocators.reserve(gpu_num);
     for (int gpu = 0; gpu < gpu_num; gpu++) {
       platform::SetDeviceId(gpu);
diff --git a/paddle/platform/device_context_test.cc b/paddle/platform/device_context_test.cc
index f4b00c57de..8bf5174c4a 100644
--- a/paddle/platform/device_context_test.cc
+++ b/paddle/platform/device_context_test.cc
@@ -20,7 +20,7 @@ TEST(Device, Init) {
   using paddle::platform::CUDADeviceContext;
   using paddle::platform::GPUPlace;
 
-  int count = paddle::platform::GetDeviceCount();
+  int count = paddle::platform::GetCUDADeviceCount();
   for (int i = 0; i < count; i++) {
     DeviceContext* device_context = new CUDADeviceContext(GPUPlace(i));
     Eigen::GpuDevice* gpu_device =
@@ -34,7 +34,7 @@ TEST(Device, CUDADeviceContext) {
   using paddle::platform::CUDADeviceContext;
   using paddle::platform::GPUPlace;
 
-  int count = paddle::platform::GetDeviceCount();
+  int count = paddle::platform::GetCUDADeviceCount();
   for (int i = 0; i < count; i++) {
     CUDADeviceContext* device_context = new CUDADeviceContext(GPUPlace(i));
     Eigen::GpuDevice* gpu_device = device_context->eigen_device();
diff --git a/paddle/platform/gpu_info.cc b/paddle/platform/gpu_info.cc
index be381a4e26..70ad611d5d 100644
--- a/paddle/platform/gpu_info.cc
+++ b/paddle/platform/gpu_info.cc
@@ -26,11 +26,11 @@ DEFINE_double(fraction_of_gpu_memory_to_use, 0.95,
 namespace paddle {
 namespace platform {
 
-int GetDeviceCount() {
+int GetCUDADeviceCount() {
   int count;
   PADDLE_ENFORCE(
       cudaGetDeviceCount(&count),
-      "cudaGetDeviceCount failed in paddle::platform::GetDeviceCount");
+      "cudaGetDeviceCount failed in paddle::platform::GetCUDADeviceCount");
   return count;
 }
 
diff --git a/paddle/platform/gpu_info.h b/paddle/platform/gpu_info.h
index ac884386dd..276783bbe4 100644
--- a/paddle/platform/gpu_info.h
+++ b/paddle/platform/gpu_info.h
@@ -28,7 +28,7 @@ const std::string kEnvFractionGpuMemoryToUse =
     "PADDLE_FRACTION_GPU_MEMORY_TO_USE";
 
 //! Get the total number of GPU devices in system.
-int GetDeviceCount();
+int GetCUDADeviceCount();
 
 //! Get the current GPU device id in system.
 int GetCurrentDeviceId();

From ee7b3ed09e699da191fe238ab409f32318637380 Mon Sep 17 00:00:00 2001
From: qiaolongfei <qiaolongfei@baidu.com>
Date: Wed, 4 Oct 2017 15:33:44 -0700
Subject: [PATCH 09/21] use EigenScalar to get learning_rate from GPU device

---
 paddle/operators/sgd_op.h | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/paddle/operators/sgd_op.h b/paddle/operators/sgd_op.h
index d72d333a9a..954fd48272 100644
--- a/paddle/operators/sgd_op.h
+++ b/paddle/operators/sgd_op.h
@@ -23,6 +23,9 @@ using Tensor = framework::Tensor;
 template <typename T, int MajorType = Eigen::RowMajor,
           typename IndexType = Eigen::DenseIndex>
 using EigenVector = framework::EigenVector<T, MajorType, IndexType>;
+template <typename T, int MajorType = Eigen::RowMajor,
+          typename IndexType = Eigen::DenseIndex>
+using EigenScalar = framework::EigenScalar<T, MajorType, IndexType>;
 
 template <typename Place, typename T>
 class SGDOpKernel : public framework::OpKernel<T> {
@@ -31,13 +34,14 @@ class SGDOpKernel : public framework::OpKernel<T> {
     auto param = ctx.Input<Tensor>("Param");
     auto grad = ctx.Input<Tensor>("Grad");
     auto param_out = ctx.Output<Tensor>("ParamOut");
-    float lr = ctx.Input<Tensor>("LearningRate")->data<float>()[0];
+    auto learning_rate = ctx.Input<Tensor>("LearningRate");
 
     param_out->mutable_data<T>(ctx.GetPlace());
 
     auto p = EigenVector<T>::Flatten(*param);
     auto g = EigenVector<T>::Flatten(*grad);
     auto o = EigenVector<T>::Flatten(*param_out);
+    auto lr = EigenScalar<T>::From(*learning_rate);
     auto place = ctx.GetEigenDevice<Place>();
 
     o.device(place) = p - lr * g;

From 775c60246b66469e06f01a50c89b7b39594a3b63 Mon Sep 17 00:00:00 2001
From: qiaolongfei <qiaolongfei@baidu.com>
Date: Wed, 4 Oct 2017 16:53:21 -0700
Subject: [PATCH 10/21] remove using in sgd header file

---
 paddle/operators/sgd_op.h | 27 ++++++++++-----------------
 1 file changed, 10 insertions(+), 17 deletions(-)

diff --git a/paddle/operators/sgd_op.h b/paddle/operators/sgd_op.h
index 954fd48272..b501d244d7 100644
--- a/paddle/operators/sgd_op.h
+++ b/paddle/operators/sgd_op.h
@@ -19,32 +19,25 @@ limitations under the License. */
 namespace paddle {
 namespace operators {
 
-using Tensor = framework::Tensor;
-template <typename T, int MajorType = Eigen::RowMajor,
-          typename IndexType = Eigen::DenseIndex>
-using EigenVector = framework::EigenVector<T, MajorType, IndexType>;
-template <typename T, int MajorType = Eigen::RowMajor,
-          typename IndexType = Eigen::DenseIndex>
-using EigenScalar = framework::EigenScalar<T, MajorType, IndexType>;
-
 template <typename Place, typename T>
 class SGDOpKernel : public framework::OpKernel<T> {
  public:
   void Compute(const framework::ExecutionContext& ctx) const override {
-    auto param = ctx.Input<Tensor>("Param");
-    auto grad = ctx.Input<Tensor>("Grad");
-    auto param_out = ctx.Output<Tensor>("ParamOut");
-    auto learning_rate = ctx.Input<Tensor>("LearningRate");
+    auto param = ctx.Input<framework::Tensor>("Param");
+    auto grad = ctx.Input<framework::Tensor>("Grad");
+    auto param_out = ctx.Output<framework::Tensor>("ParamOut");
+    auto learning_rate = ctx.Input<framework::Tensor>("LearningRate");
 
     param_out->mutable_data<T>(ctx.GetPlace());
 
-    auto p = EigenVector<T>::Flatten(*param);
-    auto g = EigenVector<T>::Flatten(*grad);
-    auto o = EigenVector<T>::Flatten(*param_out);
-    auto lr = EigenScalar<T>::From(*learning_rate);
+    auto p = framework::EigenVector<T>::Flatten(*param);
+    auto g = framework::EigenVector<T>::Flatten(*grad);
+    auto o = framework::EigenVector<T>::Flatten(*param_out);
+    auto lr = framework::EigenVector<T>::From(*learning_rate);
     auto place = ctx.GetEigenDevice<Place>();
 
-    o.device(place) = p - lr * g;
+    Eigen::DSizes<int, 2> grad_dsize(grad->dims()[0], grad->dims()[1]);
+    o.device(place) = p - lr.broadcast(grad_dsize) * g;
   }
 };
 

From 8ebc31d9358c919fdd6f50d502f4ee071a91d38e Mon Sep 17 00:00:00 2001
From: qiaolongfei <qiaolongfei@baidu.com>
Date: Wed, 4 Oct 2017 17:13:02 -0700
Subject: [PATCH 11/21] optimize the dsize

---
 paddle/operators/sgd_op.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/paddle/operators/sgd_op.h b/paddle/operators/sgd_op.h
index b501d244d7..26f4012f25 100644
--- a/paddle/operators/sgd_op.h
+++ b/paddle/operators/sgd_op.h
@@ -33,10 +33,10 @@ class SGDOpKernel : public framework::OpKernel<T> {
     auto p = framework::EigenVector<T>::Flatten(*param);
     auto g = framework::EigenVector<T>::Flatten(*grad);
     auto o = framework::EigenVector<T>::Flatten(*param_out);
-    auto lr = framework::EigenVector<T>::From(*learning_rate);
+    auto lr = framework::EigenVector<T>::Flatten(*learning_rate);
     auto place = ctx.GetEigenDevice<Place>();
 
-    Eigen::DSizes<int, 2> grad_dsize(grad->dims()[0], grad->dims()[1]);
+    Eigen::DSizes<int, 1> grad_dsize(grad->numel());
     o.device(place) = p - lr.broadcast(grad_dsize) * g;
   }
 };

From 46530f9e666012dd83c8763563f7acae8f5aad30 Mon Sep 17 00:00:00 2001
From: Kavya Srinet <kavyasrinet@baidu.com>
Date: Wed, 4 Oct 2017 19:02:22 -0700
Subject: [PATCH 12/21] Added Leaky Relu activation

---
 paddle/operators/activation_op.cc             | 19 ++++++++++++
 paddle/operators/activation_op.h              | 30 ++++++++++++++++++-
 .../v2/framework/tests/test_activation_op.py  | 17 +++++++++++
 3 files changed, 65 insertions(+), 1 deletion(-)

diff --git a/paddle/operators/activation_op.cc b/paddle/operators/activation_op.cc
index 7ae4d2f6b6..5f2ecc2673 100644
--- a/paddle/operators/activation_op.cc
+++ b/paddle/operators/activation_op.cc
@@ -69,6 +69,22 @@ class ReluOpMaker : public framework::OpProtoAndCheckerMaker {
   }
 };
 
+template <typename AttrType>
+class LeakyReluOpMaker : public framework::OpProtoAndCheckerMaker {
+ public:
+  LeakyReluOpMaker(framework::OpProto *proto,
+                   framework::OpAttrChecker *op_checker)
+      : OpProtoAndCheckerMaker(proto, op_checker) {
+    AddInput("X", "Input of LeakyRelu operator");
+    AddOutput("Y", "Output of LeakyRelu operator");
+    AddComment(
+        "LeakyRelu activation operator, "
+        "leaky_relu = max(x, alpha * x)");
+    AddAttr<AttrType>("alpha", "The small negative slope")
+        .SetDefault(static_cast<AttrType>(0.02f));
+  }
+};
+
 class TanhOpMaker : public framework::OpProtoAndCheckerMaker {
  public:
   TanhOpMaker(framework::OpProto *proto, framework::OpAttrChecker *op_checker)
@@ -240,6 +256,9 @@ REGISTER_OP(softsign, ops::ActivationOp, ops::SoftsignOpMaker, softsign_grad,
 REGISTER_OP(brelu, ops::ActivationOp, ops::BReluOpMaker<float>, brelu_grad,
             ops::ActivationOpGrad);
 
+REGISTER_OP(leaky_relu, ops::ActivationOp, ops::LeakyReluOpMaker<float>,
+            leaky_relu_grad, ops::ActivationOpGrad);
+
 REGISTER_OP(soft_relu, ops::ActivationOp, ops::SoftReluOpMaker<float>,
             soft_relu_grad, ops::ActivationOpGrad);
 
diff --git a/paddle/operators/activation_op.h b/paddle/operators/activation_op.h
index ff35c2d97e..dae66cc77d 100644
--- a/paddle/operators/activation_op.h
+++ b/paddle/operators/activation_op.h
@@ -309,6 +309,33 @@ struct SoftReluGradFunctor : public BaseActivationFunctor<T> {
   }
 };
 
+template <typename T>
+struct LeakyReluFunctor : public BaseActivationFunctor<T> {
+  float alpha;
+  typename BaseActivationFunctor<T>::AttrPair GetAttrs() {
+    return {{"alpha", &alpha}};
+  }
+
+  template <typename Device, typename X, typename Y>
+  void operator()(Device d, X x, Y y) const {
+    y.device(d) = x.cwiseMax(alpha * x);
+  }
+};
+
+template <typename T>
+struct LeakyReluGradFunctor : public BaseActivationFunctor<T> {
+  float alpha;
+  typename BaseActivationFunctor<T>::AttrPair GetAttrs() {
+    return {{"alpha", &alpha}};
+  }
+  template <typename Device, typename X, typename Y, typename dY, typename dX>
+  void operator()(Device d, X x, Y y, dY dy, dX dx) const {
+    auto temp1 = alpha * (x < static_cast<T>(0)).template cast<T>().eval();
+    auto temp2 = (x >= static_cast<T>(0)).template cast<T>().eval();
+    dx.device(d) = dy * (temp1 + temp2).template cast<T>();
+  }
+};
+
 template <typename T>
 struct PowFunctor : public BaseActivationFunctor<T> {
   float factor;
@@ -379,4 +406,5 @@ struct STanhGradFunctor : public BaseActivationFunctor<T> {
   __macro(soft_relu, SoftReluFunctor, SoftReluGradFunctor);      \
   __macro(pow, PowFunctor, PowGradFunctor);                      \
   __macro(stanh, STanhFunctor, STanhGradFunctor);                \
-  __macro(softsign, SoftsignFunctor, SoftsignGradFunctor)
+  __macro(softsign, SoftsignFunctor, SoftsignGradFunctor);       \
+  __macro(leaky_relu, LeakyReluFunctor, LeakyReluGradFunctor)
diff --git a/python/paddle/v2/framework/tests/test_activation_op.py b/python/paddle/v2/framework/tests/test_activation_op.py
index c44eb84906..ce6dec7748 100644
--- a/python/paddle/v2/framework/tests/test_activation_op.py
+++ b/python/paddle/v2/framework/tests/test_activation_op.py
@@ -122,6 +122,23 @@ class TestBRelu(OpTest):
         self.check_grad(['X'], 'Y', max_relative_error=0.02)
 
 
+class TestLeakyRelu(OpTest):
+    def setUp(self):
+        self.op_type = "leaky_relu"
+        alpha = 0.02
+        self.attrs = {'alpha': alpha}
+        self.inputs = {'X': np.random.uniform(-3, 3, [4, 4]).astype("float32")}
+        self.outputs = {
+            'Y': np.maximum(self.inputs['X'], alpha * self.inputs['X'])
+        }
+
+    def test_check_output(self):
+        self.check_output()
+
+    def test_check_grad(self):
+        self.check_grad(['X'], 'Y', max_relative_error=0.008)
+
+
 class TestSoftRelu(OpTest):
     def setUp(self):
         self.op_type = "soft_relu"

From 47c994be07d608471734d5d14f980d83f2e0a7a6 Mon Sep 17 00:00:00 2001
From: Kavya Srinet <kavyasrinet@baidu.com>
Date: Thu, 5 Oct 2017 08:50:51 -0700
Subject: [PATCH 13/21] Updated the reltive error

---
 python/paddle/v2/framework/tests/test_activation_op.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/python/paddle/v2/framework/tests/test_activation_op.py b/python/paddle/v2/framework/tests/test_activation_op.py
index ce6dec7748..f232996a55 100644
--- a/python/paddle/v2/framework/tests/test_activation_op.py
+++ b/python/paddle/v2/framework/tests/test_activation_op.py
@@ -136,7 +136,7 @@ class TestLeakyRelu(OpTest):
         self.check_output()
 
     def test_check_grad(self):
-        self.check_grad(['X'], 'Y', max_relative_error=0.008)
+        self.check_grad(['X'], 'Y', max_relative_error=0.007)
 
 
 class TestSoftRelu(OpTest):

From 60af56c1b8e60240238d877c093bf9c99706fefe Mon Sep 17 00:00:00 2001
From: Kavya Srinet <kavyasrinet@baidu.com>
Date: Wed, 4 Oct 2017 19:02:22 -0700
Subject: [PATCH 14/21] Added Leaky Relu activation

---
 paddle/operators/activation_op.cc             | 19 ++++++++++++
 paddle/operators/activation_op.h              | 30 ++++++++++++++++++-
 .../v2/framework/tests/test_activation_op.py  | 17 +++++++++++
 3 files changed, 65 insertions(+), 1 deletion(-)

diff --git a/paddle/operators/activation_op.cc b/paddle/operators/activation_op.cc
index 7ae4d2f6b6..5f2ecc2673 100644
--- a/paddle/operators/activation_op.cc
+++ b/paddle/operators/activation_op.cc
@@ -69,6 +69,22 @@ class ReluOpMaker : public framework::OpProtoAndCheckerMaker {
   }
 };
 
+template <typename AttrType>
+class LeakyReluOpMaker : public framework::OpProtoAndCheckerMaker {
+ public:
+  LeakyReluOpMaker(framework::OpProto *proto,
+                   framework::OpAttrChecker *op_checker)
+      : OpProtoAndCheckerMaker(proto, op_checker) {
+    AddInput("X", "Input of LeakyRelu operator");
+    AddOutput("Y", "Output of LeakyRelu operator");
+    AddComment(
+        "LeakyRelu activation operator, "
+        "leaky_relu = max(x, alpha * x)");
+    AddAttr<AttrType>("alpha", "The small negative slope")
+        .SetDefault(static_cast<AttrType>(0.02f));
+  }
+};
+
 class TanhOpMaker : public framework::OpProtoAndCheckerMaker {
  public:
   TanhOpMaker(framework::OpProto *proto, framework::OpAttrChecker *op_checker)
@@ -240,6 +256,9 @@ REGISTER_OP(softsign, ops::ActivationOp, ops::SoftsignOpMaker, softsign_grad,
 REGISTER_OP(brelu, ops::ActivationOp, ops::BReluOpMaker<float>, brelu_grad,
             ops::ActivationOpGrad);
 
+REGISTER_OP(leaky_relu, ops::ActivationOp, ops::LeakyReluOpMaker<float>,
+            leaky_relu_grad, ops::ActivationOpGrad);
+
 REGISTER_OP(soft_relu, ops::ActivationOp, ops::SoftReluOpMaker<float>,
             soft_relu_grad, ops::ActivationOpGrad);
 
diff --git a/paddle/operators/activation_op.h b/paddle/operators/activation_op.h
index ff35c2d97e..dae66cc77d 100644
--- a/paddle/operators/activation_op.h
+++ b/paddle/operators/activation_op.h
@@ -309,6 +309,33 @@ struct SoftReluGradFunctor : public BaseActivationFunctor<T> {
   }
 };
 
+template <typename T>
+struct LeakyReluFunctor : public BaseActivationFunctor<T> {
+  float alpha;
+  typename BaseActivationFunctor<T>::AttrPair GetAttrs() {
+    return {{"alpha", &alpha}};
+  }
+
+  template <typename Device, typename X, typename Y>
+  void operator()(Device d, X x, Y y) const {
+    y.device(d) = x.cwiseMax(alpha * x);
+  }
+};
+
+template <typename T>
+struct LeakyReluGradFunctor : public BaseActivationFunctor<T> {
+  float alpha;
+  typename BaseActivationFunctor<T>::AttrPair GetAttrs() {
+    return {{"alpha", &alpha}};
+  }
+  template <typename Device, typename X, typename Y, typename dY, typename dX>
+  void operator()(Device d, X x, Y y, dY dy, dX dx) const {
+    auto temp1 = alpha * (x < static_cast<T>(0)).template cast<T>().eval();
+    auto temp2 = (x >= static_cast<T>(0)).template cast<T>().eval();
+    dx.device(d) = dy * (temp1 + temp2).template cast<T>();
+  }
+};
+
 template <typename T>
 struct PowFunctor : public BaseActivationFunctor<T> {
   float factor;
@@ -379,4 +406,5 @@ struct STanhGradFunctor : public BaseActivationFunctor<T> {
   __macro(soft_relu, SoftReluFunctor, SoftReluGradFunctor);      \
   __macro(pow, PowFunctor, PowGradFunctor);                      \
   __macro(stanh, STanhFunctor, STanhGradFunctor);                \
-  __macro(softsign, SoftsignFunctor, SoftsignGradFunctor)
+  __macro(softsign, SoftsignFunctor, SoftsignGradFunctor);       \
+  __macro(leaky_relu, LeakyReluFunctor, LeakyReluGradFunctor)
diff --git a/python/paddle/v2/framework/tests/test_activation_op.py b/python/paddle/v2/framework/tests/test_activation_op.py
index c44eb84906..ce6dec7748 100644
--- a/python/paddle/v2/framework/tests/test_activation_op.py
+++ b/python/paddle/v2/framework/tests/test_activation_op.py
@@ -122,6 +122,23 @@ class TestBRelu(OpTest):
         self.check_grad(['X'], 'Y', max_relative_error=0.02)
 
 
+class TestLeakyRelu(OpTest):
+    def setUp(self):
+        self.op_type = "leaky_relu"
+        alpha = 0.02
+        self.attrs = {'alpha': alpha}
+        self.inputs = {'X': np.random.uniform(-3, 3, [4, 4]).astype("float32")}
+        self.outputs = {
+            'Y': np.maximum(self.inputs['X'], alpha * self.inputs['X'])
+        }
+
+    def test_check_output(self):
+        self.check_output()
+
+    def test_check_grad(self):
+        self.check_grad(['X'], 'Y', max_relative_error=0.008)
+
+
 class TestSoftRelu(OpTest):
     def setUp(self):
         self.op_type = "soft_relu"

From 11070e5f36be24be8da6fa2b70a2dae13212c513 Mon Sep 17 00:00:00 2001
From: Kavya Srinet <kavyasrinet@baidu.com>
Date: Thu, 5 Oct 2017 08:50:51 -0700
Subject: [PATCH 15/21] Updated the reltive error

---
 python/paddle/v2/framework/tests/test_activation_op.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/python/paddle/v2/framework/tests/test_activation_op.py b/python/paddle/v2/framework/tests/test_activation_op.py
index ce6dec7748..f232996a55 100644
--- a/python/paddle/v2/framework/tests/test_activation_op.py
+++ b/python/paddle/v2/framework/tests/test_activation_op.py
@@ -136,7 +136,7 @@ class TestLeakyRelu(OpTest):
         self.check_output()
 
     def test_check_grad(self):
-        self.check_grad(['X'], 'Y', max_relative_error=0.008)
+        self.check_grad(['X'], 'Y', max_relative_error=0.007)
 
 
 class TestSoftRelu(OpTest):

From 828c5b3e1dd3c80b955a1c65179ca6d5a27d852d Mon Sep 17 00:00:00 2001
From: Abhinav Arora <abhinavarora28@gmail.com>
Date: Thu, 5 Oct 2017 13:07:55 -0700
Subject: [PATCH 16/21] Adding Adadelta optimization operator (#4576)

* Adding Adadelta optimization operator
* Making inputs and outputs conform to naming convention
* Removing type alias from header files
* Fixing Adadelta documentation in comments
* Addressing code review feedback
---
 paddle/operators/adadelta_op.cc               | 115 ++++++++++++++++++
 paddle/operators/adadelta_op.cu               |  20 +++
 paddle/operators/adadelta_op.h                |  69 +++++++++++
 .../v2/framework/tests/test_adadelta_op.py    |  96 +++++++++++++++
 4 files changed, 300 insertions(+)
 create mode 100644 paddle/operators/adadelta_op.cc
 create mode 100644 paddle/operators/adadelta_op.cu
 create mode 100644 paddle/operators/adadelta_op.h
 create mode 100644 python/paddle/v2/framework/tests/test_adadelta_op.py

diff --git a/paddle/operators/adadelta_op.cc b/paddle/operators/adadelta_op.cc
new file mode 100644
index 0000000000..bd8c93b4a1
--- /dev/null
+++ b/paddle/operators/adadelta_op.cc
@@ -0,0 +1,115 @@
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "paddle/operators/adadelta_op.h"
+
+namespace paddle {
+namespace operators {
+
+class AdadeltaOp : public framework::OperatorWithKernel {
+ public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+
+ protected:
+  void InferShape(framework::InferShapeContextBase *ctx) const override {
+    PADDLE_ENFORCE(ctx->HasInput("Param"),
+                   "Input(Param) of AdadeltaOp should not be null.");
+    PADDLE_ENFORCE(ctx->HasInput("Grad"),
+                   "Input(Grad) of AdadeltaOp should not be null.");
+    PADDLE_ENFORCE(ctx->HasInput("AvgSquaredGrad"),
+                   "Input(AvgSquaredGrad) of AdadeltaOp should not be null.");
+    PADDLE_ENFORCE(ctx->HasInput("AvgSquaredUpdate"),
+                   "Input(AvgSquaredUpdate) of AdadeltaOp should not be null.");
+
+    PADDLE_ENFORCE(ctx->HasOutput("ParamOut"),
+                   "Output(ParamOut) of AdadeltaOp should not be null.");
+    PADDLE_ENFORCE(
+        ctx->HasOutput("AvgSquaredGradOut"),
+        "Output(AvgSquaredGradOut) of AdadeltaOp should not be null.");
+    PADDLE_ENFORCE(
+        ctx->HasOutput("AvgSquaredUpdateOut"),
+        "Output(AvgSquaredUpdateOut) of AdadeltaOp should not be null.");
+
+    auto param_dim = ctx->GetInputDim("Param");
+    PADDLE_ENFORCE_EQ(
+        param_dim, ctx->GetInputDim("Grad"),
+        "param and grad input of AdadeltaOp should have same dimension");
+    PADDLE_ENFORCE_EQ(param_dim, ctx->GetInputDim("AvgSquaredGrad"),
+                      "Param and AvgSquaredGrad input of AdadeltaOp "
+                      "should have same dimension");
+    PADDLE_ENFORCE_EQ(param_dim, ctx->GetInputDim("AvgSquaredUpdate"),
+                      "Param and AvgSquaredUpdate input of AdadeltaOp "
+                      "should have same dimension");
+
+    ctx->SetOutputDim("ParamOut", param_dim);
+    ctx->SetOutputDim("AvgSquaredGradOut", param_dim);
+    ctx->SetOutputDim("AvgSquaredUpdateOut", param_dim);
+  }
+};
+
+class AdadeltaOpMaker : public framework::OpProtoAndCheckerMaker {
+ public:
+  AdadeltaOpMaker(framework::OpProto *proto,
+                  framework::OpAttrChecker *op_checker)
+      : OpProtoAndCheckerMaker(proto, op_checker) {
+    AddInput("Param", "(Tensor) Input parameter");
+    AddInput("Grad", "(Tensor) Input gradient");
+    AddInput("AvgSquaredGrad",
+             "(Tensor) Input expectation of squared gradient");
+    AddInput("AvgSquaredUpdate",
+             "(Tensor) Input expectation of squared parameter updates");
+
+    AddOutput("ParamOut", "(Tensor) Output parameter");
+    AddOutput("AvgSquaredGradOut",
+              "(Tensor) Output expectation of squared gradient");
+    AddOutput("AvgSquaredUpdateOut",
+              "(Tensor) Output expectation of squared parameter updates");
+
+    AddAttr<float>("rho",
+                   "(float, default 0.95) Exponential decay rate "
+                   "for squared gradients.")
+        .SetDefault(0.95f);
+    AddAttr<float>("epsilon",
+                   "(float, default 1.0e-6) Constant for "
+                   "numerical stability")
+        .SetDefault(1.0e-6f);
+    AddComment(R"DOC(
+Adadelta Updates Operator.
+
+This implements the Adadelta optimizer[1]. Adadelta is a per-dimension
+adaptive learning rate method for gradient descent.
+
+Adadelta updates:
+
+avg_squared_grad_out = rho * avg_squared_grad + (1 - rho) * grad * grad
+param_update =  - sqrt((avg_squared_update + epsilon) /
+                       (avg_squared_grad_out + epsilon)) * grad
+avg_squared_update_out = rho * avg_squared_update + (1 - rho) * param_update**2
+param_out = param + param_update
+
+References:
+  [1] ADADELTA: An Adaptive Learning Rate Method
+      https://arxiv.org/abs/1212.5701
+
+)DOC");
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+REGISTER_OP_WITHOUT_GRADIENT(adadelta, ops::AdadeltaOp, ops::AdadeltaOpMaker);
+REGISTER_OP_CPU_KERNEL(
+    adadelta, ops::AdadeltaOpKernel<paddle::platform::CPUPlace, float>);
diff --git a/paddle/operators/adadelta_op.cu b/paddle/operators/adadelta_op.cu
new file mode 100644
index 0000000000..3af1c8c8e9
--- /dev/null
+++ b/paddle/operators/adadelta_op.cu
@@ -0,0 +1,20 @@
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License. */
+
+#define EIGEN_USE_GPU
+#include "paddle/operators/adadelta_op.h"
+
+namespace ops = paddle::operators;
+REGISTER_OP_GPU_KERNEL(
+    adadelta, ops::AdadeltaOpKernel<paddle::platform::GPUPlace, float>);
diff --git a/paddle/operators/adadelta_op.h b/paddle/operators/adadelta_op.h
new file mode 100644
index 0000000000..d29e15c435
--- /dev/null
+++ b/paddle/operators/adadelta_op.h
@@ -0,0 +1,69 @@
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+#include "paddle/framework/eigen.h"
+#include "paddle/framework/op_registry.h"
+
+namespace paddle {
+namespace operators {
+
+template <typename Place, typename T>
+class AdadeltaOpKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    auto param_out_tensor = ctx.Output<framework::Tensor>("ParamOut");
+    auto avg_squared_grad_out_tensor =
+        ctx.Output<framework::Tensor>("AvgSquaredGradOut");
+    auto avg_squared_update_out_tensor =
+        ctx.Output<framework::Tensor>("AvgSquaredUpdateOut");
+
+    param_out_tensor->mutable_data<T>(ctx.GetPlace());
+    avg_squared_grad_out_tensor->mutable_data<T>(ctx.GetPlace());
+    avg_squared_update_out_tensor->mutable_data<T>(ctx.GetPlace());
+
+    float rho = ctx.Attr<float>("rho");
+    float epsilon = ctx.Attr<float>("epsilon");
+
+    auto param = framework::EigenVector<T>::Flatten(
+        *ctx.Input<framework::Tensor>("Param"));
+    auto grad = framework::EigenVector<T>::Flatten(
+        *ctx.Input<framework::Tensor>("Grad"));
+    // Squared gradient accumulator
+    auto avg_squared_grad = framework::EigenVector<T>::Flatten(
+        *ctx.Input<framework::Tensor>("AvgSquaredGrad"));
+    // Squared updates accumulator
+    auto avg_squared_update = framework::EigenVector<T>::Flatten(
+        *ctx.Input<framework::Tensor>("AvgSquaredUpdate"));
+    auto param_out = framework::EigenVector<T>::Flatten(*param_out_tensor);
+    auto avg_squared_grad_out =
+        framework::EigenVector<T>::Flatten(*avg_squared_grad_out_tensor);
+    auto avg_squared_update_out =
+        framework::EigenVector<T>::Flatten(*avg_squared_update_out_tensor);
+    auto place = ctx.GetEigenDevice<Place>();
+
+    avg_squared_grad_out.device(place) =
+        rho * avg_squared_grad + (1 - rho) * grad.square();
+    auto update =
+        -((avg_squared_update + epsilon) / (avg_squared_grad_out + epsilon))
+             .sqrt() *
+        grad;
+    avg_squared_update_out.device(place) =
+        rho * avg_squared_update + (1 - rho) * update.square();
+    param_out.device(place) = param + update;
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
diff --git a/python/paddle/v2/framework/tests/test_adadelta_op.py b/python/paddle/v2/framework/tests/test_adadelta_op.py
new file mode 100644
index 0000000000..7105593a98
--- /dev/null
+++ b/python/paddle/v2/framework/tests/test_adadelta_op.py
@@ -0,0 +1,96 @@
+import unittest
+import numpy as np
+from op_test import OpTest
+
+
+class TestAdadeltaOp1(OpTest):
+    def setUp(self):
+        self.op_type = "adadelta"
+        param = np.random.uniform(-1, 1, (102, 105)).astype("float32")
+        grad = np.random.uniform(-1, 1, (102, 105)).astype("float32")
+        # The squared gradient is positive
+        avg_squared_grad = np.random.random((102, 105)).astype("float32")
+        # The squared update is positive
+        avg_squared_update = np.random.random((102, 105)).astype("float32")
+
+        rho = 0.95
+        epsilon = 1e-6
+
+        self.inputs = {
+            'Param': param,
+            'Grad': grad,
+            'AvgSquaredGrad': avg_squared_grad,
+            'AvgSquaredUpdate': avg_squared_update
+        }
+
+        self.attrs = {'rho': rho, 'epsilon': epsilon}
+
+        avg_squared_grad_out = rho * avg_squared_grad + \
+            (1 - rho) * np.square(grad)
+        update = -np.multiply(
+            np.sqrt(
+                np.divide(avg_squared_update + epsilon, avg_squared_grad_out +
+                          epsilon)), grad)
+
+        avg_squared_update_out = rho * avg_squared_update + \
+            (1 - rho) * np.square(update)
+
+        param_out = param + update
+
+        self.outputs = {
+            'ParamOut': param_out,
+            'AvgSquaredGradOut': avg_squared_grad_out,
+            'AvgSquaredUpdateOut': avg_squared_update_out
+        }
+
+    def test_check_output(self):
+        self.check_output()
+
+
+class TestAdadeltaOp2(OpTest):
+    '''Test Adadelta op with default attribute values
+    '''
+
+    def setUp(self):
+        self.op_type = "adadelta"
+        param = np.random.uniform(-1, 1, (102, 105)).astype("float32")
+        grad = np.random.uniform(-1, 1, (102, 105)).astype("float32")
+        # The squared gradient is positive
+        avg_squared_grad = np.random.random((102, 105)).astype("float32")
+        # The squared update is positive
+        avg_squared_update = np.random.random((102, 105)).astype("float32")
+
+        rho = 0.95
+        epsilon = 1e-6
+
+        self.inputs = {
+            'Param': param,
+            'Grad': grad,
+            'AvgSquaredGrad': avg_squared_grad,
+            'AvgSquaredUpdate': avg_squared_update
+        }
+
+        avg_squared_grad_out = rho * avg_squared_grad + \
+            (1 - rho) * np.square(grad)
+        update = -np.multiply(
+            np.sqrt(
+                np.divide(avg_squared_update + epsilon, avg_squared_grad_out +
+                          epsilon)), grad)
+
+        avg_squared_update_out = rho * avg_squared_update + \
+            (1 - rho) * np.square(update)
+
+        param_out = param + update
+
+        self.outputs = {
+            'ParamOut': param_out,
+            'AvgSquaredGradOut': avg_squared_grad_out,
+            'AvgSquaredUpdateOut': avg_squared_update_out
+        }
+
+    def test_check_output(self):
+        self.check_output()
+
+
+if __name__ == "__main__":
+    unittest.main()

From 45c4dcaabb4cbf140384dcffe3392d2e10b2a6d7 Mon Sep 17 00:00:00 2001
From: qijun <qijun1994@hotmail.com>
Date: Thu, 5 Oct 2017 15:54:44 -0700
Subject: [PATCH 17/21] add fetch operator

---
 paddle/framework/executor.cc      | 18 ++++----
 paddle/framework/executor_test.cc | 67 ++++++++++++++++++++++++++++++
 paddle/framework/scope.cc         |  5 ++-
 paddle/operators/activation_op.cu | 18 ++++----
 paddle/operators/feed_op.cc       |  6 +--
 paddle/operators/fetch_op.cc      | 68 +++++++++++++++++++++++++++++++
 paddle/operators/fetch_op.cu      | 18 ++++++++
 paddle/operators/fetch_op.h       | 40 ++++++++++++++++++
 8 files changed, 218 insertions(+), 22 deletions(-)
 create mode 100644 paddle/operators/fetch_op.cc
 create mode 100644 paddle/operators/fetch_op.cu
 create mode 100644 paddle/operators/fetch_op.h

diff --git a/paddle/framework/executor.cc b/paddle/framework/executor.cc
index aafef12554..51ddb7e58e 100644
--- a/paddle/framework/executor.cc
+++ b/paddle/framework/executor.cc
@@ -75,15 +75,15 @@ void Executor::Run(const ProgramDesc& pdesc, Scope* scope) {
     device_context->Wait();
   }
   // // print tensor value
-  for (auto& var : block.vars()) {
-    std::cout << var.name() << std::endl;
-    auto v = scope->FindVar(var.name());
-    const LoDTensor& t = v->Get<LoDTensor>();
-    for (int i = 0; i < t.numel(); ++i) {
-      std::cout << t.data<float>()[i] << " ";
-    }
-    std::cout << std::endl;
-  }
+  // for (auto& var : block.vars()) {
+  //   std::cout << var.name() << std::endl;
+  //   auto v = scope->FindVar(var.name());
+  //   const LoDTensor& t = v->Get<LoDTensor>();
+  //   for (int i = 0; i < t.numel(); ++i) {
+  //     std::cout << t.data<float>()[i] << " ";
+  //   }
+  //   std::cout << std::endl;
+  // }
 }
 
 }  // namespace framework
diff --git a/paddle/framework/executor_test.cc b/paddle/framework/executor_test.cc
index 0856d1f32e..980f5f579c 100644
--- a/paddle/framework/executor_test.cc
+++ b/paddle/framework/executor_test.cc
@@ -25,6 +25,7 @@ limitations under the License. */
 USE_OP(elementwise_add);
 USE_OP(gaussian_random);
 USE_OP(feed);
+USE_OP(fetch);
 
 using std::string;
 using namespace paddle::platform;
@@ -94,6 +95,41 @@ void add_feed_op(string var_name, int index, proto_block* block) {
   Out->add_arguments(var_name);
 }
 
+void add_fetch_op(string var_name, int index, proto_block* block) {
+  std::vector<int> dim{3};
+
+  // insert variable
+  auto a = block->add_vars();
+  a->set_name(var_name);
+  auto a_lt = a->mutable_lod_tensor();
+  a_lt->set_data_type(paddle::framework::DataType::FP32);
+  for (int i : dim) {
+    a_lt->add_dims(i);
+  }
+
+  // insert operation
+  auto op = block->add_ops();
+  op->set_type("fetch");
+
+  // set dims attr
+  auto dims = op->add_attrs();
+  dims->set_name("dims");
+  dims->set_type(paddle::framework::AttrType::INTS);
+  for (int i : dim) {
+    dims->add_ints(i);
+  }
+
+  // set col attr
+  auto col = op->add_attrs();
+  col->set_name("col");
+  col->set_type(paddle::framework::AttrType::INT);
+  col->set_i(index);
+
+  auto Out = op->add_inputs();
+  Out->set_parameter("Input");
+  Out->add_arguments(var_name);
+}
+
 std::once_flag set_variable_flag;
 
 template <typename T>
@@ -119,6 +155,27 @@ void set_feed_variable(const std::vector<std::vector<T>>& inputs) {
   }
 }
 
+template <typename T>
+std::vector<std::vector<T>> get_fetch_variable() {
+  typedef std::vector<paddle::framework::Tensor> FetchOutputs;
+  Variable* g_fetch_value = GetScope()->FindVar("fetch_value");
+  FetchOutputs& fetch_outputs = *(g_fetch_value->GetMutable<FetchOutputs>());
+  auto size = fetch_outputs.size();
+
+  std::vector<std::vector<T>> result;
+  result.reserve(size);
+
+  for (size_t i = 0; i < size; i++) {
+    std::vector<T> tmp;
+    tmp.reserve(fetch_outputs[i].numel());
+    memcpy(tmp.data(), fetch_outputs[i].data<T>(),
+           fetch_outputs[i].numel() * sizeof(T));
+    result.push_back(tmp);
+  }
+
+  return result;
+}
+
 class ExecutorTesterRandom : public ::testing::Test {
  public:
   virtual void SetUp() override {
@@ -181,6 +238,8 @@ class ExecutorTesterFeed : public ::testing::Test {
     Out->set_parameter("Out");
     Out->add_arguments("c");
 
+    add_fetch_op("c", 0, root_block);
+
     std::vector<float> vec1 = {1.0, 2.0, 3.0};
     std::vector<float> vec2 = {4.0, 5.0, 6.0};
     inputs_.push_back(vec1);
@@ -213,8 +272,16 @@ TEST_F(ExecutorTesterFeed, CPU) {
   // 3 mini-batch
   for (int i = 0; i < 3; i++) {
     // need to set feed variable before Executor::Run
+    std::cout << "start mini-batch " << i << std::endl;
     set_feed_variable<float>(inputs_);
     executor->Run(pdesc_, GetScope());
+    std::vector<std::vector<float>> result = get_fetch_variable<float>();
+    for (auto& vec : result) {
+      for (auto& num : vec) {
+        std::cout << num << " ";
+      }
+      std::cout << std::endl;
+    }
   }
 
   delete executor;
diff --git a/paddle/framework/scope.cc b/paddle/framework/scope.cc
index b04120abf2..2c416570cf 100644
--- a/paddle/framework/scope.cc
+++ b/paddle/framework/scope.cc
@@ -74,7 +74,10 @@ std::unique_ptr<T> make_unique(Args&&... args) {
 framework::Scope* GetScope() {
   static std::unique_ptr<framework::Scope> g_scope =
       make_unique<framework::Scope>();
-  std::call_once(feed_variable_flag, [&]() { g_scope->NewVar("feed_value"); });
+  std::call_once(feed_variable_flag, [&]() {
+    g_scope->NewVar("feed_value");
+    g_scope->NewVar("fetch_value");
+  });
   return g_scope.get();
 }
 
diff --git a/paddle/operators/activation_op.cu b/paddle/operators/activation_op.cu
index 44a6aaf9cb..93e9f1c694 100644
--- a/paddle/operators/activation_op.cu
+++ b/paddle/operators/activation_op.cu
@@ -1,16 +1,16 @@
 /* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
 
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
 
-http://www.apache.org/licenses/LICENSE-2.0
+   http://www.apache.org/licenses/LICENSE-2.0
 
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License. */
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License. */
 
 #define EIGEN_USE_GPU
 #include "paddle/operators/activation_op.h"
diff --git a/paddle/operators/feed_op.cc b/paddle/operators/feed_op.cc
index 5ae882bc8a..a61855cb99 100644
--- a/paddle/operators/feed_op.cc
+++ b/paddle/operators/feed_op.cc
@@ -49,9 +49,9 @@ class FeedOpMaker : public framework::OpProtoAndCheckerMaker {
     AddAttr<int>("data_type", "output data type")
         .SetDefault(framework::DataType::FP32);
     AddAttr<int>("col", "The col in global feed variable").SetDefault(0);
-    AddAttr<std::vector<int>>("dims", "The dimension of random tensor.");
-    AddOutput("Out", "The output of dropout op.");
-    AddComment(R"DOC(Feed data to global feed variable)DOC");
+    AddAttr<std::vector<int>>("dims", "The dimension of feed tensor.");
+    AddOutput("Out", "The output of feed op.");
+    AddComment(R"DOC(Feed data from global feed variable)DOC");
   }
 };
 
diff --git a/paddle/operators/fetch_op.cc b/paddle/operators/fetch_op.cc
new file mode 100644
index 0000000000..68e8d26dbe
--- /dev/null
+++ b/paddle/operators/fetch_op.cc
@@ -0,0 +1,68 @@
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "paddle/operators/fetch_op.h"
+
+namespace paddle {
+namespace operators {
+
+class FetchOp : public framework::OperatorWithKernel {
+ public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+
+ protected:
+  void InferShape(framework::InferShapeContextBase* ctx) const override {
+    typedef std::vector<framework::Tensor> FetchOutputs;
+    PADDLE_ENFORCE(ctx->HasInput("Input"), "Input should be not null.");
+    int col = ctx->Attrs().Get<int>("col");
+    framework::Variable* g_fetch_variable =
+        framework::GetScope()->FindVar("fetch_value");
+
+    FetchOutputs* tensors = g_fetch_variable->GetMutable<FetchOutputs>();
+    if (tensors->size() < col) {
+      tensors->resize(col);
+    }
+
+    auto input_dim = ctx->GetInputDim("Input");
+    framework::Tensor tmp;
+    tmp.Resize(input_dim);
+    (*tensors)[col].Resize(input_dim);
+    // need to handle LodTensor later
+  }
+
+  framework::DataType IndicateDataType(
+      const framework::ExecutionContext& ctx) const override {
+    return static_cast<framework::DataType>(Attr<int>("data_type"));
+  }
+};
+
+class FetchOpMaker : public framework::OpProtoAndCheckerMaker {
+ public:
+  FetchOpMaker(framework::OpProto* proto, framework::OpAttrChecker* op_checker)
+      : OpProtoAndCheckerMaker(proto, op_checker) {
+    AddAttr<int>("data_type", "output data type")
+        .SetDefault(framework::DataType::FP32);
+    AddAttr<int>("col", "The col in global fetch variable").SetDefault(0);
+    AddAttr<std::vector<int>>("dims", "The dimension of fetch tensor.");
+    AddInput("Input", "The output of fetch op.");
+    AddComment(R"DOC(Fetch data to global fetch variable)DOC");
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+REGISTER_OP_WITHOUT_GRADIENT(fetch, ops::FetchOp, ops::FetchOpMaker);
+REGISTER_OP_CPU_KERNEL(fetch, ops::FetchKernel<float>);
diff --git a/paddle/operators/fetch_op.cu b/paddle/operators/fetch_op.cu
new file mode 100644
index 0000000000..2e24d3a8ad
--- /dev/null
+++ b/paddle/operators/fetch_op.cu
@@ -0,0 +1,18 @@
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "paddle/operators/feed_op.h"
+
+namespace ops = paddle::operators;
+REGISTER_OP_GPU_KERNEL(fetch, ops::FetchKernel<float>);
diff --git a/paddle/operators/fetch_op.h b/paddle/operators/fetch_op.h
new file mode 100644
index 0000000000..95e7986a22
--- /dev/null
+++ b/paddle/operators/fetch_op.h
@@ -0,0 +1,40 @@
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+#include "paddle/framework/eigen.h"
+#include "paddle/framework/op_registry.h"
+
+namespace paddle {
+namespace operators {
+
+using Tensor = framework::Tensor;
+
+template <typename T>
+class FetchKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    typedef std::vector<framework::Tensor> FetchOutputs;
+    Tensor* input = ctx.Output<Tensor>("Input");
+    int col = ctx.template Attr<int>("col");
+    framework::Variable* g_fetch_variable =
+        framework::GetScope()->FindVar("fetch_value");
+    FetchOutputs tensors = g_fetch_variable->Get<FetchOutputs>();
+    tensors[col].mutable_data<T>(platform::CPUPlace());
+    tensors[col].CopyFrom<T>(*input, platform::CPUPlace());
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle

From 48b080db9fcc4f34535c98878112e6633d6d8d7d Mon Sep 17 00:00:00 2001
From: qijun <qijun1994@hotmail.com>
Date: Thu, 5 Oct 2017 20:48:04 -0700
Subject: [PATCH 18/21] ensure global BuddyAllocator is initialized before
 global Scope

---
 paddle/framework/executor_test.cc | 94 +++++++++++++++++--------------
 paddle/operators/feed_op.cc       |  4 +-
 paddle/operators/feed_op.h        |  2 +-
 paddle/operators/fetch_op.cc      |  7 ++-
 paddle/operators/fetch_op.h       |  8 +--
 5 files changed, 62 insertions(+), 53 deletions(-)

diff --git a/paddle/framework/executor_test.cc b/paddle/framework/executor_test.cc
index 980f5f579c..d3ea18d154 100644
--- a/paddle/framework/executor_test.cc
+++ b/paddle/framework/executor_test.cc
@@ -13,8 +13,6 @@ See the License for the specific language governing permissions and
 limitations under the License. */
 
 #include "paddle/framework/executor.h"
-#include <memory>  // for unique_ptr
-#include <mutex>   // for call_once
 #include <vector>
 #include "gtest/gtest.h"
 #include "paddle/framework/attribute.h"
@@ -34,9 +32,8 @@ using namespace paddle::framework;
 typedef paddle::framework::BlockDesc proto_block;
 typedef paddle::framework::OpDesc proto_op;
 
-void add_gaussian_random_op(string var_name, proto_block* block) {
-  std::vector<int> dim{2, 3};
-
+void add_gaussian_random_op(string var_name, std::vector<int>& dim,
+                            proto_block* block) {
   // insert variable
   auto a = block->add_vars();
   a->set_name(var_name);
@@ -60,9 +57,8 @@ void add_gaussian_random_op(string var_name, proto_block* block) {
   Out->add_arguments(var_name);
 }
 
-void add_feed_op(string var_name, int index, proto_block* block) {
-  std::vector<int> dim{3};
-
+void add_feed_op(string var_name, std::vector<int>& dim, int index,
+                 proto_block* block) {
   // insert variable
   auto a = block->add_vars();
   a->set_name(var_name);
@@ -95,9 +91,8 @@ void add_feed_op(string var_name, int index, proto_block* block) {
   Out->add_arguments(var_name);
 }
 
-void add_fetch_op(string var_name, int index, proto_block* block) {
-  std::vector<int> dim{3};
-
+void add_fetch_op(string var_name, std::vector<int>& dim, int index,
+                  proto_block* block) {
   // insert variable
   auto a = block->add_vars();
   a->set_name(var_name);
@@ -138,20 +133,11 @@ void set_feed_variable(const std::vector<std::vector<T>>& inputs) {
   Variable* g_feed_value = GetScope()->FindVar("feed_value");
   FeedInputs& feed_inputs = *(g_feed_value->GetMutable<FeedInputs>());
   auto size = inputs.size();
-
-  std::call_once(set_variable_flag, [&]() {
-    feed_inputs.reserve(size);
-    for (size_t i = 0; i < size; i++) {
-      paddle::framework::Tensor tmp;
-      tmp.mutable_data<T>(make_ddim({static_cast<int64_t>(inputs[i].size())}),
-                          CPUPlace());
-      feed_inputs.push_back(tmp);
-    }
-  });
-
+  feed_inputs.resize(size);
   for (size_t i = 0; i < size; i++) {
-    memcpy(feed_inputs[i].data<T>(), inputs[i].data(),
-           inputs[i].size() * sizeof(T));
+    T* dst = feed_inputs[i].mutable_data<T>(
+        make_ddim({static_cast<int64_t>(inputs[i].size())}), CPUPlace());
+    memcpy(dst, inputs[i].data(), inputs[i].size() * sizeof(T));
   }
 }
 
@@ -160,19 +146,17 @@ std::vector<std::vector<T>> get_fetch_variable() {
   typedef std::vector<paddle::framework::Tensor> FetchOutputs;
   Variable* g_fetch_value = GetScope()->FindVar("fetch_value");
   FetchOutputs& fetch_outputs = *(g_fetch_value->GetMutable<FetchOutputs>());
-  auto size = fetch_outputs.size();
 
+  auto size = fetch_outputs.size();
   std::vector<std::vector<T>> result;
   result.reserve(size);
-
   for (size_t i = 0; i < size; i++) {
     std::vector<T> tmp;
-    tmp.reserve(fetch_outputs[i].numel());
+    tmp.resize(fetch_outputs[i].numel());
     memcpy(tmp.data(), fetch_outputs[i].data<T>(),
            fetch_outputs[i].numel() * sizeof(T));
     result.push_back(tmp);
   }
-
   return result;
 }
 
@@ -183,8 +167,9 @@ class ExecutorTesterRandom : public ::testing::Test {
     root_block->set_idx(0);
     root_block->set_parent_idx(-1);
 
-    add_gaussian_random_op("a", root_block);
-    add_gaussian_random_op("b", root_block);
+    std::vector<int> dim{2, 3};
+    add_gaussian_random_op("a", dim, root_block);
+    add_gaussian_random_op("b", dim, root_block);
 
     auto c = root_block->add_vars();
     c->set_name("c");
@@ -203,12 +188,11 @@ class ExecutorTesterRandom : public ::testing::Test {
     Out->set_parameter("Out");
     Out->add_arguments("c");
 
-    scope_ = GetScope();
+    add_fetch_op("c", dim, 0, root_block);
   }
 
  protected:
   ProgramDesc pdesc_;
-  Scope* scope_;
 };
 
 class ExecutorTesterFeed : public ::testing::Test {
@@ -218,8 +202,10 @@ class ExecutorTesterFeed : public ::testing::Test {
     root_block->set_idx(0);
     root_block->set_parent_idx(-1);
 
-    add_feed_op("a", 0, root_block);
-    add_feed_op("b", 1, root_block);
+    std::vector<int> dim{6};
+
+    add_feed_op("a", dim, 0, root_block);
+    add_feed_op("b", dim, 1, root_block);
 
     auto c = root_block->add_vars();
     c->set_name("c");
@@ -238,10 +224,10 @@ class ExecutorTesterFeed : public ::testing::Test {
     Out->set_parameter("Out");
     Out->add_arguments("c");
 
-    add_fetch_op("c", 0, root_block);
+    add_fetch_op("c", dim, 0, root_block);
 
-    std::vector<float> vec1 = {1.0, 2.0, 3.0};
-    std::vector<float> vec2 = {4.0, 5.0, 6.0};
+    std::vector<float> vec1 = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0};
+    std::vector<float> vec2 = {4.0, 5.0, 6.0, 7.0, 8.0, 9.0};
     inputs_.push_back(vec1);
     inputs_.push_back(vec2);
   }
@@ -253,12 +239,24 @@ class ExecutorTesterFeed : public ::testing::Test {
 
 TEST_F(ExecutorTesterRandom, CPU) {
   std::vector<Place> places;
-  CPUPlace cpu_place1, cpu_place2;
-  places.push_back(cpu_place1);
-  places.push_back(cpu_place2);
+  CPUPlace cpu_place;
+  places.push_back(cpu_place);
+
+  // We have a global Scope and BuddyAllocator, and we must ensure
+  // global BuddyAllocator is initialized before global Scope. Thus,
+  // global Scope will deconstruct before BuddyAllocator. Otherwise,
+  // "pointer being freed was not allocated" error will appear.
+  paddle::memory::Used(cpu_place);
 
   Executor* executor = new Executor(places);
-  executor->Run(pdesc_, scope_);
+  executor->Run(pdesc_, GetScope());
+  std::vector<std::vector<float>> result = get_fetch_variable<float>();
+  for (auto& vec : result) {
+    for (auto& num : vec) {
+      std::cout << num << " ";
+    }
+    std::cout << std::endl;
+  }
   delete executor;
 }
 
@@ -267,6 +265,12 @@ TEST_F(ExecutorTesterFeed, CPU) {
   CPUPlace cpu_place;
   places.push_back(cpu_place);
 
+  // We have a global Scope and BuddyAllocator, and we must ensure
+  // global BuddyAllocator is initialized before global Scope. Thus,
+  // global Scope will deconstruct before BuddyAllocator. Otherwise,
+  // "pointer being freed was not allocated" error will appear.
+  paddle::memory::Used(cpu_place);
+
   Executor* executor = new Executor(places);
 
   // 3 mini-batch
@@ -293,8 +297,10 @@ TEST_F(ExecutorTesterRandom, GPU) {
   GPUPlace gpu_place(0);
   places.push_back(gpu_place);
 
+  paddle::memory::Used(gpu_place);
+
   Executor* executor = new Executor(places);
-  executor->Run(pdesc_, scope_);
+  executor->Run(pdesc_, GetScope());
   delete executor;
 }
 
@@ -303,11 +309,13 @@ TEST_F(ExecutorTesterFeed, GPU) {
   GPUPlace gpu_place(0);
   places.push_back(gpu_place);
 
+  paddle::memory::Used(gpu_place);
+
   Executor* executor = new Executor(places);
 
   // need to set feed variable before Executor::Run
   set_feed_variable<float>(inputs_);
-  executor->Run(pdesc_, scope_);
+  executor->Run(pdesc_, GetScope());
 
   delete executor;
 }
diff --git a/paddle/operators/feed_op.cc b/paddle/operators/feed_op.cc
index a61855cb99..d40db3ff2e 100644
--- a/paddle/operators/feed_op.cc
+++ b/paddle/operators/feed_op.cc
@@ -29,11 +29,11 @@ class FeedOp : public framework::OperatorWithKernel {
     framework::Variable* g_feed_variable =
         framework::GetScope()->FindVar("feed_value");
 
-    FeedInputs tensors = g_feed_variable->Get<FeedInputs>();
+    const FeedInputs& tensors = g_feed_variable->Get<FeedInputs>();
 
     auto in_dim = tensors[col].dims();
     ctx->SetOutputDim("Out", in_dim);
-    // need to handle LodTensor later
+    // TODO(qijun) need to handle LodTensor later
   }
 
   framework::DataType IndicateDataType(
diff --git a/paddle/operators/feed_op.h b/paddle/operators/feed_op.h
index 57781e205f..cf93b6f434 100644
--- a/paddle/operators/feed_op.h
+++ b/paddle/operators/feed_op.h
@@ -31,7 +31,7 @@ class FeedKernel : public framework::OpKernel<T> {
     framework::Variable* g_feed_variable =
         framework::GetScope()->FindVar("feed_value");
     int col = ctx.template Attr<int>("col");
-    FeedInputs tensors = g_feed_variable->Get<FeedInputs>();
+    const FeedInputs& tensors = g_feed_variable->Get<FeedInputs>();
     out->CopyFrom<T>(tensors[col], ctx.GetPlace());
   }
 };
diff --git a/paddle/operators/fetch_op.cc b/paddle/operators/fetch_op.cc
index 68e8d26dbe..a885deacc8 100644
--- a/paddle/operators/fetch_op.cc
+++ b/paddle/operators/fetch_op.cc
@@ -30,15 +30,16 @@ class FetchOp : public framework::OperatorWithKernel {
         framework::GetScope()->FindVar("fetch_value");
 
     FetchOutputs* tensors = g_fetch_variable->GetMutable<FetchOutputs>();
-    if (tensors->size() < col) {
-      tensors->resize(col);
+    if (tensors->size() < static_cast<size_t>(col + 1)) {
+      tensors->resize(col + 1);
     }
 
     auto input_dim = ctx->GetInputDim("Input");
     framework::Tensor tmp;
     tmp.Resize(input_dim);
     (*tensors)[col].Resize(input_dim);
-    // need to handle LodTensor later
+
+    // TODO(qijun) need to handle LodTensor later
   }
 
   framework::DataType IndicateDataType(
diff --git a/paddle/operators/fetch_op.h b/paddle/operators/fetch_op.h
index 95e7986a22..e8d5e3a9c0 100644
--- a/paddle/operators/fetch_op.h
+++ b/paddle/operators/fetch_op.h
@@ -26,13 +26,13 @@ class FetchKernel : public framework::OpKernel<T> {
  public:
   void Compute(const framework::ExecutionContext& ctx) const override {
     typedef std::vector<framework::Tensor> FetchOutputs;
-    Tensor* input = ctx.Output<Tensor>("Input");
+    const Tensor* input = ctx.Input<Tensor>("Input");
     int col = ctx.template Attr<int>("col");
     framework::Variable* g_fetch_variable =
         framework::GetScope()->FindVar("fetch_value");
-    FetchOutputs tensors = g_fetch_variable->Get<FetchOutputs>();
-    tensors[col].mutable_data<T>(platform::CPUPlace());
-    tensors[col].CopyFrom<T>(*input, platform::CPUPlace());
+    FetchOutputs* tensors = g_fetch_variable->GetMutable<FetchOutputs>();
+    (*tensors)[col].mutable_data<T>(platform::CPUPlace());
+    (*tensors)[col].CopyFrom<T>(*input, platform::CPUPlace());
   }
 };
 

From bbceb72398f23902fae2f011c2b6c7f2a8b7b8e3 Mon Sep 17 00:00:00 2001
From: qijun <qijun1994@hotmail.com>
Date: Thu, 5 Oct 2017 20:54:16 -0700
Subject: [PATCH 19/21] refine some codes

---
 paddle/framework/executor.cc      | 10 ----------
 paddle/framework/executor_test.cc |  2 ++
 paddle/framework/scope.cc         |  9 ++-------
 paddle/operators/feed_op.cc       |  2 +-
 paddle/operators/fetch_op.cc      |  2 +-
 5 files changed, 6 insertions(+), 19 deletions(-)

diff --git a/paddle/framework/executor.cc b/paddle/framework/executor.cc
index 51ddb7e58e..ee0df039ac 100644
--- a/paddle/framework/executor.cc
+++ b/paddle/framework/executor.cc
@@ -74,16 +74,6 @@ void Executor::Run(const ProgramDesc& pdesc, Scope* scope) {
   for (auto& device_context : device_contexts_) {
     device_context->Wait();
   }
-  // // print tensor value
-  // for (auto& var : block.vars()) {
-  //   std::cout << var.name() << std::endl;
-  //   auto v = scope->FindVar(var.name());
-  //   const LoDTensor& t = v->Get<LoDTensor>();
-  //   for (int i = 0; i < t.numel(); ++i) {
-  //     std::cout << t.data<float>()[i] << " ";
-  //   }
-  //   std::cout << std::endl;
-  // }
 }
 
 }  // namespace framework
diff --git a/paddle/framework/executor_test.cc b/paddle/framework/executor_test.cc
index d3ea18d154..5e327cc893 100644
--- a/paddle/framework/executor_test.cc
+++ b/paddle/framework/executor_test.cc
@@ -130,6 +130,7 @@ std::once_flag set_variable_flag;
 template <typename T>
 void set_feed_variable(const std::vector<std::vector<T>>& inputs) {
   typedef std::vector<paddle::framework::Tensor> FeedInputs;
+  // Tensors in feed value variable will only be in CPUPlace
   Variable* g_feed_value = GetScope()->FindVar("feed_value");
   FeedInputs& feed_inputs = *(g_feed_value->GetMutable<FeedInputs>());
   auto size = inputs.size();
@@ -144,6 +145,7 @@ void set_feed_variable(const std::vector<std::vector<T>>& inputs) {
 template <typename T>
 std::vector<std::vector<T>> get_fetch_variable() {
   typedef std::vector<paddle::framework::Tensor> FetchOutputs;
+  // Tensors in fetch value variable will only be in CPUPlace
   Variable* g_fetch_value = GetScope()->FindVar("fetch_value");
   FetchOutputs& fetch_outputs = *(g_fetch_value->GetMutable<FetchOutputs>());
 
diff --git a/paddle/framework/scope.cc b/paddle/framework/scope.cc
index 2c416570cf..b6a9d7fbc2 100644
--- a/paddle/framework/scope.cc
+++ b/paddle/framework/scope.cc
@@ -66,15 +66,10 @@ void Scope::DropKids() {
 
 std::once_flag feed_variable_flag;
 
-template <typename T, typename... Args>
-std::unique_ptr<T> make_unique(Args&&... args) {
-  return std::unique_ptr<T>(new T(std::forward<Args>(args)...));
-}
-
 framework::Scope* GetScope() {
-  static std::unique_ptr<framework::Scope> g_scope =
-      make_unique<framework::Scope>();
+  static std::unique_ptr<framework::Scope> g_scope{nullptr};
   std::call_once(feed_variable_flag, [&]() {
+    g_scope.reset(new framework::Scope());
     g_scope->NewVar("feed_value");
     g_scope->NewVar("fetch_value");
   });
diff --git a/paddle/operators/feed_op.cc b/paddle/operators/feed_op.cc
index d40db3ff2e..f2c498e2e2 100644
--- a/paddle/operators/feed_op.cc
+++ b/paddle/operators/feed_op.cc
@@ -33,7 +33,7 @@ class FeedOp : public framework::OperatorWithKernel {
 
     auto in_dim = tensors[col].dims();
     ctx->SetOutputDim("Out", in_dim);
-    // TODO(qijun) need to handle LodTensor later
+    // TODO(qijun): need to handle LodTensor later
   }
 
   framework::DataType IndicateDataType(
diff --git a/paddle/operators/fetch_op.cc b/paddle/operators/fetch_op.cc
index a885deacc8..f6882cbd03 100644
--- a/paddle/operators/fetch_op.cc
+++ b/paddle/operators/fetch_op.cc
@@ -39,7 +39,7 @@ class FetchOp : public framework::OperatorWithKernel {
     tmp.Resize(input_dim);
     (*tensors)[col].Resize(input_dim);
 
-    // TODO(qijun) need to handle LodTensor later
+    // TODO(qijun): need to handle LodTensor later
   }
 
   framework::DataType IndicateDataType(

From 1f5192a27b968a7980c2eead7b6885e66f09575a Mon Sep 17 00:00:00 2001
From: qijun <qijun1994@hotmail.com>
Date: Fri, 6 Oct 2017 11:06:59 -0700
Subject: [PATCH 20/21] fix executor gpu unittest

---
 paddle/framework/executor.cc      |  2 +-
 paddle/framework/executor_test.cc | 20 +++++++++++++++-----
 paddle/operators/fetch_op.cu      |  2 +-
 paddle/platform/gpu_info.cc       |  3 ++-
 4 files changed, 19 insertions(+), 8 deletions(-)

diff --git a/paddle/framework/executor.cc b/paddle/framework/executor.cc
index ee0df039ac..c18ba049c8 100644
--- a/paddle/framework/executor.cc
+++ b/paddle/framework/executor.cc
@@ -30,7 +30,7 @@ Executor::Executor(const std::vector<platform::Place>& places) {
       device_contexts_[i] = new platform::CPUDeviceContext(
           boost::get<platform::CPUPlace>(places[i]));
     } else if (platform::is_gpu_place(places[i])) {
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
       device_contexts_[i] = new platform::CUDADeviceContext(
           boost::get<platform::GPUPlace>(places[i]));
 #else
diff --git a/paddle/framework/executor_test.cc b/paddle/framework/executor_test.cc
index 5e327cc893..55e209628b 100644
--- a/paddle/framework/executor_test.cc
+++ b/paddle/framework/executor_test.cc
@@ -293,7 +293,7 @@ TEST_F(ExecutorTesterFeed, CPU) {
   delete executor;
 }
 
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 TEST_F(ExecutorTesterRandom, GPU) {
   std::vector<Place> places;
   GPUPlace gpu_place(0);
@@ -315,10 +315,20 @@ TEST_F(ExecutorTesterFeed, GPU) {
 
   Executor* executor = new Executor(places);
 
-  // need to set feed variable before Executor::Run
-  set_feed_variable<float>(inputs_);
-  executor->Run(pdesc_, GetScope());
-
+  // 3 mini-batch
+  for (int i = 0; i < 3; i++) {
+    // need to set feed variable before Executor::Run
+    std::cout << "start mini-batch " << i << std::endl;
+    set_feed_variable<float>(inputs_);
+    executor->Run(pdesc_, GetScope());
+    std::vector<std::vector<float>> result = get_fetch_variable<float>();
+    for (auto& vec : result) {
+      for (auto& num : vec) {
+        std::cout << num << " ";
+      }
+      std::cout << std::endl;
+    }
+  }
   delete executor;
 }
 #endif
diff --git a/paddle/operators/fetch_op.cu b/paddle/operators/fetch_op.cu
index 2e24d3a8ad..ca39d24c79 100644
--- a/paddle/operators/fetch_op.cu
+++ b/paddle/operators/fetch_op.cu
@@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */
 
-#include "paddle/operators/feed_op.h"
+#include "paddle/operators/fetch_op.h"
 
 namespace ops = paddle::operators;
 REGISTER_OP_GPU_KERNEL(fetch, ops::FetchKernel<float>);
diff --git a/paddle/platform/gpu_info.cc b/paddle/platform/gpu_info.cc
index 486dcd623a..aa76bb209d 100644
--- a/paddle/platform/gpu_info.cc
+++ b/paddle/platform/gpu_info.cc
@@ -43,7 +43,8 @@ int GetCurrentDeviceId() {
 }
 
 void SetDeviceId(int id) {
-  PADDLE_ENFORCE(id < GetDeviceCount(), "id must less than GPU count");
+  // TODO(qijun): find a better way to cache the cuda device count
+  PADDLE_ENFORCE(id < GetCUDADeviceCount(), "id must less than GPU count");
   PADDLE_ENFORCE(cudaSetDevice(id),
                  "cudaSetDevice failed in paddle::platform::SetDeviceId");
 }

From e8a678e1eecd11fee219a93c6c586ee24663a506 Mon Sep 17 00:00:00 2001
From: qijun <qijun1994@hotmail.com>
Date: Fri, 6 Oct 2017 22:46:04 +0000
Subject: [PATCH 21/21] fix executor gpu unittest runtime error

---
 paddle/framework/executor_test.cc | 19 ++++++++++++++++---
 paddle/operators/fetch_op.cc      |  2 --
 2 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/paddle/framework/executor_test.cc b/paddle/framework/executor_test.cc
index 55e209628b..82f9bd6f2d 100644
--- a/paddle/framework/executor_test.cc
+++ b/paddle/framework/executor_test.cc
@@ -239,6 +239,7 @@ class ExecutorTesterFeed : public ::testing::Test {
   std::vector<std::vector<float>> inputs_;
 };
 
+#ifndef PADDLE_WITH_CUDA
 TEST_F(ExecutorTesterRandom, CPU) {
   std::vector<Place> places;
   CPUPlace cpu_place;
@@ -292,13 +293,19 @@ TEST_F(ExecutorTesterFeed, CPU) {
 
   delete executor;
 }
-
-#ifdef PADDLE_WITH_CUDA
+#else
 TEST_F(ExecutorTesterRandom, GPU) {
   std::vector<Place> places;
   GPUPlace gpu_place(0);
   places.push_back(gpu_place);
 
+  // We have a global Scope and BuddyAllocator, and we must ensure
+  // global BuddyAllocator is initialized before global Scope. Thus,
+  // global Scope will deconstruct before BuddyAllocator. Otherwise,
+  // "pointer being freed was not allocated" error will appear.
+  // If paddle is compiled with GPU, both CPU and GPU BuddyAllocator
+  // need to be used at first.
+  paddle::memory::Used(CPUPlace());
   paddle::memory::Used(gpu_place);
 
   Executor* executor = new Executor(places);
@@ -310,7 +317,13 @@ TEST_F(ExecutorTesterFeed, GPU) {
   std::vector<Place> places;
   GPUPlace gpu_place(0);
   places.push_back(gpu_place);
-
+  // We have a global Scope and BuddyAllocator, and we must ensure
+  // global BuddyAllocator is initialized before global Scope. Thus,
+  // global Scope will deconstruct before BuddyAllocator. Otherwise,
+  // "pointer being freed was not allocated" error will appear.
+  // If paddle is compiled with GPU, both CPU and GPU BuddyAllocator
+  // need to be used at first.
+  paddle::memory::Used(CPUPlace());
   paddle::memory::Used(gpu_place);
 
   Executor* executor = new Executor(places);
diff --git a/paddle/operators/fetch_op.cc b/paddle/operators/fetch_op.cc
index f6882cbd03..4b6b3ca85a 100644
--- a/paddle/operators/fetch_op.cc
+++ b/paddle/operators/fetch_op.cc
@@ -35,8 +35,6 @@ class FetchOp : public framework::OperatorWithKernel {
     }
 
     auto input_dim = ctx->GetInputDim("Input");
-    framework::Tensor tmp;
-    tmp.Resize(input_dim);
     (*tensors)[col].Resize(input_dim);
 
     // TODO(qijun): need to handle LodTensor later