Merge branch 'develop' into cross_entropy_over_beam

8 years ago · 3d12a610ca
parent 7a42c92d49 14194ba08d
commit 3d12a610ca
44 changed files with 1033 additions and 122 deletions
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@ -22,7 +22,7 @@
    -   id: clang-format-with-version-check
        name: clang-format
        description: Format files with ClangFormat.
-        entry: ./.clang_format.hook -i
+        entry: bash ./.clang_format.hook -i
        language: system
        files: \.(c|cc|cxx|cpp|cu|h|hpp|hxx|proto)$
 -   repo: https://github.com/PaddlePaddle/pre-commit-golang
--- a/doc/design/cluster_train/README.md
+++ b/doc/design/cluster_train/README.md
@ -54,17 +54,18 @@ The life cycle of a single task is illustrated below:
 <img src="src/paddle-task-states.png"/>

 1. When a new pass of training starts, all tasks will be placed in the todo queue.
-1. The master server will dispatch few tasks to each trainer at a time, puts them in the pending queue and waits for completion.
-1. The trainer will work on its tasks and tell the master server once a task is completed. The master server will dispatch a new task to that trainer.
-1. If a task timeout. the master server will move it back to the todo queue. The timeout count will increase by one. If the timeout count is above a threshold, the task is likely to cause a trainer to crash, so it will be discarded.
+1. Upon trainer requests for new task, the master server will dispatch a task from todo queue to it, put the task in the pending queue and wait for completion.
+1. The trainer will work on its task and tell the master server once the task is completed and ask for new task. The master server will dispatch a new task to that trainer.
+1. If a task fails for any reason in trainer, or takes longer than a specific period of time,  the master server will move the task back to the todo queue. The timeout count for that task will increase by one. If the timeout count is above a threshold, the task is likely to cause a trainer to crash, then it will be discarded.
 1. The master server will move completed task to the done queue. When the todo queue is empty, the master server will start a new pass by moving all tasks in the done queue to todo queue and reset the timeout counter of all tasks to zero.

 ### Trainer Process

 The trainer process will:

- Receive tasks from the master.
- Work on the tasks: calculate and upload gradient to parameter servers, and update local model by downloading new parameters from parameter servers.
+- Request tasks from the master.
+- Work on the tasks
+- Upload gradient to parameter servers, and update local model by downloading new parameters from parameter servers.

 ### Parameter Server Process

@ -119,8 +120,8 @@ When the master is started by the Kubernetes, it executes the following steps at

 1. Grabs a unique *master* lock in etcd, which prevents concurrent master instantiations.
 1. Recovers the task queues from etcd if they already exist, otherwise, the master will create them.
-1. Watches the trainer prefix keys `/trainer/` on etcd to find the live trainers.
-1. Starts dispatching the tasks to the trainers, and updates task queue using an etcd transaction to ensure lock is held during the update.
+1. Write its ip address to */master/addr* so that trainers can discover it.
+1. Listens to trainers' request of task, dispatch one upon request, and updates task queue using an etcd transaction to ensure lock is held during the update.

 When the master server process is dead for any reason, Kubernetes will restart it. It will be online again with all states recovered from etcd in few minutes.

@ -128,13 +129,11 @@ When the master server process is dead for any reason, Kubernetes will restart i

 When the trainer is started by the Kubernetes, it executes the following steps at startup:

-1. Watches the available parameter server prefix keys `/ps/` on etcd and waits until the count of parameter servers reaches the desired count.
-1. Generates a unique ID, and sets key `/trainer/<unique ID>` with its contact address as value. The key will be deleted when the lease expires, so the master will be aware of the trainer being online and offline.
-1. Waits for tasks from the master to start training.
+1. Watches the available parameter server prefix keys `/ps/` on etcd and waits until the count of parameter servers reaches the desired count */ps_desired*.
+1. Finds and watches */master/addr* to get master's address.
+1. Requests for tasks from the master to start training.

-If trainer's etcd lease expires, it will try set key `/trainer/<unique ID>` again so that the master server can discover the trainer again.
-
-When a trainer fails, Kuberentes would try to restart it. The recovered trainer would fetch tasks from the TODO queue and go on training.
+When a trainer fails, Kuberentes would try to restart it. The recovered trainer would fetch tasks from master and go on training.

 ### Parameter Server Process

--- a/doc/design/cluster_train/src/paddle-etcd.graffle
+++ b/doc/design/cluster_train/src/paddle-etcd.graffle
--- a/doc/design/cluster_train/src/paddle-etcd.png
+++ b/doc/design/cluster_train/src/paddle-etcd.png
--- a/go/master/client.go
+++ b/go/master/client.go
@ -63,13 +63,24 @@ func WithAddr(addr string) func(c *Client) error {
 // WithEtcd sets the client to use etcd for master discovery.
 func WithEtcd(endpoints []string, timeout time.Duration) func(*Client) error {
 	return func(c *Client) error {
-		cli, err := clientv3.New(clientv3.Config{
-			Endpoints:   endpoints,
-			DialTimeout: timeout,
-		})
-		if err != nil {
+		var cli *clientv3.Client
+		f := func() error {
+			var err error
+			cli, err = clientv3.New(clientv3.Config{
+				Endpoints:   endpoints,
+				DialTimeout: timeout,
+			})
 			return err
 		}
+		for {
+			err := f()
+			if err != nil {
+				log.Warningln(err)
+			} else {
+				break
+			}
+			time.Sleep(time.Second)
+		}

 		ch := make(chan string, 1)
 		a, err := GetKey(cli, DefaultAddrPath, timeout)
@ -101,9 +112,6 @@ func NewClient(opts ...func(*Client) error) (*Client, error) {
 		}
 	}
 	c.ch = make(chan record, c.bufSize)
-	// FIXME: connection is created asyncrosly in monitorMaster go routine,
-	//        ensure the connection is ready for use before calling c.addClient.
-	time.Sleep(time.Second)
 	return c, nil
 }

--- a/paddle/capi/CMakeLists.txt
+++ b/paddle/capi/CMakeLists.txt
@ -53,7 +53,10 @@ add_custom_target(paddle_capi_whole ALL
 set_target_properties(paddle_capi_whole
  PROPERTIES IMPORTED_LOCATION ${CMAKE_CURRENT_BINARY_DIR}/${capi_whole_library})

+set(LINK_FLAGS " -Wl,--retain-symbols-file ${CMAKE_CURRENT_SOURCE_DIR}/export.sym -Wl,--version-script ${CMAKE_CURRENT_SOURCE_DIR}/export.map")
+# TODO: merge mkl into paddle_capi_shared
 add_library(paddle_capi_shared SHARED ${CAPI_SOURCES})
+set_target_properties(paddle_capi_shared	PROPERTIES LINK_FLAGS "${LINK_FLAGS}")
 target_include_directories(paddle_capi_shared PUBLIC ${CMAKE_CURRENT_BINARY_DIR})
 link_paddle_exe(paddle_capi_shared)

--- a/paddle/capi/export.map
+++ b/paddle/capi/export.map
@ -0,0 +1,6 @@
+{
+	global:
+		paddle_*;
+	local:
+		*;
+};
--- a/paddle/capi/export.sym
+++ b/paddle/capi/export.sym
--- a/paddle/cuda/include/hl_cuda_cudnn.h
+++ b/paddle/cuda/include/hl_cuda_cudnn.h
@ -214,7 +214,8 @@ extern void hl_conv_workspace(hl_tensor_descriptor input,
                              int* convBwdDataAlgo,
                              size_t* bwdDataLimitBytes,
                              int* convBwdFilterAlgo,
-                              size_t* bwdFilterLimitBytes);
+                              size_t* bwdFilterLimitBytes,
+                              bool useDilation);

 /**
 * @brief   destroy filter descriptor.
@ -242,7 +243,9 @@ extern void hl_create_convolution_descriptor(hl_convolution_descriptor* conv,
                                             int padding_height,
                                             int padding_width,
                                             int stride_height,
-                                             int stride_width);
+                                             int stride_width,
+                                             int dilation_h = 1,
+                                             int dilation_w = 1);

 /**
 * @brief   reset convolution descriptor.
@ -262,7 +265,9 @@ extern void hl_reset_convolution_descriptor(hl_convolution_descriptor conv,
                                            int padding_height,
                                            int padding_width,
                                            int stride_height,
-                                            int stride_width);
+                                            int stride_width,
+                                            int dilation_h = 1,
+                                            int dilation_w = 1);

 /**
 * @brief   destroy convolution descriptor.
--- a/paddle/cuda/include/stub/hl_cuda_cudnn_stub.h
+++ b/paddle/cuda/include/stub/hl_cuda_cudnn_stub.h
@ -78,7 +78,9 @@ inline void hl_create_convolution_descriptor(hl_convolution_descriptor* conv,
                                             int padding_height,
                                             int padding_width,
                                             int stride_height,
-                                             int stride_width) {}
+                                             int stride_width,
+                                             int dilation_h,
+                                             int dilation_w) {}

 inline void hl_reset_convolution_descriptor(hl_convolution_descriptor conv,
                                            hl_tensor_descriptor image,
@ -86,7 +88,9 @@ inline void hl_reset_convolution_descriptor(hl_convolution_descriptor conv,
                                            int padding_height,
                                            int padding_width,
                                            int stride_height,
-                                            int stride_width) {}
+                                            int stride_width,
+                                            int dilation_h,
+                                            int dilation_w) {}

 inline void hl_destroy_convolution_descriptor(hl_convolution_descriptor conv) {}

@ -99,7 +103,8 @@ inline void hl_conv_workspace(hl_tensor_descriptor input,
                              int* convBwdDataAlgo,
                              size_t* bwdDataLimitBytes,
                              int* convBwdFilterAlgo,
-                              size_t* bwdFilterLimitBytes) {}
+                              size_t* bwdFilterLimitBytes,
+                              bool useDilation) {}

 inline void hl_convolution_forward(hl_tensor_descriptor input,
                                   real* input_data,
--- a/paddle/cuda/src/hl_cuda_cudnn.cc
+++ b/paddle/cuda/src/hl_cuda_cudnn.cc
@ -201,7 +201,8 @@ void hl_conv_workspace(hl_tensor_descriptor input,
                       int* convBwdDataAlgo,
                       size_t* bwdDataLimitBytes,
                       int* convBwdFilterAlgo,
-                       size_t* bwdFilterLimitBytes) {
+                       size_t* bwdFilterLimitBytes,
+                       bool useDilation) {
 #if CUDNN_VERSION >= 4000

  CHECK_NOTNULL(input);
@ -213,21 +214,60 @@ void hl_conv_workspace(hl_tensor_descriptor input,
  size_t memoryLimitBytes =
      (1LL << 20) * FLAGS_cudnn_conv_workspace_limit_in_mb;

+  // For dilation
+  int algo = 0;
+
  // cudnn convolution forward configuration
  cudnnTensorDescriptor_t fwd_src_desc = GET_TENSOR_DESCRIPTOR(input);
  cudnnTensorDescriptor_t fwd_dest_desc = GET_TENSOR_DESCRIPTOR(output);
  cudnnFilterDescriptor_t fwd_filter_desc = GET_FILTER_DESCRIPTOR(filter);
  cudnnConvolutionDescriptor_t fwd_conv_desc = GET_CONVOLUTION_DESCRIPTOR(conv);
+  // cudnn convolution backward data configuration
+  cudnnFilterDescriptor_t bwd_data_filter_desc = GET_FILTER_DESCRIPTOR(filter);
+  cudnnTensorDescriptor_t bwd_data_diff_desc = GET_TENSOR_DESCRIPTOR(output);
+  cudnnTensorDescriptor_t bwd_data_grad_desc = GET_TENSOR_DESCRIPTOR(input);
+  cudnnConvolutionDescriptor_t bwd_data_conv_desc =
+      GET_CONVOLUTION_DESCRIPTOR(conv);
+  // cudnn convolution backward filter configuration
+  cudnnTensorDescriptor_t bwd_filter_src_desc = GET_TENSOR_DESCRIPTOR(input);
+  cudnnTensorDescriptor_t bwd_filter_diff_desc = GET_TENSOR_DESCRIPTOR(output);
+  cudnnConvolutionDescriptor_t bwd_filter_conv_desc =
+      GET_CONVOLUTION_DESCRIPTOR(conv);
+  cudnnFilterDescriptor_t bwd_filter_grad_desc = GET_FILTER_DESCRIPTOR(filter);

-  CHECK_CUDNN(dynload::cudnnGetConvolutionForwardAlgorithm(
-      t_resource.cudnn_handle,
-      fwd_src_desc,
-      fwd_filter_desc,
-      fwd_conv_desc,
-      fwd_dest_desc,
-      CUDNN_CONVOLUTION_FWD_SPECIFY_WORKSPACE_LIMIT,
-      memoryLimitBytes,
-      reinterpret_cast<cudnnConvolutionFwdAlgo_t*>(convFwdAlgo)));
+  if (useDilation) {
+    convFwdAlgo = &algo;
+    convBwdDataAlgo = &algo;
+    convBwdFilterAlgo = &algo;
+  } else {
+    CHECK_CUDNN(dynload::cudnnGetConvolutionForwardAlgorithm(
+        t_resource.cudnn_handle,
+        fwd_src_desc,
+        fwd_filter_desc,
+        fwd_conv_desc,
+        fwd_dest_desc,
+        CUDNN_CONVOLUTION_FWD_SPECIFY_WORKSPACE_LIMIT,
+        memoryLimitBytes,
+        reinterpret_cast<cudnnConvolutionFwdAlgo_t*>(convFwdAlgo)));
+    CHECK_CUDNN(dynload::cudnnGetConvolutionBackwardDataAlgorithm(
+        t_resource.cudnn_handle,
+        bwd_data_filter_desc,
+        bwd_data_diff_desc,
+        bwd_data_conv_desc,
+        bwd_data_grad_desc,
+        CUDNN_CONVOLUTION_BWD_DATA_SPECIFY_WORKSPACE_LIMIT,
+        memoryLimitBytes,
+        reinterpret_cast<cudnnConvolutionBwdDataAlgo_t*>(convBwdDataAlgo)));
+    CHECK_CUDNN(dynload::cudnnGetConvolutionBackwardFilterAlgorithm(
+        t_resource.cudnn_handle,
+        bwd_filter_src_desc,
+        bwd_filter_diff_desc,
+        bwd_filter_conv_desc,
+        bwd_filter_grad_desc,
+        CUDNN_CONVOLUTION_BWD_FILTER_SPECIFY_WORKSPACE_LIMIT,
+        memoryLimitBytes,
+        reinterpret_cast<cudnnConvolutionBwdFilterAlgo_t*>(convBwdFilterAlgo)));
+  }

  CHECK_CUDNN(dynload::cudnnGetConvolutionForwardWorkspaceSize(
      t_resource.cudnn_handle,
@ -238,23 +278,6 @@ void hl_conv_workspace(hl_tensor_descriptor input,
      static_cast<cudnnConvolutionFwdAlgo_t>(*convFwdAlgo),
      fwdLimitBytes));

-  // cudnn convolution backward data configuration
-  cudnnFilterDescriptor_t bwd_data_filter_desc = GET_FILTER_DESCRIPTOR(filter);
-  cudnnTensorDescriptor_t bwd_data_diff_desc = GET_TENSOR_DESCRIPTOR(output);
-  cudnnTensorDescriptor_t bwd_data_grad_desc = GET_TENSOR_DESCRIPTOR(input);
-  cudnnConvolutionDescriptor_t bwd_data_conv_desc =
-      GET_CONVOLUTION_DESCRIPTOR(conv);
-
-  CHECK_CUDNN(dynload::cudnnGetConvolutionBackwardDataAlgorithm(
-      t_resource.cudnn_handle,
-      bwd_data_filter_desc,
-      bwd_data_diff_desc,
-      bwd_data_conv_desc,
-      bwd_data_grad_desc,
-      CUDNN_CONVOLUTION_BWD_DATA_SPECIFY_WORKSPACE_LIMIT,
-      memoryLimitBytes,
-      reinterpret_cast<cudnnConvolutionBwdDataAlgo_t*>(convBwdDataAlgo)));
-
  CHECK_CUDNN(dynload::cudnnGetConvolutionBackwardDataWorkspaceSize(
      t_resource.cudnn_handle,
      bwd_data_filter_desc,
@ -264,23 +287,6 @@ void hl_conv_workspace(hl_tensor_descriptor input,
      static_cast<cudnnConvolutionBwdDataAlgo_t>(*convBwdDataAlgo),
      bwdDataLimitBytes));

-  // cudnn convolution backward filter configuration
-  cudnnTensorDescriptor_t bwd_filter_src_desc = GET_TENSOR_DESCRIPTOR(input);
-  cudnnTensorDescriptor_t bwd_filter_diff_desc = GET_TENSOR_DESCRIPTOR(output);
-  cudnnConvolutionDescriptor_t bwd_filter_conv_desc =
-      GET_CONVOLUTION_DESCRIPTOR(conv);
-  cudnnFilterDescriptor_t bwd_filter_grad_desc = GET_FILTER_DESCRIPTOR(filter);
-
-  CHECK_CUDNN(dynload::cudnnGetConvolutionBackwardFilterAlgorithm(
-      t_resource.cudnn_handle,
-      bwd_filter_src_desc,
-      bwd_filter_diff_desc,
-      bwd_filter_conv_desc,
-      bwd_filter_grad_desc,
-      CUDNN_CONVOLUTION_BWD_FILTER_SPECIFY_WORKSPACE_LIMIT,
-      memoryLimitBytes,
-      reinterpret_cast<cudnnConvolutionBwdFilterAlgo_t*>(convBwdFilterAlgo)));
-
  CHECK_CUDNN(dynload::cudnnGetConvolutionBackwardFilterWorkspaceSize(
      t_resource.cudnn_handle,
      bwd_filter_src_desc,
@ -603,7 +609,9 @@ void hl_create_convolution_descriptor(hl_convolution_descriptor* conv,
                                      int padding_height,
                                      int padding_width,
                                      int stride_height,
-                                      int stride_width) {
+                                      int stride_width,
+                                      int dilation_h,
+                                      int dilation_w) {
  CHECK_NOTNULL(conv);

  cudnn_convolution_descriptor hl_conv = (cudnn_convolution_descriptor)malloc(
@ -625,18 +633,24 @@ void hl_create_convolution_descriptor(hl_convolution_descriptor* conv,
                                                       padding_width,
                                                       stride_height,
                                                       stride_width,
-                                                       1,
-                                                       1,
+                                                       dilation_h,
+                                                       dilation_w,
                                                       mode,
                                                       data_type));
 #else
+  if (dilation_h > 1 || dilation_w > 1) {
+    LOG(FATAL)
+        << "Current cuDNN version does't support for dilation convolution. "
+        << "The dilation convolution requires cuDNN >= v6.0.";
+  }
+
  CHECK_CUDNN(dynload::cudnnSetConvolution2dDescriptor(hl_conv->desc,
                                                       padding_height,
                                                       padding_width,
                                                       stride_height,
                                                       stride_width,
-                                                       1,
-                                                       1,
+                                                       dilation_h,
+                                                       dilation_w,
                                                       mode));
 #endif

@ -659,7 +673,9 @@ void hl_reset_convolution_descriptor(hl_convolution_descriptor conv,
                                     int padding_height,
                                     int padding_width,
                                     int stride_height,
-                                     int stride_width) {
+                                     int stride_width,
+                                     int dilation_h,
+                                     int dilation_w) {
  CHECK_NOTNULL(conv);
  CHECK_NOTNULL(image);
  CHECK_NOTNULL(filter);
@ -678,8 +694,8 @@ void hl_reset_convolution_descriptor(hl_convolution_descriptor conv,
                                                       padding_width,
                                                       stride_height,
                                                       stride_width,
-                                                       1,
-                                                       1,
+                                                       dilation_h,
+                                                       dilation_w,
                                                       mode,
                                                       data_type));
 #else
@ -688,8 +704,8 @@ void hl_reset_convolution_descriptor(hl_convolution_descriptor conv,
                                                       padding_width,
                                                       stride_height,
                                                       stride_width,
-                                                       1,
-                                                       1,
+                                                       dilation_h,
+                                                       dilation_w,
                                                       mode));
 #endif

--- a/paddle/gserver/gradientmachines/RecurrentGradientMachine.cpp
+++ b/paddle/gserver/gradientmachines/RecurrentGradientMachine.cpp
@ -1344,7 +1344,7 @@ void RecurrentGradientMachine::fillGenOutputs() {
      CHECK(!finalPaths_[i].empty());
      Path& path = finalPaths_[i][0];
      generator_.ids.insert(
-          generator_.ids.begin(), path.ids.begin(), path.ids.end());
+          generator_.ids.end(), path.ids.begin(), path.ids.end());
      starts[i + 1] = starts[i] + path.ids.size();
    }
  }
@ -1399,8 +1399,8 @@ void RecurrentGradientMachine::createDataOutlinkCopySizeInfo(
             getBeamSize() > 1 ? finalPaths_.size() : finalPaths_[0].size());
    int* starts = inputSeqStartPos->getMutableData(false);
    int seqId = 0;
-    for (int i = 0; i < finalPaths_.size(); ++i) {
-      for (int j = 0; j < finalPaths_[i].size(); ++j) {
+    for (size_t i = 0; i < finalPaths_.size(); ++i) {
+      for (size_t j = 0; j < finalPaths_[i].size(); ++j) {
        copySize[seqId] = getBeamSize() > 1 ? starts[i + 1] - starts[i]
                                            : starts[j + 1] - starts[j];
        batchMachineStartPos_[seqId + 1] =
--- a/paddle/gserver/layers/ConvBaseLayer.cpp
+++ b/paddle/gserver/layers/ConvBaseLayer.cpp
@ -32,9 +32,11 @@ bool ConvBaseLayer::init(const LayerMap& layerMap,
    const ConvConfig& conf = inputConfig.conv_conf();
    padding_.push_back(conf.padding());
    stride_.push_back(conf.stride());
+    dilation_.push_back(conf.dilation());
    filterSize_.push_back(conf.filter_size());
    paddingY_.push_back(conf.padding_y());
    strideY_.push_back(conf.stride_y());
+    dilationY_.push_back(conf.dilation_y());
    filterSizeY_.push_back(conf.filter_size_y());
    filterPixels_.push_back(filterSize_.back() * filterSizeY_.back());
    channels_.push_back(conf.channels());
@ -89,7 +91,11 @@ size_t ConvBaseLayer::calOutputSize() {
  size_t layerSize = 0;

  auto setLayerSize = [&](IntV& inH, IntV& inW, IntV& outH, IntV& outW) {
+    size_t filterSizeY;
+    size_t filterSize;
    for (size_t i = 0; i < inputLayers_.size(); i++) {
+      filterSizeY = (filterSizeY_[i] - 1) * dilationY_[i] + 1;
+      filterSize = (filterSize_[i] - 1) * dilation_[i] + 1;
      inH.push_back(inputLayers_[i]->getOutput().getFrameHeight());
      inW.push_back(inputLayers_[i]->getOutput().getFrameWidth());
      const ConvConfig& conf = config_.inputs(i).conv_conf();
@ -98,17 +104,17 @@ size_t ConvBaseLayer::calOutputSize() {
          inH[i] = conf.has_output_y() ? conf.output_y() : conf.output_x();
        if (inW[i] == 0) inW[i] = conf.output_x();
        outH.push_back(imageSize(
-            inH[i], filterSizeY_[i], paddingY_[i], strideY_[i], caffeMode_));
-        outW.push_back(imageSize(
-            inW[i], filterSize_[i], padding_[i], stride_[i], caffeMode_));
+            inH[i], filterSizeY, paddingY_[i], strideY_[i], caffeMode_));
+        outW.push_back(
+            imageSize(inW[i], filterSize, padding_[i], stride_[i], caffeMode_));
      } else {
        if (inH[i] == 0)
          inH[i] = conf.has_img_size_y() ? conf.img_size_y() : conf.img_size();
        if (inW[i] == 0) inW[i] = conf.img_size();
        outH.push_back(outputSize(
-            inH[i], filterSizeY_[i], paddingY_[i], strideY_[i], caffeMode_));
+            inH[i], filterSizeY, paddingY_[i], strideY_[i], caffeMode_));
        outW.push_back(outputSize(
-            inW[i], filterSize_[i], padding_[i], stride_[i], caffeMode_));
+            inW[i], filterSize, padding_[i], stride_[i], caffeMode_));
      }
      CHECK_EQ(outH[i], outH[0]);
      CHECK_EQ(outW[i], outW[0]);
--- a/paddle/gserver/layers/ConvBaseLayer.h
+++ b/paddle/gserver/layers/ConvBaseLayer.h
@ -40,6 +40,10 @@ protected:
  IntV stride_;
  /// The y dimension of the stride.
  IntV strideY_;
+  /// The x dimension of the dilation.
+  IntV dilation_;
+  /// The y dimension of the dilation.
+  IntV dilationY_;
  /// The x dimension of a filter kernel.
  IntV filterSize_;
  /// The y dimension of a filter kernel.
--- a/paddle/gserver/layers/ConvBaseOperator.cpp
+++ b/paddle/gserver/layers/ConvBaseOperator.cpp
@ -59,7 +59,8 @@ void ConvBaseOperator::allocConvWorkSpace() {
                    &bwdDataAlgo_,
                    &bwdDataLimitBytes_,
                    &bwdFilterAlgo_,
-                    &bwdFilterLimitBytes_);
+                    &bwdFilterLimitBytes_,
+                    /*useDilation*/ false);

  size_t maxWorkSpace = 0;
  maxWorkSpace = std::max(fwdLimitBytes_, bwdDataLimitBytes_);
--- a/paddle/gserver/layers/ConvBaseProjection.cpp
+++ b/paddle/gserver/layers/ConvBaseProjection.cpp
@ -41,6 +41,11 @@ void ConvBaseProjection::getConvParams() {
  strideH_ = conf.stride_y();
  strideW_ = conf.stride();

+  dilationH_ = conf.dilation_y();
+  dilationW_ = conf.dilation();
+  CHECK_GT(dilationH_, 0);
+  CHECK_GT(dilationW_, 0);
+
  filterH_ = conf.filter_size_y();
  filterW_ = conf.filter_size();

@ -77,7 +82,9 @@ void ConvBaseProjection::initCudnn() {
                                   paddingH_,
                                   paddingW_,
                                   strideH_,
-                                   strideW_);
+                                   strideW_,
+                                   dilationH_,
+                                   dilationW_);

  // initialize all to default algorithms
  fwdAlgo_ = 0;
@ -131,7 +138,9 @@ void ConvBaseProjection::reshapeTensorDesc(int batchSize) {
                                  paddingH_,
                                  paddingW_,
                                  strideH_,
-                                  strideW_);
+                                  strideW_,
+                                  dilationH_,
+                                  dilationW_);
 }

 void ConvBaseProjection::reshape(int batchSize) {
@ -140,6 +149,10 @@ void ConvBaseProjection::reshape(int batchSize) {
  CHECK_EQ(calInputSize(), in_->value->getWidth());

  reshapeTensorDesc(batchSize);
+  bool useDilation = false;
+  if (dilationH_ > 1 || dilationW_ > 1) {
+    useDilation = true;
+  }
  hl_conv_workspace(imageDesc_,
                    outputDesc_,
                    filterDesc_,
@ -149,7 +162,8 @@ void ConvBaseProjection::reshape(int batchSize) {
                    &bwdDataAlgo_,
                    &bwdDataLimitBytes_,
                    &bwdFilterAlgo_,
-                    &bwdFilterLimitBytes_);
+                    &bwdFilterLimitBytes_,
+                    useDilation);

  size_t maxWorkSpace = 0;
  maxWorkSpace = std::max(fwdLimitBytes_, bwdDataLimitBytes_);
--- a/paddle/gserver/layers/ConvBaseProjection.h
+++ b/paddle/gserver/layers/ConvBaseProjection.h
@ -63,6 +63,7 @@ protected:
  int configChannels_, configNumFilters_;
  int paddingH_, paddingW_;
  int strideH_, strideW_;
+  int dilationH_, dilationW_;
  int filterH_, filterW_;
  /// One group offset of input data.
  int inputOffset_;
--- a/paddle/gserver/layers/ConvProjection.cpp
+++ b/paddle/gserver/layers/ConvProjection.cpp
@ -25,12 +25,12 @@ size_t ConvProjection::calOutputSize() {
  if (imageH_ == 0) imageH_ = configImgH_;
  if (imageW_ == 0) imageW_ = configImgW_;
  outputH_ = outputSize(imageH_,
-                        filterH_,
+                        (filterH_ - 1) * dilationH_ + 1,
                        paddingH_,
                        strideH_,
                        /* caffeMode */ true);
  outputW_ = outputSize(imageW_,
-                        filterW_,
+                        (filterW_ - 1) * dilationW_ + 1,
                        paddingW_,
                        strideW_,
                        /* caffeMode */ true);
--- a/paddle/gserver/layers/CrossEntropyOverBeam.cpp
+++ b/paddle/gserver/layers/CrossEntropyOverBeam.cpp
@ -331,6 +331,8 @@ void CrossEntropyOverBeam::splitBatchBeams() {
                         false,
                         false);
      beamPerSeq_[j].gold[i] = goldSequence_[i]->getData()[j];
+
+      CHECK_LE(beamPerSeq_[j].gold[i], seqStarts[j + 1] - seqStarts[j]);
    }
  }
 }
--- a/paddle/gserver/layers/SequenceSliceLayer.cpp
+++ b/paddle/gserver/layers/SequenceSliceLayer.cpp
@ -130,6 +130,8 @@ void SequenceSliceLayer::calSelectedRows(const MatrixPtr starts,
  CHECK(starts || ends) << "At least one of the start or end indices "
                        << "should be given.";

+  bool hasSubseq = getInput(0).hasSubseq();
+
  outSeqStartPos_.resize(1, 0);
  outSubSeqStartPos_.resize(1, 0);
  selectedRows_.clear();
@ -151,14 +153,13 @@ void SequenceSliceLayer::calSelectedRows(const MatrixPtr starts,
        int seqLen = endPos - begPos + 1;
        CHECK_GT(seqLen, 0U);
        for (int m = begPos; m <= endPos; ++m) selectedRows_.push_back(m);
-        inputSeqInfoVec_.size() > 1
+        hasSubseq
            ? outSubSeqStartPos_.push_back(outSubSeqStartPos_.back() + seqLen)
            : outSeqStartPos_.push_back(outSeqStartPos_.back() + seqLen);
      }
      rowIdx++;
    }
-    if (inputSeqInfoVec_.size() > 1)
-      outSeqStartPos_.push_back(outSubSeqStartPos_.back());
+    if (hasSubseq) outSeqStartPos_.push_back(outSubSeqStartPos_.back());
  }

  if (useGpu_) {
@ -175,7 +176,7 @@ void SequenceSliceLayer::calSelectedRows(const MatrixPtr starts,
  output_.sequenceStartPositions->copyFrom(
      outSeqStartPos_.data(), outSeqStartPos_.size(), false);

-  if (inputSeqInfoVec_.size() > 1) {
+  if (hasSubseq) {
    ICpuGpuVector::resizeOrCreate(
        output_.subSequenceStartPositions, outSubSeqStartPos_.size(), false);
    output_.subSequenceStartPositions->copyFrom(
@ -200,13 +201,15 @@ void SequenceSliceLayer::forward(PassType passType) {
      startIdsOnCpu_ = getInputValue(1);
      endIdsOnCpu_ = getInputValue(2);
    }
-  } else
+  } else {
    copySliceIdsToCpu();
+  }

-  // calculate the selected row indices in a batch,
-  // and build the output sequence information.
-  calSelectedRows(startIdsOnCpu_ ? startIdsOnCpu_ : nullptr,
-                  endIdsOnCpu_ ? endIdsOnCpu_ : nullptr);
+  /*
+   * calculate the selected row indices in a batch, and build the output
+   * sequence information.
+   */
+  calSelectedRows(startIdsOnCpu_, endIdsOnCpu_);

  resetOutput(selectedRows_.size(), getSize());

--- a/paddle/gserver/tests/test_LayerGrad.cpp
+++ b/paddle/gserver/tests/test_LayerGrad.cpp
@ -12,6 +12,9 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */

+#ifndef PADDLE_ONLY_CPU
+#include <cudnn.h>
+#endif
 #include <gtest/gtest.h>
 #include <string>
 #include <vector>
@ -189,10 +192,16 @@ TEST(Projection, scaling) {
 void testProjectionConv(size_t groups, bool isDeconv) {
  const int NUM_FILTERS = 18;
  const int FILTER_SIZE = 2;
-  const int FILTER_SIZE_Y = 4;
+  const int FILTER_SIZE_Y = 2;
  const int CHANNELS = 3;
  const int IMAGE_SIZE = 16;

+#if CUDNN_VERSION >= 6000
+  const int DILATION = 2;
+#else
+  const int DILATION = 1;
+#endif
+
  ProjectionConfig conf;
  if (isDeconv) {
    conf.set_type("convt");
@ -209,6 +218,8 @@ void testProjectionConv(size_t groups, bool isDeconv) {
  conv->set_padding_y(1);
  conv->set_stride(2);
  conv->set_stride_y(2);
+  conv->set_dilation(DILATION);
+  conv->set_dilation_y(DILATION);
  conv->set_groups(groups);
  if (isDeconv) {
    conv->set_filter_channels(NUM_FILTERS / conv->groups());
@ -217,12 +228,12 @@ void testProjectionConv(size_t groups, bool isDeconv) {
  }
  conv->set_img_size(IMAGE_SIZE);
  int output_x = outputSize(conv->img_size(),
-                            conv->filter_size(),
+                            (conv->filter_size() - 1) * DILATION + 1,
                            conv->padding(),
                            conv->stride(),
                            /* caffeMode */ true);
  int output_y = outputSize(conv->img_size(),
-                            conv->filter_size_y(),
+                            (conv->filter_size_y() - 1) * DILATION + 1,
                            conv->padding_y(),
                            conv->stride_y(),
                            /* caffeMode */ true);
@ -424,27 +435,38 @@ void testConvLayer(const string& type, bool trans, bool useGpu) {
  config.layerConfig.set_partial_sum(1);
  config.layerConfig.set_shared_biases(true);

-  config.inputDefs.push_back({INPUT_DATA, "layer_0", 384, 288});
+  int dilation = 1;
+  if (type == "cudnn_conv") {
+#if CUDNN_VERSION >= 6000
+    dilation = 2;
+#else
+    dilation = 1;
+#endif
+  }
+
+  config.inputDefs.push_back({INPUT_DATA, "layer_0", 768, 192});
  LayerInputConfig* input = config.layerConfig.add_inputs();
  ConvConfig* conv = input->mutable_conv_conf();
  conv->set_filter_size(2);
-  conv->set_filter_size_y(3);
+  conv->set_filter_size_y(2);
  conv->set_channels(3);
  conv->set_padding(0);
  conv->set_padding_y(1);
  conv->set_stride(2);
  conv->set_stride_y(2);
+  conv->set_dilation(dilation);
+  conv->set_dilation_y(dilation);
  conv->set_groups(1);
  conv->set_filter_channels(conv->channels() / conv->groups());
  conv->set_img_size(16);
-  conv->set_img_size_y(8);
+  conv->set_img_size_y(16);
  conv->set_output_x(outputSize(conv->img_size(),
-                                conv->filter_size(),
+                                (conv->filter_size() - 1) * dilation + 1,
                                conv->padding(),
                                conv->stride(),
                                /* caffeMode */ true));
  conv->set_output_y(outputSize(conv->img_size_y(),
-                                conv->filter_size_y(),
+                                (conv->filter_size_y() - 1) * dilation + 1,
                                conv->padding_y(),
                                conv->stride_y(),
                                /* caffeMode */ true));
--- a/paddle/gserver/tests/test_SeqSliceLayerGrad.cpp
+++ b/paddle/gserver/tests/test_SeqSliceLayerGrad.cpp
@ -30,6 +30,8 @@ const int MAX_SEQ_NUM = 17;
 const int MAX_SEQ_LEN = 23;
 const int MAX_BEAM_SIZE = 13;

+const size_t SEED = (size_t)(time(NULL));
+
 vector<real> randSampling(real range, int n) {
  CHECK_GE(range, n);
  vector<real> num(range);
@ -46,7 +48,7 @@ void genSeqInfo(vector<int>& seqStartPos, vector<int>& subSeqStartPos) {
  seqStartPos.resize(1, 0);
  subSeqStartPos.resize(1, 0);

-  srand((size_t)(time(NULL)));
+  srand(SEED);
  int seqNum = 1 + (rand() % MAX_SEQ_NUM);
  for (int i = 0; i < seqNum; ++i) {
    int subSeqNum = 1 + (rand() % MAX_SEQ_NUM);
--- a/paddle/operators/CMakeLists.txt
+++ b/paddle/operators/CMakeLists.txt
@ -42,6 +42,7 @@ function(op_library TARGET)
 endfunction()

 add_subdirectory(math)
+
 cc_test(gather_test SRCS gather_test.cc DEPS tensor)
 op_library(gather_op SRCS gather_op.cc gather_op.cu)

@ -67,6 +68,7 @@ op_library(sgd_op SRCS sgd_op.cc sgd_op.cu)

 op_library(recurrent_op SRCS recurrent_op.cc rnn/recurrent_op_utils.cc
    DEPS framework_proto tensor op_registry operator net_op)
-op_library(uniform_random_op
-        SRCS uniform_random_op.cc uniform_random_op.cu)
+op_library(uniform_random_op SRCS uniform_random_op.cc uniform_random_op.cu)
+op_library(lookup_table_op SRCS lookup_table_op.cc lookup_table_op.cu)
 op_library(scale_op SRCS scale_op.cc scale_op.cu DEPS net_op)
+op_library(minus_op SRCS minus_op.cc minus_op.cu DEPS scale_op)
--- a/paddle/operators/fill_zeros_like_op.h
+++ b/paddle/operators/fill_zeros_like_op.h
@ -26,7 +26,7 @@ class FillZerosLikeKernel : public framework::OpKernel {
    auto* output = context.Output<framework::Tensor>("Dst");
    output->mutable_data<T>(context.GetPlace());
    auto t = framework::EigenVector<T>::Flatten(*output);
-    t.device(context.GetEigenDevice<Place>()) = t.constant(T(0));
+    t.device(context.GetEigenDevice<Place>()) = t.constant(static_cast<T>(0));
  }
 };

--- a/paddle/operators/lookup_table_op.cc
+++ b/paddle/operators/lookup_table_op.cc
@ -0,0 +1,72 @@
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License. */
+
+#include "paddle/operators/lookup_table_op.h"
+
+namespace paddle {
+namespace operators {
+
+class LookupTableOp : public framework::OperatorWithKernel {
+ public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+
+ protected:
+  void InferShape(const framework::InferShapeContext &context) const override {
+    auto table_t = context.Input<Tensor>("W");
+    auto ids_t = context.Input<Tensor>("Ids");
+    auto output_t = context.Output<Tensor>("Out");
+
+    output_t->Resize({ids_t->dims()[0], table_t->dims()[1]});
+  }
+};
+
+class LookupTableOpMaker : public framework::OpProtoAndCheckerMaker {
+ public:
+  LookupTableOpMaker(framework::OpProto *proto,
+                     framework::OpAttrChecker *op_checker)
+      : OpProtoAndCheckerMaker(proto, op_checker) {
+    AddInput("W",
+             "An input represents embedding tensors,"
+             " which is a learnable parameter.");
+    AddInput("Ids",
+             "An input with type int32 or int64"
+             "contains the ids to be looked up in W.");
+    AddOutput("Out", "The lookup results, which have the same type with W.");
+    AddComment(
+        "This operator is used to perform lookups on the parameter W,"
+        "then concatenated into a dense tensor.");
+  }
+};
+
+class LookupTableOpGrad : public framework::OperatorWithKernel {
+ public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+
+ protected:
+  void InferShape(const framework::InferShapeContext &context) const override {
+    auto table = context.Input<Tensor>("W");
+    auto d_table = context.Output<Tensor>(framework::GradVarName("W"));
+    d_table->Resize(table->dims());
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+REGISTER_OP(lookup_table, ops::LookupTableOp, ops::LookupTableOpMaker,
+            lookup_table_grad, ops::LookupTableOpGrad);
+
+REGISTER_OP_CPU_KERNEL(lookup_table, ops::LookupTableKernel<float>);
+REGISTER_OP_CPU_KERNEL(lookup_table_grad, ops::LookupTableGradKernel<float>);
--- a/Show More
+++ b/Show More