Merge remote-tracking branch 'ups/develop' into fea/jit/rnn

7 years ago · 686eaf20ba
parent 0c5ed5f6fc a4497e6808
commit 686eaf20ba
20 changed files with 1315 additions and 49 deletions
--- a/.github/ISSUE_TEMPLATE/---feature-request-.md
+++ b/.github/ISSUE_TEMPLATE/---feature-request-.md
@ -0,0 +1,27 @@
+---
+name: 建议(Feature request)
+about: 您可以提出您的建议。 You could use this template for reporting a suggestion  issue.
+
+---
+
+欢迎您对PaddlePaddle提出建议，非常感谢您对PaddlePaddle的贡献！
+在留下您的建议时，辛苦您同步提供如下信息：
+- 版本、环境信息
+1）PaddlePaddle版本：请提供您的PaddlePaddle版本号，例如1.1
+2）CPU/GPU：您是否使用GPU进行训练，如是，请提供您的CUDA和cuDNN版本号
+3）系统环境：请您描述系统类型、版本，例如Mac OS 10.14
+- 复现信息：如为报错，请给出复现环境、复现步骤
+- 建议描述：请您详细描述，您认为需优化的功能
+
+Thank you for contributing to PaddlePaddle.
+Before submitting the issue, you could search issue in the github in case that there was a similar issue submitted or resolved before.
+Please make sure that this is a feature request. 
+**System information**
+-PaddlePaddle version （eg.1.1）or CommitID
+-CPU: including CPUMKL/OpenBlas/MKLDNN version
+-GPU: including CUDA/CUDNN version
+-OS Platform (eg.Mac OS 10.14)
+**To Reproduce**
+Steps to reproduce the behavior
+**Describe the feature and the current behavior/state.**
+**Any Other info.**
--- a/.github/ISSUE_TEMPLATE/---inference-issue-.md
+++ b/.github/ISSUE_TEMPLATE/---inference-issue-.md
@ -0,0 +1,40 @@
+---
+name: 预测（Inference Issue）
+about: 您可以提问预测中报错、应用等问题。 You could use this template for reporting an inference issue.
+
+---
+
+为使您的问题得到快速解决，在建立Issue前，请您先通过如下方式搜索是否有相似问题:【搜索issue关键字】【使用labels筛选】【官方文档】
+
+如果您没有查询到相似问题，为快速解决您的提问，建立issue时请提供如下细节信息：
+- 标题：简洁、精准描述您的问题，例如“最新预测库的API文档在哪儿 ”
+- 版本、环境信息：
+    1）PaddlePaddle版本：请提供您的PaddlePaddle版本号（如1.1）或CommitID
+    2）CPU：预测若用CPU，请提供CPU型号，MKL/OpenBlas/MKLDNN/等数学库使用情况
+    3）GPU：预测若用GPU，请提供GPU型号、CUDA和CUDNN版本号
+    4）系统环境：请您描述系统类型、版本（如Mac OS 10.14），Python版本
+-预测信息
+    1）C++预测：请您提供预测库安装包的版本信息，及其中的version.txt文件
+    2）CMake包含路径的完整命令
+    3）API信息（如调用请提供）
+    4）预测库来源：官网下载/特殊环境（如BCLOUD编译）
+- 复现信息：如为报错，请给出复现环境、复现步骤
+- 问题描述：请详细描述您的问题，同步贴出报错信息、日志/代码关键片段
+
+Thank you for contributing to PaddlePaddle.
+Before submitting the issue, you could search issue in the github in case that th
+If there is no solution,please make sure that this is an inference issue including the following details :
+**System information**
+-PaddlePaddle version （eg.1.1）or CommitID
+-CPU: including CPUMKL/OpenBlas/MKLDNN version
+-GPU: including CUDA/CUDNN version
+-OS Platform (eg.Mac OS 10.14)
+-Python version
+-Cmake orders
+-C++version.txt
+-API information
+**To Reproduce**
+Steps to reproduce the behavior
+**Describe your current behavior**
+**Code to reproduce the issue**
+**Other info / logs**
--- a/.github/ISSUE_TEMPLATE/---installation-issue-.md
+++ b/.github/ISSUE_TEMPLATE/---installation-issue-.md
@ -0,0 +1,40 @@
+---
+name: 安装（Installation Issue）
+about: 您可以提问安装、编译出现报错等问题。 You could use this template for reporting an installation
+   issue.
+
+---
+
+为使您的问题得到快速解决，在建立Issue前，请您先通过如下方式搜索是否有相似问题:【搜索issue关键字】【使用labels筛选】【官方文档】
+
+建立issue时，为快速解决问题，请您根据使用情况给出如下信息：
+- 标题：请包含关键词“安装错误”/“编译错误”，例如“Mac编译错误”
+- 版本、环境信息：
+    1）PaddlePaddle版本：请提供您的PaddlePaddle版本号（如1.1）或CommitID
+    2）CPU：请提供CPU型号，MKL/OpenBlas/MKLDNN/等数学库的使用情况
+    3）GPU：请提供GPU型号，CUDA和CUDNN版本号
+    4）系统环境：请说明系统类型、版本（如Mac OS 10.14）、Python版本
+- 安装方式信息：
+1）pip安装/docker安装
+2）本地编译：请提供cmake命令，编译命令
+3）docker编译：请提供docker镜像，编译命令            
+  特殊环境请注明：如离线安装等
+- 复现信息：如为报错，请给出复现环境、复现步骤
+- 问题描述：请详细描述您的问题，同步贴出报错信息、日志/代码关键片段
+
+Thank you for contributing to PaddlePaddle.
+Before submitting the issue, you could search issue in Github in case that there was a similar issue submitted or resolved before.
+If there is no solution,please make sure that this is an installation issue including the following details:
+**System information**
+-PaddlePaddle version （eg.1.1）or CommitID
+-CPU: including CPUMKL/OpenBlas/MKLDNN version
+-GPU: including CUDA/CUDNN version
+-OS Platform (eg. Mac OS 10.14)
+-Python version
+- Install method: pip install/install with docker/build from source(without docker)/build within docker
+- Other special cases that you think may be related to this problem, eg. offline install, special internet condition   
+**To Reproduce**
+Steps to reproduce the behavior
+**Describe your current behavior**
+**Code to reproduce the issue**
+**Other info / logs**
--- a/.github/ISSUE_TEMPLATE/---model-issue-.md
+++ b/.github/ISSUE_TEMPLATE/---model-issue-.md
@ -0,0 +1,36 @@
+---
+name: 模型（Model Issue）
+about: 您可以提问模型、算法、数据集方向的使用报错等问题。You could use this template for reporting a model/
+  algorithm/dataset  issue.
+
+---
+
+为使您的问题得到快速解决，在建立Issue前，请您先通过如下方式搜索是否有相似问题:【搜索issue关键字】【使用labels筛选】【官方文档】
+
+建立issue时，为快速解决问题，请您根据使用情况给出如下信息：
+- 标题：简洁、精准描述您的问题，例如“ssd 模型前置lstm报错  ”
+- 版本、环境信息：
+    1）PaddlePaddle版本：请提供PaddlePaddle版本号，例如1.1或CommitID
+    2）CPU：请提供CPU型号，MKL/OpenBlas/MKLDNN/等数学库的使用情况
+    3）GPU：请提供GPU型号，CUDA和CUDNN版本号
+    4）系统环境：请说明系统类型、版本（例如Mac OS 10.14），Python版本
+- 模型信息
+    1）模型名称 2）使用数据集名称 3）使用算法名称 4）模型链接
+- 复现信息：如为报错，请给出复现环境、复现步骤
+- 问题描述：请详细描述您的问题，同步贴出报错信息、日志/代码关键片段
+
+Thank you for contributing to PaddlePaddle.
+Before submitting the issue, you could search issue in the github.Probably there was a similar issue submitted or resolved before.
+If there is no solution,please make sure that this is a issue of models including the following details:
+**System information**
+-PaddlePaddle version （eg.1.1）or CommitID
+-CPU: including CPUMKL/OpenBlas/MKLDNN version
+-GPU: including CUDA/CUDNN version
+-OS Platform (eg.Mac OS 10.14)
+-Python version
+-Name of Models&Dataset/details of operator
+**To Reproduce**
+Steps to reproduce the behavior
+**Describe your current behavior**
+**Code to reproduce the issue**
+**Other info / logs**
--- a/.github/ISSUE_TEMPLATE/---others-.md
+++ b/.github/ISSUE_TEMPLATE/---others-.md
@ -0,0 +1,33 @@
+---
+name: 其他（Others）
+about: 如上述分类未包含您的问题，可在此提出。 You could use this template for reporting other issues
+
+---
+
+为使您的问题得到快速解决，在建立Issues前，请您先通过如下方式搜索是否有相似问题:【搜索issue关键字】【使用labels筛选】【官方文档】
+
+如果您没有查询到相似问题，为快速解决您的提问，建立issue时请提供如下细节信息：
+- 标题：简洁、精准概括您的问题
+- 版本、环境信息：
+    1）PaddlePaddle版本：请提供您的PaddlePaddle版本号，例如1.1或CommitID
+    2）CPU/GPU：如果您使用GPU训练，请提供GPU驱动版本、CUDA和cuDNN版本号
+    3）系统环境：请您描述系统类型、版本，例如Mac OS 10.14
+    4）Python版本号
+    5）显存信息
+- 复现信息：如为报错，请给出复现环境、复现步骤
+- 问题描述：请详细描述您的问题，同步贴出报错信息、日志/代码关键片段
+
+Thank you for contributing to PaddlePaddle.
+Before submitting the issue, you could search issue in the github in case that there was a similar issue submitted or resolved before.
+If there is no solution,please provide us with the following details :
+**System information**
+-PaddlePaddle version （eg.1.1）or CommitID
+-CPU: including CPUMKL/OpenBlas/MKLDNN version
+-GPU: including CUDA/cuDNN version
+-OS Platform and Distribution(eg.Mac OS 10.14)
+-Python version 
+**To Reproduce**
+Steps to reproduce the behavior
+**Describe your current behavior**
+**Code to reproduce the issue**
+**Other info / logs**
--- a/.github/ISSUE_TEMPLATE/---training-issue-.md
+++ b/.github/ISSUE_TEMPLATE/---training-issue-.md
@ -0,0 +1,38 @@
+---
+name: 训练（Training issue）
+about: 您可以提问训练中报错、应用、出core等问题。 You could use this template for reporting an training
+   issue.
+
+---
+
+为使您的问题得到快速解决，在建立Issues前，请您先通过如下方式搜索是否有相似问题:【搜索issue关键字】【使用labels筛选】【官方文档】
+
+如果您没有查询到相似问题，为快速解决您的提问，建立issue时请提供如下细节信息：
+- 标题：简洁、精准概括您的问题，例如“Insufficient Memory xxx" ”
+- 版本、环境信息：
+    1）PaddlePaddle版本：请提供您的PaddlePaddle版本号，例如1.1或CommitID
+    2）CPU：预测若用CPU，请提供CPU型号，MKL/OpenBlas/MKLDNN/等数学库使用情况
+    3）GPU：预测若用GPU，请提供GPU型号、CUDA和CUDNN版本号
+    4）系统环境：请您描述系统类型、版本，例如Mac OS 10.14，Python版本
+- 训练信息
+    1）单机/多机，单卡/多卡
+    2）显存信息
+    3）Operator信息
+- 复现信息：如为报错，请给出复现环境、复现步骤
+- 问题描述：请详细描述您的问题，同步贴出报错信息、日志、可复现的代码片段
+
+Thank you for contributing to PaddlePaddle.
+Before submitting the issue, you could search issue in the github in case that there was a similar issue submitted or resolved before.
+If there is no solution,please make sure that this is a training issue including the following details:
+**System information**
+-PaddlePaddle version （eg.1.1）or CommitID
+-CPU: including CPUMKL/OpenBlas/MKLDNN version
+-GPU: including CUDA/CUDNN version
+-OS Platform (eg.Mac OS 10.14)
+-Other imformation: Distriuted training/informantion of operator/
+Graphics card storage
+**To Reproduce**
+Steps to reproduce the behavior
+**Describe your current behavior**
+**Code to reproduce the issue**
+**Other info / logs**
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@ -205,6 +205,7 @@ include(external/pybind11)  # download pybind11
 include(external/cares)
 include(external/cub)
 include(external/xxhash)    # download xxhash
+include(external/dlpack)
 include(external/snappy)    # download snappy
 include(external/snappystream) # download snappystream

--- a/cmake/external/dlpack.cmake
+++ b/cmake/external/dlpack.cmake
@ -0,0 +1,31 @@
+include(ExternalProject)
+
+set(DLPACK_SOURCE_DIR ${THIRD_PARTY_PATH}/dlpack)
+set(DLPACK_INCLUDE_DIR ${DLPACK_SOURCE_DIR}/src/extern_dlpack/include)
+
+include_directories(${DLPACK_INCLUDE_DIR})
+
+ExternalProject_Add(
+  extern_dlpack
+  ${EXTERNAL_PROJECT_LOG_ARGS}
+  GIT_REPOSITORY "https://github.com/dmlc/dlpack.git"
+  GIT_TAG        "v0.2"
+  PREFIX         ${DLPACK_SOURCE_DIR}
+  UPDATE_COMMAND ""
+  CONFIGURE_COMMAND ""
+  BUILD_COMMAND     ""
+  INSTALL_COMMAND   ""
+  TEST_COMMAND      ""
+)
+
+if(${CMAKE_VERSION} VERSION_LESS "3.3.0")
+  set(dummyfile ${CMAKE_CURRENT_BINARY_DIR}/dlpack_dummy.c)
+  file(WRITE ${dummyfile} "const char *dummy = \"${dummyfile}\";")
+  add_library(dlpack STATIC ${dummyfile})
+else()
+  add_library(dlpack INTERFACE)
+endif()
+
+add_dependencies(dlpack extern_dlpack)
+
+LIST(APPEND externl_project_dependencies dlpack)
--- a/paddle/fluid/framework/CMakeLists.txt
+++ b/paddle/fluid/framework/CMakeLists.txt
@ -192,3 +192,6 @@ cc_test(tuple_test SRCS tuple_test.cc )
 if (NOT WIN32)
 cc_test(rw_lock_test SRCS rw_lock_test.cc)
 endif (NOT WIN32)
+
+cc_library(dlpack_tensor SRCS dlpack_tensor.cc DEPS tensor dlpack)
+cc_test(dlpack_tensor_test SRCS dlpack_tensor_test.cc DEPS dlpack_tensor glog)
--- a/paddle/fluid/framework/dlpack_tensor.cc
+++ b/paddle/fluid/framework/dlpack_tensor.cc
@ -0,0 +1,127 @@
+// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "paddle/fluid/framework/dlpack_tensor.h"
+
+namespace paddle {
+namespace framework {
+
+namespace internal {
+template <typename T>
+static ::DLDataType GetDLDataTypeCode() {
+  ::DLDataType dtype;
+  if (std::is_same<T, platform::float16>::value ||
+      std::is_floating_point<T>::value) {
+    dtype.code = kDLFloat;
+  } else if (std::is_unsigned<T>::value) {
+    dtype.code = kDLUInt;
+  } else if (std::is_integral<T>::value) {
+    dtype.code = kDLInt;
+  } else {
+    PADDLE_THROW("Unsupported data type %s", typeid(T).name());
+  }
+  dtype.bits = 8 * sizeof(T);
+  dtype.lanes = 1;
+  return dtype;
+}
+
+static DLDataType GetDLDataTypeFromTypeIndex(const std::type_index &type) {
+#define REG_DL_DATA_TYPE(type) \
+  { std::type_index(typeid(type)), GetDLDataTypeCode<type>() }
+  static const std::unordered_map<std::type_index, ::DLDataType>
+      type_to_dtype_map({
+          REG_DL_DATA_TYPE(platform::float16),  // NOLINT
+          REG_DL_DATA_TYPE(float),              // NOLINT
+          REG_DL_DATA_TYPE(double),             // NOLINT
+          REG_DL_DATA_TYPE(int),                // NOLINT
+          REG_DL_DATA_TYPE(int64_t),            // NOLINT
+          REG_DL_DATA_TYPE(bool),               // NOLINT
+          REG_DL_DATA_TYPE(size_t),             // NOLINT
+          REG_DL_DATA_TYPE(int16_t),            // NOLINT
+          REG_DL_DATA_TYPE(uint8_t),            // NOLINT
+          REG_DL_DATA_TYPE(int8_t)              // NOLINT
+      });
+  static auto type_to_dtype_map_end_it = type_to_dtype_map.end();
+  auto it = type_to_dtype_map.find(type);
+  PADDLE_ENFORCE(it != type_to_dtype_map_end_it, "Unsupported data type %s",
+                 type.name());
+  return it->second;
+#undef REG_DL_DATA_TYPE
+}
+
+struct DLContextVisitor : public boost::static_visitor<::DLContext> {
+  inline ::DLContext operator()(const platform::CPUPlace &place) const {
+    DLContext ctx;
+    ctx.device_type = kDLCPU;
+    ctx.device_id = 0;
+    return ctx;
+  }
+
+  inline ::DLContext operator()(const platform::CUDAPlace &place) const {
+#ifdef PADDLE_WITH_CUDA
+    DLContext ctx;
+    ctx.device_type = kDLGPU;
+    ctx.device_id = place.device;
+    return ctx;
+#else
+    PADDLE_THROW("platform::CUDAPlace is not supported in CPU only version");
+#endif
+  }
+
+  inline ::DLContext operator()(const platform::CUDAPinnedPlace &place) const {
+#ifdef PADDLE_WITH_CUDA
+    DLContext ctx;
+    ctx.device_type = kDLCPUPinned;
+    ctx.device_id = 0;
+    return ctx;
+#else
+    PADDLE_THROW(
+        "platform::CUDAPinnedPlace is not supported in CPU only version");
+#endif
+  }
+};
+}  // namespace internal
+
+DLPackTensor::DLPackTensor(const Tensor &tensor, LaneType lanes) {
+  // init data, data buffer
+  t_.data = const_cast<void *>(tensor.data<void>());
+
+  // init ctx, DLContext type with device_type and device_id
+  auto place = tensor.place();
+  t_.ctx = boost::apply_visitor(internal::DLContextVisitor(), place);
+
+  // init dtype
+  t_.dtype = internal::GetDLDataTypeFromTypeIndex(tensor.type());
+  t_.dtype.lanes = lanes;
+
+  // init ndim, tensor rank
+  auto &dims = tensor.dims();
+  using DimType = decltype(t_.ndim);  // int
+  t_.ndim = static_cast<DimType>(dims.size());
+
+  // init shape, tensor dims
+  t_.shape = shape_;
+  for (DimType i = 0; i < t_.ndim; ++i) {
+    t_.shape[i] = dims[i];
+  }
+
+  // init strides, nullptr means the tensor is compact
+  t_.strides = nullptr;
+
+  // init byte_offset
+  t_.byte_offset = 0;
+}
+
+}  // namespace framework
+}  // namespace paddle
--- a/paddle/fluid/framework/dlpack_tensor.h
+++ b/paddle/fluid/framework/dlpack_tensor.h
@ -0,0 +1,45 @@
+// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#pragma once
+
+#include <dlpack/dlpack.h>
+#include "paddle/fluid/framework/tensor.h"
+
+namespace paddle {
+namespace framework {
+
+class DLPackTensor {
+ public:
+  using LaneType = decltype(::DLTensor::dtype.lanes);  // uint16_t
+  using ShapeType =
+      std::remove_reference<decltype(::DLTensor::shape[0])>::type;  // int64_t
+
+  // lanes is only used in CPU to enable vectorization
+  explicit DLPackTensor(const Tensor& tensor, LaneType lanes = 1);
+
+  inline operator const ::DLTensor&() const { return t_; }
+
+  inline operator ::DLTensor&() { return t_; }
+
+ private:
+  ::DLTensor t_;
+
+  // The shape in DLTensor is defined as int64_t*
+  // Add this member to make TVMTensor init without heap allocation
+  ShapeType shape_[9];
+};
+
+}  // namespace framework
+}  // namespace paddle
--- a/paddle/fluid/framework/dlpack_tensor_test.cc
+++ b/paddle/fluid/framework/dlpack_tensor_test.cc
@ -0,0 +1,113 @@
+// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "paddle/fluid/framework/dlpack_tensor.h"
+#include <glog/logging.h>
+#include <gtest/gtest.h>
+#include <vector>
+
+namespace paddle {
+namespace framework {
+
+namespace {  // NOLINT
+template <typename T>
+constexpr uint8_t GetDLDataTypeCode() {
+  return std::is_same<platform::float16, T>::value ||
+                 std::is_floating_point<T>::value
+             ? static_cast<uint8_t>(kDLFloat)
+             : (std::is_unsigned<T>::value
+                    ? static_cast<uint8_t>(kDLUInt)
+                    : (std::is_integral<T>::value ? static_cast<uint8_t>(kDLInt)
+                                                  : static_cast<uint8_t>(-1)));
+}
+}  // NOLINT
+
+template <typename T>
+void TestMain(const platform::Place &place, uint16_t lanes) {
+  DDim dims{4, 5, 6, 7};
+  Tensor tensor;
+  tensor.Resize(dims);
+  void *p = tensor.mutable_data<T>(place);
+
+  DLPackTensor dlpack_tensor(tensor, lanes);
+  ::DLTensor &dl_tensor = dlpack_tensor;
+
+  CHECK_EQ(p, dl_tensor.data);
+  if (platform::is_cpu_place(place)) {
+    CHECK_EQ(kDLCPU, dl_tensor.ctx.device_type);
+    CHECK_EQ(0, dl_tensor.ctx.device_id);
+  } else if (platform::is_gpu_place(place)) {
+    CHECK_EQ(kDLGPU, dl_tensor.ctx.device_type);
+    CHECK_EQ(boost::get<platform::CUDAPlace>(place).device,
+             dl_tensor.ctx.device_id);
+  } else if (platform::is_cuda_pinned_place(place)) {
+    CHECK_EQ(kDLCPUPinned, dl_tensor.ctx.device_type);
+    CHECK_EQ(0, dl_tensor.ctx.device_id);
+  } else {
+    CHECK_EQ(false, true);
+  }
+
+  CHECK_EQ(dims.size(), dl_tensor.ndim);
+  for (auto i = 0; i < dims.size(); ++i) {
+    CHECK_EQ(dims[i], dl_tensor.shape[i]);
+  }
+
+  CHECK_EQ(dl_tensor.strides == nullptr, true);
+  CHECK_EQ(static_cast<uint64_t>(0), dl_tensor.byte_offset);
+
+  CHECK_EQ(lanes, dl_tensor.dtype.lanes);
+  CHECK_EQ(sizeof(T) * 8, dl_tensor.dtype.bits);
+
+  CHECK_EQ(GetDLDataTypeCode<T>(), dl_tensor.dtype.code);
+}
+
+template <typename T>
+void TestMainLoop() {
+#ifdef PADDLE_WITH_CUDA
+  std::vector<platform::Place> places{platform::CPUPlace(),
+                                      platform::CUDAPlace(0),
+                                      platform::CUDAPinnedPlace()};
+  if (platform::GetCUDADeviceCount() > 1) {
+    places.emplace_back(platform::CUDAPlace(1));
+  }
+#else
+  std::vector<platform::Place> places{platform::CPUPlace()};
+#endif
+  std::vector<uint16_t> lanes{1, 2};
+  for (auto &p : places) {
+    for (auto &l : lanes) {
+      TestMain<T>(p, l);
+    }
+  }
+}
+
+#define PADDLE_DLPACK_TEST(type) \
+  TEST(dlpack, test_##type) { TestMainLoop<type>(); }
+
+using float16 = platform::float16;
+PADDLE_DLPACK_TEST(float16);
+PADDLE_DLPACK_TEST(float);
+PADDLE_DLPACK_TEST(double);
+PADDLE_DLPACK_TEST(int);
+PADDLE_DLPACK_TEST(int64_t);
+PADDLE_DLPACK_TEST(bool);
+PADDLE_DLPACK_TEST(size_t);
+PADDLE_DLPACK_TEST(int16_t);
+PADDLE_DLPACK_TEST(uint8_t);
+PADDLE_DLPACK_TEST(int8_t);
+
+#undef PADDLE_DLPACK_TEST
+
+}  // namespace framework
+}  // namespace paddle
--- a/paddle/fluid/operators/math/blas_impl.cu.h
+++ b/paddle/fluid/operators/math/blas_impl.cu.h
--- a/paddle/fluid/platform/device_context.h
+++ b/paddle/fluid/platform/device_context.h
@ -143,6 +143,39 @@ class CudnnWorkspaceHandle {
  std::unique_ptr<std::lock_guard<std::mutex>> guard_;
 };

+#if CUDA_VERSION >= 9000
+class ScopedCublasMathMode {
+ public:
+  ScopedCublasMathMode(cublasHandle_t handle, cublasMath_t new_math_mode)
+      : handle_(handle) {
+    need_reset = false;
+    PADDLE_ENFORCE(
+        platform::dynload::cublasGetMathMode(handle_, &old_math_mode_),
+        "Failed to get old cublas math mode");
+    if (old_math_mode_ != new_math_mode) {
+      PADDLE_ENFORCE(
+          platform::dynload::cublasSetMathMode(handle_, new_math_mode),
+          "Failed to set old cublas math mode");
+      need_reset = true;
+    }
+  }
+
+  ~ScopedCublasMathMode() {
+    if (need_reset) {
+      PADDLE_ENFORCE(
+          platform::dynload::cublasSetMathMode(handle_, old_math_mode_),
+          "Failed to set old cublas math mode");
+    }
+  }
+
+ private:
+  cublasHandle_t handle_;
+  cublasMath_t old_math_mode_;
+  bool need_reset;
+};
+
+#endif
+
 class CUDADeviceContext : public DeviceContext {
 public:
  explicit CUDADeviceContext(CUDAPlace place);
@ -199,6 +232,18 @@ class CUDADeviceContext : public DeviceContext {
    callback_manager_->Wait();
  }

+#if CUDA_VERSION >= 9000
+  /*! \brief CublasCall may need to change cublas's config,
+   *  but the cublas may be hold by multi-thread, so we should
+   *  add lock here. */
+  template <typename Callback>
+  void CublasCall(Callback callback, cublasMath_t new_math) {
+    std::lock_guard<std::mutex> guard(cublas_mtx_);
+    ScopedCublasMathMode scoped_cublas_math(cublas_handle_, new_math);
+    callback();
+  }
+#endif
+
 private:
  CUDAPlace place_;

@ -220,6 +265,8 @@ class CUDADeviceContext : public DeviceContext {
  // If we use mtx_ for StreamCallbackManager, deadlock may occur sometimes
  mutable std::mutex callback_mtx_;
  std::unique_ptr<StreamCallbackManager> callback_manager_;
+
+  mutable std::mutex cublas_mtx_;
 };

 template <>
--- a/paddle/fluid/platform/dynload/cublas.h
+++ b/paddle/fluid/platform/dynload/cublas.h
@ -61,9 +61,6 @@ extern void *cublas_dso_handle;
  extern DynLoad__##__name __name
 #endif

-#define DECLARE_DYNAMIC_LOAD_CUBLAS_V2_WRAP(__name) \
-  DECLARE_DYNAMIC_LOAD_CUBLAS_WRAP(__name)
-
 #define CUBLAS_BLAS_ROUTINE_EACH(__macro) \
  __macro(cublasSaxpy_v2);                \
  __macro(cublasDaxpy_v2);                \
@ -93,22 +90,23 @@ CUBLAS_BLAS_ROUTINE_EACH(DECLARE_DYNAMIC_LOAD_CUBLAS_WRAP)

 // APIs available after CUDA 8.0
 #if CUDA_VERSION >= 8000
-#define CUBLAS_BLAS_ROUTINE_EACH_R2(__macro) \
-  __macro(cublasGemmEx);                     \
-  __macro(cublasSgemmStridedBatched);        \
-  __macro(cublasDgemmStridedBatched);        \
-  __macro(cublasCgemmStridedBatched);        \
-  __macro(cublasZgemmStridedBatched);        \
-  __macro(cublasHgemmStridedBatched);
-
-CUBLAS_BLAS_ROUTINE_EACH_R2(DECLARE_DYNAMIC_LOAD_CUBLAS_WRAP)
+DECLARE_DYNAMIC_LOAD_CUBLAS_WRAP(cublasGemmEx);
+DECLARE_DYNAMIC_LOAD_CUBLAS_WRAP(cublasSgemmStridedBatched);
+DECLARE_DYNAMIC_LOAD_CUBLAS_WRAP(cublasDgemmStridedBatched);
+DECLARE_DYNAMIC_LOAD_CUBLAS_WRAP(cublasCgemmStridedBatched);
+DECLARE_DYNAMIC_LOAD_CUBLAS_WRAP(cublasZgemmStridedBatched);
+DECLARE_DYNAMIC_LOAD_CUBLAS_WRAP(cublasHgemmStridedBatched);
 #endif

 // APIs available after CUDA 9.0
 #if CUDA_VERSION >= 9000
-#define CUBLAS_BLAS_ROUTINE_EACH_R3(__macro) __macro(cublasSetMathMode);
+DECLARE_DYNAMIC_LOAD_CUBLAS_WRAP(cublasSetMathMode);
+DECLARE_DYNAMIC_LOAD_CUBLAS_WRAP(cublasGetMathMode);
+#endif

-CUBLAS_BLAS_ROUTINE_EACH_R3(DECLARE_DYNAMIC_LOAD_CUBLAS_WRAP)
+#if CUDA_VERSION >= 9010
+DECLARE_DYNAMIC_LOAD_CUBLAS_WRAP(cublasGemmBatchedEx);
+DECLARE_DYNAMIC_LOAD_CUBLAS_WRAP(cublasGemmStridedBatchedEx);
 #endif

 #undef DECLARE_DYNAMIC_LOAD_CUBLAS_WRAP
--- a/paddle/fluid/platform/gpu_info.cc
+++ b/paddle/fluid/platform/gpu_info.cc
@ -26,6 +26,16 @@ DEFINE_double(fraction_of_gpu_memory_to_use, 0.92,
              "additional trunks of the same size will be requested from gpu "
              "until the gpu has no memory left for another trunk.");

+DEFINE_bool(
+    enable_cublas_tensor_op_math, false,
+    "The enable_cublas_tensor_op_math indicate whether to use Tensor Core, "
+    "but it may loss precision. Currently, There are two CUDA libraries that"
+    " use Tensor Cores, cuBLAS and cuDNN. cuBLAS uses Tensor Cores to speed up"
+    " GEMM computations(the matrices must be either half precision or single "
+    "precision); cuDNN uses Tensor Cores to speed up both convolutions(the "
+    "input and output must be half precision) and recurrent neural networks "
+    "(RNNs).");
+
 namespace paddle {
 namespace platform {

@ -64,6 +74,16 @@ int GetCUDADriverVersion(int id) {
  return driver_version;
 }

+bool TensorCoreAvailable() {
+#if CUDA_VERSION >= 9000
+  int device = GetCurrentDeviceId();
+  int driver_version = GetCUDAComputeCapability(device);
+  return driver_version >= 70;
+#else
+  return false;
+#endif
+}
+
 int GetCUDAMultiProcessors(int id) {
  PADDLE_ENFORCE_LT(id, GetCUDADeviceCount(), "id must less than GPU count");
  int count;
--- a/paddle/fluid/platform/gpu_info.h
+++ b/paddle/fluid/platform/gpu_info.h
@ -35,6 +35,9 @@ int GetCUDARuntimeVersion(int id);
 //! Get the driver version of the ith GPU
 int GetCUDADriverVersion(int id);

+//! Wheter the current device support TensorCore
+bool TensorCoreAvailable();
+
 //! Get the MultiProcessors of the ith GPU.
 int GetCUDAMultiProcessors(int i);

--- a/python/paddle/fluid/init.py
+++ b/python/paddle/fluid/init.py
@ -133,7 +133,8 @@ def __bootstrap__():
    if core.is_compiled_with_cuda():
        read_env_flags += [
            'fraction_of_gpu_memory_to_use', 'cudnn_deterministic',
-            'conv_workspace_size_limit', 'cudnn_exhaustive_search'
+            'enable_cublas_tensor_op_math', 'conv_workspace_size_limit',
+            'cudnn_exhaustive_search'
        ]
    core.init_gflags([sys.argv[0]] +
                     ["--tryfromenv=" + ",".join(read_env_flags)])
--- a/python/paddle/fluid/contrib/utils/init.py
+++ b/python/paddle/fluid/contrib/utils/init.py
@ -0,0 +1,20 @@
+#   Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import print_function
+
+from . import hdfs_utils
+from .hdfs_utils import *
+
+__all__ = hdfs_utils.__all__
--- a/python/paddle/fluid/contrib/utils/hdfs_utils.py
+++ b/python/paddle/fluid/contrib/utils/hdfs_utils.py