Merge remote-tracking branch 'upstream/develop' into windows/build

6 years ago · 36cd18b549
parent b2f8d4183d 39ec80def4
commit 36cd18b549
78 changed files with 1959 additions and 347 deletions
--- a/.github/ISSUE_TEMPLATE/---feature-request-.md
+++ b/.github/ISSUE_TEMPLATE/---feature-request-.md
@ -0,0 +1,27 @@
 ---
 name: 建议(Feature request)
 about: 您可以提出您的建议。 You could use this template for reporting a suggestion  issue.
 ---
 欢迎您对PaddlePaddle提出建议，非常感谢您对PaddlePaddle的贡献！
 在留下您的建议时，辛苦您同步提供如下信息：
 - 版本、环境信息
 1）PaddlePaddle版本：请提供您的PaddlePaddle版本号，例如1.1
 2）CPU/GPU：您是否使用GPU进行训练，如是，请提供您的CUDA和cuDNN版本号
 3）系统环境：请您描述系统类型、版本，例如Mac OS 10.14
 - 复现信息：如为报错，请给出复现环境、复现步骤
 - 建议描述：请您详细描述，您认为需优化的功能
 Thank you for contributing to PaddlePaddle.
 Before submitting the issue, you could search issue in the github in case that there was a similar issue submitted or resolved before.
 Please make sure that this is a feature request. 
 **System information**
 -PaddlePaddle version （eg.1.1）or CommitID
 -CPU: including CPUMKL/OpenBlas/MKLDNN version
 -GPU: including CUDA/CUDNN version
 -OS Platform (eg.Mac OS 10.14)
 **To Reproduce**
 Steps to reproduce the behavior
 **Describe the feature and the current behavior/state.**
 **Any Other info.**
--- a/.github/ISSUE_TEMPLATE/---inference-issue-.md
+++ b/.github/ISSUE_TEMPLATE/---inference-issue-.md
@ -0,0 +1,40 @@
 ---
 name: 预测（Inference Issue）
 about: 您可以提问预测中报错、应用等问题。 You could use this template for reporting an inference issue.
 ---
 为使您的问题得到快速解决，在建立Issue前，请您先通过如下方式搜索是否有相似问题:【搜索issue关键字】【使用labels筛选】【官方文档】
 如果您没有查询到相似问题，为快速解决您的提问，建立issue时请提供如下细节信息：
 - 标题：简洁、精准描述您的问题，例如“最新预测库的API文档在哪儿 ”
 - 版本、环境信息：
    1）PaddlePaddle版本：请提供您的PaddlePaddle版本号（如1.1）或CommitID
    2）CPU：预测若用CPU，请提供CPU型号，MKL/OpenBlas/MKLDNN/等数学库使用情况
    3）GPU：预测若用GPU，请提供GPU型号、CUDA和CUDNN版本号
    4）系统环境：请您描述系统类型、版本（如Mac OS 10.14），Python版本
 -预测信息
    1）C++预测：请您提供预测库安装包的版本信息，及其中的version.txt文件
    2）CMake包含路径的完整命令
    3）API信息（如调用请提供）
    4）预测库来源：官网下载/特殊环境（如BCLOUD编译）
 - 复现信息：如为报错，请给出复现环境、复现步骤
 - 问题描述：请详细描述您的问题，同步贴出报错信息、日志/代码关键片段
 Thank you for contributing to PaddlePaddle.
 Before submitting the issue, you could search issue in the github in case that th
 If there is no solution,please make sure that this is an inference issue including the following details :
 **System information**
 -PaddlePaddle version （eg.1.1）or CommitID
 -CPU: including CPUMKL/OpenBlas/MKLDNN version
 -GPU: including CUDA/CUDNN version
 -OS Platform (eg.Mac OS 10.14)
 -Python version
 -Cmake orders
 -C++version.txt
 -API information
 **To Reproduce**
 Steps to reproduce the behavior
 **Describe your current behavior**
 **Code to reproduce the issue**
 **Other info / logs**
--- a/.github/ISSUE_TEMPLATE/---installation-issue-.md
+++ b/.github/ISSUE_TEMPLATE/---installation-issue-.md
@ -0,0 +1,40 @@
 ---
 name: 安装（Installation Issue）
 about: 您可以提问安装、编译出现报错等问题。 You could use this template for reporting an installation
   issue.
 ---
 为使您的问题得到快速解决，在建立Issue前，请您先通过如下方式搜索是否有相似问题:【搜索issue关键字】【使用labels筛选】【官方文档】
 建立issue时，为快速解决问题，请您根据使用情况给出如下信息：
 - 标题：请包含关键词“安装错误”/“编译错误”，例如“Mac编译错误”
 - 版本、环境信息：
    1）PaddlePaddle版本：请提供您的PaddlePaddle版本号（如1.1）或CommitID
    2）CPU：请提供CPU型号，MKL/OpenBlas/MKLDNN/等数学库的使用情况
    3）GPU：请提供GPU型号，CUDA和CUDNN版本号
    4）系统环境：请说明系统类型、版本（如Mac OS 10.14）、Python版本
 - 安装方式信息：
 1）pip安装/docker安装
 2）本地编译：请提供cmake命令，编译命令
 3）docker编译：请提供docker镜像，编译命令            
  特殊环境请注明：如离线安装等
 - 复现信息：如为报错，请给出复现环境、复现步骤
 - 问题描述：请详细描述您的问题，同步贴出报错信息、日志/代码关键片段
 Thank you for contributing to PaddlePaddle.
 Before submitting the issue, you could search issue in Github in case that there was a similar issue submitted or resolved before.
 If there is no solution,please make sure that this is an installation issue including the following details:
 **System information**
 -PaddlePaddle version （eg.1.1）or CommitID
 -CPU: including CPUMKL/OpenBlas/MKLDNN version
 -GPU: including CUDA/CUDNN version
 -OS Platform (eg. Mac OS 10.14)
 -Python version
 - Install method: pip install/install with docker/build from source(without docker)/build within docker
 - Other special cases that you think may be related to this problem, eg. offline install, special internet condition   
 **To Reproduce**
 Steps to reproduce the behavior
 **Describe your current behavior**
 **Code to reproduce the issue**
 **Other info / logs**
--- a/.github/ISSUE_TEMPLATE/---model-issue-.md
+++ b/.github/ISSUE_TEMPLATE/---model-issue-.md
@ -0,0 +1,36 @@
 ---
 name: 模型（Model Issue）
 about: 您可以提问模型、算法、数据集方向的使用报错等问题。You could use this template for reporting a model/
  algorithm/dataset  issue.
 ---
 为使您的问题得到快速解决，在建立Issue前，请您先通过如下方式搜索是否有相似问题:【搜索issue关键字】【使用labels筛选】【官方文档】
 建立issue时，为快速解决问题，请您根据使用情况给出如下信息：
 - 标题：简洁、精准描述您的问题，例如“ssd 模型前置lstm报错  ”
 - 版本、环境信息：
    1）PaddlePaddle版本：请提供PaddlePaddle版本号，例如1.1或CommitID
    2）CPU：请提供CPU型号，MKL/OpenBlas/MKLDNN/等数学库的使用情况
    3）GPU：请提供GPU型号，CUDA和CUDNN版本号
    4）系统环境：请说明系统类型、版本（例如Mac OS 10.14），Python版本
 - 模型信息
    1）模型名称 2）使用数据集名称 3）使用算法名称 4）模型链接
 - 复现信息：如为报错，请给出复现环境、复现步骤
 - 问题描述：请详细描述您的问题，同步贴出报错信息、日志/代码关键片段
 Thank you for contributing to PaddlePaddle.
 Before submitting the issue, you could search issue in the github.Probably there was a similar issue submitted or resolved before.
 If there is no solution,please make sure that this is a issue of models including the following details:
 **System information**
 -PaddlePaddle version （eg.1.1）or CommitID
 -CPU: including CPUMKL/OpenBlas/MKLDNN version
 -GPU: including CUDA/CUDNN version
 -OS Platform (eg.Mac OS 10.14)
 -Python version
 -Name of Models&Dataset/details of operator
 **To Reproduce**
 Steps to reproduce the behavior
 **Describe your current behavior**
 **Code to reproduce the issue**
 **Other info / logs**
--- a/.github/ISSUE_TEMPLATE/---others-.md
+++ b/.github/ISSUE_TEMPLATE/---others-.md
@ -0,0 +1,33 @@
 ---
 name: 其他（Others）
 about: 如上述分类未包含您的问题，可在此提出。 You could use this template for reporting other issues
 ---
 为使您的问题得到快速解决，在建立Issues前，请您先通过如下方式搜索是否有相似问题:【搜索issue关键字】【使用labels筛选】【官方文档】
 如果您没有查询到相似问题，为快速解决您的提问，建立issue时请提供如下细节信息：
 - 标题：简洁、精准概括您的问题
 - 版本、环境信息：
    1）PaddlePaddle版本：请提供您的PaddlePaddle版本号，例如1.1或CommitID
    2）CPU/GPU：如果您使用GPU训练，请提供GPU驱动版本、CUDA和cuDNN版本号
    3）系统环境：请您描述系统类型、版本，例如Mac OS 10.14
    4）Python版本号
    5）显存信息
 - 复现信息：如为报错，请给出复现环境、复现步骤
 - 问题描述：请详细描述您的问题，同步贴出报错信息、日志/代码关键片段
 Thank you for contributing to PaddlePaddle.
 Before submitting the issue, you could search issue in the github in case that there was a similar issue submitted or resolved before.
 If there is no solution,please provide us with the following details :
 **System information**
 -PaddlePaddle version （eg.1.1）or CommitID
 -CPU: including CPUMKL/OpenBlas/MKLDNN version
 -GPU: including CUDA/cuDNN version
 -OS Platform and Distribution(eg.Mac OS 10.14)
 -Python version 
 **To Reproduce**
 Steps to reproduce the behavior
 **Describe your current behavior**
 **Code to reproduce the issue**
 **Other info / logs**
--- a/.github/ISSUE_TEMPLATE/---training-issue-.md
+++ b/.github/ISSUE_TEMPLATE/---training-issue-.md
@ -0,0 +1,38 @@
 ---
 name: 训练（Training issue）
 about: 您可以提问训练中报错、应用、出core等问题。 You could use this template for reporting an training
   issue.
 ---
 为使您的问题得到快速解决，在建立Issues前，请您先通过如下方式搜索是否有相似问题:【搜索issue关键字】【使用labels筛选】【官方文档】
 如果您没有查询到相似问题，为快速解决您的提问，建立issue时请提供如下细节信息：
 - 标题：简洁、精准概括您的问题，例如“Insufficient Memory xxx" ”
 - 版本、环境信息：
    1）PaddlePaddle版本：请提供您的PaddlePaddle版本号，例如1.1或CommitID
    2）CPU：预测若用CPU，请提供CPU型号，MKL/OpenBlas/MKLDNN/等数学库使用情况
    3）GPU：预测若用GPU，请提供GPU型号、CUDA和CUDNN版本号
    4）系统环境：请您描述系统类型、版本，例如Mac OS 10.14，Python版本
 - 训练信息
    1）单机/多机，单卡/多卡
    2）显存信息
    3）Operator信息
 - 复现信息：如为报错，请给出复现环境、复现步骤
 - 问题描述：请详细描述您的问题，同步贴出报错信息、日志、可复现的代码片段
 Thank you for contributing to PaddlePaddle.
 Before submitting the issue, you could search issue in the github in case that there was a similar issue submitted or resolved before.
 If there is no solution,please make sure that this is a training issue including the following details:
 **System information**
 -PaddlePaddle version （eg.1.1）or CommitID
 -CPU: including CPUMKL/OpenBlas/MKLDNN version
 -GPU: including CUDA/CUDNN version
 -OS Platform (eg.Mac OS 10.14)
 -Other imformation: Distriuted training/informantion of operator/
 Graphics card storage
 **To Reproduce**
 Steps to reproduce the behavior
 **Describe your current behavior**
 **Code to reproduce the issue**
 **Other info / logs**
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@ -204,7 +204,9 @@ include(external/eigen)     # download eigen3
 include(external/pybind11)  # download pybind11
 include(external/cares)
 include(external/cub)
 include(external/rocprim)
 include(external/xxhash)    # download xxhash
 include(external/dlpack)
 include(external/snappy)    # download snappy
 include(external/snappystream) # download snappystream
--- a/39
+++ b/39
@ -22,6 +22,27 @@ ENV HOME /root
 # Add bash enhancements
 COPY ./paddle/scripts/docker/root/ /root/
 # Prepare packages for Python
 RUN apt-get update && \
    apt-get install -y make build-essential libssl-dev zlib1g-dev libbz2-dev \
    libreadline-dev libsqlite3-dev wget curl llvm libncurses5-dev libncursesw5-dev \
    xz-utils tk-dev libffi-dev liblzma-dev
 # Install Python3.6
 RUN mkdir -p /root/python_build/ && wget -q https://www.sqlite.org/2018/sqlite-autoconf-3250300.tar.gz && \
    tar -zxf sqlite-autoconf-3250300.tar.gz && cd sqlite-autoconf-3250300 && \
    ./configure -prefix=/usr/local && make -j8 && make install && cd ../ && rm sqlite-autoconf-3250300.tar.gz && \
    wget -q https://www.python.org/ftp/python/3.6.0/Python-3.6.0.tgz && \
    tar -xzf Python-3.6.0.tgz && cd Python-3.6.0 && \
    CFLAGS="-Wformat" ./configure --prefix=/usr/local/ --enable-shared > /dev/null && \
    make -j8 > /dev/null && make altinstall > /dev/null
 # Install Python3.7
 RUN wget -q https://www.python.org/ftp/python/3.7.0/Python-3.7.0.tgz && \
    tar -xzf Python-3.7.0.tgz && cd Python-3.7.0 && \
    CFLAGS="-Wformat" ./configure --prefix=/usr/local/ --enable-shared > /dev/null && \
    make -j8 > /dev/null && make altinstall > /dev/null
 RUN apt-get update && \
    apt-get install -y --allow-downgrades patchelf \
    python3 python3-dev python3-pip \
@ -74,6 +95,12 @@ RUN localedef -i en_US -f UTF-8 en_US.UTF-8
 RUN pip3 install -U wheel && \
    pip3 install -U docopt PyYAML sphinx==1.5.6 && \
    pip3 install sphinx-rtd-theme==0.1.9 recommonmark && \
    pip3.6 install -U wheel && \
    pip3.6 install -U docopt PyYAML sphinx==1.5.6 && \
    pip3.6 install sphinx-rtd-theme==0.1.9 recommonmark && \
    pip3.7 install -U wheel && \
    pip3.7 install -U docopt PyYAML sphinx==1.5.6 && \
    pip3.7 install sphinx-rtd-theme==0.1.9 recommonmark && \
    easy_install -U pip && \
    pip install -U pip setuptools wheel && \
    pip install -U docopt PyYAML sphinx==1.5.6 && \
@ -82,22 +109,34 @@ RUN pip3 install -U wheel && \
 RUN pip3 install 'pre-commit==1.10.4' 'ipython==5.3.0' && \
    pip3 install 'ipykernel==4.6.0' 'jupyter==1.0.0' && \
    pip3 install opencv-python && \
    pip3.6 install 'pre-commit==1.10.4' 'ipython==5.3.0' && \
    pip3.6 install 'ipykernel==4.6.0' 'jupyter==1.0.0' && \
    pip3.6 install opencv-python && \
    pip3.7 install 'pre-commit==1.10.4' 'ipython==5.3.0' && \
    pip3.7 install 'ipykernel==4.6.0' 'jupyter==1.0.0' && \
    pip3.7 install opencv-python && \
    pip install 'pre-commit==1.10.4' 'ipython==5.3.0' && \
    pip install 'ipykernel==4.6.0' 'jupyter==1.0.0' && \
    pip install opencv-python
 #For docstring checker
 RUN pip3 install pylint pytest astroid isort
 RUN pip3.6 install pylint pytest astroid isort
 RUN pip3.7 install pylint pytest astroid isort
 RUN pip install pylint pytest astroid isort LinkChecker
 COPY ./python/requirements.txt /root/
 RUN pip3 install -r /root/requirements.txt
 RUN pip3.6 install -r /root/requirements.txt
 RUN pip3.7 install -r /root/requirements.txt
 RUN pip install -r /root/requirements.txt
 # To fix https://github.com/PaddlePaddle/Paddle/issues/1954, we use
 # the solution in https://urllib3.readthedocs.io/en/latest/user-guide.html#ssl-py2
 RUN apt-get install -y libssl-dev libffi-dev
 RUN pip3 install certifi urllib3[secure]
 RUN pip3.6 install certifi urllib3[secure]
 RUN pip3.7 install certifi urllib3[secure]
 RUN pip install certifi urllib3[secure]
--- a/cmake/external/dlpack.cmake
+++ b/cmake/external/dlpack.cmake
@ -0,0 +1,31 @@
 include(ExternalProject)
 set(DLPACK_SOURCE_DIR ${THIRD_PARTY_PATH}/dlpack)
 set(DLPACK_INCLUDE_DIR ${DLPACK_SOURCE_DIR}/src/extern_dlpack/include)
 include_directories(${DLPACK_INCLUDE_DIR})
 ExternalProject_Add(
  extern_dlpack
  ${EXTERNAL_PROJECT_LOG_ARGS}
  GIT_REPOSITORY "https://github.com/dmlc/dlpack.git"
  GIT_TAG        "v0.2"
  PREFIX         ${DLPACK_SOURCE_DIR}
  UPDATE_COMMAND ""
  CONFIGURE_COMMAND ""
  BUILD_COMMAND     ""
  INSTALL_COMMAND   ""
  TEST_COMMAND      ""
 )
 if(${CMAKE_VERSION} VERSION_LESS "3.3.0")
  set(dummyfile ${CMAKE_CURRENT_BINARY_DIR}/dlpack_dummy.c)
  file(WRITE ${dummyfile} "const char *dummy = \"${dummyfile}\";")
  add_library(dlpack STATIC ${dummyfile})
 else()
  add_library(dlpack INTERFACE)
 endif()
 add_dependencies(dlpack extern_dlpack)
 LIST(APPEND externl_project_dependencies dlpack)
--- a/cmake/external/mkldnn.cmake
+++ b/cmake/external/mkldnn.cmake
@ -53,7 +53,7 @@ ExternalProject_Add(
    ${EXTERNAL_PROJECT_LOG_ARGS}
    DEPENDS             ${MKLDNN_DEPENDS}
    GIT_REPOSITORY      "https://github.com/01org/mkl-dnn.git"
-    GIT_TAG             "21fb5f2af1dd14e132af4f1b79160977ee487818"
+    GIT_TAG             "830a10059a018cd2634d94195140cf2d8790a75a"
    PREFIX              ${MKLDNN_SOURCES_DIR}
    UPDATE_COMMAND      ""
    CMAKE_ARGS          -DCMAKE_CXX_COMPILER=${CMAKE_CXX_COMPILER}
--- a/cmake/external/rocprim.cmake
+++ b/cmake/external/rocprim.cmake
@ -0,0 +1,44 @@
 if (NOT WITH_AMD_GPU)
    return()
 endif()
 # rocprim is "ROCm Parallel Primitives" for short.
 # It is a header-only library providing HIP and HC parallel primitives
 # for developing performant GPU-accelerated code on AMD ROCm platform.
 if("x${HCC_HOME}" STREQUAL "x")
  set(HCC_HOME "/opt/rocm/hcc")
 endif()
 INCLUDE(ExternalProject)
 SET(ROCPRIM_SOURCE_DIR ${THIRD_PARTY_PATH}/rocprim)
 SET(ROCPRIM_INSTALL_DIR  ${THIRD_PARTY_PATH}/install/rocprim)
 SET(ROCPRIM_INCLUDE_DIR ${ROCPRIM_INSTALL_DIR}/include)
 ExternalProject_Add(
    extern_rocprim
    GIT_REPOSITORY "https://github.com/ROCmSoftwarePlatform/rocPRIM.git"
    GIT_TAG        5bd41b96ab8d8343330fb2c3e1b96775bde3b3fc 
    PREFIX         ${ROCPRIM_SOURCE_DIR}
    UPDATE_COMMAND  ""
    CMAKE_ARGS     -DCMAKE_CXX_COMPILER=${HCC_HOME}/bin/hcc
    CMAKE_ARGS     -DONLY_INSTALL=ON
    CMAKE_ARGS     -DBUILD_TEST=OFF
    CMAKE_ARGS     -DCMAKE_INSTALL_PREFIX=${ROCPRIM_INSTALL_DIR}
    INSTALL_DIR    ${ROCPRIM_INSTALL_DIR}
    ${EXTERNAL_PROJECT_LOG_ARGS}
 )
 INCLUDE_DIRECTORIES(${ROCPRIM_INCLUDE_DIR})
 if (${CMAKE_VERSION} VERSION_LESS "3.3.0")
    set(dummyfile ${CMAKE_CURRENT_BINARY_DIR}/rocprim_dummy.c)
    file(WRITE ${dummyfile} "const char *dummy_rocprim = \"${dummyfile}\";")
    add_library(rocprim STATIC ${dummyfile})
 else()
    add_library(rocprim INTERFACE)
 endif()
 add_dependencies(rocprim extern_rocprim)
--- a/cmake/flags.cmake
+++ b/cmake/flags.cmake
@ -129,6 +129,9 @@ set(COMMON_FLAGS
    -Wno-error=parentheses-equality # Warnings in pybind11
    -Wno-error=ignored-attributes  # Warnings in Eigen, gcc 6.3
    -Wno-error=terminate  # Warning in PADDLE_ENFORCE
    -Wno-error=int-in-bool-context # Warning in Eigen gcc 7.2
    -Wimplicit-fallthrough=0 # Warning in tinyformat.h
    -Wno-error=maybe-uninitialized # Warning in boost gcc 7.2
 )
 set(GPU_COMMON_FLAGS
--- a/cmake/generic.cmake
+++ b/cmake/generic.cmake
@ -461,25 +461,29 @@ function(hip_library TARGET_NAME)
      else()
        add_library(${TARGET_NAME} STATIC ${_cmake_options} ${_generated_files} ${_sources})
        set_target_properties(${TARGET_NAME} PROPERTIES LINKER_LANGUAGE CXX)
-        target_link_libraries(${TARGET_NAME} /opt/rocm/hip/lib/libhip_hcc.so /opt/rocm/hip/lib/libhip_device.a)
+        target_link_libraries(${TARGET_NAME} /opt/rocm/hip/lib/libhip_hcc.so /opt/rocm/hip/lib/libhip_device.a /opt/rocm/rccl/lib/librccl.so /opt/rocm/hiprand/lib/libhiprand.so)
-	find_fluid_modules(${TARGET_NAME})
+        find_fluid_modules(${TARGET_NAME})
      endif()
-      if (hip_library_DEPS)
+      if("${hip_library_DEPS}" MATCHES "ARCHIVE_START")
-	add_dependencies(${TARGET_NAME} ${hip_library_DEPS})
+        # Support linking flags: --whole-archive (Linux) / -force_load (MacOS).
-	target_link_libraries(${TARGET_NAME} ${hip_library_DEPS})
+        # WARNING: Please don't use ARCHIVE_START&ARCHIVE_END if TARGET_NAME will be linked by other libraries.
        target_circle_link_libraries(${TARGET_NAME} ${hip_library_DEPS})
        list(REMOVE_ITEM hip_library_DEPS ARCHIVE_START ARCHIVE_END)
      else()
        target_link_libraries(${TARGET_NAME} ${hip_library_DEPS})
      endif()
      # cpplint code style
      foreach(source_file ${hip_library_SRCS})
-	string(REGEX REPLACE "\\.[^.]*$" "" source ${source_file})
+        string(REGEX REPLACE "\\.[^.]*$" "" source ${source_file})
-	if(EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/${source}.h)
+        if(EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/${source}.h)
-	  list(APPEND hip_library_HEADERS ${CMAKE_CURRENT_SOURCE_DIR}/${source}.h)
+          list(APPEND hip_library_HEADERS ${CMAKE_CURRENT_SOURCE_DIR}/${source}.h)
-	endif()
+        endif()
      endforeach()
    else(hip_library_SRCS)
      if (hip_library_DEPS)
-	merge_static_libs(${TARGET_NAME} ${hip_library_DEPS})
+        merge_static_libs(${TARGET_NAME} ${hip_library_DEPS})
      else()
-	message(FATAL "Please specify source file or library in nv_library.")
+        message(FATAL "Please specify source file or library in nv_library.")
      endif()
    endif(hip_library_SRCS)
  endif()
--- a/cmake/hip.cmake
+++ b/cmake/hip.cmake
@ -3,6 +3,8 @@ if(NOT WITH_AMD_GPU)
 endif()
 include_directories("/opt/rocm/include")
 include_directories("/opt/rocm/hip/include")
 include_directories("/opt/rocm/miopen/include")
 include_directories("/opt/rocm/hipblas/include")
 include_directories("/opt/rocm/hiprand/include")
 include_directories("/opt/rocm/rocrand/include")
@ -11,20 +13,40 @@ include_directories("/opt/rocm/thrust")
 list(APPEND EXTERNAL_LIBS "-L/opt/rocm/lib/ -lhip_hcc")
-set(HIP_HCC_FLAGS "${HIP_HCC_FLAGS} -fPIC -DPADDLE_WITH_HIP -std=c++14" )
+set(HIP_HCC_FLAGS "${HIP_HCC_FLAGS} -fPIC -DPADDLE_WITH_HIP -std=c++11" )
 if(WITH_DSO)
  set(HIP_HCC_FLAGS "${HIP_HCC_FLAGS} -DPADDLE_USE_DSO")
 endif(WITH_DSO)
 if(WITH_DOUBLE)
  set(HIP_HCC_FLAGS "${HIP_HCC_FLAGS} -DPADDLE_TYPE_DOUBLE")
 endif(WITH_DOUBLE)
 if(WITH_TESTING)
  set(HIP_HCC_FLAGS "${HIP_HCC_FLAGS} -DPADDLE_WITH_TESTING")
 endif(WITH_TESTING)
 if(WITH_DISTRIBUTE)
  set(HIP_HCC_FLAGS "${HIP_HCC_FLAGS} -DPADDLE_WITH_DISTRIBUTE")
 endif(WITH_DISTRIBUTE)
 if(WITH_GRPC)
  set(HIP_HCC_FLAGS "${HIP_HCC_FLAGS} -DPADDLE_WITH_GRPC")
 endif(WITH_GRPC)
 if(NOT WITH_GOLANG)
  set(HIP_HCC_FLAGS "${HIP_HCC_FLAGS} -DPADDLE_WITHOUT_GOLANG")
 endif(NOT WITH_GOLANG)
 if(WITH_MKLDNN)
  set(HIP_HCC_FLAGS "${HIP_HCC_FLAGS} -DPADDLE_WITH_MKLDNN")
 endif(WITH_MKLDNN)
 set(HIP_HCC_FLAGS "${HIP_HCC_FLAGS} -DANY_IMPL_ANY_CAST_MOVEABLE")
 if(NOT WITH_RDMA)
  set(HIP_HCC_FLAGS "${HIP_HCC_FLAGS} -DPADDLE_DISABLE_RDMA")
 endif(NOT WITH_RDMA)
 if(CMAKE_BUILD_TYPE  STREQUAL "Debug")
    list(APPEND HIP_HCC_FLAGS  ${CMAKE_CXX_FLAGS_DEBUG})
 elseif(CMAKE_BUILD_TYPE  STREQUAL "RelWithDebInfo")
--- a/paddle/fluid/API.spec
+++ b/paddle/fluid/API.spec
@ -276,7 +276,7 @@ paddle.fluid.layers.hard_shrink ArgSpec(args=['x', 'threshold'], varargs=None, k
 paddle.fluid.layers.cumsum ArgSpec(args=['x', 'axis', 'exclusive', 'reverse'], varargs=None, keywords=None, defaults=(None, None, None))
 paddle.fluid.layers.thresholded_relu ArgSpec(args=['x', 'threshold'], varargs=None, keywords=None, defaults=(None,))
 paddle.fluid.layers.prior_box ArgSpec(args=['input', 'image', 'min_sizes', 'max_sizes', 'aspect_ratios', 'variance', 'flip', 'clip', 'steps', 'offset', 'name', 'min_max_aspect_ratios_order'], varargs=None, keywords=None, defaults=(None, [1.0], [0.1, 0.1, 0.2, 0.2], False, False, [0.0, 0.0], 0.5, None, False))
-paddle.fluid.layers.density_prior_box ArgSpec(args=['input', 'image', 'densities', 'fixed_sizes', 'fixed_ratios', 'variance', 'clip', 'steps', 'offset', 'name'], varargs=None, keywords=None, defaults=(None, None, None, [0.1, 0.1, 0.2, 0.2], False, [0.0, 0.0], 0.5, None))
+paddle.fluid.layers.density_prior_box ArgSpec(args=['input', 'image', 'densities', 'fixed_sizes', 'fixed_ratios', 'variance', 'clip', 'steps', 'offset', 'flatten_to_2d', 'name'], varargs=None, keywords=None, defaults=(None, None, None, [0.1, 0.1, 0.2, 0.2], False, [0.0, 0.0], 0.5, False, None))
 paddle.fluid.layers.multi_box_head ArgSpec(args=['inputs', 'image', 'base_size', 'num_classes', 'aspect_ratios', 'min_ratio', 'max_ratio', 'min_sizes', 'max_sizes', 'steps', 'step_w', 'step_h', 'offset', 'variance', 'flip', 'clip', 'kernel_size', 'pad', 'stride', 'name', 'min_max_aspect_ratios_order'], varargs=None, keywords=None, defaults=(None, None, None, None, None, None, None, 0.5, [0.1, 0.1, 0.2, 0.2], True, False, 1, 0, 1, None, False))
 paddle.fluid.layers.bipartite_match ArgSpec(args=['dist_matrix', 'match_type', 'dist_threshold', 'name'], varargs=None, keywords=None, defaults=(None, None, None))
 paddle.fluid.layers.target_assign ArgSpec(args=['input', 'matched_indices', 'negative_indices', 'mismatch_value', 'name'], varargs=None, keywords=None, defaults=(None, None, None))
@ -342,7 +342,7 @@ paddle.fluid.transpiler.RoundRobin.dispatch ArgSpec(args=['self', 'varlist'], va
 paddle.fluid.transpiler.RoundRobin.reset ArgSpec(args=['self'], varargs=None, keywords=None, defaults=None)
 paddle.fluid.transpiler.DistributeTranspilerConfig.__init__ 
 paddle.fluid.nets.simple_img_conv_pool ArgSpec(args=['input', 'num_filters', 'filter_size', 'pool_size', 'pool_stride', 'pool_padding', 'pool_type', 'global_pooling', 'conv_stride', 'conv_padding', 'conv_dilation', 'conv_groups', 'param_attr', 'bias_attr', 'act', 'use_cudnn'], varargs=None, keywords=None, defaults=(0, 'max', False, 1, 0, 1, 1, None, None, None, True))
-paddle.fluid.nets.sequence_conv_pool ArgSpec(args=['input', 'num_filters', 'filter_size', 'param_attr', 'act', 'pool_type'], varargs=None, keywords=None, defaults=(None, 'sigmoid', 'max'))
+paddle.fluid.nets.sequence_conv_pool ArgSpec(args=['input', 'num_filters', 'filter_size', 'param_attr', 'act', 'pool_type', 'bias_attr'], varargs=None, keywords=None, defaults=(None, 'sigmoid', 'max', None))
 paddle.fluid.nets.glu ArgSpec(args=['input', 'dim'], varargs=None, keywords=None, defaults=(-1,))
 paddle.fluid.nets.scaled_dot_product_attention ArgSpec(args=['queries', 'keys', 'values', 'num_heads', 'dropout_rate'], varargs=None, keywords=None, defaults=(1, 0.0))
 paddle.fluid.nets.img_conv_group ArgSpec(args=['input', 'conv_num_filter', 'pool_size', 'conv_padding', 'conv_filter_size', 'conv_act', 'param_attr', 'conv_with_batchnorm', 'conv_batchnorm_drop_rate', 'pool_stride', 'pool_type', 'use_cudnn'], varargs=None, keywords=None, defaults=(1, 3, None, None, False, 0.0, 1, 'max', True))
--- a/paddle/fluid/framework/CMakeLists.txt
+++ b/paddle/fluid/framework/CMakeLists.txt
@ -116,8 +116,9 @@ cc_test(op_proto_maker_test SRCS op_proto_maker_test.cc DEPS op_proto_maker)
 cc_library(op_info SRCS op_info.cc DEPS attribute framework_proto)
 cc_library(shape_inference SRCS shape_inference.cc DEPS ddim attribute device_context)
 cc_library(transfer_scope_cache SRCS transfer_scope_cache.cc DEPS scope framework_proto)
 cc_library(operator SRCS operator.cc DEPS op_info device_context tensor scope glog
-    shape_inference data_transform lod_tensor profiler)
+    shape_inference data_transform lod_tensor profiler transfer_scope_cache)
 cc_test(operator_test SRCS operator_test.cc DEPS operator op_registry device_context)
@ -192,3 +193,6 @@ cc_test(tuple_test SRCS tuple_test.cc )
 if (NOT WIN32)
 cc_test(rw_lock_test SRCS rw_lock_test.cc)
 endif (NOT WIN32)
 cc_library(dlpack_tensor SRCS dlpack_tensor.cc DEPS tensor dlpack)
 cc_test(dlpack_tensor_test SRCS dlpack_tensor_test.cc DEPS dlpack_tensor glog)
--- a/paddle/fluid/framework/dlpack_tensor.cc
+++ b/paddle/fluid/framework/dlpack_tensor.cc
@ -0,0 +1,127 @@
 // Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
 //
 // Licensed under the Apache License, Version 2.0 (the "License");
 // you may not use this file except in compliance with the License.
 // You may obtain a copy of the License at
 //
 //     http://www.apache.org/licenses/LICENSE-2.0
 //
 // Unless required by applicable law or agreed to in writing, software
 // distributed under the License is distributed on an "AS IS" BASIS,
 // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 // See the License for the specific language governing permissions and
 // limitations under the License.
 #include "paddle/fluid/framework/dlpack_tensor.h"
 namespace paddle {
 namespace framework {
 namespace internal {
 template <typename T>
 static ::DLDataType GetDLDataTypeCode() {
  ::DLDataType dtype;
  if (std::is_same<T, platform::float16>::value ||
      std::is_floating_point<T>::value) {
    dtype.code = kDLFloat;
  } else if (std::is_unsigned<T>::value) {
    dtype.code = kDLUInt;
  } else if (std::is_integral<T>::value) {
    dtype.code = kDLInt;
  } else {
    PADDLE_THROW("Unsupported data type %s", typeid(T).name());
  }
  dtype.bits = 8 * sizeof(T);
  dtype.lanes = 1;
  return dtype;
 }
 static DLDataType GetDLDataTypeFromTypeIndex(const std::type_index &type) {
 #define REG_DL_DATA_TYPE(type) \
  { std::type_index(typeid(type)), GetDLDataTypeCode<type>() }
  static const std::unordered_map<std::type_index, ::DLDataType>
      type_to_dtype_map({
          REG_DL_DATA_TYPE(platform::float16),  // NOLINT
          REG_DL_DATA_TYPE(float),              // NOLINT
          REG_DL_DATA_TYPE(double),             // NOLINT
          REG_DL_DATA_TYPE(int),                // NOLINT
          REG_DL_DATA_TYPE(int64_t),            // NOLINT
          REG_DL_DATA_TYPE(bool),               // NOLINT
          REG_DL_DATA_TYPE(size_t),             // NOLINT
          REG_DL_DATA_TYPE(int16_t),            // NOLINT
          REG_DL_DATA_TYPE(uint8_t),            // NOLINT
          REG_DL_DATA_TYPE(int8_t)              // NOLINT
      });
  static auto type_to_dtype_map_end_it = type_to_dtype_map.end();
  auto it = type_to_dtype_map.find(type);
  PADDLE_ENFORCE(it != type_to_dtype_map_end_it, "Unsupported data type %s",
                 type.name());
  return it->second;
 #undef REG_DL_DATA_TYPE
 }
 struct DLContextVisitor : public boost::static_visitor<::DLContext> {
  inline ::DLContext operator()(const platform::CPUPlace &place) const {
    DLContext ctx;
    ctx.device_type = kDLCPU;
    ctx.device_id = 0;
    return ctx;
  }
  inline ::DLContext operator()(const platform::CUDAPlace &place) const {
 #ifdef PADDLE_WITH_CUDA
    DLContext ctx;
    ctx.device_type = kDLGPU;
    ctx.device_id = place.device;
    return ctx;
 #else
    PADDLE_THROW("platform::CUDAPlace is not supported in CPU only version");
 #endif
  }
  inline ::DLContext operator()(const platform::CUDAPinnedPlace &place) const {
 #ifdef PADDLE_WITH_CUDA
    DLContext ctx;
    ctx.device_type = kDLCPUPinned;
    ctx.device_id = 0;
    return ctx;
 #else
    PADDLE_THROW(
        "platform::CUDAPinnedPlace is not supported in CPU only version");
 #endif
  }
 };
 }  // namespace internal
 DLPackTensor::DLPackTensor(const Tensor &tensor, LaneType lanes) {
  // init data, data buffer
  t_.data = const_cast<void *>(tensor.data<void>());
  // init ctx, DLContext type with device_type and device_id
  auto place = tensor.place();
  t_.ctx = boost::apply_visitor(internal::DLContextVisitor(), place);
  // init dtype
  t_.dtype = internal::GetDLDataTypeFromTypeIndex(tensor.type());
  t_.dtype.lanes = lanes;
  // init ndim, tensor rank
  auto &dims = tensor.dims();
  using DimType = decltype(t_.ndim);  // int
  t_.ndim = static_cast<DimType>(dims.size());
  // init shape, tensor dims
  t_.shape = shape_;
  for (DimType i = 0; i < t_.ndim; ++i) {
    t_.shape[i] = dims[i];
  }
  // init strides, nullptr means the tensor is compact
  t_.strides = nullptr;
  // init byte_offset
  t_.byte_offset = 0;
 }
 }  // namespace framework
 }  // namespace paddle
--- a/paddle/fluid/framework/dlpack_tensor.h
+++ b/paddle/fluid/framework/dlpack_tensor.h
@ -0,0 +1,45 @@
 // Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
 //
 // Licensed under the Apache License, Version 2.0 (the "License");
 // you may not use this file except in compliance with the License.
 // You may obtain a copy of the License at
 //
 //     http://www.apache.org/licenses/LICENSE-2.0
 //
 // Unless required by applicable law or agreed to in writing, software
 // distributed under the License is distributed on an "AS IS" BASIS,
 // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 // See the License for the specific language governing permissions and
 // limitations under the License.
 #pragma once
 #include <dlpack/dlpack.h>
 #include "paddle/fluid/framework/tensor.h"
 namespace paddle {
 namespace framework {
 class DLPackTensor {
 public:
  using LaneType = decltype(::DLTensor::dtype.lanes);  // uint16_t
  using ShapeType =
      std::remove_reference<decltype(::DLTensor::shape[0])>::type;  // int64_t
  // lanes is only used in CPU to enable vectorization
  explicit DLPackTensor(const Tensor& tensor, LaneType lanes = 1);
  inline operator const ::DLTensor&() const { return t_; }
  inline operator ::DLTensor&() { return t_; }
 private:
  ::DLTensor t_;
  // The shape in DLTensor is defined as int64_t*
  // Add this member to make TVMTensor init without heap allocation
  ShapeType shape_[9];
 };
 }  // namespace framework
 }  // namespace paddle
--- a/paddle/fluid/framework/dlpack_tensor_test.cc
+++ b/paddle/fluid/framework/dlpack_tensor_test.cc
@ -0,0 +1,113 @@
 // Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
 //
 // Licensed under the Apache License, Version 2.0 (the "License");
 // you may not use this file except in compliance with the License.
 // You may obtain a copy of the License at
 //
 //     http://www.apache.org/licenses/LICENSE-2.0
 //
 // Unless required by applicable law or agreed to in writing, software
 // distributed under the License is distributed on an "AS IS" BASIS,
 // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 // See the License for the specific language governing permissions and
 // limitations under the License.
 #include "paddle/fluid/framework/dlpack_tensor.h"
 #include <glog/logging.h>
 #include <gtest/gtest.h>
 #include <vector>
 namespace paddle {
 namespace framework {
 namespace {  // NOLINT
 template <typename T>
 constexpr uint8_t GetDLDataTypeCode() {
  return std::is_same<platform::float16, T>::value ||
                 std::is_floating_point<T>::value
             ? static_cast<uint8_t>(kDLFloat)
             : (std::is_unsigned<T>::value
                    ? static_cast<uint8_t>(kDLUInt)
                    : (std::is_integral<T>::value ? static_cast<uint8_t>(kDLInt)
                                                  : static_cast<uint8_t>(-1)));
 }
 }  // NOLINT
 template <typename T>
 void TestMain(const platform::Place &place, uint16_t lanes) {
  DDim dims{4, 5, 6, 7};
  Tensor tensor;
  tensor.Resize(dims);
  void *p = tensor.mutable_data<T>(place);
  DLPackTensor dlpack_tensor(tensor, lanes);
  ::DLTensor &dl_tensor = dlpack_tensor;
  CHECK_EQ(p, dl_tensor.data);
  if (platform::is_cpu_place(place)) {
    CHECK_EQ(kDLCPU, dl_tensor.ctx.device_type);
    CHECK_EQ(0, dl_tensor.ctx.device_id);
  } else if (platform::is_gpu_place(place)) {
    CHECK_EQ(kDLGPU, dl_tensor.ctx.device_type);
    CHECK_EQ(boost::get<platform::CUDAPlace>(place).device,
             dl_tensor.ctx.device_id);
  } else if (platform::is_cuda_pinned_place(place)) {
    CHECK_EQ(kDLCPUPinned, dl_tensor.ctx.device_type);
    CHECK_EQ(0, dl_tensor.ctx.device_id);
  } else {
    CHECK_EQ(false, true);
  }
  CHECK_EQ(dims.size(), dl_tensor.ndim);
  for (auto i = 0; i < dims.size(); ++i) {
    CHECK_EQ(dims[i], dl_tensor.shape[i]);
  }
  CHECK_EQ(dl_tensor.strides == nullptr, true);
  CHECK_EQ(static_cast<uint64_t>(0), dl_tensor.byte_offset);
  CHECK_EQ(lanes, dl_tensor.dtype.lanes);
  CHECK_EQ(sizeof(T) * 8, dl_tensor.dtype.bits);
  CHECK_EQ(GetDLDataTypeCode<T>(), dl_tensor.dtype.code);
 }
 template <typename T>
 void TestMainLoop() {
 #ifdef PADDLE_WITH_CUDA
  std::vector<platform::Place> places{platform::CPUPlace(),
                                      platform::CUDAPlace(0),
                                      platform::CUDAPinnedPlace()};
  if (platform::GetCUDADeviceCount() > 1) {
    places.emplace_back(platform::CUDAPlace(1));
  }
 #else
  std::vector<platform::Place> places{platform::CPUPlace()};
 #endif
  std::vector<uint16_t> lanes{1, 2};
  for (auto &p : places) {
    for (auto &l : lanes) {
      TestMain<T>(p, l);
    }
  }
 }
 #define PADDLE_DLPACK_TEST(type) \
  TEST(dlpack, test_##type) { TestMainLoop<type>(); }
 using float16 = platform::float16;
 PADDLE_DLPACK_TEST(float16);
 PADDLE_DLPACK_TEST(float);
 PADDLE_DLPACK_TEST(double);
 PADDLE_DLPACK_TEST(int);
 PADDLE_DLPACK_TEST(int64_t);
 PADDLE_DLPACK_TEST(bool);
 PADDLE_DLPACK_TEST(size_t);
 PADDLE_DLPACK_TEST(int16_t);
 PADDLE_DLPACK_TEST(uint8_t);
 PADDLE_DLPACK_TEST(int8_t);
 #undef PADDLE_DLPACK_TEST
 }  // namespace framework
 }  // namespace paddle
--- a/paddle/fluid/framework/executor.cc
+++ b/paddle/fluid/framework/executor.cc
@ -20,6 +20,7 @@ limitations under the License. */
 #include "paddle/fluid/framework/ngraph_operator.h"
 #include "paddle/fluid/framework/op_registry.h"
 #include "paddle/fluid/framework/reader.h"
 #include "paddle/fluid/framework/transfer_scope_cache.h"
 #include "paddle/fluid/operators/detail/macros.h"
 #include "paddle/fluid/platform/place.h"
 #include "paddle/fluid/platform/profiler.h"
@ -391,8 +392,8 @@ void Executor::RunPreparedContext(ExecutorPrepareContext* ctx, Scope* scope,
  int64_t max_memory_size = GetEagerDeletionThreshold();
  std::unique_ptr<GarbageCollector<Tensor>> gc;
-  // WhileOp would set keep_kids to false
+  // WhileOp would set keep_kids to true,
-  // WhileGradOp would need the scopes created in WhileOp
+  // because WhileGradOp needs the scopes created in WhileOp.
  // Perhaps, we should not perform eager deletion in WhileOp
  // The scopes and variables created by WhileOp would be deleted
  // in WhileGradOp.
--- a/paddle/fluid/framework/naive_executor.cc
+++ b/paddle/fluid/framework/naive_executor.cc
@ -83,6 +83,7 @@ void NaiveExecutor::Run() {
  for (auto &op : ops_) {
    VLOG(3) << std::this_thread::get_id() << " run " << op->Type()
            << " on scope " << scope_;
    op->SetIsCalledByExecutor(false);
    op->Run(*scope_, place_);
  }
 }
--- a/paddle/fluid/framework/op_desc.cc
+++ b/paddle/fluid/framework/op_desc.cc
@ -252,6 +252,12 @@ void OpDesc::SetAttr(const std::string &name, const Attribute &v) {
        this->attrs_[name] = std::vector<int>();
        break;
      }
      case proto::AttrType::LONGS: {
        VLOG(110) << "SetAttr: " << Type() << ", " << name
                  << " from LONGS to LONGS";
        this->attrs_[name] = std::vector<int64_t>();
        break;
      }
      case proto::AttrType::FLOATS: {
        VLOG(110) << "SetAttr: " << Type() << ", " << name
                  << " from INTS to FLOATS";
--- a/paddle/fluid/framework/operator.cc
+++ b/paddle/fluid/framework/operator.cc
@ -22,6 +22,7 @@ limitations under the License. */
 #include "paddle/fluid/framework/lod_tensor.h"
 #include "paddle/fluid/framework/operator.h"
 #include "paddle/fluid/framework/shape_inference.h"
 #include "paddle/fluid/framework/transfer_scope_cache.h"
 #include "paddle/fluid/framework/var_type.h"
 #include "paddle/fluid/platform/profiler.h"
@ -33,11 +34,6 @@ DEFINE_bool(check_nan_inf, false,
 namespace paddle {
 namespace framework {
 // Combine two hash values to a single hash.
 inline size_t CombineHash(size_t seed, size_t a) {
  return (seed ^ a) + 0x9e3779b9 + (seed << 6) + (seed >> 2);
 }
 std::vector<std::tuple<platform::Place, LibraryType>> kKernelPriority = {
    std::make_tuple(platform::CUDAPlace(0), LibraryType::kCUDNN),
    std::make_tuple(platform::CUDAPlace(0), LibraryType::kPlain),
@ -794,17 +790,6 @@ void OperatorWithKernel::TransferInplaceVarsBack(
 Scope* OperatorWithKernel::TryTransferData(
    const Scope& scope, const OpKernelType& expected_kernel_key,
    std::vector<std::string>* transfered_inplace_vars) const {
 // In the inference scenerio, the scopes will be reused across the batches, so
 // the `new_scope` here will result in GPU memroy explosion over the running of
 // operators.
 // We use a thread_local cache to fix that issue, the key in the cache is the
 // combination of the `scope` argument, from_kernel_type, target_kernel_type.
 // Have a discussion with @Superjomn or the inference developers if some changes
 // on this logic for this macro might not tested on the other scenerios.
 #ifdef PADDLE_ON_INFERENCE
  thread_local std::unordered_map<size_t, Scope*> infer_transfer_scope_cache;
 #endif
  Scope* new_scope = nullptr;
  for (auto& var_name_item : Inputs()) {
    for (auto& var_name : var_name_item.second) {
@ -835,23 +820,23 @@ Scope* OperatorWithKernel::TryTransferData(
      VLOG(30) << "Transform Variable " << var_name << " from "
               << kernel_type_for_var << " to " << expected_kernel_key;
-#ifdef PADDLE_ON_INFERENCE
+      // In the inference scenerio, the scopes will be reused across the
-      size_t infer_cache_key =
+      // batches, so the `new_scope` here will result in GPU memroy explosion
-          CombineHash(OpKernelType::Hash()(kernel_type_for_var),
+      // over the  running of operators.
-                      OpKernelType::Hash()(expected_kernel_key));
+      // We use a thread_local cache to fix that issue, the key in the cache is
-      infer_cache_key =
+      // the combination of the `scope` argument, from_kernel_type,
-          CombineHash(infer_cache_key, std::hash<const Scope*>()(&scope));
+      // target_kernel_type.
-
+      // Have a discussion with @Superjomn or the inference developers if some
-      auto it = infer_transfer_scope_cache.find(infer_cache_key);
+      // changes on this logic for this macro might not tested on the other
-      if (it != infer_transfer_scope_cache.end()) {
+      // scenerios.
-        new_scope = infer_transfer_scope_cache[infer_cache_key];
+      // If this op is not called by an Executor or ParallelExecutor, it should
-      } else {
+      // called by a NaiveExecutor, the NaiveExecutor will cache the scopes and
-        new_scope = &scope.NewScope();
+      // variables, that behavior a lot different.
-        infer_transfer_scope_cache[infer_cache_key] = new_scope;
+      if (!run_by_executor_) {
        new_scope = TryCreateTransferScope(kernel_type_for_var,
                                           expected_kernel_key, &scope);
      }
-#endif
+      if (!new_scope) {
      if (new_scope == nullptr) {
        new_scope = &scope.NewScope();
      }
--- a/paddle/fluid/framework/operator.h
+++ b/paddle/fluid/framework/operator.h
@ -127,6 +127,8 @@ class OperatorBase {
  //! Get all outputs variable names
  virtual std::vector<std::string> OutputVars(bool has_intermediate) const;
  void SetIsCalledByExecutor(bool x) { run_by_executor_ = x; }
 protected:
  std::string type_;
  // NOTE: in case of OpGrad, inputs_ contains:
@ -139,6 +141,8 @@ class OperatorBase {
  // IG (Inputs Gradients)
  VariableNameMap outputs_;
  AttributeMap attrs_;
  // Whether this operator executes in an Executor.
  bool run_by_executor_{true};
 private:
  void GenerateTemporaryNames();
--- a/paddle/fluid/framework/transfer_scope_cache.cc
+++ b/paddle/fluid/framework/transfer_scope_cache.cc
@ -0,0 +1,72 @@
 // Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
 //
 // Licensed under the Apache License, Version 2.0 (the "License");
 // you may not use this file except in compliance with the License.
 // You may obtain a copy of the License at
 //
 //     http://www.apache.org/licenses/LICENSE-2.0
 //
 // Unless required by applicable law or agreed to in writing, software
 // distributed under the License is distributed on an "AS IS" BASIS,
 // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 // See the License for the specific language governing permissions and
 // limitations under the License.
 #include "paddle/fluid/framework/transfer_scope_cache.h"
 namespace paddle {
 namespace framework {
 std::unordered_map<size_t, Scope*>& global_transfer_data_cache() {
  thread_local auto* x = new std::unordered_map<size_t, Scope*>;
  return *x;
 }
 std::unordered_set<Scope*>& global_transfer_scope_cache() {
  thread_local auto* x = new std::unordered_set<Scope*>;
  return *x;
 }
 Scope* TryCreateTransferScope(OpKernelType type0, OpKernelType type1,
                              const Scope* scope) {
  Scope* new_scope{nullptr};
  size_t infer_cache_key =
      CombineHash(OpKernelType::Hash()(type0), OpKernelType::Hash()(type1));
  infer_cache_key =
      CombineHash(infer_cache_key, std::hash<const Scope*>()(scope));
  auto it = global_transfer_data_cache().find(infer_cache_key);
  if (it != global_transfer_data_cache().end()) {
    new_scope = global_transfer_data_cache()[infer_cache_key];
  } else {
    new_scope = &scope->NewScope();
    global_transfer_data_cache()[infer_cache_key] = new_scope;
  }
  global_transfer_scope_cache().insert(new_scope);
  return new_scope;
 }
 void RemoveKidsFromTransferScopeCache(Scope* scope) {
  auto it = global_transfer_scope_cache().find(scope);
  if (it != global_transfer_scope_cache().end()) {
    global_transfer_scope_cache().erase(it);
  }
  for (auto* s : scope->kids()) {
    auto it = global_transfer_scope_cache().find(s);
    if (it != global_transfer_scope_cache().end()) {
      global_transfer_scope_cache().erase(it);
    }
  }
  // remove global transfer data cache
  auto& cache = global_transfer_data_cache();
  for (auto it = cache.begin(); it != cache.end();) {
    if (it->second == scope)
      it = cache.erase(it);
    else
      it++;
  }
 }
 }  // namespace framework
 }  // namespace paddle
--- a/Show More
+++ b/Show More