Merge branch 'develop' into cmake_protobuf

8 years ago · c27e71e2fd
parent e678546388 d10f6cfbed
commit c27e71e2fd
49 changed files with 432 additions and 247 deletions
--- a/RELEASE.md
+++ b/RELEASE.md
@ -1,11 +1,103 @@
 # Release v0.10.0

+We are glad to release version 0.10.0.  In this version, we are happy to release the new 
+[Python API](http://research.baidu.com/paddlepaddles-new-api-simplifies-deep-learning-programs/).
+
+- Our old Python API is kind of out of date.  It's hard to learn and hard to
+  use.  To write a PaddlePaddle program using the old API, we'd have to write
+  at least two Python files: one `data provider` and another one that defines
+  the network topology.  Users start a PaddlePaddle job by running the
+  `paddle_trainer` C++ program, which calls Python interpreter to run the
+  network topology configuration script and then start the training loop,
+  which iteratively calls the data provider function to load minibatches.
+  This prevents us from writing a Python program in a modern way, e.g., in the
+  Jupyter Notebook.
+  
+- The new API, which we often refer to as the *v2 API*, allows us to write
+  much shorter Python programs to define the network and the data in a single
+  .py file.  Also, this program can run in Jupyter Notebook, since the entry
+  point is in Python program and PaddlePaddle runs as a shared library loaded
+  and invoked by this Python program.
+  
+Basing on the new API, we delivered an online interative
+book, [Deep Learning 101](http://book.paddlepaddle.org/index.en.html)
+and [its Chinese version](http://book.paddlepaddle.org/).
+
+We also worked on updating our online documentation to describe the new API.
+But this is an ongoing work.  We will release more documentation improvements
+in the next version.
+
+We also worked on bring the new API to distributed model training (via MPI and
+Kubernetes).  This work is ongoing. We will release more about it in the next
+version.
+
 ## New Features

+* We release [new Python API](http://research.baidu.com/paddlepaddles-new-api-simplifies-deep-learning-programs/).
+* Deep Learning 101 book in [English](http://book.paddlepaddle.org/index.en.html) and [Chinese](http://book.paddlepaddle.org/).
+* Support rectangle input for CNN.
+* Support stride pooling for seqlastin and seqfirstin.
+* Expose `seq_concat_layer/seq_reshape_layer` in `trainer_config_helpers`.
+* Add dataset package: CIFAR, MNIST, IMDB, WMT14, CONLL05, movielens, imikolov.
+* Add Priorbox layer for Single Shot Multibox Detection. 
+* Add smooth L1 cost.
+* Add data reader creator and data reader decorator for v2 API.
+* Add the CPU implementation of cmrnorm projection.
+
 ## Improvements

+* Support Python virtualenv for `paddle_trainer`.
+* Add pre-commit hooks, used for automatically format our code.
+* Upgrade protobuf to version 3.x.
+* Add an option to check data type in Python data provider.
+* Speedup the backward of average layer on GPU.
+* Documentation refinement.
+* Check dead links in documents using Travis-CI.
+* Add a example for explaining `sparse_vector`.
+* Add ReLU in layer_math.py
+* Simplify data processing flow for Quick Start.
+* Support CUDNN Deconv.
+* Add data feeder in v2 API.
+* Support predicting the samples from sys.stdin for sentiment demo.
+* Provide multi-proccess interface for image preprocessing. 
+* Add benchmark document for v1 API.
+* Add ReLU in `layer_math.py`.
+* Add packages for automatically downloading public datasets.
+* Rename `Argument::sumCost` to `Argument::sum` since class `Argument` is nothing with cost.
+* Expose Argument::sum to Python
+* Add a new `TensorExpression` implementation for matrix-related expression evaluations.
+* Add lazy assignment for optimizing the calculation of a batch of multiple expressions.
+* Add abstract calss `Function` and its implementation:
+  * `PadFunc` and `PadGradFunc`.
+  * `ContextProjectionForwardFunc` and `ContextProjectionBackwardFunc`.
+  * `CosSimBackward` and `CosSimBackwardFunc`.
+  * `CrossMapNormalFunc` and `CrossMapNormalGradFunc`.
+  * `MulFunc`.
+* Add class `AutoCompare` and `FunctionCompare`, which make it easier to write unit tests for comparing gpu and cpu version of a function.
+* Generate `libpaddle_test_main.a` and remove the main function inside the test file.
+* Support dense numpy vector in PyDataProvider2.
+* Clean code base, remove some copy-n-pasted code snippets:
+  * Extract `RowBuffer` class for `SparseRowMatrix`.
+  * Clean the  interface of `GradientMachine`.
+  * Use `override` keyword in layer.
+  * Simplify `Evaluator::create`, use `ClassRegister` to create `Evaluator`s.
+* Check MD5 checksum when downloading demo's dataset.
+* Add `paddle::Error` which intentially replace `LOG(FATAL)` in Paddle.
+
 ## Bug Fixes

+* Check layer input types for `recurrent_group`.
+* Don't run `clang-format` with .cu source files.
+* Fix bugs with `LogActivation`.
+* Fix the bug that runs `test_layerHelpers` multiple times.
+* Fix the bug that the seq2seq demo exceeds protobuf message size limit.
+* Fix the bug in dataprovider converter in GPU mode.
+* Fix a bug in `GatedRecurrentLayer`.
+* Fix bug for `BatchNorm` when testing more than one models.
+* Fix broken unit test of paramRelu.
+* Fix some compile-time warnings about `CpuSparseMatrix`.
+* Fix `MultiGradientMachine` error when `trainer_count > batch_size`.
+* Fix bugs that prevents from asynchronous data loading in `PyDataProvider2`.

 # Release v0.9.0

--- a/cmake/cblas.cmake
+++ b/cmake/cblas.cmake
@ -44,7 +44,6 @@ if(MKL_INC_DIR AND MKL_CORE_LIB AND MKL_SEQUENTIAL_LIB AND MKL_INTEL_LP64)
  message(STATUS "Found MKL (include: ${CBLAS_INC_DIR}, library: ${CBLAS_LIBRARIES})")
  set(CBLAS_FOUND ON)
  if(${MKL_LAPACK_INC_DIR})
-    add_definitions(-DPADDLE_USE_LAPACK)
    message(STATUS "Found lapack in MKL (include: ${MKL_LAPACK_INC_DIR})")
  endif()
  return() # return file.
@ -80,7 +79,6 @@ if(ATLAS_INC_DIR AND ATLAS_CBLAS_LIB AND ATLAS_LIB AND NOT CBLAS_FOUND)
  message(STATUS "Found ATLAS (include: ${CBLAS_INC_DIR}, library: ${CBLAS_LIBRARIES})")
  set(CBLAS_FOUND ON)
  if(ATLAS_CLAPACK_INC_DIR)
-    add_definitions(-DPADDLE_USE_LAPACK)
    set(CBLAS_INC_DIR ${CBLAS_INC_DIR} ${ATLAS_CLAPACK_INC_DIR})
    message(STATUS "Found lapack in ATLAS (include: ${ATLAS_CLAPACK_INC_DIR})")
  endif()
@ -115,7 +113,6 @@ if(OPENBLAS_INC_DIR AND OPENBLAS_LIB)
  message(STATUS "Found OpenBLAS (include: ${CBLAS_INC_DIR}, library: ${CBLAS_LIBRARIES})")
  set(CBLAS_FOUND ON)
  if(OPENBLAS_LAPACKE_INC_DIR)
-    add_definitions(-DPADDLE_USE_LAPACK)
    message(STATUS "Found lapack in OpenBLAS (include: ${OPENBLAS_LAPACKE_INC_DIR})")
  endif()
  return()
--- a/cmake/external/openblas.cmake
+++ b/cmake/external/openblas.cmake
@ -24,45 +24,17 @@ IF(NOT ${CBLAS_FOUND})
    SET(CBLAS_LIBRARIES "${CBLAS_INSTALL_DIR}/lib/${LIBRARY_PREFIX}openblas${STATIC_LIBRARY_SUFFIX}"
        CACHE FILEPATH "openblas library." FORCE)

-    # check fortran compiler and library
+    SET(COMMON_ARGS CC=${CMAKE_C_COMPILER} NO_LAPACK=1 NO_SHARED=1)
+
    IF(ANDROID)
        SET(OPENBLAS_COMMIT "b5c96fcfcdc82945502a2303116a64d89985daf5")
-        SET(OPTIONAL_ARGS HOSTCC=${HOST_C_COMPILER} TARGET=ARMV7 ARM_SOFTFP_ABI=1 NOFORTRAN=1 USE_THREAD=0 libs)
+        SET(OPTIONAL_ARGS HOSTCC=${HOST_C_COMPILER} TARGET=ARMV7 ARM_SOFTFP_ABI=1 USE_THREAD=0 libs)
    ELSEIF(RPI)
        SET(OPENBLAS_COMMIT "v0.2.19")
-        SET(OPTIONAL_ARGS HOSTCC=${HOST_C_COMPILER} TARGET=ARMV7 NOFORTRAN=1 USE_THREAD=0 libs)
+        SET(OPTIONAL_ARGS HOSTCC=${HOST_C_COMPILER} TARGET=ARMV7 USE_THREAD=0 libs)
    ELSE()
-        IF(CMAKE_COMPILER_IS_GNUCC)
-            ENABLE_LANGUAGE(Fortran)
-            if (NOT CMAKE_Fortran_COMPILER_VERSION)
-              # cmake < 3.4 cannot get CMAKE_Fortran_COMPILER_VERSION directly.
-              execute_process(COMMAND ${CMAKE_Fortran_COMPILER} -dumpversion
-                        OUTPUT_VARIABLE CMAKE_Fortran_COMPILER_VERSION)
-            endif()
-            string(REGEX MATCHALL "[0-9]+" Fortran_VERSION ${CMAKE_Fortran_COMPILER_VERSION})
-            list(GET Fortran_VERSION 0 Fortran_MAJOR)
-            list(GET Fortran_VERSION 1 Fortran_MINOR)
-            find_library(GFORTRAN_LIBRARY NAMES gfortran PATHS
-                         /lib
-                         /usr/lib
-                         /usr/lib/gcc/x86_64-linux-gnu/${Fortran_MAJOR}.${Fortran_MINOR}/
-                         /usr/lib/gcc/x86_64-linux-gnu/${Fortran_MAJOR}/)
-            if (NOT GFORTRAN_LIBRARY)
-                message(FATAL_ERROR "Cannot found gfortran library which it is used by openblas")
-            endif()
-            find_package(Threads REQUIRED)
-            LIST(APPEND CBLAS_LIBRARIES ${GFORTRAN_LIBRARY} ${CMAKE_THREAD_LIBS_INIT})
-        ENDIF(CMAKE_COMPILER_IS_GNUCC)
-
-        IF(NOT CMAKE_Fortran_COMPILER)
-            MESSAGE(FATAL_ERROR "To build lapack in libopenblas, "
-                    "you need to set gfortran compiler: cmake .. -DCMAKE_Fortran_COMPILER=...")
-        ENDIF(NOT CMAKE_Fortran_COMPILER)
-
-        ADD_DEFINITIONS(-DPADDLE_USE_LAPACK)
-
        SET(OPENBLAS_COMMIT "v0.2.19")
-        SET(OPENBLAS_ARGS FC=${CMAKE_Fortran_COMPILER} DYNAMIC_ARCH=1 libs netlib)
+        SET(OPENBLAS_ARGS DYNAMIC_ARCH=1 libs)
    ENDIF()

    ExternalProject_Add(
@ -73,7 +45,7 @@ IF(NOT ${CBLAS_FOUND})
        PREFIX              ${CBLAS_SOURCES_DIR}
        INSTALL_DIR         ${CBLAS_INSTALL_DIR}
        BUILD_IN_SOURCE     1
-        BUILD_COMMAND       ${CMAKE_MAKE_PROGRAM} CC=${CMAKE_C_COMPILER} NO_SHARED=1 ${OPTIONAL_ARGS}
+        BUILD_COMMAND       ${CMAKE_MAKE_PROGRAM} ${COMMON_ARGS} ${OPTIONAL_ARGS}
        INSTALL_COMMAND     ${CMAKE_MAKE_PROGRAM} install NO_SHARED=1 PREFIX=<INSTALL_DIR>
        UPDATE_COMMAND      ""
        CONFIGURE_COMMAND   ""
--- a/cmake/package.cmake
+++ b/cmake/package.cmake
@ -1,5 +1,4 @@
 set(CPACK_PACKAGE_NAME paddle)
-set(CPACK_PACKAGE_DESCRIPTION_SUMMARY "")
 set(CPACK_PACKAGE_VERSION_MAJOR ${PADDLE_MAJOR_VERSION})
 set(CPACK_PACKAGE_VERSION_MINOR ${PADDLE_MINOR_VERSION})
 set(CPACK_PACKAGE_VERSION_PATCH ${PADDLE_PATCH_VERSION})
@ -10,8 +9,9 @@ set(CPACK_DEBIAN_PACKAGE_ARCHITECTURE amd64)
 set(CPACK_DEBIAN_PACKAGE_MAINTAINER PaddlePaddle Dev <paddle-dev@baidu.com>)
 set(CPACK_PACKAGE_DESCRIPTION_SUMMARY "Paddle")
 set(CPACK_PACKAGE_DESCRIPTION "")
-set(CPACK_DEBIAN_PACKAGE_DEPENDS "libatlas3-base, libgflags2, libgoogle-glog0, libprotobuf8, libpython2.7, libstdc++6, python-numpy, python-pip, python-pip-whl, python-protobuf")
+set(CPACK_DEBIAN_PACKAGE_DEPENDS "libpython2.7-dev, libstdc++6, python-pip, curl, libgfortran3, python-pip-whl")
 set(CPACK_DEBIAN_PACKAGE_SECTION Devel)
+set(CPACK_DEBIAN_PACKAGE_VERSION ${PADDLE_VERSION})
 set(CPACK_DEBIAN_PACKAGE_CONTROL_EXTRA "${PROJ_ROOT}/paddle/scripts/deb/postinst")
 #set(CPACK_GENERATOR "DEB")
 # Start cpack
--- a/demo/sentiment/trainer_config.py
+++ b/demo/sentiment/trainer_config.py
@ -29,7 +29,7 @@ settings(
    batch_size=128,
    learning_rate=2e-3,
    learning_method=AdamOptimizer(),
-    average_window=0.5,
+    model_average=ModelAverage(0.5),
    regularization=L2Regularization(8e-4),
    gradient_clipping_threshold=25)

--- a/demo/seqToseq/seqToseq_net.py
+++ b/demo/seqToseq/seqToseq_net.py
@ -69,7 +69,8 @@ def gru_encoder_decoder(data_conf,
                        encoder_size=512,
                        decoder_size=512,
                        beam_size=3,
-                        max_length=250):
+                        max_length=250,
+                        error_clipping=50):
    """
    A wrapper for an attention version of GRU Encoder-Decoder network
    is_generating: whether this config is used for generating
@ -90,9 +91,19 @@ def gru_encoder_decoder(data_conf,
        input=src_word_id,
        size=word_vector_dim,
        param_attr=ParamAttr(name='_source_language_embedding'))
-    src_forward = simple_gru(input=src_embedding, size=encoder_size)
+    src_forward = simple_gru(
+        input=src_embedding,
+        size=encoder_size,
+        naive=True,
+        gru_layer_attr=ExtraLayerAttribute(
+            error_clipping_threshold=error_clipping))
    src_backward = simple_gru(
-        input=src_embedding, size=encoder_size, reverse=True)
+        input=src_embedding,
+        size=encoder_size,
+        reverse=True,
+        naive=True,
+        gru_layer_attr=ExtraLayerAttribute(
+            error_clipping_threshold=error_clipping))
    encoded_vector = concat_layer(input=[src_forward, src_backward])

    with mixed_layer(size=decoder_size) as encoded_proj:
@ -117,11 +128,13 @@ def gru_encoder_decoder(data_conf,
            decoder_inputs += full_matrix_projection(input=context)
            decoder_inputs += full_matrix_projection(input=current_word)

-        gru_step = gru_step_layer(
+        gru_step = gru_step_naive_layer(
            name='gru_decoder',
            input=decoder_inputs,
            output_mem=decoder_mem,
-            size=decoder_size)
+            size=decoder_size,
+            layer_attr=ExtraLayerAttribute(
+                error_clipping_threshold=error_clipping))

        with mixed_layer(
                size=target_dict_dim, bias_attr=True,
--- a/doc/getstarted/index_cn.rst
+++ b/doc/getstarted/index_cn.rst
@ -2,7 +2,8 @@
 ============

 ..  toctree::
-  :maxdepth: 2
+  :maxdepth: 1

  build_and_install/index_cn.rst
-  basic_usage/index_cn.rst
+
+- `深度学习入门课程 <http://book.paddlepaddle.org/>`_
--- a/doc/getstarted/index_en.rst
+++ b/doc/getstarted/index_en.rst
@ -2,7 +2,8 @@ GET STARTED
 ============

 ..  toctree::
-  :maxdepth: 2
+  :maxdepth: 1

  build_and_install/index_en.rst
-  basic_usage/index_en.rst
+
+- `Deep Learning 101 <http://book.paddlepaddle.org/index.en.html>`_
--- a/doc/howto/deep_model/rnn/hierarchical_layer_cn.rst
+++ b/doc/howto/deep_model/rnn/hierarchical_layer_cn.rst
@ -19,18 +19,18 @@

 在 PaddlePaddle中，下面这些Layer能够接受双层序列作为输入，完成相应的计算。

-pooling_layer
-==============
+pooling
+========

-pooling_layer 的使用示例如下，详细见 :ref:`api_trainer_config_helpers_layers_pooling_layer` 配置API。
+pooling 的使用示例如下，详细见 :ref:`api_v2.layer_pooling` 配置API。

 ..	code-block:: bash

-        seq_pool = pooling_layer(input=layer,
-                                 pooling_type=AvgPooling(),
-                                 agg_level=AggregateLevel.EACH_SEQUENCE)
+        seq_pool = pooling(input=layer,
+                           pooling_type=pooling.Max(),
+                           agg_level=AggregateLevel.EACH_SEQUENCE)
        
- `pooling_type` 目前支持两种，分别是：MaxPooling()和AvgPooling()。
+- `pooling_type` 目前支持两种，分别是：pooling.Max()和pooling.Avg()。

 - `agg_level=AggregateLevel.EACH_TIMESTEP` 时（默认值）：

@ -47,7 +47,7 @@ pooling_layer 的使用示例如下，详细见 :ref:`api_trainer_config_helpers
 last_seq 和 first_seq
 =====================

-last_seq 的使用示例如下（ :ref:`api_trainer_config_helpers_layers_first_seq` 类似），详细见 :ref:`api_trainer_config_helpers_layers_last_seq` 配置API。
+last_seq 的使用示例如下（ :ref:`api_v2.layer_first_seq` 类似），详细见 :ref:`api_v2.layer_last_seq` 配置API。

 ..	code-block:: bash

@ -65,16 +65,16 @@ last_seq 的使用示例如下（ :ref:`api_trainer_config_helpers_layers_first_
  - 输入：必须是一个双层序列
  - 输出：一个单层序列，其中每个元素是双层序列中每个subseq最后一个（或第一个）元素。

-expand_layer
-============
+expand
+======

-expand_layer 的使用示例如下，详细见 :ref:`api_trainer_config_helpers_layers_expand_layer` 配置API。
+expand 的使用示例如下，详细见 :ref:`api_v2.layer_expand` 配置API。

 ..	code-block:: bash

-        expand = expand_layer(input=layer1,
-                              expand_as=layer2,
-                              expand_level=ExpandLevel.FROM_TIMESTEP)
+        ex = expand(input=layer1,
+                    expand_as=layer2,
+                    expand_level=ExpandLevel.FROM_TIMESTEP)
        
 - `expand_level=ExpandLevel.FROM_TIMESTEP` 时（默认值）：

--- a/doc/howto/deep_model/rnn/index_cn.rst
+++ b/doc/howto/deep_model/rnn/index_cn.rst
@ -4,7 +4,6 @@ RNN相关模型
 ..  toctree::
  :maxdepth: 1

-  rnn_config_cn.rst
  recurrent_group_cn.md
  hierarchical_layer_cn.rst
  hrnn_rnn_api_compare_cn.rst
--- a/doc/howto/deep_model/rnn/index_en.rst
+++ b/doc/howto/deep_model/rnn/index_en.rst
@ -1,7 +1,2 @@
 RNN Models
 ==========
-
-..  toctree::
-  :maxdepth: 1
-
-  rnn_config_en.rst
--- a/doc/index_cn.rst
+++ b/doc/index_cn.rst
@ -5,7 +5,6 @@ PaddlePaddle 文档
  :maxdepth: 1

  getstarted/index_cn.rst
-  tutorials/index_cn.md
  howto/index_cn.rst
  api/index_cn.rst
  faq/index_cn.rst
--- a/doc/index_en.rst
+++ b/doc/index_en.rst
@ -5,8 +5,6 @@ PaddlePaddle Documentation
  :maxdepth: 1

  getstarted/index_en.rst
-  tutorials/index_en.md
  howto/index_en.rst
  api/index_en.rst
  about/index_en.rst
- 
--- a/doc_theme/templates/layout.html
+++ b/doc_theme/templates/layout.html
@ -114,10 +114,7 @@
          </ul>
        </div>
        <ul class="site-page-links">
-          <li><a>Home</a></li>
-          <li><a>Get Started</a></li>
-          <li class="active"><a>Documentation</a></li>
-          <li><a>About Us</a></li>
+          <li><a href="/">Home</a></li>
        </ul>
      </div>
      <div class="doc-module">
@ -137,7 +134,7 @@
          {{ toctree }}
        {% endblock %}
    </nav>
-    {% if toc %}
+    {% if False %}
    <nav class="local-toc">{{ toc }}</nav>
    {% endif %}
    <section class="doc-content-wrap">
@ -168,7 +165,8 @@
            VERSION:'{{ release|e }}',
            COLLAPSE_INDEX:false,
            FILE_SUFFIX:'{{ '' if no_search_suffix else file_suffix }}',
-            HAS_SOURCE:  {{ has_source|lower }}
+            HAS_SOURCE:  {{ has_source|lower }},
+            SOURCELINK_SUFFIX: ".txt",
        };
    </script>
    {%- for scriptfile in script_files %}
--- a/paddle/cuda/CMakeLists.txt
+++ b/paddle/cuda/CMakeLists.txt
@ -21,16 +21,13 @@ set(CUDA_CXX_WITH_GPU_SOURCES

 if(WITH_GPU)
    set(CUDA_CXX_SOURCES
-        src/hl_dso_loader.cc
        src/hl_warpctc_wrap.cc
        ${CUDA_CXX_WITH_GPU_SOURCES})

    set_source_files_properties(${CUDA_CXX_SOURCES}
                                PROPERTIES COMPILE_FLAGS "-D__NVCC__")
 else()
-    set(CUDA_CXX_SOURCES
-        src/hl_dso_loader.cc
-        src/hl_warpctc_wrap.cc)
+    set(CUDA_CXX_SOURCES src/hl_warpctc_wrap.cc)
 endif()

 set(CUDA_CU_SOURCES
@ -47,7 +44,6 @@ set(CUDA_CU_SOURCES

 set(CUDA_HEADERS
    include/hl_time.h
-    include/hl_dso_loader.h
    include/hl_warpctc_wrap.h
    include/hl_sequence.h
    include/hl_cuda_cublas.h
--- a/paddle/cuda/include/hl_activation_functions.h
+++ b/paddle/cuda/include/hl_activation_functions.h
@ -40,18 +40,18 @@ public:
 namespace gpu {
 static __device__ Active<real>::forward forward[] = HPPL_ACTIVE_FUNCTION;
 static __device__ Active<real>::backward backward[] = HPPL_ACTIVE_FUNCTION;
-}
+}  // namespace gpu
 #else
 namespace cpu {
 static Active<real>::forward forward[] = HPPL_ACTIVE_FUNCTION;
 static Active<real>::backward backward[] = HPPL_ACTIVE_FUNCTION;
-}
+}  // namespace cpu

 #ifdef __AVX__
 namespace avx {
 static Active<__m256>::forward forward[] = HPPL_ACTIVE_FUNCTION;
 static Active<__m256>::backward backward[] = HPPL_ACTIVE_FUNCTION;
-}
+}  // namespace avx
 #endif
 #endif

--- a/paddle/cuda/include/hl_cnn.h
+++ b/paddle/cuda/include/hl_cnn.h
@ -273,23 +273,23 @@ extern void hl_bilinear_forward(const real* inData,
                                const real ratioW);

 /**
-* @brief   Bilinear interpolation backward.
-*
-* @param[out]  inGrad      input gradient.
-* @param[in]   inImgH      input image height.
-* @param[in]   inImgW      input image width.
-* @param[in]   inputH      input batchSize.
-* @param[in]   inputW      input image data dim.
-* @param[in]   outGrad     output gradient.
-* @param[in]   outImgH     output image height.
-* @param[in]   outImgW     output image width.
-* @param[in]   outputH     output batchSize.
-* @param[in]   outputW     output image data dim.
-* @param[in]   numChannels number of channels.
-* @param[in]   ratioH      inImgH / outImgH.
-* @param[in]   ratioW      inImgW / outImgW.
-*
-*/
+ * @brief   Bilinear interpolation backward.
+ *
+ * @param[out]  inGrad      input gradient.
+ * @param[in]   inImgH      input image height.
+ * @param[in]   inImgW      input image width.
+ * @param[in]   inputH      input batchSize.
+ * @param[in]   inputW      input image data dim.
+ * @param[in]   outGrad     output gradient.
+ * @param[in]   outImgH     output image height.
+ * @param[in]   outImgW     output image width.
+ * @param[in]   outputH     output batchSize.
+ * @param[in]   outputW     output image data dim.
+ * @param[in]   numChannels number of channels.
+ * @param[in]   ratioH      inImgH / outImgH.
+ * @param[in]   ratioW      inImgW / outImgW.
+ *
+ */
 extern void hl_bilinear_backward(real* inGrad,
                                 const size_t inImgH,
                                 const size_t inImgW,
--- a/paddle/cuda/src/hl_cuda_cublas.cc
+++ b/paddle/cuda/src/hl_cuda_cublas.cc
@ -14,10 +14,9 @@ limitations under the License. */

 #include "hl_cuda_cublas.h"
 #include <sys/time.h>
-#include <mutex>
 #include "hl_cuda.h"
-#include "hl_dso_loader.h"
 #include "hl_thread.ph"
+#include "paddle/utils/DynamicLoader.h"
 #include "paddle/utils/Logging.h"

 namespace dynload {
--- a/paddle/cuda/src/hl_cuda_cudnn.cc
+++ b/paddle/cuda/src/hl_cuda_cudnn.cc
@ -15,10 +15,9 @@ limitations under the License. */
 #include "hl_cuda_cudnn.h"
 #include <cudnn.h>
 #include <gflags/gflags.h>
-#include <mutex>
 #include "hl_cuda_cudnn.ph"
-#include "hl_dso_loader.h"
 #include "hl_thread.ph"
+#include "paddle/utils/DynamicLoader.h"
 #include "paddle/utils/Logging.h"

 DEFINE_int32(cudnn_conv_workspace_limit_in_mb,
--- a/paddle/cuda/src/hl_cuda_device.cc
+++ b/paddle/cuda/src/hl_cuda_device.cc
@ -21,11 +21,10 @@ limitations under the License. */
 #include <sys/syscall.h>
 #include <sys/time.h>
 #include <unistd.h>
-#include <mutex>
 #include "hl_cuda.ph"
 #include "hl_thread.ph"
-#include "hl_dso_loader.h"
 #include "paddle/utils/Logging.h"
+#include "paddle/utils/DynamicLoader.h"
 // clang-format on

 namespace dynload {
--- a/paddle/cuda/src/hl_warpctc_wrap.cc
+++ b/paddle/cuda/src/hl_warpctc_wrap.cc
@ -14,7 +14,7 @@ limitations under the License. */

 #include "hl_warpctc_wrap.h"
 #include <mutex>
-#include "hl_dso_loader.h"
+#include "paddle/utils/DynamicLoader.h"
 #include "paddle/utils/Logging.h"

 namespace dynload {
--- a/paddle/function/CMakeLists.txt
+++ b/paddle/function/CMakeLists.txt
@ -12,7 +12,7 @@ endif()

 add_library(paddle_function STATIC ${cpp_files} ${cu_objs})
 add_dependencies(paddle_function ${external_project_dependencies})
-
+add_dependencies(paddle_function gen_proto_cpp)

 if(WITH_GPU)
 if(WITH_TESTING)
--- a/paddle/function/MulOpTest.cpp
+++ b/paddle/function/MulOpTest.cpp
@ -74,9 +74,9 @@ TEST(MulOp, DDDMatrixMul) {
 }

 /**
-  * C += A * B, B, C dense, A sparse
-  * dense = sparse * dense
-  */
+ * C += A * B, B, C dense, A sparse
+ * dense = sparse * dense
+ */
 void testFuncDSparseDMatrix(
    size_t dimM, size_t dimN, size_t dimK, size_t nnz, SparseFormat FORMAT) {
  real scaleT = 1.0;
@ -119,9 +119,9 @@ TEST(MuLOp, DSparseDMul) {
 }

 /**
-  * C += A * B, A, C dense, B sparse
-  * dense = dense * sparse
-  */
+ * C += A * B, A, C dense, B sparse
+ * dense = dense * sparse
+ */
 void testFuncDDSparseMatrix(
    size_t dimM, size_t dimN, size_t dimK, size_t nnz, SparseFormat FORMAT) {
  real scaleT = 1.0;
@ -165,9 +165,9 @@ TEST(MulOp, DDSparseMul) {
 }

 /**
-  * C += A * B, A sparse, B, C dense
-  * sparse = dense * dense
-  */
+ * C += A * B, A sparse, B, C dense
+ * sparse = dense * dense
+ */
 void testFuncSparseDDMatrix(
    size_t dimM, size_t dimN, size_t dimK, size_t nnz, SparseFormat FORMAT) {
  real scaleT = 1.0;
--- a/paddle/gserver/gradientmachines/GradientMachine.cpp
+++ b/paddle/gserver/gradientmachines/GradientMachine.cpp
@ -21,7 +21,6 @@ limitations under the License. */
 #include "MultiGradientMachine.h"
 #include "MultiNetwork.h"
 #include "NeuralNetwork.h"
-#include "NeuralNetwork.h"
 #include "ParallelNeuralNetwork.h"
 #include "hl_gpu.h"

--- a/paddle/gserver/gradientmachines/RecurrentGradientMachine.cpp
+++ b/paddle/gserver/gradientmachines/RecurrentGradientMachine.cpp
@ -637,7 +637,7 @@ void RecurrentGradientMachine::removeBeamSearchStatisticsCallbacks() {
 /* create scattered id infomation for all realLayer of inFrameLines one time.
 * If hasSubseq, will also create scattered sequenceStartPositions infomation
 * for all realLayer of inFrameLines one time.
-*/
+ */

 void RecurrentGradientMachine::createInFrameInfo(int inlinkId,
                                                 const Argument& input,
--- a/Show More
+++ b/Show More