Merge branch 'develop' into cmake_protobuf

8 years ago · c27e71e2fd
parent e678546388 d10f6cfbed
commit c27e71e2fd
49 changed files with 432 additions and 247 deletions
--- a/RELEASE.md
+++ b/RELEASE.md
@ -1,11 +1,103 @@
 # Release v0.10.0
 We are glad to release version 0.10.0.  In this version, we are happy to release the new 
 [Python API](http://research.baidu.com/paddlepaddles-new-api-simplifies-deep-learning-programs/).
 - Our old Python API is kind of out of date.  It's hard to learn and hard to
  use.  To write a PaddlePaddle program using the old API, we'd have to write
  at least two Python files: one `data provider` and another one that defines
  the network topology.  Users start a PaddlePaddle job by running the
  `paddle_trainer` C++ program, which calls Python interpreter to run the
  network topology configuration script and then start the training loop,
  which iteratively calls the data provider function to load minibatches.
  This prevents us from writing a Python program in a modern way, e.g., in the
  Jupyter Notebook.
 - The new API, which we often refer to as the *v2 API*, allows us to write
  much shorter Python programs to define the network and the data in a single
  .py file.  Also, this program can run in Jupyter Notebook, since the entry
  point is in Python program and PaddlePaddle runs as a shared library loaded
  and invoked by this Python program.
 Basing on the new API, we delivered an online interative
 book, [Deep Learning 101](http://book.paddlepaddle.org/index.en.html)
 and [its Chinese version](http://book.paddlepaddle.org/).
 We also worked on updating our online documentation to describe the new API.
 But this is an ongoing work.  We will release more documentation improvements
 in the next version.
 We also worked on bring the new API to distributed model training (via MPI and
 Kubernetes).  This work is ongoing. We will release more about it in the next
 version.
 ## New Features
 * We release [new Python API](http://research.baidu.com/paddlepaddles-new-api-simplifies-deep-learning-programs/).
 * Deep Learning 101 book in [English](http://book.paddlepaddle.org/index.en.html) and [Chinese](http://book.paddlepaddle.org/).
 * Support rectangle input for CNN.
 * Support stride pooling for seqlastin and seqfirstin.
 * Expose `seq_concat_layer/seq_reshape_layer` in `trainer_config_helpers`.
 * Add dataset package: CIFAR, MNIST, IMDB, WMT14, CONLL05, movielens, imikolov.
 * Add Priorbox layer for Single Shot Multibox Detection. 
 * Add smooth L1 cost.
 * Add data reader creator and data reader decorator for v2 API.
 * Add the CPU implementation of cmrnorm projection.
 ## Improvements
 * Support Python virtualenv for `paddle_trainer`.
 * Add pre-commit hooks, used for automatically format our code.
 * Upgrade protobuf to version 3.x.
 * Add an option to check data type in Python data provider.
 * Speedup the backward of average layer on GPU.
 * Documentation refinement.
 * Check dead links in documents using Travis-CI.
 * Add a example for explaining `sparse_vector`.
 * Add ReLU in layer_math.py
 * Simplify data processing flow for Quick Start.
 * Support CUDNN Deconv.
 * Add data feeder in v2 API.
 * Support predicting the samples from sys.stdin for sentiment demo.
 * Provide multi-proccess interface for image preprocessing. 
 * Add benchmark document for v1 API.
 * Add ReLU in `layer_math.py`.
 * Add packages for automatically downloading public datasets.
 * Rename `Argument::sumCost` to `Argument::sum` since class `Argument` is nothing with cost.
 * Expose Argument::sum to Python
 * Add a new `TensorExpression` implementation for matrix-related expression evaluations.
 * Add lazy assignment for optimizing the calculation of a batch of multiple expressions.
 * Add abstract calss `Function` and its implementation:
  * `PadFunc` and `PadGradFunc`.
  * `ContextProjectionForwardFunc` and `ContextProjectionBackwardFunc`.
  * `CosSimBackward` and `CosSimBackwardFunc`.
  * `CrossMapNormalFunc` and `CrossMapNormalGradFunc`.
  * `MulFunc`.
 * Add class `AutoCompare` and `FunctionCompare`, which make it easier to write unit tests for comparing gpu and cpu version of a function.
 * Generate `libpaddle_test_main.a` and remove the main function inside the test file.
 * Support dense numpy vector in PyDataProvider2.
 * Clean code base, remove some copy-n-pasted code snippets:
  * Extract `RowBuffer` class for `SparseRowMatrix`.
  * Clean the  interface of `GradientMachine`.
  * Use `override` keyword in layer.
  * Simplify `Evaluator::create`, use `ClassRegister` to create `Evaluator`s.
 * Check MD5 checksum when downloading demo's dataset.
 * Add `paddle::Error` which intentially replace `LOG(FATAL)` in Paddle.
 ## Bug Fixes
 * Check layer input types for `recurrent_group`.
 * Don't run `clang-format` with .cu source files.
 * Fix bugs with `LogActivation`.
 * Fix the bug that runs `test_layerHelpers` multiple times.
 * Fix the bug that the seq2seq demo exceeds protobuf message size limit.
 * Fix the bug in dataprovider converter in GPU mode.
 * Fix a bug in `GatedRecurrentLayer`.
 * Fix bug for `BatchNorm` when testing more than one models.
 * Fix broken unit test of paramRelu.
 * Fix some compile-time warnings about `CpuSparseMatrix`.
 * Fix `MultiGradientMachine` error when `trainer_count > batch_size`.
 * Fix bugs that prevents from asynchronous data loading in `PyDataProvider2`.
 # Release v0.9.0
--- a/cmake/cblas.cmake
+++ b/cmake/cblas.cmake
@ -44,7 +44,6 @@ if(MKL_INC_DIR AND MKL_CORE_LIB AND MKL_SEQUENTIAL_LIB AND MKL_INTEL_LP64)
  message(STATUS "Found MKL (include: ${CBLAS_INC_DIR}, library: ${CBLAS_LIBRARIES})")
  set(CBLAS_FOUND ON)
  if(${MKL_LAPACK_INC_DIR})
    add_definitions(-DPADDLE_USE_LAPACK)
    message(STATUS "Found lapack in MKL (include: ${MKL_LAPACK_INC_DIR})")
  endif()
  return() # return file.
@ -80,7 +79,6 @@ if(ATLAS_INC_DIR AND ATLAS_CBLAS_LIB AND ATLAS_LIB AND NOT CBLAS_FOUND)
  message(STATUS "Found ATLAS (include: ${CBLAS_INC_DIR}, library: ${CBLAS_LIBRARIES})")
  set(CBLAS_FOUND ON)
  if(ATLAS_CLAPACK_INC_DIR)
    add_definitions(-DPADDLE_USE_LAPACK)
    set(CBLAS_INC_DIR ${CBLAS_INC_DIR} ${ATLAS_CLAPACK_INC_DIR})
    message(STATUS "Found lapack in ATLAS (include: ${ATLAS_CLAPACK_INC_DIR})")
  endif()
@ -115,7 +113,6 @@ if(OPENBLAS_INC_DIR AND OPENBLAS_LIB)
  message(STATUS "Found OpenBLAS (include: ${CBLAS_INC_DIR}, library: ${CBLAS_LIBRARIES})")
  set(CBLAS_FOUND ON)
  if(OPENBLAS_LAPACKE_INC_DIR)
    add_definitions(-DPADDLE_USE_LAPACK)
    message(STATUS "Found lapack in OpenBLAS (include: ${OPENBLAS_LAPACKE_INC_DIR})")
  endif()
  return()
--- a/cmake/external/openblas.cmake
+++ b/cmake/external/openblas.cmake
@ -24,45 +24,17 @@ IF(NOT ${CBLAS_FOUND})
    SET(CBLAS_LIBRARIES "${CBLAS_INSTALL_DIR}/lib/${LIBRARY_PREFIX}openblas${STATIC_LIBRARY_SUFFIX}"
        CACHE FILEPATH "openblas library." FORCE)
-    # check fortran compiler and library
+    SET(COMMON_ARGS CC=${CMAKE_C_COMPILER} NO_LAPACK=1 NO_SHARED=1)
    IF(ANDROID)
        SET(OPENBLAS_COMMIT "b5c96fcfcdc82945502a2303116a64d89985daf5")
-        SET(OPTIONAL_ARGS HOSTCC=${HOST_C_COMPILER} TARGET=ARMV7 ARM_SOFTFP_ABI=1 NOFORTRAN=1 USE_THREAD=0 libs)
+        SET(OPTIONAL_ARGS HOSTCC=${HOST_C_COMPILER} TARGET=ARMV7 ARM_SOFTFP_ABI=1 USE_THREAD=0 libs)
    ELSEIF(RPI)
        SET(OPENBLAS_COMMIT "v0.2.19")
-        SET(OPTIONAL_ARGS HOSTCC=${HOST_C_COMPILER} TARGET=ARMV7 NOFORTRAN=1 USE_THREAD=0 libs)
+        SET(OPTIONAL_ARGS HOSTCC=${HOST_C_COMPILER} TARGET=ARMV7 USE_THREAD=0 libs)
    ELSE()
        IF(CMAKE_COMPILER_IS_GNUCC)
            ENABLE_LANGUAGE(Fortran)
            if (NOT CMAKE_Fortran_COMPILER_VERSION)
              # cmake < 3.4 cannot get CMAKE_Fortran_COMPILER_VERSION directly.
              execute_process(COMMAND ${CMAKE_Fortran_COMPILER} -dumpversion
                        OUTPUT_VARIABLE CMAKE_Fortran_COMPILER_VERSION)
            endif()
            string(REGEX MATCHALL "[0-9]+" Fortran_VERSION ${CMAKE_Fortran_COMPILER_VERSION})
            list(GET Fortran_VERSION 0 Fortran_MAJOR)
            list(GET Fortran_VERSION 1 Fortran_MINOR)
            find_library(GFORTRAN_LIBRARY NAMES gfortran PATHS
                         /lib
                         /usr/lib
                         /usr/lib/gcc/x86_64-linux-gnu/${Fortran_MAJOR}.${Fortran_MINOR}/
                         /usr/lib/gcc/x86_64-linux-gnu/${Fortran_MAJOR}/)
            if (NOT GFORTRAN_LIBRARY)
                message(FATAL_ERROR "Cannot found gfortran library which it is used by openblas")
            endif()
            find_package(Threads REQUIRED)
            LIST(APPEND CBLAS_LIBRARIES ${GFORTRAN_LIBRARY} ${CMAKE_THREAD_LIBS_INIT})
        ENDIF(CMAKE_COMPILER_IS_GNUCC)
        IF(NOT CMAKE_Fortran_COMPILER)
            MESSAGE(FATAL_ERROR "To build lapack in libopenblas, "
                    "you need to set gfortran compiler: cmake .. -DCMAKE_Fortran_COMPILER=...")
        ENDIF(NOT CMAKE_Fortran_COMPILER)
        ADD_DEFINITIONS(-DPADDLE_USE_LAPACK)
        SET(OPENBLAS_COMMIT "v0.2.19")
-        SET(OPENBLAS_ARGS FC=${CMAKE_Fortran_COMPILER} DYNAMIC_ARCH=1 libs netlib)
+        SET(OPENBLAS_ARGS DYNAMIC_ARCH=1 libs)
    ENDIF()
    ExternalProject_Add(
@ -73,7 +45,7 @@ IF(NOT ${CBLAS_FOUND})
        PREFIX              ${CBLAS_SOURCES_DIR}
        INSTALL_DIR         ${CBLAS_INSTALL_DIR}
        BUILD_IN_SOURCE     1
-        BUILD_COMMAND       ${CMAKE_MAKE_PROGRAM} CC=${CMAKE_C_COMPILER} NO_SHARED=1 ${OPTIONAL_ARGS}
+        BUILD_COMMAND       ${CMAKE_MAKE_PROGRAM} ${COMMON_ARGS} ${OPTIONAL_ARGS}
        INSTALL_COMMAND     ${CMAKE_MAKE_PROGRAM} install NO_SHARED=1 PREFIX=<INSTALL_DIR>
        UPDATE_COMMAND      ""
        CONFIGURE_COMMAND   ""
--- a/cmake/package.cmake
+++ b/cmake/package.cmake
@ -1,5 +1,4 @@
 set(CPACK_PACKAGE_NAME paddle)
 set(CPACK_PACKAGE_DESCRIPTION_SUMMARY "")
 set(CPACK_PACKAGE_VERSION_MAJOR ${PADDLE_MAJOR_VERSION})
 set(CPACK_PACKAGE_VERSION_MINOR ${PADDLE_MINOR_VERSION})
 set(CPACK_PACKAGE_VERSION_PATCH ${PADDLE_PATCH_VERSION})
@ -10,8 +9,9 @@ set(CPACK_DEBIAN_PACKAGE_ARCHITECTURE amd64)
 set(CPACK_DEBIAN_PACKAGE_MAINTAINER PaddlePaddle Dev <paddle-dev@baidu.com>)
 set(CPACK_PACKAGE_DESCRIPTION_SUMMARY "Paddle")
 set(CPACK_PACKAGE_DESCRIPTION "")
-set(CPACK_DEBIAN_PACKAGE_DEPENDS "libatlas3-base, libgflags2, libgoogle-glog0, libprotobuf8, libpython2.7, libstdc++6, python-numpy, python-pip, python-pip-whl, python-protobuf")
+set(CPACK_DEBIAN_PACKAGE_DEPENDS "libpython2.7-dev, libstdc++6, python-pip, curl, libgfortran3, python-pip-whl")
 set(CPACK_DEBIAN_PACKAGE_SECTION Devel)
 set(CPACK_DEBIAN_PACKAGE_VERSION ${PADDLE_VERSION})
 set(CPACK_DEBIAN_PACKAGE_CONTROL_EXTRA "${PROJ_ROOT}/paddle/scripts/deb/postinst")
 #set(CPACK_GENERATOR "DEB")
 # Start cpack
--- a/demo/sentiment/trainer_config.py
+++ b/demo/sentiment/trainer_config.py
@ -29,7 +29,7 @@ settings(
    batch_size=128,
    learning_rate=2e-3,
    learning_method=AdamOptimizer(),
-    average_window=0.5,
+    model_average=ModelAverage(0.5),
    regularization=L2Regularization(8e-4),
    gradient_clipping_threshold=25)
--- a/demo/seqToseq/seqToseq_net.py
+++ b/demo/seqToseq/seqToseq_net.py
@ -69,7 +69,8 @@ def gru_encoder_decoder(data_conf,
                        encoder_size=512,
                        decoder_size=512,
                        beam_size=3,
-                        max_length=250):
+                        max_length=250,
                        error_clipping=50):
    """
    A wrapper for an attention version of GRU Encoder-Decoder network
    is_generating: whether this config is used for generating
@ -90,9 +91,19 @@ def gru_encoder_decoder(data_conf,
        input=src_word_id,
        size=word_vector_dim,
        param_attr=ParamAttr(name='_source_language_embedding'))
-    src_forward = simple_gru(input=src_embedding, size=encoder_size)
+    src_forward = simple_gru(
        input=src_embedding,
        size=encoder_size,
        naive=True,
        gru_layer_attr=ExtraLayerAttribute(
            error_clipping_threshold=error_clipping))
    src_backward = simple_gru(
-        input=src_embedding, size=encoder_size, reverse=True)
+        input=src_embedding,
        size=encoder_size,
        reverse=True,
        naive=True,
        gru_layer_attr=ExtraLayerAttribute(
            error_clipping_threshold=error_clipping))
    encoded_vector = concat_layer(input=[src_forward, src_backward])
    with mixed_layer(size=decoder_size) as encoded_proj:
@ -117,11 +128,13 @@ def gru_encoder_decoder(data_conf,
            decoder_inputs += full_matrix_projection(input=context)
            decoder_inputs += full_matrix_projection(input=current_word)
-        gru_step = gru_step_layer(
+        gru_step = gru_step_naive_layer(
            name='gru_decoder',
            input=decoder_inputs,
            output_mem=decoder_mem,
-            size=decoder_size)
+            size=decoder_size,
            layer_attr=ExtraLayerAttribute(
                error_clipping_threshold=error_clipping))
        with mixed_layer(
                size=target_dict_dim, bias_attr=True,
--- a/doc/getstarted/index_cn.rst
+++ b/doc/getstarted/index_cn.rst
@ -2,7 +2,8 @@
 ============
 ..  toctree::
-  :maxdepth: 2
+  :maxdepth: 1
  build_and_install/index_cn.rst
-  basic_usage/index_cn.rst
+
 - `深度学习入门课程 <http://book.paddlepaddle.org/>`_
--- a/doc/getstarted/index_en.rst
+++ b/doc/getstarted/index_en.rst
@ -2,7 +2,8 @@ GET STARTED
 ============
 ..  toctree::
-  :maxdepth: 2
+  :maxdepth: 1
  build_and_install/index_en.rst
-  basic_usage/index_en.rst
+
 - `Deep Learning 101 <http://book.paddlepaddle.org/index.en.html>`_
--- a/doc/howto/deep_model/rnn/hierarchical_layer_cn.rst
+++ b/doc/howto/deep_model/rnn/hierarchical_layer_cn.rst
@ -19,18 +19,18 @@
 在 PaddlePaddle中，下面这些Layer能够接受双层序列作为输入，完成相应的计算。
-pooling_layer
+pooling
-==============
+========
-pooling_layer 的使用示例如下，详细见 :ref:`api_trainer_config_helpers_layers_pooling_layer` 配置API。
+pooling 的使用示例如下，详细见 :ref:`api_v2.layer_pooling` 配置API。
 ..	code-block:: bash
-        seq_pool = pooling_layer(input=layer,
+        seq_pool = pooling(input=layer,
-                                 pooling_type=AvgPooling(),
+                           pooling_type=pooling.Max(),
-                                 agg_level=AggregateLevel.EACH_SEQUENCE)
+                           agg_level=AggregateLevel.EACH_SEQUENCE)
- `pooling_type` 目前支持两种，分别是：MaxPooling()和AvgPooling()。
+- `pooling_type` 目前支持两种，分别是：pooling.Max()和pooling.Avg()。
 - `agg_level=AggregateLevel.EACH_TIMESTEP` 时（默认值）：
@ -47,7 +47,7 @@ pooling_layer 的使用示例如下，详细见 :ref:`api_trainer_config_helpers
 last_seq 和 first_seq
 =====================
-last_seq 的使用示例如下（ :ref:`api_trainer_config_helpers_layers_first_seq` 类似），详细见 :ref:`api_trainer_config_helpers_layers_last_seq` 配置API。
+last_seq 的使用示例如下（ :ref:`api_v2.layer_first_seq` 类似），详细见 :ref:`api_v2.layer_last_seq` 配置API。
 ..	code-block:: bash
@ -65,16 +65,16 @@ last_seq 的使用示例如下（ :ref:`api_trainer_config_helpers_layers_first_
  - 输入：必须是一个双层序列
  - 输出：一个单层序列，其中每个元素是双层序列中每个subseq最后一个（或第一个）元素。
-expand_layer
+expand
-============
+======
-expand_layer 的使用示例如下，详细见 :ref:`api_trainer_config_helpers_layers_expand_layer` 配置API。
+expand 的使用示例如下，详细见 :ref:`api_v2.layer_expand` 配置API。
 ..	code-block:: bash
-        expand = expand_layer(input=layer1,
+        ex = expand(input=layer1,
-                              expand_as=layer2,
+                    expand_as=layer2,
-                              expand_level=ExpandLevel.FROM_TIMESTEP)
+                    expand_level=ExpandLevel.FROM_TIMESTEP)
 - `expand_level=ExpandLevel.FROM_TIMESTEP` 时（默认值）：
--- a/doc/howto/deep_model/rnn/index_cn.rst
+++ b/doc/howto/deep_model/rnn/index_cn.rst
@ -4,7 +4,6 @@ RNN相关模型
 ..  toctree::
  :maxdepth: 1
  rnn_config_cn.rst
  recurrent_group_cn.md
  hierarchical_layer_cn.rst
  hrnn_rnn_api_compare_cn.rst
--- a/doc/howto/deep_model/rnn/index_en.rst
+++ b/doc/howto/deep_model/rnn/index_en.rst
@ -1,7 +1,2 @@
 RNN Models
 ==========
 ..  toctree::
  :maxdepth: 1
  rnn_config_en.rst
--- a/doc/index_cn.rst
+++ b/doc/index_cn.rst
@ -5,7 +5,6 @@ PaddlePaddle 文档
  :maxdepth: 1
  getstarted/index_cn.rst
  tutorials/index_cn.md
  howto/index_cn.rst
  api/index_cn.rst
  faq/index_cn.rst
--- a/doc/index_en.rst
+++ b/doc/index_en.rst
@ -5,8 +5,6 @@ PaddlePaddle Documentation
  :maxdepth: 1
  getstarted/index_en.rst
  tutorials/index_en.md
  howto/index_en.rst
  api/index_en.rst
  about/index_en.rst
--- a/doc_theme/templates/layout.html
+++ b/doc_theme/templates/layout.html
@ -114,10 +114,7 @@
          </ul>
        </div>
        <ul class="site-page-links">
-          <li><a>Home</a></li>
+          <li><a href="/">Home</a></li>
          <li><a>Get Started</a></li>
          <li class="active"><a>Documentation</a></li>
          <li><a>About Us</a></li>
        </ul>
      </div>
      <div class="doc-module">
@ -137,7 +134,7 @@
          {{ toctree }}
        {% endblock %}
    </nav>
-    {% if toc %}
+    {% if False %}
    <nav class="local-toc">{{ toc }}</nav>
    {% endif %}
    <section class="doc-content-wrap">
@ -168,7 +165,8 @@
            VERSION:'{{ release|e }}',
            COLLAPSE_INDEX:false,
            FILE_SUFFIX:'{{ '' if no_search_suffix else file_suffix }}',
-            HAS_SOURCE:  {{ has_source|lower }}
+            HAS_SOURCE:  {{ has_source|lower }},
            SOURCELINK_SUFFIX: ".txt",
        };
    </script>
    {%- for scriptfile in script_files %}
--- a/paddle/cuda/CMakeLists.txt
+++ b/paddle/cuda/CMakeLists.txt
@ -21,16 +21,13 @@ set(CUDA_CXX_WITH_GPU_SOURCES
 if(WITH_GPU)
    set(CUDA_CXX_SOURCES
        src/hl_dso_loader.cc
        src/hl_warpctc_wrap.cc
        ${CUDA_CXX_WITH_GPU_SOURCES})
    set_source_files_properties(${CUDA_CXX_SOURCES}
                                PROPERTIES COMPILE_FLAGS "-D__NVCC__")
 else()
-    set(CUDA_CXX_SOURCES
+    set(CUDA_CXX_SOURCES src/hl_warpctc_wrap.cc)
        src/hl_dso_loader.cc
        src/hl_warpctc_wrap.cc)
 endif()
 set(CUDA_CU_SOURCES
@ -47,7 +44,6 @@ set(CUDA_CU_SOURCES
 set(CUDA_HEADERS
    include/hl_time.h
    include/hl_dso_loader.h
    include/hl_warpctc_wrap.h
    include/hl_sequence.h
    include/hl_cuda_cublas.h
--- a/paddle/cuda/include/hl_activation_functions.h
+++ b/paddle/cuda/include/hl_activation_functions.h
@ -40,18 +40,18 @@ public:
 namespace gpu {
 static __device__ Active<real>::forward forward[] = HPPL_ACTIVE_FUNCTION;
 static __device__ Active<real>::backward backward[] = HPPL_ACTIVE_FUNCTION;
-}
+}  // namespace gpu
 #else
 namespace cpu {
 static Active<real>::forward forward[] = HPPL_ACTIVE_FUNCTION;
 static Active<real>::backward backward[] = HPPL_ACTIVE_FUNCTION;
-}
+}  // namespace cpu
 #ifdef __AVX__
 namespace avx {
 static Active<__m256>::forward forward[] = HPPL_ACTIVE_FUNCTION;
 static Active<__m256>::backward backward[] = HPPL_ACTIVE_FUNCTION;
-}
+}  // namespace avx
 #endif
 #endif
--- a/paddle/cuda/include/hl_cnn.h
+++ b/paddle/cuda/include/hl_cnn.h
@ -273,23 +273,23 @@ extern void hl_bilinear_forward(const real* inData,
                                const real ratioW);
 /**
-* @brief   Bilinear interpolation backward.
+ * @brief   Bilinear interpolation backward.
-*
+ *
-* @param[out]  inGrad      input gradient.
+ * @param[out]  inGrad      input gradient.
-* @param[in]   inImgH      input image height.
+ * @param[in]   inImgH      input image height.
-* @param[in]   inImgW      input image width.
+ * @param[in]   inImgW      input image width.
-* @param[in]   inputH      input batchSize.
+ * @param[in]   inputH      input batchSize.
-* @param[in]   inputW      input image data dim.
+ * @param[in]   inputW      input image data dim.
-* @param[in]   outGrad     output gradient.
+ * @param[in]   outGrad     output gradient.
-* @param[in]   outImgH     output image height.
+ * @param[in]   outImgH     output image height.
-* @param[in]   outImgW     output image width.
+ * @param[in]   outImgW     output image width.
-* @param[in]   outputH     output batchSize.
+ * @param[in]   outputH     output batchSize.
-* @param[in]   outputW     output image data dim.
+ * @param[in]   outputW     output image data dim.
-* @param[in]   numChannels number of channels.
+ * @param[in]   numChannels number of channels.
-* @param[in]   ratioH      inImgH / outImgH.
+ * @param[in]   ratioH      inImgH / outImgH.
-* @param[in]   ratioW      inImgW / outImgW.
+ * @param[in]   ratioW      inImgW / outImgW.
-*
+ *
-*/
+ */
 extern void hl_bilinear_backward(real* inGrad,
                                 const size_t inImgH,
                                 const size_t inImgW,
--- a/paddle/cuda/src/hl_cuda_cublas.cc
+++ b/paddle/cuda/src/hl_cuda_cublas.cc
@ -14,10 +14,9 @@ limitations under the License. */
 #include "hl_cuda_cublas.h"
 #include <sys/time.h>
 #include <mutex>
 #include "hl_cuda.h"
 #include "hl_dso_loader.h"
 #include "hl_thread.ph"
 #include "paddle/utils/DynamicLoader.h"
 #include "paddle/utils/Logging.h"
 namespace dynload {
--- a/paddle/cuda/src/hl_cuda_cudnn.cc
+++ b/paddle/cuda/src/hl_cuda_cudnn.cc
@ -15,10 +15,9 @@ limitations under the License. */
 #include "hl_cuda_cudnn.h"
 #include <cudnn.h>
 #include <gflags/gflags.h>
 #include <mutex>
 #include "hl_cuda_cudnn.ph"
 #include "hl_dso_loader.h"
 #include "hl_thread.ph"
 #include "paddle/utils/DynamicLoader.h"
 #include "paddle/utils/Logging.h"
 DEFINE_int32(cudnn_conv_workspace_limit_in_mb,
--- a/paddle/cuda/src/hl_cuda_device.cc
+++ b/paddle/cuda/src/hl_cuda_device.cc
@ -21,11 +21,10 @@ limitations under the License. */
 #include <sys/syscall.h>
 #include <sys/time.h>
 #include <unistd.h>
 #include <mutex>
 #include "hl_cuda.ph"
 #include "hl_thread.ph"
 #include "hl_dso_loader.h"
 #include "paddle/utils/Logging.h"
 #include "paddle/utils/DynamicLoader.h"
 // clang-format on
 namespace dynload {
--- a/paddle/cuda/src/hl_warpctc_wrap.cc
+++ b/paddle/cuda/src/hl_warpctc_wrap.cc
@ -14,7 +14,7 @@ limitations under the License. */
 #include "hl_warpctc_wrap.h"
 #include <mutex>
-#include "hl_dso_loader.h"
+#include "paddle/utils/DynamicLoader.h"
 #include "paddle/utils/Logging.h"
 namespace dynload {
--- a/paddle/function/CMakeLists.txt
+++ b/paddle/function/CMakeLists.txt
@ -12,7 +12,7 @@ endif()
 add_library(paddle_function STATIC ${cpp_files} ${cu_objs})
 add_dependencies(paddle_function ${external_project_dependencies})
-
+add_dependencies(paddle_function gen_proto_cpp)
 if(WITH_GPU)
 if(WITH_TESTING)
--- a/paddle/function/MulOpTest.cpp
+++ b/paddle/function/MulOpTest.cpp
@ -74,9 +74,9 @@ TEST(MulOp, DDDMatrixMul) {
 }
 /**
-  * C += A * B, B, C dense, A sparse
+ * C += A * B, B, C dense, A sparse
-  * dense = sparse * dense
+ * dense = sparse * dense
-  */
+ */
 void testFuncDSparseDMatrix(
    size_t dimM, size_t dimN, size_t dimK, size_t nnz, SparseFormat FORMAT) {
  real scaleT = 1.0;
@ -119,9 +119,9 @@ TEST(MuLOp, DSparseDMul) {
 }
 /**
-  * C += A * B, A, C dense, B sparse
+ * C += A * B, A, C dense, B sparse
-  * dense = dense * sparse
+ * dense = dense * sparse
-  */
+ */
 void testFuncDDSparseMatrix(
    size_t dimM, size_t dimN, size_t dimK, size_t nnz, SparseFormat FORMAT) {
  real scaleT = 1.0;
@ -165,9 +165,9 @@ TEST(MulOp, DDSparseMul) {
 }
 /**
-  * C += A * B, A sparse, B, C dense
+ * C += A * B, A sparse, B, C dense
-  * sparse = dense * dense
+ * sparse = dense * dense
-  */
+ */
 void testFuncSparseDDMatrix(
    size_t dimM, size_t dimN, size_t dimK, size_t nnz, SparseFormat FORMAT) {
  real scaleT = 1.0;
--- a/paddle/gserver/gradientmachines/GradientMachine.cpp
+++ b/paddle/gserver/gradientmachines/GradientMachine.cpp
@ -21,7 +21,6 @@ limitations under the License. */
 #include "MultiGradientMachine.h"
 #include "MultiNetwork.h"
 #include "NeuralNetwork.h"
 #include "NeuralNetwork.h"
 #include "ParallelNeuralNetwork.h"
 #include "hl_gpu.h"
--- a/paddle/gserver/gradientmachines/RecurrentGradientMachine.cpp
+++ b/paddle/gserver/gradientmachines/RecurrentGradientMachine.cpp
@ -637,7 +637,7 @@ void RecurrentGradientMachine::removeBeamSearchStatisticsCallbacks() {
 /* create scattered id infomation for all realLayer of inFrameLines one time.
 * If hasSubseq, will also create scattered sequenceStartPositions infomation
 * for all realLayer of inFrameLines one time.
-*/
+ */
 void RecurrentGradientMachine::createInFrameInfo(int inlinkId,
                                                 const Argument& input,
--- a/Show More
+++ b/Show More