Merge branch 'develop' into mkl_script

8 years ago · d9f2057692
parent b0e4357178 dd92fb2328
commit d9f2057692
75 changed files with 2385 additions and 1212 deletions
--- a/README.md
+++ b/README.md
@ -2,8 +2,8 @@
 [![Build Status](https://travis-ci.org/PaddlePaddle/Paddle.svg?branch=develop)](https://travis-ci.org/PaddlePaddle/Paddle)
-[![Documentation Status](https://img.shields.io/badge/docs-latest-brightgreen.svg?style=flat)](http://doc.paddlepaddle.org/develop/doc/)
+[![Documentation Status](https://img.shields.io/badge/docs-latest-brightgreen.svg?style=flat)](http://www.paddlepaddle.org/docs/develop/documentation/en/getstarted/index_en.html)
-[![Documentation Status](https://img.shields.io/badge/中文文档-最新-brightgreen.svg)](http://doc.paddlepaddle.org/develop/doc_cn/)
+[![Documentation Status](https://img.shields.io/badge/中文文档-最新-brightgreen.svg)](http://www.paddlepaddle.org/docs/develop/documentation/zh/getstarted/index_cn.html)
 [![Coverage Status](https://coveralls.io/repos/github/PaddlePaddle/Paddle/badge.svg?branch=develop)](https://coveralls.io/github/PaddlePaddle/Paddle?branch=develop)
 [![Release](https://img.shields.io/github/release/PaddlePaddle/Paddle.svg)](https://github.com/PaddlePaddle/Paddle/releases)
 [![License](https://img.shields.io/badge/license-Apache%202-blue.svg)](LICENSE)
--- a/doc/api/index_cn.rst
+++ b/doc/api/index_cn.rst
@ -7,3 +7,4 @@ API
    模型配置 <v2/model_configs.rst>
    数据访问 <v2/data.rst>
    训练与应用 <v2/run_logic.rst>
    v2/fluid.rst
--- a/doc/design/images/multigpu_allreduce.graffle
+++ b/doc/design/images/multigpu_allreduce.graffle
--- a/doc/design/images/multigpu_allreduce.png
+++ b/doc/design/images/multigpu_allreduce.png
--- a/doc/design/images/multigpu_before_convert.graffle
+++ b/doc/design/images/multigpu_before_convert.graffle
--- a/doc/design/images/multigpu_before_convert.png
+++ b/doc/design/images/multigpu_before_convert.png
--- a/doc/design/mkldnn/image/engine.png
+++ b/doc/design/mkldnn/image/engine.png
--- a/doc/design/mkldnn/image/gradients.png
+++ b/doc/design/mkldnn/image/gradients.png
--- a/doc/design/mkldnn/image/layers.png
+++ b/doc/design/mkldnn/image/layers.png
--- a/doc/design/mkldnn/image/matrix.png
+++ b/doc/design/mkldnn/image/matrix.png
--- a/doc/design/mkldnn/image/overview.png
+++ b/doc/design/mkldnn/image/overview.png
--- a/doc/design/mkl/mkl_packed.md
+++ b/doc/design/mkl/mkl_packed.md
@ -0,0 +1,95 @@
 # Intel® MKL Packed on PaddlePaddle: Design Doc
 ## Contents
 - [Overview](#overview)
 - [Key Points](#key-points) 
   - [Background](#background)
   - [Solution](#solution)
 - [Actions](#actions)
    - [CMake](#cmake)
 	- [Layers](#layers)
 	- [Unit Tests](#unit-tests)
 	- [Python API](#python-api)
 	- [Benchmarking](#benchmarking)
 ## Overview
 我们计划将 Intel® MKL 中引入的 GEMM Packed APIs\[[1](#references)\] 集成到 PaddlePaddle 中，充分发挥英特尔平台的优势，有效提升PaddlePaddle在英特尔架构上的性能。
 现阶段的优化主要针对 Recurrent Neural Network（以下简称RNN）相关层（包括`RecurrentLayer`, `GatedRecurrentLayer`和`LstmLayer`）， 以及 PaddlePaddle V1 API。
 ## Key Points
 ### Background
 目前PaddlePaddle采用了 Intel® MKL库的[cblas_?gemm](https://software.intel.com/en-us/mkl-developer-reference-c-cblas-gemm)函数，这个函数本身会在计算前将原数据转换为更适合英特尔平台的内部格式。
 1. 转换耗时 \
 这一数据格式的转换操作（Packing），在问题本身的计算量比较小的时候，显得相对来说较为耗时。例如在DeepSpeech2 \[[2](#references)\] 的Vanilla RNN部分中，矩阵大小是`batch_size * 2048`。
 2. 转换冗余 \
 由于在现有的某些情况下（例如RNN），多次调用 cblas_?gemm 会使用相同的原数据，因此，每次调用时对原数据的重复Packing便成为了冗余。
 为了最大程度减少多次调用 cblas_?gemm 在Packing上的耗时，Intel® MKL 引入了以下四个API:
   * cblas_?gemm_alloc
   * cblas_?gemm_pack 
   * cblas_?gemm_compute
   * cblas_?gemm_free
 通过使用这些API，我们可以先完成对原数据的Packing操作，再把已转换为Packed格式的数据传递给那些复用同一数据的gemm_compute函数，从而避免了Packing冗余。
 ### Solution
 在RNN的情况下，同一次前向、后向（forward/backward）过程中所有时间步（time step）共享同一个权重（weight）。当只做推断（inference）时，各次前向之间也都使用了相同的权重，没有必要在每次前向中每个时间步的计算时对权重进行重复的Packing操作。
 我们通过使用新引入的GEMM Packed APIs，在层初始化的时候，先完成对权重的Packing操作，然后在前向，后向时复用已经转换过的权重，并在每次权重更新后，对新的权重进行转换用于下次迭代。
 * 优化前，对于序列长度（sequence length）为`T`的网络模型（model）, `N`次迭代执行的转换次数为：
  - `inference`： `N * T`  
  - `training`： `2 * N * T`
 * 优化后，对于同样设置的网络模型，其转换次数减少至：
  - `inference`： `1`    
  - `training`： `2 * N`
 ## Actions
 添加的相关文件和目录结构如下：
 ```txt
 PaddlePaddle/Paddle
 ├── ...
 └── paddle/
    ├── ...
    └── gserver/
        ├── ...
        ├── layers/
        │   ├── ...
        │   ├── MKLPackedRecurrentLayer.*
        |   ├── MKLPackedGatedRecurrentLayer.*
        |   ├── MKLPackedLstmLayer.*
        |   └── MKLPackedGemm.h
        └── tests/
            ├── ...
            └── test_MKLPacked.cpp
 ```
 ### CMake
 在对应的`CMakeLists.txt`中根据`WITH_MKL`是否打开，来决定是否开启MKL Packed相关功能。
 ### Layers
 所有的`MKLPacked*Layer`都继承于PaddlePaddle的基类`Layer`, 并添加头文件 `MKLPackedGemm.h`，该文件对相关GEMM Packed APIs做了封装。
 ### Unit Tests
 我们会添加`test_MKLPacked.cpp`用于MKL Packed优化后layer的测试。
 对于每一个新加的RNN layer，我们会对比如下2个方面：
 1. 对比优化后layer自身，sequence mode（`rnn_use_batch=false`）与batch mode(`rnn_use_batch=true`)的结果。
 2. 对比优化后layer与相对应的PaddlePaddle原有layer, 在batch mode下的结果。
 ### Python API
 TBD
 ### Benchmarking
 会添加相应的脚本用于测试和对比在使用MKL Packed recurrent layers 前后的网络性能。
 ## References 
 1. [Introducing the new Packed APIs for GEMM](https://software.intel.com/en-us/articles/introducing-the-new-packed-apis-for-gemm)
 2. [DeepSpeech2 on PaddlePaddle](https://github.com/PaddlePaddle/DeepSpeech#deepspeech2-on-paddlepaddle)
--- a/doc/design/mkldnn/README.MD
+++ b/doc/design/mkldnn/README.MD
@ -208,4 +208,3 @@ if use_mkldnn
 但是在PaddlePaddle中，无论是重构前的layer还是重构后的op，都不会想要知道next layer/op的信息。
 4. MKL-DNN的高性能格式与PaddlePaddle原有的`NCHW`不同(PaddlePaddle中的cuDNN部分使用的也是`NCHW`，所以不存在这个问题)。
 所以需要引入一个转换方法，并且只需要在必要的时候转换这种格式，才能更好的发挥MKL-DNN的性能。
--- a/doc/design/paddle_nccl.md
+++ b/doc/design/paddle_nccl.md
@ -0,0 +1,65 @@
 # Design Doc: NCCL support in Paddle Fluid
 ## Abstract
 This Design Doc refers to the NCCL feature in  paddle.  We propose an approach to support NCCL library both on a single machine and multiple machines. We wrapper the NCCL primitives `Broadcast`, `Allreduce`, `Reduce` as operators to utilize Multi-GPU powers in one script.
 ## Motivation
 [NCCL](https://developer.nvidia.com/nccl) is a NVIDIA library support Multi-GPU communicating and optimized for NVIDIA GPUs, it provides routines such as all-gather, all-reduce, broadcast, reduce, reduce-scatter, that can achieve high bandwidth over PCIe and NVLink high-speed interconnect. With NCCL library, we can easily accelerate the training in parallel. 
 - Pros
 1. easily plug-in with [NCCL2](https://developer.nvidia.com/nccl) library.
 1. high performance in NVIDIA GPUs.
 1. MPI like primitives, which have low learning cost for users.
 - Cons
 1. Only design for NVIDIA GPUs, not a general multi-device solution.
 1. Although NCCL1 is opensourced under BSD license, but NCCL2 is not opensourced anymore.
 At the beginning of training, the framework needs to distribute the same parameters to every GPU, and merge the gradients at any time user interests.
 As a result, during training, we need the operations of peer to peer copy between different GPUs, aggregating gradients/parameters from GPUs, and broadcasting parameters to GPUs. Every GPU only need to run the operator with correct place information.
 Besides, it needs interfaces to synchronize model update with each different GPU Cards. 
 ## Implementation
 As mentioned above, we wrap the NCCL routines as several kinds of operators. Need to note that NCCL need to create Communicator between gpu at the beginning, so there is a NCCLInit operator created.
 ### Transpiler
 To be compatible with [parameter server design doc](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/ops/dist_train.md), the transpiler compiles the user defined operation graph into sub-graphs to be executed on different devices.
 1. The user-defined model will be a single device program
 2. Broadcast/Reduce operators between GPUs will be inserted into the program, even for the multi-node, may insert the `Send`, `Recv` operator.
   *Broadcast, AllReduce in a single machine. And Broadcast, AllReduce, [Send, Recv](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/ops/dist_train.md#graph-converter) in multiple machines*
   <img src="images/multigpu_before_convert.png" width="300"/>
 After compiling, the graph as shows
 <img src="images/multigpu_allreduce.png" width="1000"/>
 Operators are added to the sub-graphs. Every GPU assigned a role of `rank0`, `rank1` etc. 
 - **Broadcast**. Broadcast operator distribute initialized parameter to all the GPUs from the GPU who owns it. e.g. from`rank0` GPU.
 - **AllReduce**. AllReduce operator synchronizes parameters/gradients between GPUs. AllReduce implemented in the Ring-Based  communicating method, avoid of the bottle neck in a single GPU.
 Need to notice that AllReduce operator force GPUs synchronized at that point. The whole training process in asynchronous or synchronous mode depends on the AllReduce point in the graph.
 As it shown in the picture, when each GPU compute the gradient of `W`, followed with a `AllReduce` operator, accumulate the `dW` to full batch of data, then run the optimize process individually and apply the gradient to its `W`.
 - **AllReduce**
  Need to note that our AllReduce operator is a ring-base AllReduce implementation. If we use the NCCL2 AllReduce primitive, every GPU optimized full batch of data, wasted (n-1) GPU compute resources. In addition, NCCL2 built-in AllReduce will only utilize the communicating resource during synchronization, then update the gradient will be a subsequent phase. In fact, we can amortize the update gradient time cost into the communicating phase. The process is
 1. Every parameter has its root card. That card will responsible for aggregating the gradients from GPUs.
 2. The whole model's parameter will be hashed to different root card, ensure the load balance between GPUs.
 3. Logically neighberhood card will start send parameter to the next one. After one round, the parameter main card will aggregate the full gradients.
 4. Then the root card will optimize the parameter.
 5. This parameter card will send its optimized result to its neighberhood, then the neighberhood will send parameter to its next one.
 6. Finish the sychronization round.
 The total time cost will be 2 * (n-1) * per-parameter-send-time, we reach the goal of amortize the upgrade time into communicating phase.
--- a/doc/getstarted/concepts/src/infer.py
+++ b/doc/getstarted/concepts/src/infer.py
@ -0,0 +1,18 @@
 import paddle.v2 as paddle
 import numpy as np
 paddle.init(use_gpu=False)
 x = paddle.layer.data(name='x', type=paddle.data_type.dense_vector(2))
 y_predict = paddle.layer.fc(input=x, size=1, act=paddle.activation.Linear())
 # loading the model which generated by training
 with open('params_pass_90.tar', 'r') as f:
    parameters = paddle.parameters.Parameters.from_tar(f)
 # Input multiple sets of data，Output the infer result in a array.
 i = [[[1, 2]], [[3, 4]], [[5, 6]]]
 print paddle.infer(output_layer=y_predict, parameters=parameters, input=i)
 # Will print:
 # [[ -3.24491572]
 #  [ -6.94668722]
 #  [-10.64845848]]
--- a/doc/getstarted/concepts/src/train.py
+++ b/doc/getstarted/concepts/src/train.py
@ -26,6 +26,11 @@ def event_handler(event):
        if event.batch_id % 1 == 0:
            print "Pass %d, Batch %d, Cost %f" % (event.pass_id, event.batch_id,
                                                  event.cost)
    # product model every 10 pass
    if isinstance(event, paddle.event.EndPass):
        if event.pass_id % 10 == 0:
            with open('params_pass_%d.tar' % event.pass_id, 'w') as f:
                trainer.save_parameter_to_tar(f)
 # define training dataset reader
--- a/doc/getstarted/concepts/use_concepts_cn.rst
+++ b/doc/getstarted/concepts/use_concepts_cn.rst
@ -147,4 +147,9 @@ PaddlePaddle支持不同类型的输入数据，主要包括四种类型，和
 ..  literalinclude:: src/train.py
    :linenos:
 使用以上训练好的模型进行预测，取其中一个模型params_pass_90.tar，输入需要预测的向量组，然后打印输出：
 ..  literalinclude:: src/infer.py
    :linenos:
 有关线性回归的实际应用，可以参考PaddlePaddle book的 `第一章节 <http://book.paddlepaddle.org/index.html>`_。
--- a/doc/howto/read_source.md
+++ b/doc/howto/read_source.md
@ -6,10 +6,10 @@ Core: https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/framework
 Operator: https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/operators
 Optimizer: https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/optimizer
 Memory: https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/memory
 Platform: https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/platform
 # Compile Time
 The following **defines** the NN. The definition goes into this [protocol buffer](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/framework.proto).
--- a/doc/mobile/cross_compiling_for_ios_cn.md
+++ b/doc/mobile/cross_compiling_for_ios_cn.md
@ -18,11 +18,11 @@ PaddlePaddle为交叉编译提供了工具链配置文档[cmake/cross_compiling/
 - `CMAKE_SYSTEM_NAME`，CMake编译的目标平台，必须设置为`iOS`。在设置`CMAKE_SYSTEM_NAME=iOS`后，PaddlePaddle的CMake系统会自动编译所有的第三方依赖库，并且强制设置一些PaddlePaddle参数的值（`WITH_C_API=ON`、`WITH_GPU=OFF`、`WITH_AVX=OFF`、`WITH_PYTHON=OFF`、`WITH_RDMA=OFF`）。
 - `WITH_C_API`，是否编译C-API预测库，必须设置为ON。在iOS平台上只支持使用C-API来预测。
- `WITH_SWIG_PY`，必须设置为ON。在iOS平台上不支持通过swig调用来训练或者预测。
+- `WITH_SWIG_PY`，必须设置为`OFF`。在iOS平台上不支持通过swig调用来训练或者预测。
 iOS平台可选配置参数：
- `IOS_PLATFORM`，可设置为`OS/SIMULATOR`，默认值为`OS`。
+- `IOS_PLATFORM`，可设置为`OS`（默认值）或`SIMULATOR`。
  - `OS`，构建目标为`arm`架构的iPhone或者iPad等物理设备。
  - `SIMULATOR`，构建目标为`x86`架构的模拟器平台。
 - `IOS_ARCH`，目标架构。针对不同的`IOS_PLATFORM`，可设置的目标架构如下表所示，默认编译所有架构：
--- a/doc/mobile/cross_compiling_for_ios_en.md
+++ b/doc/mobile/cross_compiling_for_ios_en.md
@ -0,0 +1,120 @@
 # PaddlePaddle Compiling Guide for iOS
 This tutorial will walk you through cross compiling the PaddlePaddle library for iOS from the source in MacOS.
 ## Preparation
 Apple provides Xcode for cross-compiling and IDE for iOS development. Download from App store or [here](https://developer.apple.com/cn/xcode/). To verify your installation, run command as follows
 ```bash
 $ xcodebuild -version
 Xcode 9.0
 Build version 9A235
 ```
 ## Cross-compiling configurations
 PaddlePaddle provides cross-compiling toolchain configuration documentation [cmake/cross_compiling/ios.cmake](https://github.com/PaddlePaddle/Paddle/blob/develop/cmake/cross_compiling/ios.cmake), which has some default settings for frequently used compilers.
 There are some mandatory environment variables need to be set before cross compiling PaddlePaddle for iOS:
 - `CMAKE_SYSTEM_NAME`, CMake compiling target platform name, has to be `iOS`. PaddlePaddle CMake will compile all the third party dependencies and enforce some parameters (`WITH_C_API=ON`, `WITH_GPU=OFF`, `WITH_AVX=OFF`, `WITH_PYTHON=OFF`,`WITH_RDMA=OFF`) when this variable is set with value `iOS`.
 - `WITH_C_API`, Whether to compile inference C-API library, has to be `ON`, since C-API is the only supported interface for inferencing in iOS.
 - `WITH_SWIG_PY`, has to be `OFF`. It's not supported to inference or train via swig in iOS.
 Optional environment variables for iOS are:
 - `IOS_PLATFORM`, either `OS` (default) or `SIMULATOR`.
  - `OS`, build targets ARM-based physical devices like iPhone or iPad.
  - `SIMULATOR`, build targets x86 architecture simulators.
 - `IOS_ARCH`, target architecture. By default, all architecture types will be compiled. If you need to specify the architecture to compile for, please find valid values for different `IOS_PLATFORM` settings from the table below:
    <table class="docutils">
    <colgroup>
      <col width="35%" />
      <col width="65%" />
    </colgroup>
    <thead valign="bottom">
      <tr class="row-odd">
      <th class="head">IOS_PLATFORM</th>
      <th class="head">IOS_ARCH</th>
    </tr>
    </thead>
    <tbody valign="top">
      <tr class="row-even">
      <td>OS</td>
      <td>armv7, armv7s, arm64 </td>
    </tr>
    <tr class="row-odd">
      <td>SIMULATOR</td>
      <td>i386, x86_64 </td>
    </tr>
    </tbody>
    </table>
 - `IOS_DEPLOYMENT_TARGET`, minimum iOS version to deployment, `7.0` by default.
 - `IOS_ENABLE_BITCODE`, whether to enable [Bitcode](https://developer.apple.com/library/content/documentation/IDEs/Conceptual/AppDistributionGuide/AppThinning/AppThinning.html#//apple_ref/doc/uid/TP40012582-CH35-SW3), values can be `ON/OFF`, `ON` by default.
 - `IOS_USE_VECLIB_FOR_BLAS`, whether to use [vecLib](https://developer.apple.com/documentation/accelerate/veclib) framework for BLAS computing. values can be `ON/OFF`, `OFF` by default.
 - `IOS_DEVELOPMENT_ROOT`, the path to `Developer` directory, can be explicitly set with your `/path/to/platform/Developer`. If left blank, PaddlePaddle will automatically pick the Xcode corresponding `platform`'s `Developer` directory based on your `IOS_PLATFORM` value.
 - `IOS_SDK_ROOT`, the path to `SDK` root, can be explicitly set with your  `/path/to/platform/Developer/SDKs/SDK`. if left black, PaddlePaddle will pick the latest SDK in the directory of `IOS_DEVELOPMENT_ROOT`.
 other settings：
 - `USE_EIGEN_FOR_BLAS`, whether to use Eigen for matrix computing. effective when `IOS_USE_VECLIB_FOR_BLAS=OFF`. Values can be `ON/OFF`, `OFF` by default.
 - `HOST_C/CXX_COMPILER`, host C/C++ compiler. Uses value from environment variable `CC/CXX` by default or `cc/c++` if `CC/CXX` doesn't exist.
 some typical cmake configurations:
 ```bash
 cmake -DCMAKE_SYSTEM_NAME=iOS \
      -DIOS_PLATFORM=OS \
      -DIOS_ARCH="armv7;arm64" \
      -DIOS_ENABLE_BITCODE=ON \
      -DIOS_USE_VECLIB_FOR_BLAS=ON \
      -DCMAKE_INSTALL_PREFIX=your/path/to/install \
      -DWITH_C_API=ON \
      -DWITH_TESTING=OFF \
      -DWITH_SWIG_PY=OFF \
      ..
 ```
 ```bash
 cmake -DCMAKE_SYSTEM_NAME=iOS \
      -DIOS_PLATFORM=SIMULATOR \
      -DIOS_ARCH="x86_64" \
      -DIOS_USE_VECLIB_FOR_BLAS=ON \
      -DCMAKE_INSTALL_PREFIX=your/path/to/install \
      -DWITH_C_API=ON \
      -DWITH_TESTING=OFF \
      -DWITH_SWIG_PY=OFF \
      ..
 ```
 You can set other compiling parameters for your own need. I.E. if you are trying to minimize the library size, set `CMAKE_BUILD_TYPE` with `MinSizeRel`; or if the performance is your concern, set `CMAKE_BUILD_TYPE` with `Release`. You can even manipulate the PaddlePaddle compiling procedure by manually set `CMAKE_C/CXX_FLAGS` values.
 **TIPS for a better performance**:
 - set `CMAKE_BUILD_TYPE` with `Release`
 - set `IOS_USE_VECLIB_FOR_BLAS` with `ON`
 ## Compile and install
 After CMake, run following commands, PaddlePaddle will download the compile 3rd party dependencies, compile and install PaddlePaddle inference library.
 ```
 $ make
 $ make install
 ```
 Please Note: if you compiled PaddlePaddle in the source directory for other platforms, do remove `third_party` and `build` directory within the source with `rm -rf` to ensure that all the 3rd party libraries dependencies and PaddlePaddle is newly compiled with current CMake configuration.
 `your/path/to/install` directory will have following directories after `compile` and `install`:
 - `include`, contains all the C-API header files.
 - `lib`, contains PaddlePaddle C-API static library.
 - `third_party` contains all the 3rd party libraries.
 Please note: if PaddlePaddle library need to support both physical devices and simulators, you will need to compile correspondingly, then merge fat library with `lipo`.
 Now you will have PaddlePaddle library compiled and installed, the fat library can be used in deep learning related iOS APPs. Please refer to C-API documentation for usage guides.
--- a/doc/mobile/index_en.rst
+++ b/doc/mobile/index_en.rst
@ -5,4 +5,5 @@ MOBILE
  :maxdepth: 1
  cross_compiling_for_android_en.md
  cross_compiling_for_ios_en.md
  cross_compiling_for_raspberry_en.md
--- a/paddle/capi/error.cpp
+++ b/paddle/capi/error.cpp
@ -14,7 +14,7 @@ limitations under the License. */
 #include "error.h"
-const char* paddle_error_string(paddle_error err) {
+extern "C" const char* paddle_error_string(paddle_error err) {
  switch (err) {
    case kPD_NULLPTR:
      return "nullptr error";
--- a/paddle/capi/error.h
+++ b/paddle/capi/error.h
@ -29,9 +29,17 @@ typedef enum {
  kPD_UNDEFINED_ERROR = -1,
 } paddle_error;
 #ifdef __cplusplus
 extern "C" {
 #endif
 /**
 * Error string for Paddle API.
 */
 PD_API const char* paddle_error_string(paddle_error err);
 #ifdef __cplusplus
 }
 #endif
 #endif
--- a/paddle/framework/CMakeLists.txt
+++ b/paddle/framework/CMakeLists.txt
@ -58,3 +58,6 @@ cc_test(var_type_inference_test SRCS var_type_inference_test.cc DEPS op_registry
        proto_desc)
 cc_library(selected_rows SRCS selected_rows.cc DEPS tensor)
 cc_test(selected_rows_test SRCS selected_rows_test.cc DEPS selected_rows)
 cc_library(init SRCS init.cc DEPS gflags executor place stringpiece)
 cc_test(init_test SRCS init_test.cc DEPS init)
--- a/paddle/framework/ddim_test.cc
+++ b/paddle/framework/ddim_test.cc
@ -1,3 +1,16 @@
 /* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at
   http://www.apache.org/licenses/LICENSE-2.0
   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License. */
 #include <sstream>
 #include <vector>
--- a/Show More
+++ b/Show More