Merge remote-tracking branch 'upstream/develop' into factorization_machine_layer

8 years ago · f7941dbb74
parent 4172fc09c3 3971586149
commit f7941dbb74
605 changed files with 33592 additions and 7179 deletions
--- a/.gitignore
+++ b/.gitignore
@ -28,3 +28,4 @@ cmake_install.cmake
 paddle/.timestamp
 python/paddlepaddle.egg-info/
 paddle/pybind/pybind.h
 python/paddle/v2/framework/tests/tmp/*
--- a/.travis.yml
+++ b/.travis.yml
@ -30,6 +30,7 @@ addons:
      - automake
      - libtool
      - ccache
  ssh_known_hosts: 52.76.173.135
 before_install:
  - if [[ "$JOB" == "check_style" ]]; then sudo ln -s /usr/bin/clang-format-3.8 /usr/bin/clang-format; fi
  # Paddle is using protobuf 3.1 currently. Protobuf 3.2 breaks the compatibility. So we specify the python
@ -42,6 +43,14 @@ script:
  - |
    timeout 2580 paddle/scripts/travis/${JOB}.sh # 43min timeout
    RESULT=$?; if [ $RESULT -eq 0 ] || [ $RESULT -eq 142 ]; then true; else false; fi;
  - |
    if [[ "$JOB" != "build_doc" ]]; then exit 0; fi;
    if [[ "$TRAVIS_PULL_REQUEST" != "false" ]]; then exit 0; fi;
    if [[ "$TRAVIS_BRANCH" != "develop"  && ! "$TRAVIS_BRANCH" =~ ^v[[:digit:]]+\.[[:digit:]]+(\.[[:digit:]]+)?(-\S*)?$ ]]; then exit 0; fi;
    export DEPLOY_DOCS_SH=https://raw.githubusercontent.com/PaddlePaddle/PaddlePaddle.org/master/scripts/deploy/deploy_docs.sh
    export DOCS_DIR=`pwd`
    cd ..
    curl $DEPLOY_DOCS_SH | bash -s $CONTENT_DEC_PASSWD $TRAVIS_BRANCH $DOCS_DIR $DOCS_DIR/build/doc   
 notifications:
  email:
    on_success: change
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@ -1 +1,157 @@
-./doc/howto/dev/contribute_to_paddle_en.md
+# Contribute Code
 We sincerely appreciate your contribution.  This document explains our workflow and work style.
 ## Workflow
 PaddlePaddle uses this [Git branching model](http://nvie.com/posts/a-successful-git-branching-model/).  The following steps guide usual contributions.
 1. Fork
   Our development community has been growing fastly; it doesn't make sense for everyone to write into the official repo.  So, please file Pull Requests from your fork.  To make a fork,  just head over to the GitHub page and click the ["Fork" button](https://help.github.com/articles/fork-a-repo/).
 1. Clone
   To make a copy of your fork to your local computers, please run
   ```bash
   git clone https://github.com/your-github-account/paddle
   cd paddle
   ```
 1. Create the local feature branch
   For daily works like adding a new feature or fixing a bug, please open your feature branch before coding:
   ```bash
   git checkout -b my-cool-stuff
   ```
 1. Commit
   Before issuing your first `git commit` command, please install [`pre-commit`](http://pre-commit.com/) by running the following commands:
   ```bash
   pip install pre-commit
   pre-commit install
   ```
   Our pre-commit configuration requires clang-format 3.8 for auto-formating C/C++ code and yapf for Python.
   Once installed, `pre-commit` checks the style of code and documentation in every commit.  We will see something like the following when you run `git commit`:
   ```
   ➜  git commit
   CRLF end-lines remover...............................(no files to check)Skipped
   yapf.................................................(no files to check)Skipped
   Check for added large files..............................................Passed
   Check for merge conflicts................................................Passed
   Check for broken symlinks................................................Passed
   Detect Private Key...................................(no files to check)Skipped
   Fix End of Files.....................................(no files to check)Skipped
   clang-formater.......................................(no files to check)Skipped
   [my-cool-stuff c703c041] add test file
    1 file changed, 0 insertions(+), 0 deletions(-)
    create mode 100644 233
   ```
 1. Build and test
   Users can build PaddlePaddle natively on Linux and Mac OS X.  But to unify the building environment and to make it easy for debugging, the recommended way is [using Docker](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/dev/build_en.md).
 1. Keep pulling
   An experienced Git user pulls from the official repo often -- daily or even hourly, so they notice conflicts with others work early, and it's easier to resolve smaller conflicts.
   ```bash
   git remote add upstream https://github.com/PaddlePaddle/Paddle
   git pull upstream develop
   ```
 1. Push and file a pull request
   You can "push" your local work into your forked repo:
   ```bash
   git push origin my-cool-stuff
   ```
   The push allows you to create a pull request, requesting owners of this [official repo](https://github.com/PaddlePaddle/Paddle) to pull your change into the official one.
   To create a pull request, please follow [these steps](https://help.github.com/articles/creating-a-pull-request/).
   If your change is for fixing an issue, please write ["Fixes <issue-URL>"](https://help.github.com/articles/closing-issues-using-keywords/) in the description section of your pull request.  Github would close the issue when the owners merge your pull request.
   Please remember to specify some reviewers for your pull request.  If you don't know who are the right ones, please follow Github's recommendation.
 1. Delete local and remote branches
   To keep your local workspace and your fork clean, you might want to remove merged branches:
   ```bash
   git push origin :my-cool-stuff
   git checkout develop
   git pull upstream develop
   git branch -d my-cool-stuff
   ```
 ### Code Review
 -  Please feel free to ping your reviewers by sending them the URL of your pull request via IM or email.  Please do this after your pull request passes the CI.
 - Please answer reviewers' every comment.  If you are to follow the comment, please write "Done"; please give a reason otherwise.
 - If you don't want your reviewers to get overwhelmed by email notifications, you might reply their comments by [in a batch](https://help.github.com/articles/reviewing-proposed-changes-in-a-pull-request/).
 - Reduce the unnecessary commits.  Some developers commit often.  It is recommended to append a sequence of small changes into one commit by running `git commit --amend` instead of `git commit`.
 ## Coding Standard
 ### Code Style
 Our C/C++ code follows the [Google style guide](http://google.github.io/styleguide/cppguide.html).
 Our Python code follows the [PEP8 style guide](https://www.python.org/dev/peps/pep-0008/).
 Our build process helps to check the code style.  In [`build.sh`](https://github.com/PaddlePaddle/Paddle/blob/b84e8226514b8bb4405c3c28e54aa5077193d179/paddle/scripts/docker/build.sh#L42), the entry point of our [builder Docker image](https://github.com/PaddlePaddle/Paddle/blob/b84e8226514b8bb4405c3c28e54aa5077193d179/Dockerfile#L88), the CMake argument `WITH_STYLE_CHECK` is set to `ON` by default.  This flag is on
 Please install pre-commit, which automatically reformat the changes to C/C++ and Python code whenever we run `git commit`.  To check the whole codebase, we can run the command `pre-commit run -a`, as in the [`check_style.sh` file](https://github.com/PaddlePaddle/Paddle/blob/b84e8226514b8bb4405c3c28e54aa5077193d179/paddle/scripts/travis/check_style.sh#L30), which is invoked by [our Travis CI configuration](https://github.com/PaddlePaddle/Paddle/blob/b84e8226514b8bb4405c3c28e54aa5077193d179/.travis.yml#L43).
 ### Unit Tests
 Please remember to add related unit tests.
 - For C/C++ code, please follow [`google-test` Primer](https://github.com/google/googletest/blob/master/googletest/docs/Primer.md).
 - For Python code, please use [Python's standard `unittest` package](http://pythontesting.net/framework/unittest/unittest-introduction/).
 ### Writing Logs
 We use [glog](https://github.com/google/glog) for logging in our C/C++ code.
 For general information, please use `LOG`.  For debug information, please use [`VLOG`](http://htmlpreview.github.io/?https://github.com/google/glog/blob/master/doc/glog.html#verbose).  The reason is at [here](https://groups.google.com/a/chromium.org/d/msg/chromium-dev/3NDNd1KzXeY/AZKMMx37fdQJ).
 `VLOG` requires a *verbose level* parameter.  For example:
 ```c++
 VLOG(3) << "Operator FC is taking " << num_inputs << "inputs."
 ```
 When we run a PaddlePaddle application or test, we can specify a verbose threshold.  For example:
 ```bash
 GLOG_vmodule=buddy_allocator=2 \
 GLOG_v=10 \
 python \
 ../python/paddle/v2/framework/tests/test_recurrent_op.py
 ```
 This will enable VLOG messages generated by `buddy_allocator.{h,cc}` and in the verbose range of 0 to 3, so you will see above example VLOG message, which is in level 3.  This suggests that we output overall messages in lower verbose levels, so they display with higher probability.  When coding C++, please follow the verbose level convention as follows:
 - verbose level 1: [framework](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/framework)
 - verbose level 3: [operators](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/operators)
 - verbose level 5: [memory](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/memory), [platform](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/platform)
 - verbose level 7: [math](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/math)
--- a/benchmark/IntelOptimizedPaddle.md
+++ b/benchmark/IntelOptimizedPaddle.md
@ -0,0 +1,48 @@
 # Benchmark
 Machine:
 - Server
 	- Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz, 2 Sockets, 20 Cores per socket
 - Laptop
 	- DELL XPS15-9560-R1745: i7-7700HQ 8G 256GSSD
 	- i5 MacBook Pro (Retina, 13-inch, Early 2015)
 - Desktop
 	- i7-6700k
 System: CentOS release 6.3 (Final), Docker 1.12.1.
 PaddlePaddle: paddlepaddle/paddle:latest (TODO: will rerun after 0.11.0)
 - MKL-DNN tag v0.10
 - MKLML 2018.0.20170720
 - OpenBLAS v0.2.20
 On each machine, we will test and compare the performance of training on single node using MKL-DNN / MKLML / OpenBLAS respectively.
 ## Benchmark Model
 ### Server
 Test on batch size 64, 128, 256 on Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
 Input image size - 3 * 224 * 224, Time: images/second
 - VGG-19
 | BatchSize    | 64    | 128  | 256     |
 |--------------|-------| -----| --------|
 | OpenBLAS     | 7.82  | 8.62  | 10.34  | 
 | MKLML        | 11.02 | 12.86 | 15.33  |
 | MKL-DNN      | 27.69 | 28.8 | 29.27  |
 chart on batch size 128
 TBD
 - ResNet
 - GoogLeNet
 ### Laptop
 TBD
 ### Desktop
 TBD
--- a/benchmark/paddle/image/resnet.py
+++ b/benchmark/paddle/image/resnet.py
@ -0,0 +1,213 @@
 #!/usr/bin/env python
 from paddle.trainer_config_helpers import *
 height = 224
 width = 224
 num_class = 1000
 batch_size = get_config_arg('batch_size', int, 64)
 layer_num = get_config_arg("layer_num", int, 50)
 is_test = get_config_arg("is_test", bool, False)
 args = {'height': height, 'width': width, 'color': True, 'num_class': num_class}
 define_py_data_sources2(
    "train.list", None, module="provider", obj="process", args=args)
 settings(
    batch_size=batch_size,
    learning_rate=0.01 / batch_size,
    learning_method=MomentumOptimizer(0.9),
    regularization=L2Regularization(0.0005 * batch_size))
 #######################Network Configuration #############
 def conv_bn_layer(name,
                  input,
                  filter_size,
                  num_filters,
                  stride,
                  padding,
                  channels=None,
                  active_type=ReluActivation()):
    """
    A wrapper for conv layer with batch normalization layers.
    Note:
    conv layer has no activation.
    """
    tmp = img_conv_layer(
        name=name + "_conv",
        input=input,
        filter_size=filter_size,
        num_channels=channels,
        num_filters=num_filters,
        stride=stride,
        padding=padding,
        act=LinearActivation(),
        bias_attr=False)
    return batch_norm_layer(
        name=name + "_bn", input=tmp, act=active_type, use_global_stats=is_test)
 def bottleneck_block(name, input, num_filters1, num_filters2):
    """
    A wrapper for bottlenect building block in ResNet.
    Last conv_bn_layer has no activation.
    Addto layer has activation of relu.
    """
    last_name = conv_bn_layer(
        name=name + '_branch2a',
        input=input,
        filter_size=1,
        num_filters=num_filters1,
        stride=1,
        padding=0)
    last_name = conv_bn_layer(
        name=name + '_branch2b',
        input=last_name,
        filter_size=3,
        num_filters=num_filters1,
        stride=1,
        padding=1)
    last_name = conv_bn_layer(
        name=name + '_branch2c',
        input=last_name,
        filter_size=1,
        num_filters=num_filters2,
        stride=1,
        padding=0,
        active_type=LinearActivation())
    return addto_layer(
        name=name + "_addto", input=[input, last_name], act=ReluActivation())
 def mid_projection(name, input, num_filters1, num_filters2, stride=2):
    """
    A wrapper for middile projection in ResNet.
    projection shortcuts are used for increasing dimensions,
    and other shortcuts are identity
    branch1: projection shortcuts are used for increasing
    dimensions, has no activation.
    branch2x: bottleneck building block, shortcuts are identity.
    """
    # stride = 2
    branch1 = conv_bn_layer(
        name=name + '_branch1',
        input=input,
        filter_size=1,
        num_filters=num_filters2,
        stride=stride,
        padding=0,
        active_type=LinearActivation())
    last_name = conv_bn_layer(
        name=name + '_branch2a',
        input=input,
        filter_size=1,
        num_filters=num_filters1,
        stride=stride,
        padding=0)
    last_name = conv_bn_layer(
        name=name + '_branch2b',
        input=last_name,
        filter_size=3,
        num_filters=num_filters1,
        stride=1,
        padding=1)
    last_name = conv_bn_layer(
        name=name + '_branch2c',
        input=last_name,
        filter_size=1,
        num_filters=num_filters2,
        stride=1,
        padding=0,
        active_type=LinearActivation())
    return addto_layer(
        name=name + "_addto", input=[branch1, last_name], act=ReluActivation())
 img = data_layer(name='image', size=height * width * 3)
 def deep_res_net(res2_num=3, res3_num=4, res4_num=6, res5_num=3):
    """
    A wrapper for 50,101,152 layers of ResNet.
    res2_num: number of blocks stacked in conv2_x
    res3_num: number of blocks stacked in conv3_x
    res4_num: number of blocks stacked in conv4_x
    res5_num: number of blocks stacked in conv5_x
    """
    # For ImageNet
    # conv1: 112x112
    tmp = conv_bn_layer(
        "conv1",
        input=img,
        filter_size=7,
        channels=3,
        num_filters=64,
        stride=2,
        padding=3)
    tmp = img_pool_layer(name="pool1", input=tmp, pool_size=3, stride=2)
    # conv2_x: 56x56
    tmp = mid_projection(
        name="res2_1", input=tmp, num_filters1=64, num_filters2=256, stride=1)
    for i in xrange(2, res2_num + 1, 1):
        tmp = bottleneck_block(
            name="res2_" + str(i), input=tmp, num_filters1=64, num_filters2=256)
    # conv3_x: 28x28
    tmp = mid_projection(
        name="res3_1", input=tmp, num_filters1=128, num_filters2=512)
    for i in xrange(2, res3_num + 1, 1):
        tmp = bottleneck_block(
            name="res3_" + str(i),
            input=tmp,
            num_filters1=128,
            num_filters2=512)
    # conv4_x: 14x14
    tmp = mid_projection(
        name="res4_1", input=tmp, num_filters1=256, num_filters2=1024)
    for i in xrange(2, res4_num + 1, 1):
        tmp = bottleneck_block(
            name="res4_" + str(i),
            input=tmp,
            num_filters1=256,
            num_filters2=1024)
    # conv5_x: 7x7
    tmp = mid_projection(
        name="res5_1", input=tmp, num_filters1=512, num_filters2=2048)
    for i in xrange(2, res5_num + 1, 1):
        tmp = bottleneck_block(
            name="res5_" + str(i),
            input=tmp,
            num_filters1=512,
            num_filters2=2048)
    tmp = img_pool_layer(
        name='avgpool',
        input=tmp,
        pool_size=7,
        stride=1,
        pool_type=AvgPooling())
    return fc_layer(input=tmp, size=num_class, act=SoftmaxActivation())
 if layer_num == 50:
    resnet = deep_res_net(3, 4, 6, 3)
 elif layer_num == 101:
    resnet = deep_res_net(3, 4, 23, 3)
 elif layer_num == 152:
    resnet = deep_res_net(3, 8, 36, 3)
 else:
    print("Wrong layer number.")
 lbl = data_layer(name="label", size=num_class)
 loss = cross_entropy(name='loss', input=resnet, label=lbl)
 inputs(img, lbl)
 outputs(loss)
--- a/benchmark/paddle/image/run_mkldnn.sh
+++ b/benchmark/paddle/image/run_mkldnn.sh
@ -5,22 +5,23 @@ function train() {
  export OMP_DYNAMIC="FALSE"
  export KMP_AFFINITY="granularity=fine,compact,0,0"
  topology=$1
-  bs=$2
+  layer_num=$2
-  use_mkldnn=$3
+  bs=$3
-  if [ $3 == "True" ]; then
+  use_mkldnn=$4
  if [ $4 == "True" ]; then
    thread=1
-    log="logs/${topology}-mkldnn-${bs}.log"
+    log="logs/${topology}-${layer_num}-mkldnn-${bs}.log"
-  elif [ $3 == "False" ]; then
+  elif [ $4 == "False" ]; then
    thread=`nproc`
    # each trainer_count use only 1 core to avoid conflict
    export OMP_NUM_THREADS=1
    export MKL_NUM_THREADS=1
-    log="logs/${topology}-${thread}mklml-${bs}.log"
+    log="logs/${topology}-${layer_num}-${thread}mklml-${bs}.log"
  else
    echo "Wrong input $3, use True or False."
    exit 0
  fi
-  args="batch_size=${bs}"
+  args="batch_size=${bs},layer_num=${layer_num}"
  config="${topology}.py"
  paddle train --job=time \
    --config=$config \
@ -40,12 +41,9 @@ if [ ! -d "logs" ]; then
  mkdir logs
 fi
-#========== mkldnn ==========#
+for use_mkldnn in True False; do
-train vgg 64 True
+  for batchsize in 64 128 256; do
-train vgg 128 True
+    train vgg 19 $batchsize $use_mkldnn
-train vgg 256 True
+    train resnet 50  $batchsize $use_mkldnn
-
+  done
-#========== mklml ===========#
+done
 train vgg 64 False
 train vgg 128 False
 train vgg 256 False
--- a/benchmark/paddle/image/vgg.py
+++ b/benchmark/paddle/image/vgg.py
@ -13,7 +13,7 @@ define_py_data_sources2(
 settings(
    batch_size=batch_size,
-    learning_rate=0.01 / batch_size,
+    learning_rate=0.001 / batch_size,
    learning_method=MomentumOptimizer(0.9),
    regularization=L2Regularization(0.0005 * batch_size))
--- a/cmake/cblas.cmake
+++ b/cmake/cblas.cmake
@ -1,17 +1,12 @@
 # Find the CBlas and lapack libraries
 #
-# It will search MKL, atlas, OpenBlas, reference-cblas in order.
+# It will search MKLML, atlas, OpenBlas, reference-cblas in order.
 #
 # If any cblas implementation found, the following variable will be set.
-#    CBLAS_PROVIDER  # one of MKL, ATLAS, OPENBLAS, REFERENCE
+#    CBLAS_PROVIDER  # one of MKLML, ATLAS, OPENBLAS, REFERENCE
 #    CBLAS_INC_DIR   # the include directory for cblas.
 #    CBLAS_LIBS      # a list of libraries should be linked by paddle.
 #                    # Each library should be full path to object file.
 #
 # User should set one of MKL_ROOT, ATLAS_ROOT, OPENBLAS_ROOT, REFERENCE_CBLAS_ROOT
 # during cmake. If none of them set, it will try to find cblas implementation in
 # system paths.
 #
 set(CBLAS_FOUND OFF)
@ -30,44 +25,6 @@ if(WITH_MKLML AND MKLML_INC_DIR AND MKLML_LIB)
  return()
 endif()
 ## Then find MKL.
 set(INTEL_MKL_ROOT "/opt/intel/mkl" CACHE PATH "Folder contains intel mkl libs")
 set(MKL_ROOT $ENV{MKL_ROOT} CACHE PATH "Folder contains env MKL")
 set(MKL_INCLUDE_SEARCH_PATHS
  ${MKL_ROOT}/include
  ${INTEL_MKL_ROOT}/include)
 set(MKL_LIB_SEARCH_PATHS
  ${MKL_ROOT}/lib
  ${MKL_ROOT}/lib/intel64
  ${INTEL_MKL_ROOT}/lib
  ${INTEL_MKL_ROOT}/lib/intel64)
 find_path(MKL_INC_DIR mkl.h PATHS
  ${MKL_INCLUDE_SEARCH_PATHS})
 find_path(MKL_LAPACK_INC_DIR mkl_lapacke.h PATHS
  ${MKL_INCLUDE_SEARCH_PATHS})
 find_library(MKL_CORE_LIB NAMES mkl_core PATHS
  ${MKL_LIB_SEARCH_PATHS})
 find_library(MKL_SEQUENTIAL_LIB NAMES mkl_sequential PATHS
  ${MKL_LIB_SEARCH_PATHS})
 find_library(MKL_INTEL_LP64 NAMES mkl_intel_lp64 PATHS
  ${MKL_LIB_SEARCH_PATHS})
 if(MKL_LAPACK_INC_DIR AND MKL_INC_DIR AND MKL_CORE_LIB AND MKL_SEQUENTIAL_LIB AND MKL_INTEL_LP64)
  set(CBLAS_FOUND ON)
  set(CBLAS_PROVIDER MKL)
  set(CBLAS_INC_DIR ${MKL_INC_DIR} ${MKL_LAPACK_INC_DIR})
  set(CBLAS_LIBRARIES ${MKL_INTEL_LP64} ${MKL_SEQUENTIAL_LIB} ${MKL_CORE_LIB})
  add_definitions(-DPADDLE_USE_MKL)
  add_definitions(-DLAPACK_FOUND)
  message(STATUS "Found MKL (include: ${MKL_INC_DIR}, library: ${CBLAS_LIBRARIES})")
  message(STATUS "Found lapack in MKL (include: ${MKL_LAPACK_INC_DIR})")
  return()
 endif()
 ## Then find atlas.
 set(ATLAS_ROOT $ENV{ATLAS_ROOT} CACHE PATH "Folder contains Atlas")
 set(ATLAS_INCLUDE_SEARCH_PATHS
--- a/cmake/cross_compiling/ios.cmake
+++ b/cmake/cross_compiling/ios.cmake
@ -79,9 +79,8 @@ if(NOT DEFINED IOS_ARCH)
    # FIXME(liuyiqun): support "armv7;armv7s;arm64" future
    set(IOS_ARCH "arm64")
  elseif(IOS_PLATFORM STREQUAL "SIMULATOR")
-    set(IOS_ARCH "i386;x86_64")
+    # FIXME(liuyiqun): support "i386;x86_64" future
-  elseif(IOS_PLATFORM STREQUAL "WATCHOS")
+    set(IOS_ARCH "x86_64")
    set(IOS_ARCH armv7k)
  endif()
 endif()
 set(CMAKE_OSX_ARCHITECTURES ${IOS_ARCH} CACHE string  "Build architecture for iOS")
--- a/cmake/external/eigen.cmake
+++ b/cmake/external/eigen.cmake
@ -8,7 +8,7 @@ ExternalProject_Add(
    extern_eigen3
    ${EXTERNAL_PROJECT_LOG_ARGS}
    GIT_REPOSITORY  "https://github.com/RLovelett/eigen.git"
-    GIT_TAG         4e79cb69b9425f5f8c3a84be4350d4ab75b5fd9d
+    GIT_TAG         70661066beef694cadf6c304d0d07e0758825c10
    PREFIX          ${EIGEN_SOURCE_DIR}
    UPDATE_COMMAND  ""
    CONFIGURE_COMMAND ""
--- a/cmake/external/mkldnn.cmake
+++ b/cmake/external/mkldnn.cmake
@ -46,16 +46,20 @@ IF(${CBLAS_PROVIDER} STREQUAL "MKLML")
    MESSAGE(STATUS "Build MKLDNN with ${MKLDNN_MKLROOT}")
 ENDIF()
 SET(MKLDNN_CFLAG "${CMAKE_C_FLAGS} -Wno-error=strict-overflow")
 SET(MKLDNN_CXXFLAG "${CMAKE_CXX_FLAGS} -Wno-error=strict-overflow")
 ExternalProject_Add(
    ${MKLDNN_PROJECT}
    ${EXTERNAL_PROJECT_LOG_ARGS}
    DEPENDS             ${MKLDNN_DEPENDS}
    GIT_REPOSITORY      "https://github.com/01org/mkl-dnn.git"
-    GIT_TAG             "v0.10"
+    GIT_TAG             "v0.11"
    PREFIX              ${MKLDNN_SOURCES_DIR}
    UPDATE_COMMAND      ""
    CMAKE_ARGS          -DCMAKE_INSTALL_PREFIX=${MKLDNN_INSTALL_DIR}
    CMAKE_ARGS          -DMKLROOT=${MKLDNN_MKLROOT}
    CMAKE_ARGS          -DCMAKE_C_FLAGS=${MKLDNN_CFLAG}
    CMAKE_ARGS          -DCMAKE_CXX_FLAGS=${MKLDNN_CXXFLAG}
    CMAKE_CACHE_ARGS    -DCMAKE_INSTALL_PREFIX:PATH=${MKLDNN_INSTALL_DIR}
                        -DMKLROOT:PATH=${MKLDNN_MKLROOT}
 )
--- a/cmake/external/mklml.cmake
+++ b/cmake/external/mklml.cmake
@ -27,8 +27,8 @@ ENDIF()
 INCLUDE(ExternalProject)
 SET(MKLML_PROJECT       "extern_mklml")
-SET(MKLML_VER           "mklml_lnx_2018.0.20170720")
+SET(MKLML_VER           "mklml_lnx_2018.0.1.20171007")
-SET(MKLML_URL           "https://github.com/01org/mkl-dnn/releases/download/v0.10/${MKLML_VER}.tgz")
+SET(MKLML_URL           "https://github.com/01org/mkl-dnn/releases/download/v0.11/${MKLML_VER}.tgz")
 SET(MKLML_SOURCE_DIR    "${THIRD_PARTY_PATH}/mklml")
 SET(MKLML_DOWNLOAD_DIR  "${MKLML_SOURCE_DIR}/src/${MKLML_PROJECT}")
 SET(MKLML_DST_DIR       "mklml")
--- a/cmake/external/nccl.cmake
+++ b/cmake/external/nccl.cmake
@ -1,9 +1,26 @@
-INCLUDE(ExternalProject)
+# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 # http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-SET(NCCL_SOURCE_DIR ${THIRD_PARTY_PATH}/nccl)
+if(NOT WITH_GPU)
  return()
 endif()
 include(ExternalProject)
-INCLUDE_DIRECTORIES(${NCCL_SOURCE_DIR}/src/extern_nccl/src)
+set(NCCL_SOURCE_DIR ${THIRD_PARTY_PATH}/nccl)
 include_directories(${NCCL_SOURCE_DIR}/src/extern_nccl/src)
 if(WITH_DSO)
  # If we use DSO, we do not build nccl, just download the dependencies
@ -12,9 +29,11 @@ if(WITH_DSO)
  set(NCCL_INSTALL_DIR "")
 else()
  # otherwise, we build nccl and link it.
  set(NCCL_INSTALL_DIR ${THIRD_PARTY_PATH}/install/nccl)
  # Note: cuda 8.0 is needed to make nccl
  # When cuda is not installed on the system directory, need to set CUDA_HOME to your cuda root
  set(NCCL_BUILD_COMMAND "make -j 8")
-  set(NCCL_INSTALL_COMMAND  "make install")
+  set(NCCL_INSTALL_COMMAND  "make install PREFIX=${NCCL_INSTALL_DIR}")
  SET(NCCL_INSTALL_DIR ${THIRD_PARTY_PATH}/install/nccl)
 endif()
 ExternalProject_Add(
@ -31,20 +50,18 @@ ExternalProject_Add(
    TEST_COMMAND      ""
 )
-if (WITH_DSO)
+if(WITH_DSO)
-  if (${CMAKE_VERSION} VERSION_LESS "3.3.0")
+  if(${CMAKE_VERSION} VERSION_LESS "3.3.0")
-    set(dummyfile ${CMAKE_CURRENT_BINARY_DIR}/lib_any_dummy.c)
+    set(dummyfile ${CMAKE_CURRENT_BINARY_DIR}/lib_nccl_dummy.c)
-    file(WRITE ${dummyfile} "const char * dummy_any = \"${dummyfile}\";")
+    file(WRITE ${dummyfile} "const char * dummy_nccl = \"${dummyfile}\";")
    add_library(nccl STATIC ${dummyfile})
  else()
    add_library(nccl INTERFACE)
  endif()
 else()
-  ADD_LIBRARY(nccl STATIC IMPORTED GLOBAL)
+  add_library(nccl STATIC IMPORTED GLOBAL)
-  SET_PROPERTY(TARGET nccl PROPERTY IMPORTED_LOCATION
+  set_property(TARGET nccl PROPERTY IMPORTED_LOCATION
-          ${NCCL_INSTALL_DIR}/lib/libnccl.a)
+               ${NCCL_INSTALL_DIR}/lib/libnccl_static.a)
 endif()
 add_dependencies(nccl extern_nccl)
 LIST(APPEND external_project_dependencies nccl)
--- a/cmake/external/openblas.cmake
+++ b/cmake/external/openblas.cmake
@ -86,7 +86,7 @@ IF(NOT ${CBLAS_FOUND})
        UPDATE_COMMAND      ""
        CONFIGURE_COMMAND   ""
    )
-
+    SET(CBLAS_PROVIDER openblas)
    IF(WITH_C_API)
        INSTALL(DIRECTORY ${CBLAS_INC_DIR} DESTINATION third_party/openblas)
        # Because libopenblas.a is a symbolic link of another library, thus need to
@ -115,7 +115,7 @@ INCLUDE_DIRECTORIES(${CBLAS_INC_DIR})
 # linear algebra libraries for cc_library(xxx SRCS xxx.c DEPS cblas)
 SET(dummyfile ${CMAKE_CURRENT_BINARY_DIR}/cblas_dummy.c)
 FILE(WRITE ${dummyfile} "const char * dummy = \"${dummyfile}\";")
-IF(${CBLAS_PROVIDER} MATCHES MKL)
+IF("${CBLAS_PROVIDER}" STREQUAL "MKLML")
    ADD_LIBRARY(cblas SHARED ${dummyfile})
 ELSE()
    ADD_LIBRARY(cblas STATIC ${dummyfile})
--- a/cmake/external/pybind11.cmake
+++ b/cmake/external/pybind11.cmake
@ -1,8 +1,26 @@
-INCLUDE(ExternalProject)
+# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 # http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-SET(PYBIND_SOURCE_DIR ${THIRD_PARTY_PATH}/pybind)
+if(NOT WITH_PYTHON)
    return()
 endif()
 include(ExternalProject)
-INCLUDE_DIRECTORIES(${PYBIND_SOURCE_DIR}/src/extern_pybind/include)
+set(PYBIND_SOURCE_DIR ${THIRD_PARTY_PATH}/pybind)
 include_directories(${PYBIND_SOURCE_DIR}/src/extern_pybind/include)
 ExternalProject_Add(
        extern_pybind
@ -17,14 +35,12 @@ ExternalProject_Add(
        TEST_COMMAND      ""
 )
-if (${CMAKE_VERSION} VERSION_LESS "3.3.0")
+if(${CMAKE_VERSION} VERSION_LESS "3.3.0")
    set(dummyfile ${CMAKE_CURRENT_BINARY_DIR}/pybind_dummy.c)
-    file(WRITE ${dummyfile} "const char * dummy_any = \"${dummyfile}\";")
+    file(WRITE ${dummyfile} "const char * dummy_pybind = \"${dummyfile}\";")
    add_library(pybind STATIC ${dummyfile})
 else()
    add_library(pybind INTERFACE)
 endif()
 add_dependencies(pybind extern_pybind)
 LIST(APPEND external_project_dependencies pybind)
--- a/cmake/generic.cmake
+++ b/cmake/generic.cmake
@ -93,7 +93,7 @@ include_directories(${CMAKE_CURRENT_BINARY_DIR})
 if(NOT APPLE AND NOT ANDROID)
    find_package(Threads REQUIRED)
    link_libraries(${CMAKE_THREAD_LIBS_INIT})
-    set(CMAKE_CXX_LINK_EXECUTABLE "${CMAKE_CXX_LINK_EXECUTABLE} -ldl -lrt")
+    set(CMAKE_CXX_LINK_EXECUTABLE "${CMAKE_CXX_LINK_EXECUTABLE} -pthread -ldl -lrt")
 endif(NOT APPLE AND NOT ANDROID)
 function(merge_static_libs TARGET_NAME)
--- a/cmake/simd.cmake
+++ b/cmake/simd.cmake
@ -1,27 +1,28 @@
 # This file is use to check all support level of AVX on your machine
 # so that PaddlePaddle can unleash the vectorization power of muticore.
-INCLUDE(CheckCXXSourceRuns)
+include(CheckCXXSourceRuns)
-INCLUDE(CheckCXXSourceCompiles)
+include(CheckCXXSourceCompiles)
-IF(CMAKE_COMPILER_IS_GNUCC OR CMAKE_COMPILER_IS_GNUCXX OR CMAKE_CXX_COMPILER_ID MATCHES "Clang")
+if(CMAKE_COMPILER_IS_GNUCC OR CMAKE_COMPILER_IS_GNUCXX OR CMAKE_CXX_COMPILER_ID MATCHES "Clang")
    set(MMX_FLAG "-mmmx")
    set(SSE2_FLAG "-msse2")
    set(SSE3_FLAG "-msse3")
-    SET(AVX_FLAG "-mavx")
+    set(AVX_FLAG "-mavx")
-    SET(AVX2_FLAG "-mavx2")
+    set(AVX2_FLAG "-mavx2")
-ELSEIF(MSVC)
+elseif(MSVC)
    set(MMX_FLAG "/arch:MMX")
    set(SSE2_FLAG "/arch:SSE2")
    set(SSE3_FLAG "/arch:SSE3")
    SET(AVX_FLAG "/arch:AVX")
    SET(AVX2_FLAG "/arch:AVX2")
-ENDIF()
+endif()
 set(CMAKE_REQUIRED_FLAGS_RETAINED ${CMAKE_REQUIRED_FLAGS})
 # Check  MMX
 set(CMAKE_REQUIRED_FLAGS ${MMX_FLAG})
 set(MMX_FOUND_EXITCODE 1 CACHE STRING "Result from TRY_RUN" FORCE)
 CHECK_CXX_SOURCE_RUNS("
 #include <mmintrin.h>
 int main()
@ -32,6 +33,7 @@ int main()
 # Check SSE2
 set(CMAKE_REQUIRED_FLAGS ${SSE2_FLAG})
 set(SSE2_FOUND_EXITCODE 1 CACHE STRING "Result from TRY_RUN" FORCE)
 CHECK_CXX_SOURCE_RUNS("
 #include <emmintrin.h>
 int main()
@ -42,6 +44,7 @@ int main()
 # Check SSE3
 set(CMAKE_REQUIRED_FLAGS ${SSE3_FLAG})
 set(SSE3_FOUND_EXITCODE 1 CACHE STRING "Result from TRY_RUN" FORCE)
 CHECK_CXX_SOURCE_RUNS("
 #include <pmmintrin.h>
 int main()
@ -55,6 +58,7 @@ int main()
 # Check AVX
 set(CMAKE_REQUIRED_FLAGS ${AVX_FLAG})
 set(AVX_FOUND_EXITCODE 1 CACHE STRING "Result from TRY_RUN" FORCE)
 CHECK_CXX_SOURCE_RUNS("
 #include <immintrin.h>
 int main()
@ -67,6 +71,7 @@ int main()
 # Check AVX 2
 set(CMAKE_REQUIRED_FLAGS ${AVX2_FLAG})
 set(AVX2_FOUND_EXITCODE 1 CACHE STRING "Result from TRY_RUN" FORCE)
 CHECK_CXX_SOURCE_RUNS("
 #include <immintrin.h>
 int main()
--- a/doc/api/v2/config/layer.rst
+++ b/doc/api/v2/config/layer.rst
@ -82,6 +82,11 @@ maxout
 ..  autoclass:: paddle.v2.layer.maxout
    :noindex:
 roi_pool
 --------
 ..  autoclass:: paddle.v2.layer.roi_pool
    :noindex:
 Norm Layer
 ==========
--- a/doc/api/v2/data.rst
+++ b/doc/api/v2/data.rst
@ -2,112 +2,9 @@
 Data Reader Interface and DataSets
 ==================================
 ..  toctree::
    :maxdepth: 1
-DataTypes
+    data/data_reader.rst
-=========
+    data/image.rst
-
+    data/dataset.rst
 ..  automodule:: paddle.v2.data_type
    :members:
    :noindex:
 DataFeeder
 ==========
 ..  automodule:: paddle.v2.data_feeder
    :members:
    :noindex:
 Reader
 ======
 ..  automodule:: paddle.v2.reader
    :members:
    :noindex:
 ..  automodule:: paddle.v2.reader.creator
    :members:
    :noindex:
 minibatch
 =========
 ..  automodule:: paddle.v2.minibatch
    :members:
    :noindex:
 Dataset
 =======
 ..  automodule:: paddle.v2.dataset
    :members:
    :noindex:
 mnist
 +++++
 ..  automodule:: paddle.v2.dataset.mnist
    :members:
    :noindex:
 cifar
 +++++
 ..  automodule:: paddle.v2.dataset.cifar
    :members:
    :noindex:
 conll05
 +++++++
 ..  automodule:: paddle.v2.dataset.conll05
    :members: get_dict,get_embedding,test
    :noindex:
 imdb
 ++++
 ..  automodule:: paddle.v2.dataset.imdb
    :members:
    :noindex:
 imikolov
 ++++++++
 ..  automodule:: paddle.v2.dataset.imikolov
    :members:
    :noindex:
 movielens
 +++++++++
 ..  automodule:: paddle.v2.dataset.movielens
    :members:
    :noindex:
 ..  autoclass:: paddle.v2.dataset.movielens.MovieInfo
    :noindex:
 ..  autoclass:: paddle.v2.dataset.movielens.UserInfo
    :noindex:
 sentiment
 +++++++++
 ..  automodule:: paddle.v2.dataset.sentiment
    :members:
    :noindex:
 uci_housing
 +++++++++++
 ..  automodule:: paddle.v2.dataset.uci_housing
    :members:
    :noindex:
 wmt14
 +++++
 ..  automodule:: paddle.v2.dataset.wmt14
    :members:
    :noindex:
--- a/doc/api/v2/data/data_reader.rst
+++ b/doc/api/v2/data/data_reader.rst
@ -0,0 +1,36 @@
 =====================
 Data Reader Interface
 =====================
 DataTypes
 =========
 ..  automodule:: paddle.v2.data_type
    :members:
    :noindex:
 DataFeeder
 ==========
 ..  automodule:: paddle.v2.data_feeder
    :members:
    :noindex:
 Reader
 ======
 ..  automodule:: paddle.v2.reader
    :members:
    :noindex:
 ..  automodule:: paddle.v2.reader.creator
    :members:
    :noindex:
 minibatch
 =========
 ..  automodule:: paddle.v2.minibatch
    :members:
    :noindex:
--- a/doc/api/v2/data/dataset.rst
+++ b/doc/api/v2/data/dataset.rst
@ -0,0 +1,75 @@
 Dataset
 =======
 ..  automodule:: paddle.v2.dataset
    :members:
    :noindex:
 mnist
 +++++
 ..  automodule:: paddle.v2.dataset.mnist
    :members:
    :noindex:
 cifar
 +++++
 ..  automodule:: paddle.v2.dataset.cifar
    :members:
    :noindex:
 conll05
 +++++++
 ..  automodule:: paddle.v2.dataset.conll05
    :members: get_dict,get_embedding,test
    :noindex:
 imdb
 ++++
 ..  automodule:: paddle.v2.dataset.imdb
    :members:
    :noindex:
 imikolov
 ++++++++
 ..  automodule:: paddle.v2.dataset.imikolov
    :members:
    :noindex:
 movielens
 +++++++++
 ..  automodule:: paddle.v2.dataset.movielens
    :members:
    :noindex:
 ..  autoclass:: paddle.v2.dataset.movielens.MovieInfo
    :noindex:
 ..  autoclass:: paddle.v2.dataset.movielens.UserInfo
    :noindex:
 sentiment
 +++++++++
 ..  automodule:: paddle.v2.dataset.sentiment
    :members:
    :noindex:
 uci_housing
 +++++++++++
 ..  automodule:: paddle.v2.dataset.uci_housing
    :members:
    :noindex:
 wmt14
 +++++
 ..  automodule:: paddle.v2.dataset.wmt14
    :members:
    :noindex:
--- a/doc/api/v2/data/image.rst
+++ b/doc/api/v2/data/image.rst
@ -0,0 +1,5 @@
 Image Interface
 ===============
 ..  automodule:: paddle.v2.image
    :members:
--- a/doc/design/float16.md
+++ b/doc/design/float16.md
@ -0,0 +1,60 @@
 # Design Doc: float16
 ## Why float16
 Half precision (float16) is a binary floating-point format that occupies 16 bits in memory. float16 is half the size of traditional 32-bit single precision format (float) and has lower precision and smaller range. 
 When high precision computation is not required, using float16 data type could potentially 
 - reduce storage space, memory bandwidth, and power usages; 
 - increase the chance of data fitting into a smaller cache of lower latency; 
 - provide arithmetic speed up if supported by hardware. 
 ## Survey of current float16 support
 A brief survey of float16 support on different compilers, hardwares, and libraries can be found below. Interested readers can refer to [link1](https://github.com/PaddlePaddle/Paddle/issues/4853) and [link2](https://github.com/Xreki/Xreki.github.io/blob/master/multi_data_types_in_dl_framework/ppt/float16_and_quantized_type.md) for more info.
 The goal of float16 is to serve as a key for the executor to find and run the correct version of compute method specialized for float16 in operator kernel. It should be compatible with various natively supported float16 implementations including `__half` for cuda, `float16_t` for ARM, and `Eigen::half` for Eigen to make writing customized float16 kernels easier. 
 ### Compiler
 - nvcc supports `__half` data type after CUDA 7.5.
 - `__fp16` or `float16_t` is supported as storage type for gcc >= 6.1 and clang >= 3.4.
 - `__fp16` or `float16_t` is supported as arithmetic type for gcc >= 7.1 and clang >= 3.9.
 ### Hardware
 - `__half` is supported on GPU with compute capability >= 5.3.
 - `__fp16` is supported as storage type for ARMv7-A, ARMv8-A, and above.
 - `__fp16` is supported as arithmetic type after ARMv8.2-A (currently, the only microarchitecture implementing ARMv8.2-A is ARM Cortex-A75, which is announced in May 2017. There seems to be no application processors currently available on market that adopts this architecture. It is reported that Qualcomm Snapdragon 845 uses Cortex-A75 design and will be available in mobile devices in early 2018).
 ### Libraries
 - [Eigen](https://github.com/RLovelett/eigen) >= 3.3 supports float16 calculation on both GPU and CPU using the `Eigen::half` class. It is mostly useful for Nvidia GPUs because of the overloaded arithmetic operators using cuda intrinsics. It falls back to using software emulation on CPU for calculation and there is no special treatment to ARM processors.
 - [ARM compute library](https://github.com/ARM-software/ComputeLibrary) >= 17.02.01 supports NEON FP16 kernels (requires ARMv8.2-A CPU).
 ## Implementation
 The float16 class holds a 16-bit `uint16_t` data internally.
 ```
 struct float16 {
  uint16_t x;
 };
 ``` 
 float16 supports the following features:
  - constructors / assignment operators that take input from primitive data types including bool, integers of various length, float, and double. 
  - constructors / assignment operators that take input from `__half` on cuda, `float16_t` on ARM, and `Eigen::half` on Eigen.
  - conversion operators to primitive data types and half precision data types on cuda, ARM and Eigen. 
  - overloaded arithmetic operators for cuda, arm, and non-arm cpu, respectively. These operators will take advantage of the cuda and ARM intrinsics on the corresponding hardware. 
 To support the above features, two fundamental conversion functions are provided:
 ```
 float16 float_to_half_rn(float f);  // convert to half precision in round-to-nearest-even mode
 float half_to_float(float16 h);
 ```
 which provides one-to-one conversion between float32 and float16. These twos functions will do different conversion routines based on the current hardware. CUDA/ARM instrinsics will be used when the corresonding hardware is available. If the hardware or compiler level does not support float32 to float16 conversion, software emulation will be performed to do the conversion.
 ## To do
 After float16 class is available, some of the future items are below:
 - Update pybind/tensor_py.h to bind c++ float16 with numpy float16. 
 - Modify `GetKernelType()` method in `framework/operator.h` to make it compatible with float16.
 - Create a type-casting operator that can convert the data type in tensor between float16 and other types.
--- a/doc/design/graph_survey.md
+++ b/doc/design/graph_survey.md
@ -0,0 +1,232 @@
 ## Survey on Graph
 Neural network framework often provides symbolic API for users to write network topology conveniently. This doc manily focus on symbolic API in most popular neural network frameworks, and try to find out how to parse symbolic configuration to a portable file, such as protobuf or json.
 ### Mxnet
 The core concept of symbolic API is `Symbol`. Mxnet implements `Symbol` class in C++, and export to Python using C-API. Please refer to the comments in Mxnet:
 `Symbol` is help class used to represent the operator node in Graph.
 `Symbol` acts as an interface for building graphs from different components like Variable, Functor and Group. `Symbol` is also exported to python front-end (while Graph is not) to enable quick test and deployment. Conceptually, symbol is the final operation of a graph and thus including all the information required (the graph) to evaluate its output value.
 A simple network topology wrote by Symbol is as follows:
 ```python
 def get_symbol(num_classes=10, **kwargs):
    data = mx.symbol.Variable('data')
    data = mx.symbol.Flatten(data=data)
    fc1  = mx.symbol.FullyConnected(data = data, name='fc1', num_hidden=128)
    act1 = mx.symbol.Activation(data = fc1, name='relu1', act_type="relu")
    fc2  = mx.symbol.FullyConnected(data = act1, name = 'fc2', num_hidden = 64)
    act2 = mx.symbol.Activation(data = fc2, name='relu2', act_type="relu")
    fc3  = mx.symbol.FullyConnected(data = act2, name='fc3', num_hidden=num_classes)
    mlp  = mx.symbol.SoftmaxOutput(data = fc3, name = 'softmax')
    return mlp
 ```
 Varible here is actually a Symbol. Every basic Symbol will correspond to one Node, and every Node has its own NodeAttr. There is a op field in NodeAttr class, when a Symbol represents Variable(often input data), the op field is null.
 Symbol contains a data member, std::vector<NodeEntry> outputs, and NodeEntry cantains a poniter to Node. We can follow the Node pointer to get all the Graph.
 And Symbol can be saved to a Json file.
 Here is a detailed example:
 ```
 >>> import mxnet as mx
 >>> data = mx.symbol.Variable('data')
 >>> print data.debug_str()
 Variable:data
 >>> data = mx.symbol.Flatten(data=data)
 >>> print data.debug_str()
 Symbol Outputs:
 	output[0]=flatten0(0)
 Variable:data
 --------------------
 Op:Flatten, Name=flatten0
 Inputs:
 	arg[0]=data(0) version=0
 >>> fc1  = mx.symbol.FullyConnected(data = data, name='fc1', num_hidden=128)
 >>> print fc1.debug_str()
 Symbol Outputs:
 	output[0]=fc1(0)
 Variable:data
 --------------------
 Op:Flatten, Name=flatten0
 Inputs:
 	arg[0]=data(0) version=0
 Variable:fc1_weight
 Variable:fc1_bias
 --------------------
 Op:FullyConnected, Name=fc1
 Inputs:
 	arg[0]=flatten0(0)
 	arg[1]=fc1_weight(0) version=0
 	arg[2]=fc1_bias(0) version=0
 Attrs:
 	num_hidden=128
 ```
 ### TensorFlow
 The core concept of symbolic API is `Tensor`. Tensorflow defines `Tensor` in Python. Please refer to the comments in TensorFlow:
 A `Tensor` is a symbolic handle to one of the outputs of an `Operation`. It does not hold the values of that operation's output, but instead provides a means of computing those values in a TensorFlow [Session](https://www.tensorflow.org/api_docs/python/tf/Session).
 A simple example is as follows:
 ```python
  # Build a dataflow graph.
  c = tf.constant([[1.0, 2.0], [3.0, 4.0]])
  d = tf.constant([[1.0, 1.0], [0.0, 1.0]])
  e = tf.matmul(c, d)
  # Construct a `Session` to execute the graph.
  sess = tf.Session()
  # Execute the graph and store the value that `e` represents in `result`.
  result = sess.run(e)
 ```
 The main method of `Tensor` is as follows: 
 ```python
@property
 def op(self):
  """The `Operation` that produces this tensor as an output."""
  return self._op
@property
 def dtype(self):
   """The `DType` of elements in this tensor."""
  return self._dtype
@property
 def graph(self):
  """The `Graph` that contains this tensor."""
  return self._op.graph
@property
 def name(self):
  """The string name of this tensor."""
  if not self._op.name:
    raise ValueError("Operation was not named: %s" % self._op)
  return "%s:%d" % (self._op.name, self._value_index)
@property
 def device(self):
  """The name of the device on which this tensor will be produced, or None."""
  return self._op.device
 ```
 Tensor can be taken as target to run by session. Tensor contains all the information of Graph, and tracks data dependency.
 Here is a detailed example:
 ```
 >>> import tensorflow as tf
 >>> c = tf.constant([[1.0, 2.0], [3.0, 4.0]])
 >>> print c.graph
 <tensorflow.python.framework.ops.Graph object at 0x10f256d50>
 >>> d = tf.constant([[1.0, 1.0], [0.0, 1.0]])
 >>> print d.graph
 <tensorflow.python.framework.ops.Graph object at 0x10f256d50>
 >>> e = tf.matmul(c, d)
 >>> print e.graph
 <tensorflow.python.framework.ops.Graph object at 0x10f256d50>
 ```
 ### Dynet
 The core concept of symbolic API is `Expression`, and Dynet defines `Expression` class in C++.
 A simple example is as follows:
 ```cpp
 ComputationGraph cg;
 Expression W = parameter(cg, pW);
 Expression in = input(cg, xs[i]);
 Expression label = input(cg, ys[i]);
 Expression pred = W * in;
 Expression loss = square(pred - label);
 ```
 The input data and parameter are also represented by Expression. Every basci Expression corresponds to a Node. And input data is also a Node. 
 Expression has a data member ComputationGraph, and ComputationGraph will be modified in users' configuring process. Expression can be a running target, beacuse Expression contains all dependency.
 Here is a detailed example:
 write topology in C++
 ```
 ComputationGraph cg;
 Expression W = parameter(cg, pW);
 cg.print_graphviz();
 Expression pred = W * xs[i];
 cg.print_graphviz();
 Expression loss = square(pred - ys[i]);
 cg.print_graphviz();
 ```
 compile and print
 ```
 # first print
 digraph G {
  rankdir=LR;
  nodesep=.05;
  N0 [label="v0 = parameters({1}) @ 0x7ffe4de00110"];
 }
 # second print
 digraph G {
  rankdir=LR;
  nodesep=.05;
  N0 [label="v0 = parameters({1}) @ 0x7ffe4de00110"];
  N1 [label="v1 = v0 * -0.98"];
  N0 -> N1;
 }
 # third print
 digraph G {
  rankdir=LR;
  nodesep=.05;
  N0 [label="v0 = parameters({1}) @ 0x7ffe4de00110"];
  N1 [label="v1 = v0 * -0.98"];
  N0 -> N1;
  N2 [label="v2 = -1.88387 - v1"];
  N1 -> N2;
  N3 [label="v3 = -v2"];
  N2 -> N3;
  N4 [label="v4 = square(v3)"];
  N3 -> N4;
 }
 ```
 ### Conclusion
 Actually, Symbol/Tensor/Expression in Mxnet/TensorFlow/Dynet are the same level concepts. We use a unified name Expression here, this level concept has following features:
 - Users wirte topoloy with symbolic API, and all return value is Expression, including input data and parameter.
 - Expression corresponds with a global Graph, and Expression can also be composed.
 - Expression tracks all dependency and can be taken as a run target
--- a/doc/design/images/asgd.gif
+++ b/doc/design/images/asgd.gif
--- a/Show More
+++ b/Show More